Spaces:
Running
Running
Commit
·
c49cb47
1
Parent(s):
b9aef14
Configure OCR Time Capsule with default dataset and branding
Browse files- CLAUDE.md +186 -0
- css/styles.css +197 -0
- js/app.js +550 -0
- js/dataset-api.js +273 -0
- js/diff-utils.js +219 -0
CLAUDE.md
ADDED
|
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.
|
| 4 |
+
|
| 5 |
+
## Project Overview
|
| 6 |
+
|
| 7 |
+
OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.
|
| 8 |
+
|
| 9 |
+
## Architecture
|
| 10 |
+
|
| 11 |
+
### Technology Stack
|
| 12 |
+
- **Frontend Framework**: Alpine.js (lightweight reactivity, ~15KB)
|
| 13 |
+
- **Styling**: Tailwind CSS (utility-first, responsive design)
|
| 14 |
+
- **Interactions**: HTMX (server-side rendering capabilities)
|
| 15 |
+
- **API**: HuggingFace Dataset Viewer API (no backend required)
|
| 16 |
+
- **Language**: Vanilla JavaScript (no build process needed)
|
| 17 |
+
|
| 18 |
+
### Core Components
|
| 19 |
+
|
| 20 |
+
**index.html** - Main application shell
|
| 21 |
+
- Split-pane layout (1/3 image, 2/3 text comparison)
|
| 22 |
+
- Three view modes: Side-by-side, Inline diff, Improved only
|
| 23 |
+
- Dark mode support with proper contrast
|
| 24 |
+
- Responsive design for mobile devices
|
| 25 |
+
|
| 26 |
+
**js/dataset-api.js** - HuggingFace API wrapper
|
| 27 |
+
- Smart caching with 45-minute expiration for signed URLs
|
| 28 |
+
- Batch loading (100 rows at a time)
|
| 29 |
+
- Automatic column detection for different dataset schemas
|
| 30 |
+
- Image URL refresh on expiration
|
| 31 |
+
|
| 32 |
+
**js/app.js** - Alpine.js application logic
|
| 33 |
+
- Keyboard navigation (J/K, arrows)
|
| 34 |
+
- URL state management for shareable links
|
| 35 |
+
- Diff mode switching (character/word/line)
|
| 36 |
+
- Dark mode persistence in localStorage
|
| 37 |
+
|
| 38 |
+
**js/diff-utils.js** - Text comparison algorithms
|
| 39 |
+
- Character-level diff with inline highlighting
|
| 40 |
+
- Word-level diff preserving whitespace
|
| 41 |
+
- Line-level diff for larger changes
|
| 42 |
+
- LCS (Longest Common Subsequence) implementation
|
| 43 |
+
|
| 44 |
+
**css/styles.css** - Custom styling
|
| 45 |
+
- Dark mode enhancements
|
| 46 |
+
- Diff highlighting with accessibility in mind
|
| 47 |
+
- Smooth transitions and animations
|
| 48 |
+
- Print-friendly styles
|
| 49 |
+
|
| 50 |
+
## Key Design Decisions
|
| 51 |
+
|
| 52 |
+
### Why Separate from OCR Time Machine?
|
| 53 |
+
|
| 54 |
+
1. **Focused Purpose**: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
|
| 55 |
+
2. **Performance**: No Python/Gradio overhead - instant loading and navigation
|
| 56 |
+
3. **User Experience**: Custom UI optimized for text comparison workflows
|
| 57 |
+
4. **Deployment**: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)
|
| 58 |
+
|
| 59 |
+
### API vs Backend Trade-offs
|
| 60 |
+
|
| 61 |
+
**Chose HF Dataset Viewer API because:**
|
| 62 |
+
- No backend infrastructure needed
|
| 63 |
+
- Automatic image serving with CDN
|
| 64 |
+
- Built-in pagination support
|
| 65 |
+
- Works with any public HF dataset
|
| 66 |
+
|
| 67 |
+
**Limitations accepted:**
|
| 68 |
+
- Image URLs expire (~1 hour)
|
| 69 |
+
- 100 rows max per request
|
| 70 |
+
- No write capabilities
|
| 71 |
+
- Public datasets only (no auth yet)
|
| 72 |
+
|
| 73 |
+
### UI/UX Principles
|
| 74 |
+
|
| 75 |
+
1. **Keyboard-first**: Professional users prefer keyboard navigation
|
| 76 |
+
2. **Information density**: Show more content, less chrome
|
| 77 |
+
3. **Visual diff**: Color-coded changes are easier to scan than side-by-side
|
| 78 |
+
4. **Dark mode**: Essential for extended reading sessions
|
| 79 |
+
5. **Responsive**: Works on tablets for field work
|
| 80 |
+
|
| 81 |
+
## Development Approach
|
| 82 |
+
|
| 83 |
+
### Phase 1: MVP (Completed)
|
| 84 |
+
- Basic dataset loading and navigation
|
| 85 |
+
- Side-by-side text comparison
|
| 86 |
+
- Keyboard shortcuts
|
| 87 |
+
- Dark mode
|
| 88 |
+
|
| 89 |
+
### Phase 2: Enhancements (Completed)
|
| 90 |
+
- Three diff algorithms (char/word/line)
|
| 91 |
+
- URL state management
|
| 92 |
+
- Image error handling with refresh
|
| 93 |
+
- Responsive mobile layout
|
| 94 |
+
|
| 95 |
+
### Phase 3: Polish (Completed)
|
| 96 |
+
- Fixed dark mode contrast issues
|
| 97 |
+
- Optimized performance with direct indexing
|
| 98 |
+
- Added loading states and error handling
|
| 99 |
+
- Comprehensive documentation
|
| 100 |
+
|
| 101 |
+
## Common Tasks
|
| 102 |
+
|
| 103 |
+
### Adding Column Name Patterns
|
| 104 |
+
```javascript
|
| 105 |
+
// In dataset-api.js detectColumns() method
|
| 106 |
+
if (!originalTextColumn && ['your_column_name'].includes(name)) {
|
| 107 |
+
originalTextColumn = name;
|
| 108 |
+
}
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### Adding Keyboard Shortcuts
|
| 112 |
+
```javascript
|
| 113 |
+
// In app.js setupKeyboardNavigation()
|
| 114 |
+
case 'your_key':
|
| 115 |
+
// Your action
|
| 116 |
+
break;
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
### Customizing Diff Colors
|
| 120 |
+
```javascript
|
| 121 |
+
// In diff-utils.js
|
| 122 |
+
// Light mode: bg-red-200, text-red-800
|
| 123 |
+
// Dark mode: bg-red-950, text-red-300
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
## Performance Optimizations
|
| 127 |
+
|
| 128 |
+
1. **Direct Dataset Indexing**: Uses `dataset[index]` instead of loading batches into memory
|
| 129 |
+
2. **Smart Caching**: Caches API responses for 45 minutes (conservative for signed URLs)
|
| 130 |
+
3. **Batch Fetching**: Loads 100 rows at once, caches for smooth navigation
|
| 131 |
+
4. **Lazy Loading**: Only fetches data when needed
|
| 132 |
+
|
| 133 |
+
## Known Issues & Solutions
|
| 134 |
+
|
| 135 |
+
### Issue: Navigation buttons were disabled
|
| 136 |
+
**Cause**: API response structure wasn't parsed correctly
|
| 137 |
+
**Fix**: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows`
|
| 138 |
+
|
| 139 |
+
### Issue: Dark mode text unreadable
|
| 140 |
+
**Cause**: Insufficient contrast in diff highlighting and code blocks
|
| 141 |
+
**Fix**:
|
| 142 |
+
- Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300`
|
| 143 |
+
- Added explicit `text-gray-900 dark:text-gray-100` to all text containers
|
| 144 |
+
|
| 145 |
+
### Issue: Image loading errors
|
| 146 |
+
**Cause**: Signed URLs expire after ~1 hour
|
| 147 |
+
**Fix**: Implemented handleImageError() with automatic URL refresh
|
| 148 |
+
|
| 149 |
+
## Future Enhancements
|
| 150 |
+
|
| 151 |
+
- [ ] Search/filter within dataset
|
| 152 |
+
- [ ] Bookmark favorite samples
|
| 153 |
+
- [ ] Export selected texts
|
| 154 |
+
- [ ] Support for private datasets (auth)
|
| 155 |
+
- [ ] Metrics display (CER/WER)
|
| 156 |
+
- [ ] Batch operations
|
| 157 |
+
- [ ] PWA support for offline viewing
|
| 158 |
+
|
| 159 |
+
## Deployment
|
| 160 |
+
|
| 161 |
+
### Static Hosting (Recommended)
|
| 162 |
+
```bash
|
| 163 |
+
# Any static file server works
|
| 164 |
+
python3 -m http.server 8000
|
| 165 |
+
npx serve .
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
### GitHub Pages
|
| 169 |
+
1. Push to GitHub repository
|
| 170 |
+
2. Enable Pages in settings
|
| 171 |
+
3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/`
|
| 172 |
+
|
| 173 |
+
### CDN Deployment
|
| 174 |
+
- Upload files to any CDN
|
| 175 |
+
- No server-side processing needed
|
| 176 |
+
- Works with CloudFlare, Netlify, Vercel, etc.
|
| 177 |
+
|
| 178 |
+
## Testing Datasets
|
| 179 |
+
|
| 180 |
+
Known working datasets:
|
| 181 |
+
- `davanstrien/exams-ocr` - Default dataset with great examples
|
| 182 |
+
- Any dataset with image + text columns
|
| 183 |
+
|
| 184 |
+
Column patterns automatically detected:
|
| 185 |
+
- Original: `text`, `ocr`, `original_text`, `ground_truth`
|
| 186 |
+
- Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`
|
css/styles.css
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* Custom styles for OCR Text Explorer
|
| 3 |
+
* Extends Tailwind CSS with specific styling needs
|
| 4 |
+
*/
|
| 5 |
+
|
| 6 |
+
/* Custom scrollbar styling */
|
| 7 |
+
::-webkit-scrollbar {
|
| 8 |
+
width: 8px;
|
| 9 |
+
height: 8px;
|
| 10 |
+
}
|
| 11 |
+
|
| 12 |
+
::-webkit-scrollbar-track {
|
| 13 |
+
@apply bg-gray-100 dark:bg-gray-800;
|
| 14 |
+
}
|
| 15 |
+
|
| 16 |
+
::-webkit-scrollbar-thumb {
|
| 17 |
+
@apply bg-gray-400 dark:bg-gray-600 rounded;
|
| 18 |
+
}
|
| 19 |
+
|
| 20 |
+
::-webkit-scrollbar-thumb:hover {
|
| 21 |
+
@apply bg-gray-500 dark:bg-gray-500;
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
/* Firefox scrollbar */
|
| 25 |
+
* {
|
| 26 |
+
scrollbar-width: thin;
|
| 27 |
+
scrollbar-color: theme('colors.gray.400') theme('colors.gray.100');
|
| 28 |
+
}
|
| 29 |
+
|
| 30 |
+
.dark * {
|
| 31 |
+
scrollbar-color: theme('colors.gray.600') theme('colors.gray.800');
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
/* Smooth transitions for theme switching */
|
| 35 |
+
body {
|
| 36 |
+
transition: background-color 0.3s ease, color 0.3s ease;
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
/* Image panel sticky positioning adjustment */
|
| 40 |
+
.sticky {
|
| 41 |
+
position: -webkit-sticky;
|
| 42 |
+
position: sticky;
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
/* Diff content styling */
|
| 46 |
+
.diff-content {
|
| 47 |
+
line-height: 1.6;
|
| 48 |
+
word-break: break-word;
|
| 49 |
+
}
|
| 50 |
+
|
| 51 |
+
/* Keyboard hint styling */
|
| 52 |
+
kbd {
|
| 53 |
+
@apply inline-block px-2 py-1 text-xs font-semibold text-gray-800 bg-gray-100 border border-gray-300 rounded dark:bg-gray-700 dark:text-gray-200 dark:border-gray-600;
|
| 54 |
+
box-shadow: 0 1px 0 rgba(0, 0, 0, 0.1);
|
| 55 |
+
}
|
| 56 |
+
|
| 57 |
+
/* Loading spinner animation (in case Tailwind's animate-spin needs adjustment) */
|
| 58 |
+
@keyframes spin {
|
| 59 |
+
to {
|
| 60 |
+
transform: rotate(360deg);
|
| 61 |
+
}
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
.animate-spin {
|
| 65 |
+
animation: spin 1s linear infinite;
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
/* Tab hover effect */
|
| 69 |
+
nav button {
|
| 70 |
+
position: relative;
|
| 71 |
+
transition: color 0.2s ease;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
nav button::after {
|
| 75 |
+
content: '';
|
| 76 |
+
position: absolute;
|
| 77 |
+
bottom: -2px;
|
| 78 |
+
left: 0;
|
| 79 |
+
right: 0;
|
| 80 |
+
height: 2px;
|
| 81 |
+
background-color: transparent;
|
| 82 |
+
transition: background-color 0.2s ease;
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
nav button:hover::after {
|
| 86 |
+
@apply bg-gray-300 dark:bg-gray-600;
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
/* Image loading state */
|
| 90 |
+
img {
|
| 91 |
+
@apply bg-gray-200 dark:bg-gray-700;
|
| 92 |
+
min-height: 200px;
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
+
img[src=""] {
|
| 96 |
+
visibility: hidden;
|
| 97 |
+
}
|
| 98 |
+
|
| 99 |
+
/* Print styles */
|
| 100 |
+
@media print {
|
| 101 |
+
header, footer {
|
| 102 |
+
display: none !important;
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
.no-print {
|
| 106 |
+
display: none !important;
|
| 107 |
+
}
|
| 108 |
+
|
| 109 |
+
main {
|
| 110 |
+
height: auto !important;
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
.diff-content {
|
| 114 |
+
page-break-inside: avoid;
|
| 115 |
+
}
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
/* Responsive adjustments */
|
| 119 |
+
@media (max-width: 768px) {
|
| 120 |
+
/* Stack panels vertically on mobile */
|
| 121 |
+
main.flex {
|
| 122 |
+
@apply flex-col;
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
/* Full width for panels on mobile */
|
| 126 |
+
main > div:first-child {
|
| 127 |
+
@apply w-full max-h-96;
|
| 128 |
+
}
|
| 129 |
+
|
| 130 |
+
/* Adjust text size */
|
| 131 |
+
.prose-sm {
|
| 132 |
+
@apply text-xs;
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
/* Hide keyboard hints on mobile */
|
| 136 |
+
footer .text-sm:last-child {
|
| 137 |
+
@apply hidden;
|
| 138 |
+
}
|
| 139 |
+
}
|
| 140 |
+
|
| 141 |
+
/* Focus styles for accessibility */
|
| 142 |
+
button:focus, input:focus, select:focus {
|
| 143 |
+
@apply outline-none ring-2 ring-blue-500 ring-offset-2 dark:ring-offset-gray-900;
|
| 144 |
+
}
|
| 145 |
+
|
| 146 |
+
/* Custom tooltip styles (if needed later) */
|
| 147 |
+
.tooltip {
|
| 148 |
+
@apply invisible absolute z-10 px-2 py-1 text-xs text-white bg-gray-900 rounded shadow-lg dark:bg-gray-700;
|
| 149 |
+
}
|
| 150 |
+
|
| 151 |
+
.tooltip-trigger:hover .tooltip {
|
| 152 |
+
@apply visible;
|
| 153 |
+
}
|
| 154 |
+
|
| 155 |
+
/* Preserve whitespace in diff views */
|
| 156 |
+
.whitespace-pre-wrap {
|
| 157 |
+
white-space: pre-wrap;
|
| 158 |
+
word-wrap: break-word;
|
| 159 |
+
}
|
| 160 |
+
|
| 161 |
+
/* Enhanced diff highlighting with better dark mode contrast */
|
| 162 |
+
.diff-delete {
|
| 163 |
+
@apply bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300;
|
| 164 |
+
text-decoration: line-through;
|
| 165 |
+
text-decoration-color: currentColor;
|
| 166 |
+
text-decoration-thickness: 2px;
|
| 167 |
+
}
|
| 168 |
+
|
| 169 |
+
.diff-insert {
|
| 170 |
+
@apply bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300;
|
| 171 |
+
position: relative;
|
| 172 |
+
}
|
| 173 |
+
|
| 174 |
+
/* Dark mode specific improvements */
|
| 175 |
+
.dark .prose {
|
| 176 |
+
@apply text-gray-200;
|
| 177 |
+
}
|
| 178 |
+
|
| 179 |
+
.dark .prose h3 {
|
| 180 |
+
@apply text-gray-100;
|
| 181 |
+
}
|
| 182 |
+
|
| 183 |
+
/* Remove this - handled inline with classes
|
| 184 |
+
.dark pre {
|
| 185 |
+
@apply bg-gray-800 text-gray-200;
|
| 186 |
+
} */
|
| 187 |
+
|
| 188 |
+
/* Line numbers for future enhancement */
|
| 189 |
+
.line-numbers {
|
| 190 |
+
counter-reset: line;
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
.line-numbers > div::before {
|
| 194 |
+
counter-increment: line;
|
| 195 |
+
content: counter(line);
|
| 196 |
+
@apply inline-block w-12 mr-4 text-right text-gray-400 dark:text-gray-600 select-none;
|
| 197 |
+
}
|
js/app.js
ADDED
|
@@ -0,0 +1,550 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* Main Alpine.js application for OCR Text Explorer
|
| 3 |
+
*/
|
| 4 |
+
|
| 5 |
+
document.addEventListener('alpine:init', () => {
|
| 6 |
+
Alpine.data('ocrExplorer', () => ({
|
| 7 |
+
// Dataset state
|
| 8 |
+
datasetId: 'davanstrien/exams-ocr',
|
| 9 |
+
datasetConfig: 'default',
|
| 10 |
+
datasetSplit: 'train',
|
| 11 |
+
|
| 12 |
+
// Navigation state
|
| 13 |
+
currentIndex: 0,
|
| 14 |
+
totalSamples: null,
|
| 15 |
+
currentSample: null,
|
| 16 |
+
jumpToPage: '',
|
| 17 |
+
|
| 18 |
+
// UI state
|
| 19 |
+
loading: false,
|
| 20 |
+
error: null,
|
| 21 |
+
activeTab: 'comparison',
|
| 22 |
+
diffMode: 'char',
|
| 23 |
+
darkMode: false,
|
| 24 |
+
showAbout: false,
|
| 25 |
+
showFlowView: false,
|
| 26 |
+
showDock: false,
|
| 27 |
+
|
| 28 |
+
// Flow view state
|
| 29 |
+
flowItems: [],
|
| 30 |
+
flowStartIndex: 0,
|
| 31 |
+
flowVisibleCount: 7,
|
| 32 |
+
flowOffset: 0,
|
| 33 |
+
|
| 34 |
+
// Dock state
|
| 35 |
+
dockItems: [],
|
| 36 |
+
dockHideTimeout: null,
|
| 37 |
+
dockStartIndex: 0,
|
| 38 |
+
dockVisibleCount: 10,
|
| 39 |
+
|
| 40 |
+
// Computed diff HTML
|
| 41 |
+
diffHtml: '',
|
| 42 |
+
|
| 43 |
+
// Statistics
|
| 44 |
+
similarity: 0,
|
| 45 |
+
charStats: { total: 0, added: 0, removed: 0 },
|
| 46 |
+
wordStats: { original: 0, improved: 0 },
|
| 47 |
+
|
| 48 |
+
// API instance
|
| 49 |
+
api: null,
|
| 50 |
+
|
| 51 |
+
async init() {
|
| 52 |
+
// Initialize API
|
| 53 |
+
this.api = new DatasetAPI();
|
| 54 |
+
|
| 55 |
+
// Apply dark mode from localStorage
|
| 56 |
+
this.darkMode = localStorage.getItem('darkMode') === 'true';
|
| 57 |
+
this.$watch('darkMode', value => {
|
| 58 |
+
localStorage.setItem('darkMode', value);
|
| 59 |
+
document.documentElement.classList.toggle('dark', value);
|
| 60 |
+
});
|
| 61 |
+
document.documentElement.classList.toggle('dark', this.darkMode);
|
| 62 |
+
|
| 63 |
+
// Setup keyboard navigation
|
| 64 |
+
this.setupKeyboardNavigation();
|
| 65 |
+
|
| 66 |
+
// Load initial dataset
|
| 67 |
+
await this.loadDataset();
|
| 68 |
+
},
|
| 69 |
+
|
| 70 |
+
setupKeyboardNavigation() {
|
| 71 |
+
document.addEventListener('keydown', (e) => {
|
| 72 |
+
// Ignore if user is typing in input
|
| 73 |
+
if (e.target.tagName === 'INPUT') return;
|
| 74 |
+
|
| 75 |
+
switch(e.key) {
|
| 76 |
+
case 'ArrowLeft':
|
| 77 |
+
e.preventDefault();
|
| 78 |
+
if (e.shiftKey && this.showDock) {
|
| 79 |
+
this.scrollDockLeft();
|
| 80 |
+
} else {
|
| 81 |
+
this.previousSample();
|
| 82 |
+
}
|
| 83 |
+
break;
|
| 84 |
+
case 'ArrowRight':
|
| 85 |
+
e.preventDefault();
|
| 86 |
+
if (e.shiftKey && this.showDock) {
|
| 87 |
+
this.scrollDockRight();
|
| 88 |
+
} else {
|
| 89 |
+
this.nextSample();
|
| 90 |
+
}
|
| 91 |
+
break;
|
| 92 |
+
case 'k':
|
| 93 |
+
case 'K':
|
| 94 |
+
e.preventDefault();
|
| 95 |
+
this.previousSample();
|
| 96 |
+
break;
|
| 97 |
+
case 'j':
|
| 98 |
+
case 'J':
|
| 99 |
+
e.preventDefault();
|
| 100 |
+
this.nextSample();
|
| 101 |
+
break;
|
| 102 |
+
case '1':
|
| 103 |
+
this.activeTab = 'comparison';
|
| 104 |
+
break;
|
| 105 |
+
case '2':
|
| 106 |
+
this.activeTab = 'diff';
|
| 107 |
+
break;
|
| 108 |
+
case '3':
|
| 109 |
+
this.activeTab = 'improved';
|
| 110 |
+
break;
|
| 111 |
+
case 'v':
|
| 112 |
+
case 'V':
|
| 113 |
+
// Toggle dock with V key
|
| 114 |
+
if (this.showDock) {
|
| 115 |
+
this.hideDockPreview();
|
| 116 |
+
} else {
|
| 117 |
+
this.showDockPreview();
|
| 118 |
+
}
|
| 119 |
+
break;
|
| 120 |
+
}
|
| 121 |
+
});
|
| 122 |
+
},
|
| 123 |
+
|
| 124 |
+
async loadDataset() {
|
| 125 |
+
this.loading = true;
|
| 126 |
+
this.error = null;
|
| 127 |
+
|
| 128 |
+
try {
|
| 129 |
+
// Validate dataset
|
| 130 |
+
await this.api.validateDataset(this.datasetId);
|
| 131 |
+
|
| 132 |
+
// Get dataset info
|
| 133 |
+
const info = await this.api.getDatasetInfo(this.datasetId);
|
| 134 |
+
this.datasetConfig = info.defaultConfig;
|
| 135 |
+
this.datasetSplit = info.defaultSplit;
|
| 136 |
+
|
| 137 |
+
// Get total rows
|
| 138 |
+
this.totalSamples = await this.api.getTotalRows(
|
| 139 |
+
this.datasetId,
|
| 140 |
+
this.datasetConfig,
|
| 141 |
+
this.datasetSplit
|
| 142 |
+
);
|
| 143 |
+
|
| 144 |
+
// Load first sample
|
| 145 |
+
this.currentIndex = 0;
|
| 146 |
+
await this.loadSample(0);
|
| 147 |
+
|
| 148 |
+
} catch (error) {
|
| 149 |
+
this.error = error.message;
|
| 150 |
+
} finally {
|
| 151 |
+
this.loading = false;
|
| 152 |
+
}
|
| 153 |
+
},
|
| 154 |
+
|
| 155 |
+
async loadSample(index) {
|
| 156 |
+
try {
|
| 157 |
+
const data = await this.api.getRow(
|
| 158 |
+
this.datasetId,
|
| 159 |
+
this.datasetConfig,
|
| 160 |
+
this.datasetSplit,
|
| 161 |
+
index
|
| 162 |
+
);
|
| 163 |
+
|
| 164 |
+
this.currentSample = data.row;
|
| 165 |
+
this.currentIndex = index;
|
| 166 |
+
|
| 167 |
+
// Update diff when sample changes
|
| 168 |
+
this.updateDiff();
|
| 169 |
+
|
| 170 |
+
// Update URL without triggering navigation
|
| 171 |
+
const url = new URL(window.location);
|
| 172 |
+
url.searchParams.set('dataset', this.datasetId);
|
| 173 |
+
url.searchParams.set('index', index);
|
| 174 |
+
window.history.replaceState({}, '', url);
|
| 175 |
+
|
| 176 |
+
} catch (error) {
|
| 177 |
+
this.error = `Failed to load sample: ${error.message}`;
|
| 178 |
+
}
|
| 179 |
+
},
|
| 180 |
+
|
| 181 |
+
async nextSample() {
|
| 182 |
+
if (this.currentIndex < this.totalSamples - 1) {
|
| 183 |
+
await this.loadSample(this.currentIndex + 1);
|
| 184 |
+
}
|
| 185 |
+
},
|
| 186 |
+
|
| 187 |
+
async previousSample() {
|
| 188 |
+
if (this.currentIndex > 0) {
|
| 189 |
+
await this.loadSample(this.currentIndex - 1);
|
| 190 |
+
}
|
| 191 |
+
},
|
| 192 |
+
|
| 193 |
+
async jumpToSample() {
|
| 194 |
+
const pageNum = parseInt(this.jumpToPage);
|
| 195 |
+
if (!isNaN(pageNum) && pageNum >= 1 && pageNum <= this.totalSamples) {
|
| 196 |
+
// Convert 1-based page number to 0-based index
|
| 197 |
+
await this.loadSample(pageNum - 1);
|
| 198 |
+
// Clear the input after jumping
|
| 199 |
+
this.jumpToPage = '';
|
| 200 |
+
} else {
|
| 201 |
+
// Show error or just reset
|
| 202 |
+
this.jumpToPage = '';
|
| 203 |
+
}
|
| 204 |
+
},
|
| 205 |
+
|
| 206 |
+
getOriginalText() {
|
| 207 |
+
if (!this.currentSample) return '';
|
| 208 |
+
const columns = this.api.detectColumns(null, this.currentSample);
|
| 209 |
+
return this.currentSample[columns.originalText] || 'No original text found';
|
| 210 |
+
},
|
| 211 |
+
|
| 212 |
+
getImprovedText() {
|
| 213 |
+
if (!this.currentSample) return '';
|
| 214 |
+
const columns = this.api.detectColumns(null, this.currentSample);
|
| 215 |
+
return this.currentSample[columns.improvedText] || 'No improved text found';
|
| 216 |
+
},
|
| 217 |
+
|
| 218 |
+
getImageData() {
|
| 219 |
+
if (!this.currentSample) return null;
|
| 220 |
+
const columns = this.api.detectColumns(null, this.currentSample);
|
| 221 |
+
return columns.image ? this.currentSample[columns.image] : null;
|
| 222 |
+
},
|
| 223 |
+
|
| 224 |
+
getImageSrc() {
|
| 225 |
+
const imageData = this.getImageData();
|
| 226 |
+
return imageData?.src || '';
|
| 227 |
+
},
|
| 228 |
+
|
| 229 |
+
getImageDimensions() {
|
| 230 |
+
const imageData = this.getImageData();
|
| 231 |
+
if (imageData?.width && imageData?.height) {
|
| 232 |
+
return `${imageData.width}×${imageData.height}`;
|
| 233 |
+
}
|
| 234 |
+
return null;
|
| 235 |
+
},
|
| 236 |
+
|
| 237 |
+
updateDiff() {
|
| 238 |
+
const original = this.getOriginalText();
|
| 239 |
+
const improved = this.getImprovedText();
|
| 240 |
+
|
| 241 |
+
// Calculate statistics
|
| 242 |
+
this.calculateStatistics(original, improved);
|
| 243 |
+
|
| 244 |
+
// Use diff utility based on mode
|
| 245 |
+
switch(this.diffMode) {
|
| 246 |
+
case 'char':
|
| 247 |
+
this.diffHtml = createCharacterDiff(original, improved);
|
| 248 |
+
break;
|
| 249 |
+
case 'word':
|
| 250 |
+
this.diffHtml = createWordDiff(original, improved);
|
| 251 |
+
break;
|
| 252 |
+
case 'line':
|
| 253 |
+
this.diffHtml = createLineDiff(original, improved);
|
| 254 |
+
break;
|
| 255 |
+
}
|
| 256 |
+
},
|
| 257 |
+
|
| 258 |
+
calculateStatistics(original, improved) {
|
| 259 |
+
// Calculate similarity
|
| 260 |
+
this.similarity = calculateSimilarity(original, improved);
|
| 261 |
+
|
| 262 |
+
// Character statistics
|
| 263 |
+
const charDiff = this.getCharacterDiffStats(original, improved);
|
| 264 |
+
this.charStats = charDiff;
|
| 265 |
+
|
| 266 |
+
// Word statistics
|
| 267 |
+
const originalWords = original.split(/\s+/).filter(w => w.length > 0);
|
| 268 |
+
const improvedWords = improved.split(/\s+/).filter(w => w.length > 0);
|
| 269 |
+
this.wordStats = {
|
| 270 |
+
original: originalWords.length,
|
| 271 |
+
improved: improvedWords.length
|
| 272 |
+
};
|
| 273 |
+
},
|
| 274 |
+
|
| 275 |
+
getCharacterDiffStats(original, improved) {
|
| 276 |
+
const dp = computeLCS(original, improved);
|
| 277 |
+
const diff = buildDiff(original, improved, dp);
|
| 278 |
+
|
| 279 |
+
let added = 0;
|
| 280 |
+
let removed = 0;
|
| 281 |
+
let unchanged = 0;
|
| 282 |
+
|
| 283 |
+
for (const part of diff) {
|
| 284 |
+
if (part.type === 'insert') {
|
| 285 |
+
added += part.value.length;
|
| 286 |
+
} else if (part.type === 'delete') {
|
| 287 |
+
removed += part.value.length;
|
| 288 |
+
} else {
|
| 289 |
+
unchanged += part.value.length;
|
| 290 |
+
}
|
| 291 |
+
}
|
| 292 |
+
|
| 293 |
+
return {
|
| 294 |
+
total: original.length,
|
| 295 |
+
added: added,
|
| 296 |
+
removed: removed,
|
| 297 |
+
unchanged: unchanged
|
| 298 |
+
};
|
| 299 |
+
},
|
| 300 |
+
|
| 301 |
+
async handleImageError(event) {
|
| 302 |
+
// Try to refresh the image URL
|
| 303 |
+
console.log('Image failed to load, refreshing URL...');
|
| 304 |
+
try {
|
| 305 |
+
const data = await this.api.refreshImageUrl(
|
| 306 |
+
this.datasetId,
|
| 307 |
+
this.datasetConfig,
|
| 308 |
+
this.datasetSplit,
|
| 309 |
+
this.currentIndex
|
| 310 |
+
);
|
| 311 |
+
|
| 312 |
+
// Update the image source
|
| 313 |
+
if (data.row && data.row[this.api.detectColumns(null, data.row).image]?.src) {
|
| 314 |
+
event.target.src = data.row[this.api.detectColumns(null, data.row).image].src;
|
| 315 |
+
}
|
| 316 |
+
} catch (error) {
|
| 317 |
+
console.error('Failed to refresh image URL:', error);
|
| 318 |
+
// Set a placeholder image
|
| 319 |
+
event.target.src = '';
|
| 320 |
+
}
|
| 321 |
+
},
|
| 322 |
+
|
| 323 |
+
exportComparison() {
|
| 324 |
+
const original = this.getOriginalText();
|
| 325 |
+
const improved = this.getImprovedText();
|
| 326 |
+
const metadata = {
|
| 327 |
+
dataset: this.datasetId,
|
| 328 |
+
page: this.currentIndex + 1,
|
| 329 |
+
totalPages: this.totalSamples,
|
| 330 |
+
exportDate: new Date().toISOString(),
|
| 331 |
+
similarity: `${this.similarity}%`,
|
| 332 |
+
statistics: {
|
| 333 |
+
characters: this.charStats,
|
| 334 |
+
words: this.wordStats
|
| 335 |
+
}
|
| 336 |
+
};
|
| 337 |
+
|
| 338 |
+
// Create export content
|
| 339 |
+
let content = `OCR Text Comparison Export\n`;
|
| 340 |
+
content += `==========================\n\n`;
|
| 341 |
+
content += `Dataset: ${metadata.dataset}\n`;
|
| 342 |
+
content += `Page: ${metadata.page} of ${metadata.totalPages}\n`;
|
| 343 |
+
content += `Export Date: ${new Date().toLocaleString()}\n`;
|
| 344 |
+
content += `Similarity: ${metadata.similarity}\n`;
|
| 345 |
+
content += `Characters: ${metadata.statistics.characters.total} total, `;
|
| 346 |
+
content += `${metadata.statistics.characters.added} added, `;
|
| 347 |
+
content += `${metadata.statistics.characters.removed} removed\n`;
|
| 348 |
+
content += `Words: ${metadata.statistics.words.original} → ${metadata.statistics.words.improved}\n`;
|
| 349 |
+
content += `\n${'='.repeat(50)}\n\n`;
|
| 350 |
+
content += `ORIGINAL OCR:\n`;
|
| 351 |
+
content += `${'='.repeat(50)}\n`;
|
| 352 |
+
content += original;
|
| 353 |
+
content += `\n\n${'='.repeat(50)}\n\n`;
|
| 354 |
+
content += `IMPROVED OCR:\n`;
|
| 355 |
+
content += `${'='.repeat(50)}\n`;
|
| 356 |
+
content += improved;
|
| 357 |
+
|
| 358 |
+
// Download file
|
| 359 |
+
const blob = new Blob([content], { type: 'text/plain' });
|
| 360 |
+
const url = URL.createObjectURL(blob);
|
| 361 |
+
const a = document.createElement('a');
|
| 362 |
+
a.href = url;
|
| 363 |
+
a.download = `ocr-comparison-${this.datasetId.replace('/', '-')}-page-${this.currentIndex + 1}.txt`;
|
| 364 |
+
document.body.appendChild(a);
|
| 365 |
+
a.click();
|
| 366 |
+
document.body.removeChild(a);
|
| 367 |
+
URL.revokeObjectURL(url);
|
| 368 |
+
},
|
| 369 |
+
|
| 370 |
+
// Flow view methods
|
| 371 |
+
async toggleFlowView() {
|
| 372 |
+
this.showFlowView = !this.showFlowView;
|
| 373 |
+
if (this.showFlowView) {
|
| 374 |
+
// Reset to center around current page when opening
|
| 375 |
+
this.flowStartIndex = Math.max(0, this.currentIndex - Math.floor(this.flowVisibleCount / 2));
|
| 376 |
+
await this.loadFlowItems();
|
| 377 |
+
}
|
| 378 |
+
},
|
| 379 |
+
|
| 380 |
+
async loadFlowItems() {
|
| 381 |
+
// Load thumbnails from flowStartIndex
|
| 382 |
+
const startIdx = this.flowStartIndex;
|
| 383 |
+
this.flowItems = [];
|
| 384 |
+
|
| 385 |
+
// Load visible items
|
| 386 |
+
for (let i = 0; i < this.flowVisibleCount && (startIdx + i) < this.totalSamples; i++) {
|
| 387 |
+
const idx = startIdx + i;
|
| 388 |
+
try {
|
| 389 |
+
const data = await this.api.getRow(
|
| 390 |
+
this.datasetId,
|
| 391 |
+
this.datasetConfig,
|
| 392 |
+
this.datasetSplit,
|
| 393 |
+
idx
|
| 394 |
+
);
|
| 395 |
+
|
| 396 |
+
const columns = this.api.detectColumns(null, data.row);
|
| 397 |
+
const imageData = columns.image ? data.row[columns.image] : null;
|
| 398 |
+
|
| 399 |
+
this.flowItems.push({
|
| 400 |
+
index: idx,
|
| 401 |
+
imageSrc: imageData?.src || '',
|
| 402 |
+
row: data.row
|
| 403 |
+
});
|
| 404 |
+
} catch (error) {
|
| 405 |
+
console.error(`Failed to load flow item ${idx}:`, error);
|
| 406 |
+
}
|
| 407 |
+
}
|
| 408 |
+
},
|
| 409 |
+
|
| 410 |
+
scrollFlowLeft() {
|
| 411 |
+
if (this.flowStartIndex > 0) {
|
| 412 |
+
this.flowStartIndex = Math.max(0, this.flowStartIndex - this.flowVisibleCount);
|
| 413 |
+
this.loadFlowItems();
|
| 414 |
+
}
|
| 415 |
+
},
|
| 416 |
+
|
| 417 |
+
scrollFlowRight() {
|
| 418 |
+
if (this.flowStartIndex < this.totalSamples - this.flowVisibleCount) {
|
| 419 |
+
this.flowStartIndex = Math.min(
|
| 420 |
+
this.totalSamples - this.flowVisibleCount,
|
| 421 |
+
this.flowStartIndex + this.flowVisibleCount
|
| 422 |
+
);
|
| 423 |
+
this.loadFlowItems();
|
| 424 |
+
}
|
| 425 |
+
},
|
| 426 |
+
|
| 427 |
+
async jumpToFlowPage(index) {
|
| 428 |
+
this.showFlowView = false;
|
| 429 |
+
await this.loadSample(index);
|
| 430 |
+
},
|
| 431 |
+
|
| 432 |
+
async handleFlowImageError(event, index) {
|
| 433 |
+
// Try to refresh the image URL for flow item
|
| 434 |
+
try {
|
| 435 |
+
const data = await this.api.refreshImageUrl(
|
| 436 |
+
this.datasetId,
|
| 437 |
+
this.datasetConfig,
|
| 438 |
+
this.datasetSplit,
|
| 439 |
+
index
|
| 440 |
+
);
|
| 441 |
+
|
| 442 |
+
if (data.row) {
|
| 443 |
+
const columns = this.api.detectColumns(null, data.row);
|
| 444 |
+
const imageData = columns.image ? data.row[columns.image] : null;
|
| 445 |
+
if (imageData?.src) {
|
| 446 |
+
event.target.src = imageData.src;
|
| 447 |
+
// Update the flow item
|
| 448 |
+
const flowItem = this.flowItems.find(item => item.index === index);
|
| 449 |
+
if (flowItem) {
|
| 450 |
+
flowItem.imageSrc = imageData.src;
|
| 451 |
+
}
|
| 452 |
+
}
|
| 453 |
+
}
|
| 454 |
+
} catch (error) {
|
| 455 |
+
console.error('Failed to refresh flow image URL:', error);
|
| 456 |
+
}
|
| 457 |
+
},
|
| 458 |
+
|
| 459 |
+
// Dock methods
|
| 460 |
+
async showDockPreview() {
|
| 461 |
+
// Clear any hide timeout
|
| 462 |
+
if (this.dockHideTimeout) {
|
| 463 |
+
clearTimeout(this.dockHideTimeout);
|
| 464 |
+
this.dockHideTimeout = null;
|
| 465 |
+
}
|
| 466 |
+
|
| 467 |
+
this.showDock = true;
|
| 468 |
+
|
| 469 |
+
// Center dock around current page
|
| 470 |
+
this.dockStartIndex = Math.max(0,
|
| 471 |
+
Math.min(
|
| 472 |
+
this.currentIndex - Math.floor(this.dockVisibleCount / 2),
|
| 473 |
+
this.totalSamples - this.dockVisibleCount
|
| 474 |
+
)
|
| 475 |
+
);
|
| 476 |
+
|
| 477 |
+
// Always reload dock items to show current position
|
| 478 |
+
await this.loadDockItems();
|
| 479 |
+
},
|
| 480 |
+
|
| 481 |
+
hideDockPreview() {
|
| 482 |
+
// Add a small delay to prevent flickering
|
| 483 |
+
this.dockHideTimeout = setTimeout(() => {
|
| 484 |
+
this.showDock = false;
|
| 485 |
+
}, 300);
|
| 486 |
+
},
|
| 487 |
+
|
| 488 |
+
async loadDockItems() {
|
| 489 |
+
// Load thumbnails based on dock start index
|
| 490 |
+
const endIdx = Math.min(this.totalSamples, this.dockStartIndex + this.dockVisibleCount);
|
| 491 |
+
|
| 492 |
+
this.dockItems = [];
|
| 493 |
+
|
| 494 |
+
for (let i = this.dockStartIndex; i < endIdx; i++) {
|
| 495 |
+
try {
|
| 496 |
+
const data = await this.api.getRow(
|
| 497 |
+
this.datasetId,
|
| 498 |
+
this.datasetConfig,
|
| 499 |
+
this.datasetSplit,
|
| 500 |
+
i
|
| 501 |
+
);
|
| 502 |
+
|
| 503 |
+
const columns = this.api.detectColumns(null, data.row);
|
| 504 |
+
const imageData = columns.image ? data.row[columns.image] : null;
|
| 505 |
+
|
| 506 |
+
this.dockItems.push({
|
| 507 |
+
index: i,
|
| 508 |
+
imageSrc: imageData?.src || '',
|
| 509 |
+
row: data.row
|
| 510 |
+
});
|
| 511 |
+
} catch (error) {
|
| 512 |
+
console.error(`Failed to load dock item ${i}:`, error);
|
| 513 |
+
}
|
| 514 |
+
}
|
| 515 |
+
},
|
| 516 |
+
|
| 517 |
+
async scrollDockLeft() {
|
| 518 |
+
if (this.dockStartIndex > 0) {
|
| 519 |
+
this.dockStartIndex = Math.max(0, this.dockStartIndex - Math.floor(this.dockVisibleCount / 2));
|
| 520 |
+
await this.loadDockItems();
|
| 521 |
+
}
|
| 522 |
+
},
|
| 523 |
+
|
| 524 |
+
async scrollDockRight() {
|
| 525 |
+
if (this.dockStartIndex < this.totalSamples - this.dockVisibleCount) {
|
| 526 |
+
this.dockStartIndex = Math.min(
|
| 527 |
+
this.totalSamples - this.dockVisibleCount,
|
| 528 |
+
this.dockStartIndex + Math.floor(this.dockVisibleCount / 2)
|
| 529 |
+
);
|
| 530 |
+
await this.loadDockItems();
|
| 531 |
+
}
|
| 532 |
+
},
|
| 533 |
+
|
| 534 |
+
async jumpToDockPage(index) {
|
| 535 |
+
this.showDock = false;
|
| 536 |
+
await this.loadSample(index);
|
| 537 |
+
},
|
| 538 |
+
|
| 539 |
+
// Watch for diff mode changes
|
| 540 |
+
initWatchers() {
|
| 541 |
+
this.$watch('diffMode', () => this.updateDiff());
|
| 542 |
+
this.$watch('currentSample', () => this.updateDiff());
|
| 543 |
+
}
|
| 544 |
+
}));
|
| 545 |
+
});
|
| 546 |
+
|
| 547 |
+
// Initialize watchers after Alpine loads
|
| 548 |
+
document.addEventListener('alpine:initialized', () => {
|
| 549 |
+
Alpine.store('ocrExplorer')?.initWatchers?.();
|
| 550 |
+
});
|
js/dataset-api.js
ADDED
|
@@ -0,0 +1,273 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* HuggingFace Dataset Viewer API wrapper
|
| 3 |
+
* Handles fetching data from the datasets-server API with caching and error handling
|
| 4 |
+
*/
|
| 5 |
+
|
| 6 |
+
class DatasetAPI {
|
| 7 |
+
constructor() {
|
| 8 |
+
this.baseURL = 'https://datasets-server.huggingface.co';
|
| 9 |
+
this.cache = new Map();
|
| 10 |
+
this.cacheExpiry = 45 * 60 * 1000; // 45 minutes (conservative for signed URLs)
|
| 11 |
+
this.rowsPerFetch = 100; // API maximum
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
/**
|
| 15 |
+
* Check if a dataset is valid and has viewer enabled
|
| 16 |
+
*/
|
| 17 |
+
async validateDataset(datasetId) {
|
| 18 |
+
try {
|
| 19 |
+
const response = await fetch(`${this.baseURL}/is-valid?dataset=${encodeURIComponent(datasetId)}`);
|
| 20 |
+
if (!response.ok) {
|
| 21 |
+
throw new Error(`Failed to validate dataset: ${response.statusText}`);
|
| 22 |
+
}
|
| 23 |
+
const data = await response.json();
|
| 24 |
+
|
| 25 |
+
if (!data.viewer) {
|
| 26 |
+
throw new Error('Dataset viewer is not available for this dataset');
|
| 27 |
+
}
|
| 28 |
+
|
| 29 |
+
return true;
|
| 30 |
+
} catch (error) {
|
| 31 |
+
throw new Error(`Dataset validation failed: ${error.message}`);
|
| 32 |
+
}
|
| 33 |
+
}
|
| 34 |
+
|
| 35 |
+
/**
|
| 36 |
+
* Get dataset info including splits and configs
|
| 37 |
+
*/
|
| 38 |
+
async getDatasetInfo(datasetId) {
|
| 39 |
+
const cacheKey = `info_${datasetId}`;
|
| 40 |
+
const cached = this.getFromCache(cacheKey);
|
| 41 |
+
if (cached) return cached;
|
| 42 |
+
|
| 43 |
+
try {
|
| 44 |
+
const response = await fetch(`${this.baseURL}/splits?dataset=${encodeURIComponent(datasetId)}`);
|
| 45 |
+
if (!response.ok) {
|
| 46 |
+
throw new Error(`Failed to get dataset info: ${response.statusText}`);
|
| 47 |
+
}
|
| 48 |
+
const data = await response.json();
|
| 49 |
+
|
| 50 |
+
// Extract the default config and split
|
| 51 |
+
const defaultConfig = data.splits[0]?.config || 'default';
|
| 52 |
+
const defaultSplit = data.splits.find(s => s.split === 'train')?.split || data.splits[0]?.split || 'train';
|
| 53 |
+
|
| 54 |
+
const info = {
|
| 55 |
+
configs: [...new Set(data.splits.map(s => s.config))],
|
| 56 |
+
splits: [...new Set(data.splits.map(s => s.split))],
|
| 57 |
+
defaultConfig,
|
| 58 |
+
defaultSplit,
|
| 59 |
+
raw: data
|
| 60 |
+
};
|
| 61 |
+
|
| 62 |
+
this.setCache(cacheKey, info);
|
| 63 |
+
return info;
|
| 64 |
+
} catch (error) {
|
| 65 |
+
throw new Error(`Failed to get dataset info: ${error.message}`);
|
| 66 |
+
}
|
| 67 |
+
}
|
| 68 |
+
|
| 69 |
+
/**
|
| 70 |
+
* Get the total number of rows in a dataset
|
| 71 |
+
*/
|
| 72 |
+
async getTotalRows(datasetId, config, split) {
|
| 73 |
+
const cacheKey = `size_${datasetId}_${config}_${split}`;
|
| 74 |
+
const cached = this.getFromCache(cacheKey);
|
| 75 |
+
if (cached) return cached;
|
| 76 |
+
|
| 77 |
+
try {
|
| 78 |
+
// First try to get from the size endpoint
|
| 79 |
+
const sizeResponse = await fetch(
|
| 80 |
+
`${this.baseURL}/size?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
|
| 81 |
+
);
|
| 82 |
+
|
| 83 |
+
if (sizeResponse.ok) {
|
| 84 |
+
const sizeData = await sizeResponse.json();
|
| 85 |
+
// The API returns num_rows in size.config or size.splits[0]
|
| 86 |
+
const size = sizeData.size?.config?.num_rows ||
|
| 87 |
+
sizeData.size?.splits?.[0]?.num_rows ||
|
| 88 |
+
0;
|
| 89 |
+
this.setCache(cacheKey, size);
|
| 90 |
+
return size;
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
// Fallback: get first rows and check num_rows_total
|
| 94 |
+
const rowsResponse = await fetch(
|
| 95 |
+
`${this.baseURL}/first-rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
|
| 96 |
+
);
|
| 97 |
+
|
| 98 |
+
if (!rowsResponse.ok) {
|
| 99 |
+
throw new Error('Unable to determine dataset size');
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
const rowsData = await rowsResponse.json();
|
| 103 |
+
const size = rowsData.num_rows_total || rowsData.rows?.length || 0;
|
| 104 |
+
this.setCache(cacheKey, size);
|
| 105 |
+
return size;
|
| 106 |
+
} catch (error) {
|
| 107 |
+
console.warn('Failed to get total rows:', error);
|
| 108 |
+
return null;
|
| 109 |
+
}
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
/**
|
| 113 |
+
* Fetch rows from the dataset
|
| 114 |
+
*/
|
| 115 |
+
async fetchRows(datasetId, config, split, offset, length = this.rowsPerFetch) {
|
| 116 |
+
const cacheKey = `rows_${datasetId}_${config}_${split}_${offset}_${length}`;
|
| 117 |
+
const cached = this.getFromCache(cacheKey);
|
| 118 |
+
if (cached) return cached;
|
| 119 |
+
|
| 120 |
+
try {
|
| 121 |
+
const response = await fetch(
|
| 122 |
+
`${this.baseURL}/rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}&offset=${offset}&length=${length}`
|
| 123 |
+
);
|
| 124 |
+
|
| 125 |
+
if (!response.ok) {
|
| 126 |
+
if (response.status === 403) {
|
| 127 |
+
throw new Error('Access denied. This dataset may be private or gated.');
|
| 128 |
+
}
|
| 129 |
+
throw new Error(`Failed to fetch rows: ${response.statusText}`);
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
+
const data = await response.json();
|
| 133 |
+
|
| 134 |
+
// Extract column information
|
| 135 |
+
const columns = this.detectColumns(data.features, data.rows[0]?.row);
|
| 136 |
+
|
| 137 |
+
const result = {
|
| 138 |
+
rows: data.rows,
|
| 139 |
+
features: data.features,
|
| 140 |
+
columns,
|
| 141 |
+
numRowsTotal: data.num_rows_total,
|
| 142 |
+
partial: data.partial || false
|
| 143 |
+
};
|
| 144 |
+
|
| 145 |
+
this.setCache(cacheKey, result);
|
| 146 |
+
return result;
|
| 147 |
+
} catch (error) {
|
| 148 |
+
throw new Error(`Failed to fetch rows: ${error.message}`);
|
| 149 |
+
}
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
/**
|
| 153 |
+
* Get a single row by index with smart batching
|
| 154 |
+
*/
|
| 155 |
+
async getRow(datasetId, config, split, index) {
|
| 156 |
+
// Calculate which batch this index falls into
|
| 157 |
+
const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
|
| 158 |
+
const batchData = await this.fetchRows(datasetId, config, split, batchStart, this.rowsPerFetch);
|
| 159 |
+
|
| 160 |
+
const localIndex = index - batchStart;
|
| 161 |
+
if (localIndex >= 0 && localIndex < batchData.rows.length) {
|
| 162 |
+
return {
|
| 163 |
+
row: batchData.rows[localIndex].row,
|
| 164 |
+
columns: batchData.columns,
|
| 165 |
+
numRowsTotal: batchData.numRowsTotal
|
| 166 |
+
};
|
| 167 |
+
}
|
| 168 |
+
|
| 169 |
+
throw new Error(`Row ${index} not found`);
|
| 170 |
+
}
|
| 171 |
+
|
| 172 |
+
/**
|
| 173 |
+
* Detect column names for image and text data
|
| 174 |
+
*/
|
| 175 |
+
detectColumns(features, sampleRow) {
|
| 176 |
+
let imageColumn = null;
|
| 177 |
+
let originalTextColumn = null;
|
| 178 |
+
let improvedTextColumn = null;
|
| 179 |
+
|
| 180 |
+
// Try to detect from features first
|
| 181 |
+
for (const feature of features || []) {
|
| 182 |
+
const name = feature.name;
|
| 183 |
+
const type = feature.type;
|
| 184 |
+
|
| 185 |
+
// Detect image column
|
| 186 |
+
if (type._type === 'Image' || type.dtype === 'image' || type.feature?._type === 'Image') {
|
| 187 |
+
imageColumn = name;
|
| 188 |
+
}
|
| 189 |
+
|
| 190 |
+
// Detect text columns based on common patterns
|
| 191 |
+
if (!originalTextColumn && ['text', 'ocr', 'original_text', 'original', 'ground_truth'].includes(name)) {
|
| 192 |
+
originalTextColumn = name;
|
| 193 |
+
}
|
| 194 |
+
|
| 195 |
+
if (!improvedTextColumn && ['markdown', 'new_ocr', 'corrected_text', 'improved', 'vlm_ocr', 'corrected'].includes(name)) {
|
| 196 |
+
improvedTextColumn = name;
|
| 197 |
+
}
|
| 198 |
+
}
|
| 199 |
+
|
| 200 |
+
// Fallback: detect from sample row
|
| 201 |
+
if (sampleRow) {
|
| 202 |
+
const keys = Object.keys(sampleRow);
|
| 203 |
+
|
| 204 |
+
if (!imageColumn) {
|
| 205 |
+
for (const key of keys) {
|
| 206 |
+
if (sampleRow[key]?.src && sampleRow[key]?.height !== undefined) {
|
| 207 |
+
imageColumn = key;
|
| 208 |
+
break;
|
| 209 |
+
}
|
| 210 |
+
}
|
| 211 |
+
}
|
| 212 |
+
|
| 213 |
+
// Additional text column detection from row data
|
| 214 |
+
if (!originalTextColumn) {
|
| 215 |
+
const candidates = ['text', 'ocr', 'original_text', 'original'];
|
| 216 |
+
originalTextColumn = keys.find(k => candidates.includes(k)) || null;
|
| 217 |
+
}
|
| 218 |
+
|
| 219 |
+
if (!improvedTextColumn) {
|
| 220 |
+
const candidates = ['markdown', 'new_ocr', 'corrected_text', 'improved'];
|
| 221 |
+
improvedTextColumn = keys.find(k => candidates.includes(k)) || null;
|
| 222 |
+
}
|
| 223 |
+
}
|
| 224 |
+
|
| 225 |
+
return {
|
| 226 |
+
image: imageColumn,
|
| 227 |
+
originalText: originalTextColumn,
|
| 228 |
+
improvedText: improvedTextColumn
|
| 229 |
+
};
|
| 230 |
+
}
|
| 231 |
+
|
| 232 |
+
/**
|
| 233 |
+
* Refresh expired image URL by re-fetching the row
|
| 234 |
+
*/
|
| 235 |
+
async refreshImageUrl(datasetId, config, split, index) {
|
| 236 |
+
// Clear cache for this specific row batch
|
| 237 |
+
const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
|
| 238 |
+
const cacheKey = `rows_${datasetId}_${config}_${split}_${batchStart}_${this.rowsPerFetch}`;
|
| 239 |
+
this.cache.delete(cacheKey);
|
| 240 |
+
|
| 241 |
+
// Re-fetch the row
|
| 242 |
+
return await this.getRow(datasetId, config, split, index);
|
| 243 |
+
}
|
| 244 |
+
|
| 245 |
+
/**
|
| 246 |
+
* Cache management utilities
|
| 247 |
+
*/
|
| 248 |
+
getFromCache(key) {
|
| 249 |
+
const cached = this.cache.get(key);
|
| 250 |
+
if (!cached) return null;
|
| 251 |
+
|
| 252 |
+
if (Date.now() - cached.timestamp > this.cacheExpiry) {
|
| 253 |
+
this.cache.delete(key);
|
| 254 |
+
return null;
|
| 255 |
+
}
|
| 256 |
+
|
| 257 |
+
return cached.data;
|
| 258 |
+
}
|
| 259 |
+
|
| 260 |
+
setCache(key, data) {
|
| 261 |
+
this.cache.set(key, {
|
| 262 |
+
data,
|
| 263 |
+
timestamp: Date.now()
|
| 264 |
+
});
|
| 265 |
+
}
|
| 266 |
+
|
| 267 |
+
clearCache() {
|
| 268 |
+
this.cache.clear();
|
| 269 |
+
}
|
| 270 |
+
}
|
| 271 |
+
|
| 272 |
+
// Export for use in other scripts
|
| 273 |
+
window.DatasetAPI = DatasetAPI;
|
js/diff-utils.js
ADDED
|
@@ -0,0 +1,219 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* Text comparison utilities for OCR Text Explorer
|
| 3 |
+
* Provides character, word, and line-level diff visualization
|
| 4 |
+
*/
|
| 5 |
+
|
| 6 |
+
/**
|
| 7 |
+
* Create character-level diff with inline highlighting
|
| 8 |
+
*/
|
| 9 |
+
function createCharacterDiff(original, improved) {
|
| 10 |
+
if (!original || !improved) {
|
| 11 |
+
return '<p class="text-gray-500">No text to compare</p>';
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
const dp = computeLCS(original, improved);
|
| 15 |
+
const diff = buildDiff(original, improved, dp);
|
| 16 |
+
|
| 17 |
+
let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
|
| 18 |
+
|
| 19 |
+
for (const part of diff) {
|
| 20 |
+
if (part.type === 'equal') {
|
| 21 |
+
html += escapeHtml(part.value);
|
| 22 |
+
} else if (part.type === 'delete') {
|
| 23 |
+
html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value)}</span>`;
|
| 24 |
+
} else if (part.type === 'insert') {
|
| 25 |
+
html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value)}</span>`;
|
| 26 |
+
}
|
| 27 |
+
}
|
| 28 |
+
|
| 29 |
+
html += '</div>';
|
| 30 |
+
return html;
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
/**
|
| 34 |
+
* Create word-level diff
|
| 35 |
+
*/
|
| 36 |
+
function createWordDiff(original, improved) {
|
| 37 |
+
if (!original || !improved) {
|
| 38 |
+
return '<p class="text-gray-500">No text to compare</p>';
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
// Split into words while preserving whitespace
|
| 42 |
+
const originalWords = splitIntoWords(original);
|
| 43 |
+
const improvedWords = splitIntoWords(improved);
|
| 44 |
+
|
| 45 |
+
const dp = computeLCS(originalWords, improvedWords);
|
| 46 |
+
const diff = buildDiff(originalWords, improvedWords, dp);
|
| 47 |
+
|
| 48 |
+
let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
|
| 49 |
+
|
| 50 |
+
for (const part of diff) {
|
| 51 |
+
if (part.type === 'equal') {
|
| 52 |
+
html += escapeHtml(part.value.join(''));
|
| 53 |
+
} else if (part.type === 'delete') {
|
| 54 |
+
html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value.join(''))}</span>`;
|
| 55 |
+
} else if (part.type === 'insert') {
|
| 56 |
+
html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value.join(''))}</span>`;
|
| 57 |
+
}
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
html += '</div>';
|
| 61 |
+
return html;
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
/**
|
| 65 |
+
* Create line-level diff
|
| 66 |
+
*/
|
| 67 |
+
function createLineDiff(original, improved) {
|
| 68 |
+
if (!original || !improved) {
|
| 69 |
+
return '<p class="text-gray-500">No text to compare</p>';
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
const originalLines = original.split('\n');
|
| 73 |
+
const improvedLines = improved.split('\n');
|
| 74 |
+
|
| 75 |
+
const dp = computeLCS(originalLines, improvedLines);
|
| 76 |
+
const diff = buildDiff(originalLines, improvedLines, dp);
|
| 77 |
+
|
| 78 |
+
let html = '<div class="font-mono text-sm text-gray-900 dark:text-gray-100">';
|
| 79 |
+
|
| 80 |
+
for (const part of diff) {
|
| 81 |
+
if (part.type === 'equal') {
|
| 82 |
+
for (const line of part.value) {
|
| 83 |
+
html += `<div class="py-1">${escapeHtml(line)}</div>`;
|
| 84 |
+
}
|
| 85 |
+
} else if (part.type === 'delete') {
|
| 86 |
+
for (const line of part.value) {
|
| 87 |
+
html += `<div class="py-1 bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(line)}</div>`;
|
| 88 |
+
}
|
| 89 |
+
} else if (part.type === 'insert') {
|
| 90 |
+
for (const line of part.value) {
|
| 91 |
+
html += `<div class="py-1 bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(line)}</div>`;
|
| 92 |
+
}
|
| 93 |
+
}
|
| 94 |
+
}
|
| 95 |
+
|
| 96 |
+
html += '</div>';
|
| 97 |
+
return html;
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
/**
|
| 101 |
+
* Compute Longest Common Subsequence using dynamic programming
|
| 102 |
+
*/
|
| 103 |
+
function computeLCS(a, b) {
|
| 104 |
+
const m = a.length;
|
| 105 |
+
const n = b.length;
|
| 106 |
+
const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
|
| 107 |
+
|
| 108 |
+
for (let i = 1; i <= m; i++) {
|
| 109 |
+
for (let j = 1; j <= n; j++) {
|
| 110 |
+
if (a[i - 1] === b[j - 1]) {
|
| 111 |
+
dp[i][j] = dp[i - 1][j - 1] + 1;
|
| 112 |
+
} else {
|
| 113 |
+
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
|
| 114 |
+
}
|
| 115 |
+
}
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
return dp;
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
/**
|
| 122 |
+
* Build diff from LCS table
|
| 123 |
+
*/
|
| 124 |
+
function buildDiff(a, b, dp) {
|
| 125 |
+
const diff = [];
|
| 126 |
+
let i = a.length;
|
| 127 |
+
let j = b.length;
|
| 128 |
+
|
| 129 |
+
while (i > 0 || j > 0) {
|
| 130 |
+
if (i > 0 && j > 0 && a[i - 1] === b[j - 1]) {
|
| 131 |
+
// Characters are equal
|
| 132 |
+
if (diff.length > 0 && diff[diff.length - 1].type === 'equal') {
|
| 133 |
+
diff[diff.length - 1].value.unshift(a[i - 1]);
|
| 134 |
+
} else {
|
| 135 |
+
diff.push({ type: 'equal', value: [a[i - 1]] });
|
| 136 |
+
}
|
| 137 |
+
i--;
|
| 138 |
+
j--;
|
| 139 |
+
} else if (j > 0 && (i === 0 || dp[i][j - 1] >= dp[i - 1][j])) {
|
| 140 |
+
// Character in b but not in a (insertion)
|
| 141 |
+
if (diff.length > 0 && diff[diff.length - 1].type === 'insert') {
|
| 142 |
+
diff[diff.length - 1].value.unshift(b[j - 1]);
|
| 143 |
+
} else {
|
| 144 |
+
diff.push({ type: 'insert', value: [b[j - 1]] });
|
| 145 |
+
}
|
| 146 |
+
j--;
|
| 147 |
+
} else {
|
| 148 |
+
// Character in a but not in b (deletion)
|
| 149 |
+
if (diff.length > 0 && diff[diff.length - 1].type === 'delete') {
|
| 150 |
+
diff[diff.length - 1].value.unshift(a[i - 1]);
|
| 151 |
+
} else {
|
| 152 |
+
diff.push({ type: 'delete', value: [a[i - 1]] });
|
| 153 |
+
}
|
| 154 |
+
i--;
|
| 155 |
+
}
|
| 156 |
+
}
|
| 157 |
+
|
| 158 |
+
diff.reverse();
|
| 159 |
+
|
| 160 |
+
// Convert arrays to strings for character diff
|
| 161 |
+
if (typeof a === 'string') {
|
| 162 |
+
diff.forEach(part => {
|
| 163 |
+
part.value = part.value.join('');
|
| 164 |
+
});
|
| 165 |
+
}
|
| 166 |
+
|
| 167 |
+
return diff;
|
| 168 |
+
}
|
| 169 |
+
|
| 170 |
+
/**
|
| 171 |
+
* Split text into words while preserving whitespace
|
| 172 |
+
*/
|
| 173 |
+
function splitIntoWords(text) {
|
| 174 |
+
const words = [];
|
| 175 |
+
let current = '';
|
| 176 |
+
let inWord = false;
|
| 177 |
+
|
| 178 |
+
for (const char of text) {
|
| 179 |
+
if (/\s/.test(char)) {
|
| 180 |
+
if (inWord && current) {
|
| 181 |
+
words.push(current);
|
| 182 |
+
current = '';
|
| 183 |
+
inWord = false;
|
| 184 |
+
}
|
| 185 |
+
words.push(char);
|
| 186 |
+
} else {
|
| 187 |
+
current += char;
|
| 188 |
+
inWord = true;
|
| 189 |
+
}
|
| 190 |
+
}
|
| 191 |
+
|
| 192 |
+
if (current) {
|
| 193 |
+
words.push(current);
|
| 194 |
+
}
|
| 195 |
+
|
| 196 |
+
return words;
|
| 197 |
+
}
|
| 198 |
+
|
| 199 |
+
/**
|
| 200 |
+
* Escape HTML special characters
|
| 201 |
+
*/
|
| 202 |
+
function escapeHtml(text) {
|
| 203 |
+
const div = document.createElement('div');
|
| 204 |
+
div.textContent = text;
|
| 205 |
+
return div.innerHTML;
|
| 206 |
+
}
|
| 207 |
+
|
| 208 |
+
/**
|
| 209 |
+
* Calculate similarity percentage between two texts
|
| 210 |
+
*/
|
| 211 |
+
function calculateSimilarity(original, improved) {
|
| 212 |
+
if (!original || !improved) return 0;
|
| 213 |
+
|
| 214 |
+
const dp = computeLCS(original, improved);
|
| 215 |
+
const lcsLength = dp[original.length][improved.length];
|
| 216 |
+
const maxLength = Math.max(original.length, improved.length);
|
| 217 |
+
|
| 218 |
+
return Math.round((lcsLength / maxLength) * 100);
|
| 219 |
+
}
|