ggunio commited on
Commit
a9ec422
Β·
verified Β·
1 Parent(s): 9040409

Update Space README with Phase 2 status (Sept 21, 2025)

Browse files
Files changed (1) hide show
  1. README.md +129 -19
README.md CHANGED
@@ -1,36 +1,146 @@
1
  ---
2
- title: B2NL Tokenizer v6.1.1 Demo
3
  emoji: 🌍
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: gradio
7
- sdk_version: 5.46.1
8
  app_file: app.py
9
  pinned: true
 
10
  models:
11
  - ggunio/B2NL-v6.1.1
12
- license: apache-2.0
13
  ---
14
 
15
- # B2NL v6.1.1: Byte-to-Natural-Language Tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- ## πŸŽ‰ 97.71% Reconstruction Achieved!
18
 
19
- This space demonstrates our breakthrough byte-level tokenizer that achieves **100% byte-exact reconstruction** for all tested languages without any vocabulary files.
20
 
21
- ### Key Features
22
- - **No Vocabulary**: Pure byte-level learning
23
- - **97.71% Overall Accuracy**: Near-perfect reconstruction
24
- - **6 Languages**: 100% byte-exact for each
25
- - **301.7M Parameters**: Efficient size
26
- - **Pure Learning**: No linguistic rules
27
 
28
- ### Phase 1 Complete
29
- We've successfully completed Phase 1 training with outstanding results. Phase 2 (compression) starting soon!
 
 
 
 
 
 
 
30
 
31
- ### Links
32
- - [Model](https://huggingface.co/ggunio/B2NL-v6.1.1)
33
- - [GitHub](https://github.com/Woojiggun/intelligent-tokenizer)
34
 
35
- ### Support Us
36
- We need GPU resources to train on 204 languages. If you can help, please reach out!
 
1
  ---
2
+ title: Intelligent Tokenizer V6 Demo
3
  emoji: 🌍
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
  pinned: true
10
+ license: apache-2.0
11
  models:
12
  - ggunio/B2NL-v6.1.1
 
13
  ---
14
 
15
+ # 🌍 B2NL v6.1.1: Byte-to-Natural-Language Tokenizer Demo
16
+
17
+ ## πŸ“’ Status Update (2025-09-21)
18
+
19
+ ### βœ… Phase 1: COMPLETE - 97.71% Reconstruction Achieved!
20
+ ### πŸ”„ Phase 2: IN PROGRESS - Dynamic Compression Training
21
+ ### πŸ“… Next Update: September 28, 2025 (Phase 2 Results)
22
+
23
+ ---
24
+
25
+ ## 🎯 Current Model Status
26
+
27
+ This demo shows **B2NL (ByToNL) v6.1.1**, a revolutionary byte-level tokenizer that achieved:
28
+ - **97.71% overall reconstruction rate**
29
+ - **100% byte-exact reconstruction** for all 6 test languages
30
+ - **No vocabulary files** - pure byte-level learning
31
+
32
+ ### ⚠️ Important Notes:
33
+ 1. **Current Scope**: 6 languages (NOT 204 yet)
34
+ 2. **Phase 2 Training**: Dynamic compression (1-50:1) in progress
35
+ 3. **204 Languages**: Will begin AFTER successful validation
36
+
37
+ ---
38
+
39
+ ## πŸ“Š Phase 1 Results (COMPLETE)
40
+
41
+ | Language | Byte-Exact | Character-Level | Edit Similarity | Status |
42
+ |----------|------------|-----------------|-----------------|--------|
43
+ | English | 100.00% | 100.00% | 98.88% | βœ… Perfect |
44
+ | Korean | 100.00% | 100.00% | 97.30% | βœ… Perfect |
45
+ | Japanese | 100.00% | 100.00% | 96.55% | βœ… Perfect |
46
+ | Chinese | 100.00% | 100.00% | 96.30% | βœ… Perfect |
47
+ | Arabic | 100.00% | 100.00% | 98.36% | βœ… Perfect |
48
+ | Spanish | 100.00% | 100.00% | 98.88% | βœ… Perfect |
49
+
50
+ ---
51
+
52
+ ## πŸ”„ Phase 2: Compression Training (IN PROGRESS)
53
+
54
+ Currently training with dynamic compression ratios:
55
+ - **High accuracy (>95%)**: Apply 30-50:1 compression
56
+ - **Medium accuracy (90-95%)**: Apply 10-30:1 compression
57
+ - **Low accuracy (<90%)**: Apply 1-10:1 compression
58
+
59
+ **Target**: 3:1 average compression while maintaining >95% reconstruction
60
+
61
+ ---
62
+
63
+ ## πŸš€ How to Use This Demo
64
+
65
+ 1. **Enter any text** in the input box
66
+ 2. **Choose generation mode**:
67
+ - Teacher Forcing: Better quality (uses ground truth)
68
+ - Autoregressive: Realistic inference
69
+ 3. **Click "Tokenize & Reconstruct"**
70
+ 4. See the reconstruction quality and compression ratio!
71
+
72
+ ---
73
+
74
+ ## 🎯 Key Features
75
+
76
+ ### Zero Vocabulary
77
+ - No vocabulary files needed
78
+ - Works with ANY text (any language, emoji, code)
79
+ - Direct byte-level processing
80
+
81
+ ### Universal Coverage
82
+ - Tested on 6 diverse languages
83
+ - Plans for 204 languages (pending validation)
84
+ - Handles mixed languages seamlessly
85
+
86
+ ### Efficient Architecture
87
+ - 301.7M parameters (lightweight)
88
+ - 5-layer encoder + 8-layer decoder
89
+ - Fast inference on CPU/GPU
90
+
91
+ ---
92
+
93
+ ## πŸ“… Timeline
94
+
95
+ ### This Week (Sept 21-28, 2025)
96
+ - Phase 2 compression training
97
+ - Dynamic ratio testing
98
+ - Performance monitoring
99
+
100
+ ### Next Week (Sept 28 - Oct 5, 2025)
101
+ - **Phase 2 Results Release**
102
+ - Compression achievements
103
+ - Decision on 204-language expansion
104
+
105
+ ### Future (With GPU Support)
106
+ - 204-language training
107
+ - 2 weeks on A100 needed
108
+ - Full FLORES-200 dataset
109
+
110
+ ---
111
+
112
+ ## πŸ’‘ Try These Examples
113
+
114
+ The demo includes examples in:
115
+ - πŸ‡¬πŸ‡§ English
116
+ - πŸ‡°πŸ‡· Korean
117
+ - πŸ‡―πŸ‡΅ Japanese
118
+ - πŸ‡¨πŸ‡³ Chinese
119
+ - πŸ‡ΈπŸ‡¦ Arabic
120
+ - πŸ‡ͺπŸ‡Έ Spanish
121
+ - πŸ‡«πŸ‡· French
122
+ - πŸ‡·πŸ‡Ί Russian
123
+ - πŸš€ Even emojis!
124
 
125
+ ---
126
 
127
+ ## πŸ“Š Technical Details
128
 
129
+ - **Parameters**: 301,739,670 (301.7M)
130
+ - **Encoder**: 5 layers (768β†’896β†’1024β†’1152β†’1280)
131
+ - **Decoder**: 8 layers (1280d)
132
+ - **Vocab Size**: 260 (256 bytes + 4 special tokens)
 
 
133
 
134
+ ---
135
+
136
+ ## 🀝 Support & Links
137
+
138
+ - **Model**: [ggunio/B2NL-v6.1.1](https://huggingface.co/ggunio/B2NL-v6.1.1)
139
+ - **GitHub**: [Repository](https://github.com/Woojiggun/intelligent-tokenizer)
140
+ - **Paper**: Coming soon after Phase 3
141
+
142
+ ---
143
 
144
+ **Note: This is a research project. Current model is 6 languages only. 204-language expansion pending validation and GPU resources.**
 
 
145
 
146
+ πŸš€ Watch this space for Phase 2 results next week!