ggunio commited on
Commit
7e2a1a9
ยท
verified ยท
1 Parent(s): b91522f

Update: B2NL is Tokenizer-Free Revolution

Browse files
Files changed (1) hide show
  1. README.md +56 -54
README.md CHANGED
@@ -1,10 +1,10 @@
1
  ---
2
- title: B2NL Embedding Preprocessor Demo
3
  emoji: ๐Ÿš€
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: gradio
7
- sdk_version: 5.46.1
8
  app_file: app.py
9
  pinned: true
10
  license: apache-2.0
@@ -12,84 +12,86 @@ models:
12
  - ggunio/B2NL-v6.1.1
13
  ---
14
 
15
- # ๐Ÿš€ B2NL v6.1.1: Intelligent Embedding Preprocessor
16
 
17
- ## โš ๏ธ NOT a Tokenizer - It's an Embedding Preprocessor!
18
-
19
- This demo shows how B2NL **reduces the number of embeddings** sent to LLMs by intelligently grouping bytes.
20
 
21
  ---
22
 
23
- ## ๐ŸŽฏ What You're Actually Seeing
24
 
25
- When you input text, B2NL:
26
- 1. **Converts to bytes** (UTF-8)
27
- 2. **Groups bytes intelligently**
28
- 3. **Outputs fewer embeddings**
29
-
30
- ### Example:
31
- ```
32
- Input: "์•ˆ๋…•ํ•˜์„ธ์š”. ์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."
33
- Traditional: 44 bytes โ†’ 44 embeddings
34
- B2NL Current: 44 bytes โ†’ 18 embeddings (2.4x reduction!)
35
- B2NL Target: 44 bytes โ†’ 4 embeddings (11x reduction!)
36
- ```
37
 
38
  ---
39
 
40
- ## ๐Ÿ“Š Current Performance (Phase 2, Epoch 51)
41
 
42
- | Language | Embeddings Reduction | Goal |
43
- |----------|---------------------|------|
44
- | Korean | 2.4x | 20x |
45
- | Chinese | 3.0x | 15x |
46
- | Japanese | 3.0x | 15x |
47
- | Arabic | 1.8x | 10x |
48
- | English | 1.0x | 5x |
49
- | Spanish | 1.0x | 5x |
50
 
51
  ---
52
 
53
- ## ๐Ÿ’ก Why This Matters
54
 
55
- **For LLMs:**
56
- - 2-20x fewer embeddings to process
57
- - Faster inference
58
- - Less memory usage
59
- - Longer effective context
60
 
61
- **For Users:**
62
- - Same quality (100% reversible)
63
- - Faster responses
64
- - Lower costs
65
 
66
  ---
67
 
68
- ## ๐Ÿ”ฌ Technical Details
69
 
70
- - **Not replacing tokenizers** - Works BEFORE tokenization
71
- - **Language agnostic** - Pure byte-level learning
72
- - **100% reversible** - Perfect reconstruction
73
- - **Dynamic compression** - Adapts to content
 
74
 
75
  ---
76
 
77
- ## ๐Ÿ“ˆ Phase 2 Progress
 
 
 
 
 
 
78
 
79
- Currently training with dynamic compression (1-50:1 ratios):
80
- - High quality text: More compression
81
- - Complex text: Less compression
82
- - Always maintains >95% reconstruction
 
83
 
84
  ---
85
 
86
- ## ๐ŸŽฎ Try It!
87
 
88
- 1. Enter any text
89
- 2. See the reconstruction quality
90
- 3. Check the **compression ratio** (bytes:embeddings)
91
- 4. That ratio = how much faster your LLM could be!
 
 
 
 
92
 
93
  ---
94
 
95
- **Remember: The "tokens" shown are actually embedding groups that would be sent to an LLM, not traditional tokens!**
 
 
 
1
  ---
2
+ title: B2NL Tokenizer-Free Demo
3
  emoji: ๐Ÿš€
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
  pinned: true
10
  license: apache-2.0
 
12
  - ggunio/B2NL-v6.1.1
13
  ---
14
 
15
+ # ๐Ÿš€ B2NL: The Tokenizer-Free Revolution
16
 
17
+ ## No Vocabulary Files. No Rules. Just Intelligence.
 
 
18
 
19
  ---
20
 
21
+ ## ๐ŸŽฏ What You're Testing
22
 
23
+ **B2NL replaces traditional tokenizers entirely:**
24
+ - Input text โ†’ Bytes โ†’ Intelligent grouping โ†’ Tokens
25
+ - No vocabulary needed (vs GPT's 100K+ vocabulary)
26
+ - Works with ANY language/emoji/symbol
 
 
 
 
 
 
 
 
27
 
28
  ---
29
 
30
+ ## ๐Ÿ“Š Live Compression Stats (Phase 2, Epoch 51)
31
 
32
+ When you type Korean text:
33
+ ```
34
+ "์•ˆ๋…•ํ•˜์„ธ์š”" (4 characters)
35
+ โ†’ Traditional: 12 bytes โ†’ 12 tokens
36
+ โ†’ GPT-4: 12 bytes โ†’ ~5 tokens
37
+ โ†’ B2NL Now: 12 bytes โ†’ 5 tokens (2.4x compression)
38
+ โ†’ B2NL Goal: 12 bytes โ†’ 1 token (12x compression!)
39
+ ```
40
 
41
  ---
42
 
43
+ ## ๐Ÿ’ฌ Try These Examples
44
 
45
+ ### Korean (Watch the compression!):
46
+ - Short: "์•ˆ๋…•ํ•˜์„ธ์š”"
47
+ - Medium: "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”"
48
+ - Long: "์ธ๊ณต์ง€๋Šฅ์ด ์„ธ์ƒ์„ ๋ฐ”๊พธ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค"
 
49
 
50
+ ### See the "Statistics" box:
51
+ - **Tokens**: Number of embeddings generated
52
+ - **Compression**: How much we compressed (goal: 20:1 for Korean!)
 
53
 
54
  ---
55
 
56
+ ## ๐Ÿ“ˆ Current Performance
57
 
58
+ | What you type | Traditional | B2NL Now | B2NL Target |
59
+ |---------------|-------------|----------|-------------|
60
+ | Korean word | 3-5 tokens | 2 tokens | 0.3 tokens |
61
+ | Chinese char | 1-3 tokens | 1 token | 0.2 tokens |
62
+ | English word | 1-2 tokens | 1 token | 0.5 tokens |
63
 
64
  ---
65
 
66
+ ## ๐Ÿ”ฅ Why This Changes Everything
67
+
68
+ **For LLM Users:**
69
+ - Korean/Chinese/Japanese: 3-20x longer context
70
+ - All languages: Faster inference
71
+ - No tokenizer downloads
72
+ - Perfect reconstruction
73
 
74
+ **For Developers:**
75
+ - No vocabulary management
76
+ - No OOV problems
77
+ - Universal API
78
+ - Tiny model (301M params)
79
 
80
  ---
81
 
82
+ ## ๐ŸŽฎ How to Interpret Results
83
 
84
+ 1. **Reconstruction Accuracy**: Should be 95-100%
85
+ 2. **Token Count**: Lower is better!
86
+ 3. **Compression Ratio**: Higher is better!
87
+
88
+ Current Status:
89
+ - โœ… Phase 1: 97.71% reconstruction (DONE)
90
+ - ๐Ÿ”„ Phase 2: Learning compression (IN PROGRESS)
91
+ - โณ Phase 3: 204 languages (PLANNED)
92
 
93
  ---
94
 
95
+ **Remember: This is replacing tokenizers entirely. The "tokens" shown are intelligent byte groups, not vocabulary lookups!**
96
+
97
+ ๐Ÿš€ **The future is tokenizer-free!**