Spaces:

ggunio
/

intelligent-tokenizer-v6-demo

Sleeping

App Files Files Community

ggunio commited on Sep 20

Commit

7e2a1a9

verified ·

1 Parent(s): b91522f

Update: B2NL is Tokenizer-Free Revolution

Browse files

Files changed (1) hide show

README.md +56 -54

README.md CHANGED Viewed

@@ -1,10 +1,10 @@
 ---
-title: B2NL Embedding Preprocessor Demo
 emoji: 🚀
 colorFrom: blue
 colorTo: green
 sdk: gradio
-sdk_version: 5.46.1
 app_file: app.py
 pinned: true
 license: apache-2.0
@@ -12,84 +12,86 @@ models:
 - ggunio/B2NL-v6.1.1
 ---
-# 🚀 B2NL v6.1.1: Intelligent Embedding Preprocessor
-## ⚠️ NOT a Tokenizer - It's an Embedding Preprocessor!
-This demo shows how B2NL **reduces the number of embeddings** sent to LLMs by intelligently grouping bytes.
 ---
-## 🎯 What You're Actually Seeing
-When you input text, B2NL:
-1. **Converts to bytes** (UTF-8)
-2. **Groups bytes intelligently**
-3. **Outputs fewer embeddings**
-### Example:
-```
-Input: "안녕하세요. 오늘 날씨가 좋네요."
-Traditional: 44 bytes → 44 embeddings
-B2NL Current: 44 bytes → 18 embeddings (2.4x reduction!)
-B2NL Target: 44 bytes → 4 embeddings (11x reduction!)
-```
 ---
-## 📊 Current Performance (Phase 2, Epoch 51)
-| Language | Embeddings Reduction | Goal |
-|----------|---------------------|------|
-| Korean   | 2.4x | 20x |
-| Chinese  | 3.0x | 15x |
-| Japanese | 3.0x | 15x |
-| Arabic   | 1.8x | 10x |
-| English  | 1.0x | 5x |
-| Spanish  | 1.0x | 5x |
 ---
-## 💡 Why This Matters
-**For LLMs:**
-- 2-20x fewer embeddings to process
-- Faster inference
-- Less memory usage
-- Longer effective context
-**For Users:**
-- Same quality (100% reversible)
-- Faster responses
-- Lower costs
 ---
-## 🔬 Technical Details
-- **Not replacing tokenizers** - Works BEFORE tokenization
-- **Language agnostic** - Pure byte-level learning
-- **100% reversible** - Perfect reconstruction
-- **Dynamic compression** - Adapts to content
 ---
-## 📈 Phase 2 Progress
-Currently training with dynamic compression (1-50:1 ratios):
-- High quality text: More compression
-- Complex text: Less compression
-- Always maintains >95% reconstruction
 ---
-## 🎮 Try It!
-1. Enter any text
-2. See the reconstruction quality
-3. Check the **compression ratio** (bytes:embeddings)
-4. That ratio = how much faster your LLM could be!
 ---
-**Remember: The "tokens" shown are actually embedding groups that would be sent to an LLM, not traditional tokens!**

 ---
+title: B2NL Tokenizer-Free Demo
 emoji: 🚀
 colorFrom: blue
 colorTo: green
 sdk: gradio
+sdk_version: 4.0.0
 app_file: app.py
 pinned: true
 license: apache-2.0
 - ggunio/B2NL-v6.1.1
 ---
+# 🚀 B2NL: The Tokenizer-Free Revolution
+## No Vocabulary Files. No Rules. Just Intelligence.
 ---
+## 🎯 What You're Testing
+**B2NL replaces traditional tokenizers entirely:**
+- Input text → Bytes → Intelligent grouping → Tokens
+- No vocabulary needed (vs GPT's 100K+ vocabulary)
+- Works with ANY language/emoji/symbol
 ---
+## 📊 Live Compression Stats (Phase 2, Epoch 51)
+When you type Korean text:
+```
+"안녕하세요" (4 characters)
+→ Traditional: 12 bytes → 12 tokens
+→ GPT-4: 12 bytes → ~5 tokens
+→ B2NL Now: 12 bytes → 5 tokens (2.4x compression)
+→ B2NL Goal: 12 bytes → 1 token (12x compression!)
+```
 ---
+## 💬 Try These Examples
+### Korean (Watch the compression!):
+- Short: "안녕하세요"
+- Medium: "오늘 날씨가 좋네요"
+- Long: "인공지능이 세상을 바꾸고 있습니다"
+### See the "Statistics" box:
+- **Tokens**: Number of embeddings generated
+- **Compression**: How much we compressed (goal: 20:1 for Korean!)
 ---
+## 📈 Current Performance
+| What you type | Traditional | B2NL Now | B2NL Target |
+|---------------|-------------|----------|-------------|
+| Korean word   | 3-5 tokens  | 2 tokens | 0.3 tokens  |
+| Chinese char  | 1-3 tokens  | 1 token  | 0.2 tokens  |
+| English word  | 1-2 tokens  | 1 token  | 0.5 tokens  |
 ---
+## 🔥 Why This Changes Everything
+**For LLM Users:**
+- Korean/Chinese/Japanese: 3-20x longer context
+- All languages: Faster inference
+- No tokenizer downloads
+- Perfect reconstruction
+**For Developers:**
+- No vocabulary management
+- No OOV problems
+- Universal API
+- Tiny model (301M params)
 ---
+## 🎮 How to Interpret Results
+1. **Reconstruction Accuracy**: Should be 95-100%
+2. **Token Count**: Lower is better!
+3. **Compression Ratio**: Higher is better!
+Current Status:
+- ✅ Phase 1: 97.71% reconstruction (DONE)
+- 🔄 Phase 2: Learning compression (IN PROGRESS)
+- ⏳ Phase 3: 204 languages (PLANNED)
 ---
+**Remember: This is replacing tokenizers entirely. The "tokens" shown are intelligent byte groups, not vocabulary lookups!**
+🚀 **The future is tokenizer-free!**