Spaces:
Sleeping
Sleeping
edbeeching
commited on
Commit
Β·
c80506e
1
Parent(s):
f00ab9d
polishing 2
Browse files
app.py
CHANGED
|
@@ -561,75 +561,53 @@ def main():
|
|
| 561 |
with main_interface:
|
| 562 |
with gr.Group():
|
| 563 |
with gr.Row():
|
| 564 |
-
gr.Markdown("# DataForge - Synthetic Data Generation")
|
| 565 |
with gr.Row():
|
| 566 |
-
gr.
|
| 567 |
-
|
| 568 |
-
|
| 569 |
-
|
| 570 |
-
|
| 571 |
-
|
| 572 |
-
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
|
| 576 |
-
|
| 577 |
-
|
| 578 |
-
|
| 579 |
-
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
|
| 592 |
-
|
| 593 |
-
|
| 594 |
-
|
| 595 |
-
|
| 596 |
-
|
| 597 |
-
|
| 598 |
-
|
| 599 |
-
|
| 600 |
-
|
| 601 |
-
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
**Conversational Data**
|
| 612 |
-
- Input: Conversation starters β Output: Multi-turn dialogues
|
| 613 |
-
- Models: `meta-llama/Llama-3.2-3B-Instruct` or `mistralai/Mistral-7B-Instruct-v0.3`
|
| 614 |
-
- Temperature: 0.7-0.9 for natural variety
|
| 615 |
-
|
| 616 |
-
**Code Generation**
|
| 617 |
-
- Input: Problem descriptions β Output: Code solutions with explanations
|
| 618 |
-
- Models: `Qwen/Qwen2.5-Coder-3B-Instruct` or `deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct`
|
| 619 |
-
- Temperature: 0.1-0.3 for accurate code
|
| 620 |
-
|
| 621 |
-
**Creative Writing**
|
| 622 |
-
- Input: Story prompts β Output: Creative narratives
|
| 623 |
-
- Models: `meta-llama/Llama-3.2-3B-Instruct` or `mistralai/Mistral-7B-Instruct-v0.3`
|
| 624 |
-
- Temperature: 0.8-1.2 for creativity
|
| 625 |
-
|
| 626 |
-
**Example Dataset Names to Try:**
|
| 627 |
-
```
|
| 628 |
-
simplescaling/s1K-1.1 # Simple Q&A pairs
|
| 629 |
-
HuggingFaceH4/ultrachat_200k # Conversations
|
| 630 |
-
iamtarun/python_code_instructions_18k_alpaca # Code tasks
|
| 631 |
-
```
|
| 632 |
-
""")
|
| 633 |
|
| 634 |
with gr.Tabs():
|
| 635 |
with gr.TabItem("Generate Data"):
|
|
@@ -647,7 +625,7 @@ def main():
|
|
| 647 |
)
|
| 648 |
# model_token = gr.Textbox(label="Model Token (Optional)", type="password", placeholder="Your HF token with read/write access to the model...")
|
| 649 |
with gr.Row():
|
| 650 |
-
system_prompt = gr.Textbox(label="System Prompt (Optional)",
|
| 651 |
gr.Markdown("### Generation Parameters")
|
| 652 |
with gr.Row():
|
| 653 |
with gr.Column():
|
|
|
|
| 561 |
with main_interface:
|
| 562 |
with gr.Group():
|
| 563 |
with gr.Row():
|
| 564 |
+
gr.Markdown("# DataForge - Synthetic Data Generation")
|
| 565 |
with gr.Row():
|
| 566 |
+
with gr.Column(scale=1):
|
| 567 |
+
gr.Markdown("""
|
| 568 |
+
**DataForge** - Scalable synthetic data generation framework built on DataTrove. Supports distributed Slurm processing with 20+ models.
|
| 569 |
+
|
| 570 |
+
**Free for PRO users** (10K samples) β’ **100 samples** for free users β’ All datasets are **PUBLIC** under [synthetic-data-universe](https://huggingface.co/synthetic-data-universe)
|
| 571 |
+
""")
|
| 572 |
+
with gr.Column(scale=1):
|
| 573 |
+
with gr.Accordion("Usage Guide", open=False):
|
| 574 |
+
gr.Markdown("""
|
| 575 |
+
**Step-by-Step Process:**
|
| 576 |
+
1. **Load Dataset**: Enter a HF dataset name
|
| 577 |
+
2. **Load Info**: Click "Load Dataset Info"
|
| 578 |
+
3. **Choose Model**: Select from 20+ models
|
| 579 |
+
4. **Configure**: Set generation parameters
|
| 580 |
+
5. **Submit**: Monitor progress in Statistics tab
|
| 581 |
+
|
| 582 |
+
**Requirements:**
|
| 583 |
+
- Input dataset must be public on HF Hub
|
| 584 |
+
- Model must be publicly accessible
|
| 585 |
+
- Free users: 100 samples max, PRO: 10K max
|
| 586 |
+
- Token limit: 8,192 per sample
|
| 587 |
+
""")
|
| 588 |
+
with gr.Accordion("Examples", open=False):
|
| 589 |
+
gr.Markdown("""
|
| 590 |
+
**Popular Use Cases:**
|
| 591 |
+
|
| 592 |
+
**Educational**: Q&A datasets
|
| 593 |
+
- Models: Qwen3-4B, Phi-3.5-mini
|
| 594 |
+
- Temperature: 0.3-0.5
|
| 595 |
+
|
| 596 |
+
**Conversational**: Multi-turn dialogues
|
| 597 |
+
- Models: Llama-3.2-3B, Mistral-7B
|
| 598 |
+
- Temperature: 0.7-0.9
|
| 599 |
+
|
| 600 |
+
**Code**: Problem β Solution
|
| 601 |
+
- Models: Qwen2.5-Coder, DeepSeek-Coder
|
| 602 |
+
- Temperature: 0.1-0.3
|
| 603 |
+
|
| 604 |
+
**Example datasets to try:**
|
| 605 |
+
```
|
| 606 |
+
simplescaling/s1K-1.1
|
| 607 |
+
HuggingFaceH4/ultrachat_200k
|
| 608 |
+
iamtarun/python_code_instructions_18k_alpaca
|
| 609 |
+
```
|
| 610 |
+
""")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 611 |
|
| 612 |
with gr.Tabs():
|
| 613 |
with gr.TabItem("Generate Data"):
|
|
|
|
| 625 |
)
|
| 626 |
# model_token = gr.Textbox(label="Model Token (Optional)", type="password", placeholder="Your HF token with read/write access to the model...")
|
| 627 |
with gr.Row():
|
| 628 |
+
system_prompt = gr.Textbox(label="System Prompt (Optional)", placeholder="Optional system prompt... e.g., You are a helpful assistant.", info="Sets the AI's role/behavior. Leave empty for default model behavior.")
|
| 629 |
gr.Markdown("### Generation Parameters")
|
| 630 |
with gr.Row():
|
| 631 |
with gr.Column():
|