Spaces:

synthetic-data-universe
/

synth

Sleeping

App Files Files Community

edbeeching commited on Sep 16

Commit

c80506e

1 Parent(s): f00ab9d

polishing 2

Browse files

Files changed (1) hide show

app.py +47 -69

app.py CHANGED Viewed

@@ -561,75 +561,53 @@ def main():
         with main_interface:
             with gr.Group():
                 with gr.Row():
-                    gr.Markdown("# DataForge - Synthetic Data Generation")
                 with gr.Row():
-                    gr.Markdown("""
-                    **DataForge** - Scalable synthetic data generation framework built on DataTrove. Supports distributed Slurm processing with 20+ models.
-                    **Free for PRO users** (10K samples) • **100 samples** for free users • All datasets are **PUBLIC** under [synthetic-data-universe](https://huggingface.co/synthetic-data-universe)
-                    """)
-            with gr.Accordion("Complete Usage Guide", open=False):
-                with gr.Row():
-                    gr.Markdown("""
-                    **Step-by-Step Process:**
-                    1. **Load Dataset**: Enter a Hugging Face dataset name (e.g., `simplescaling/s1K-1.1`)
-                    2. **Load Info**: Click "Load Dataset Info" to populate configs, columns, and splits
-                    3. **Choose Model**: Select from 20+ popular instruction-tuned models
-                    4. **Configure**: Set generation parameters (temperature, tokens, etc.)
-                    5. **Submit**: Click submit and monitor progress in the Statistics tab
-                    **Pro Tips:**
-                    - Use temperature 0.7-1.0 for creative tasks, 0.1-0.3 for factual content
-                    - Start with fewer samples to test your prompt before scaling up
-                    - Check existing datasets in [synthetic-data-universe](https://huggingface.co/synthetic-data-universe) for inspiration
-                    """)
-                    gr.Markdown("""
-                    **Requirements & Limits:**
-                    - Input dataset must be **publicly accessible** on HF Hub
-                    - Model must be **publicly accessible** (not gated)
-                    - **Sample Limits:**
-                      - Free users: 100 samples max
-                      - PRO users: 10,000 samples max
-                    - **Token Limit:** 8,192 generated tokens per sample
-                    - **Processing Time:** Varies by model size and queue status
-                    **Privacy & Usage:**
-                    - All outputs are **PUBLIC** on Hugging Face Hub
-                    - Datasets appear under `synthetic-data-universe` organization
-                    - Perfect for research, training data, and open-source projects
-                    """)
-            with gr.Accordion("Examples & Use Cases", open=False):
-                gr.Markdown("""
-                **Popular Use Cases:**
-                **Educational Content Generation**
-                - Input: Questions dataset → Output: Detailed explanations and answers
-                - Models: `Qwen/Qwen3-4B-Instruct-2507` or `microsoft/Phi-3.5-mini-instruct`
-                - Temperature: 0.3-0.5 for factual accuracy
-                **Conversational Data**
-                - Input: Conversation starters → Output: Multi-turn dialogues
-                - Models: `meta-llama/Llama-3.2-3B-Instruct` or `mistralai/Mistral-7B-Instruct-v0.3`
-                - Temperature: 0.7-0.9 for natural variety
-                **Code Generation**
-                - Input: Problem descriptions → Output: Code solutions with explanations
-                - Models: `Qwen/Qwen2.5-Coder-3B-Instruct` or `deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct`
-                - Temperature: 0.1-0.3 for accurate code
-                **Creative Writing**
-                - Input: Story prompts → Output: Creative narratives
-                - Models: `meta-llama/Llama-3.2-3B-Instruct` or `mistralai/Mistral-7B-Instruct-v0.3`
-                - Temperature: 0.8-1.2 for creativity
-                **Example Dataset Names to Try:**
-                ```
-                simplescaling/s1K-1.1        # Simple Q&A pairs
-                HuggingFaceH4/ultrachat_200k # Conversations
-                iamtarun/python_code_instructions_18k_alpaca # Code tasks
-                ```
-                """)
             with gr.Tabs():
                 with gr.TabItem("Generate Data"):
@@ -647,7 +625,7 @@ def main():
                                         )
                                     # model_token = gr.Textbox(label="Model Token (Optional)", type="password", placeholder="Your HF token with read/write access to the model...")
                                 with gr.Row():
-                                    system_prompt = gr.Textbox(label="System Prompt (Optional)", lines=3, placeholder="Optional system prompt... e.g., You are a helpful assistant.", info="Sets the AI's role/behavior. Leave empty for default model behavior.")
                                 gr.Markdown("### Generation Parameters")
                                 with gr.Row():
                                     with gr.Column():

         with main_interface:
             with gr.Group():
                 with gr.Row():
+                    gr.Markdown("# DataForge - Synthetic Data Generation")
                 with gr.Row():
+                    with gr.Column(scale=1):
+                        gr.Markdown("""
+                        **DataForge** - Scalable synthetic data generation framework built on DataTrove. Supports distributed Slurm processing with 20+ models.
+                        **Free for PRO users** (10K samples) • **100 samples** for free users • All datasets are **PUBLIC** under [synthetic-data-universe](https://huggingface.co/synthetic-data-universe)
+                        """)
+                    with gr.Column(scale=1):
+                        with gr.Accordion("Usage Guide", open=False):
+                            gr.Markdown("""
+                            **Step-by-Step Process:**
+                            1. **Load Dataset**: Enter a HF dataset name
+                            2. **Load Info**: Click "Load Dataset Info"
+                            3. **Choose Model**: Select from 20+ models
+                            4. **Configure**: Set generation parameters
+                            5. **Submit**: Monitor progress in Statistics tab
+                            **Requirements:**
+                            - Input dataset must be public on HF Hub
+                            - Model must be publicly accessible
+                            - Free users: 100 samples max, PRO: 10K max
+                            - Token limit: 8,192 per sample
+                            """)
+                        with gr.Accordion("Examples", open=False):
+                            gr.Markdown("""
+                            **Popular Use Cases:**
+                            **Educational**: Q&A datasets
+                            - Models: Qwen3-4B, Phi-3.5-mini
+                            - Temperature: 0.3-0.5
+                            **Conversational**: Multi-turn dialogues
+                            - Models: Llama-3.2-3B, Mistral-7B
+                            - Temperature: 0.7-0.9
+                            **Code**: Problem → Solution
+                            - Models: Qwen2.5-Coder, DeepSeek-Coder
+                            - Temperature: 0.1-0.3
+                            **Example datasets to try:**
+                            ```
+                            simplescaling/s1K-1.1
+                            HuggingFaceH4/ultrachat_200k
+                            iamtarun/python_code_instructions_18k_alpaca
+                            ```
+                            """)
             with gr.Tabs():
                 with gr.TabItem("Generate Data"):
                                         )
                                     # model_token = gr.Textbox(label="Model Token (Optional)", type="password", placeholder="Your HF token with read/write access to the model...")
                                 with gr.Row():
+                                    system_prompt = gr.Textbox(label="System Prompt (Optional)", placeholder="Optional system prompt... e.g., You are a helpful assistant.", info="Sets the AI's role/behavior. Leave empty for default model behavior.")
                                 gr.Markdown("### Generation Parameters")
                                 with gr.Row():
                                     with gr.Column():