ZeroGPU-LLM-Inference

Sleeping

Luigi commited on Apr 10

Commit

5db22d5

1 Parent(s): afa19a3

support reasoning tag

Files changed (2) hide show

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ sdk_version: 1.44.1
 app_file: app.py
 pinned: false
 license: apache-2.0
-short_description: Run GGUF models (Qwen2.5, Gemma-3, Phi-4) with llama.cpp
 ---
 This Streamlit app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`.
@@ -26,6 +26,8 @@ This Streamlit app enables **chat-based inference** on various GGUF models using
 - Model selection in the sidebar
 - Customizable system prompt and generation parameters
 - Chat-style UI with streaming responses
 ### 🧠 Memory-Safe Design (for HuggingFace Spaces):
 - Loads only **one model at a time** to prevent memory bloat

 app_file: app.py
 pinned: false
 license: apache-2.0
+short_description: Run GGUF models with llama.cpp
 ---
 This Streamlit app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`.
 - Model selection in the sidebar
 - Customizable system prompt and generation parameters
 - Chat-style UI with streaming responses
+- **Markdown output rendering** for readable, styled output
+- **DeepSeek-compatible `<think>` tag handling** — shows model reasoning in a collapsible expander
 ### 🧠 Memory-Safe Design (for HuggingFace Spaces):
 - Loads only **one model at a time** to prevent memory bloat

app.py CHANGED Viewed

@@ -4,6 +4,7 @@ from huggingface_hub import hf_hub_download
 import os
 import gc
 import shutil
 # Available models
 MODELS = {
@@ -184,6 +185,13 @@ if user_input:
                 if "choices" in chunk:
                     delta = chunk["choices"][0]["delta"].get("content", "")
                     full_response += delta
-                    response_area.markdown(full_response)
             st.session_state.chat_history.append({"role": "assistant", "content": full_response})

 import os
 import gc
 import shutil
+import re
 # Available models
 MODELS = {
                 if "choices" in chunk:
                     delta = chunk["choices"][0]["delta"].get("content", "")
                     full_response += delta
+                    visible = re.sub(r"<think>.*?</think>", "", full_response, flags=re.DOTALL)
+                    response_area.markdown(visible)
             st.session_state.chat_history.append({"role": "assistant", "content": full_response})
+            thinking = re.findall(r"<think>(.*?)</think>", full_response, flags=re.DOTALL)
+            if thinking:
+                with st.expander("🧠 Model's Internal Reasoning"):
+                    for t in thinking:
+                        st.markdown(t.strip())