Intel
/

DeepSeek-V3.1-Terminus-int8-AutoRound

8-bit precision

Model card Files Files and versions

Xuehao commited on Sep 26

Commit

2f5d1e5

·

verified ·

1 Parent(s): d957cbb

Update README.md

Files changed (1) hide show

README.md +61 -1

README.md CHANGED Viewed

@@ -6,7 +6,67 @@ base_model:
 ## Model Details
-This model is a mixed int8 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-V3.1-Terminus generated by intel/auto-round via RTN(no algorithm tuning). Please refer to Section Generate the model for more details. Please follow the license of the original model.
 ## Generate the Model

 ## Model Details
+This model is a int8 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-V3.1-Terminus generated by intel/auto-round via RTN(no algorithm tuning). Please refer to Section Generate the model for more details. Please follow the license of the original model.
+## Model Version(s)
+The model is quantized with auto-round v0.8.0
+## How To Use
+### INT8 Inference
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import transformers
+import torch
+quantized_model_dir = "Intel/DeepSeek-V3.1-Terminus-int8-AutoRound"
+model = AutoModelForCausalLM.from_pretrained(
+        quantized_model_dir,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
+prompts = [
+        "9.11和9.8哪个数字大",
+        "strawberry中有几个r?",
+        "There is a girl who likes adventure,",
+        "Please give a brief introduction of DeepSeek company.",
+        ]
+texts=[]
+for prompt in prompts:
+    messages = [
+            {"role": "system", "content": "You are a helpful assistant."},
+            {"role": "user", "content": prompt}
+    ]
+    text = tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True
+            )
+    texts.append(text)
+inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
+outputs = model.generate(
+        input_ids=inputs["input_ids"].to(model.device),
+        attention_mask=inputs["attention_mask"].to(model.device),
+        max_length=200, ##change this to align with the official usage
+        num_return_sequences=1,
+        do_sample=False  ##change this to align with the official usage
+        )
+generated_ids = [
+        output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
+        ]
+decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+for i, prompt in enumerate(prompts):
+    input_id = inputs
+    print(f"Prompt: {prompt}")
+    print(f"Generated: {decoded_outputs[i]}")
+    print("-"*50)
+```
 ## Generate the Model