--- base_model: - Qwen/Qwen3-Next-80B-A3B-Thinking pipeline_tag: text-generation license: apache-2.0 --- ## Model Details This model is a int4 model with group_size 128 and symmetric quantization of [Qwen/Qwen3-Next-80B-A3B-Thinking](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model. ## How To Use For vllm, this pr is required https://github.com/vllm-project/vllm/pull/24818 ### INT4 Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer import transformers import torch quantized_model_dir = "Intel/Qwen3-Next-80B-A3B-Thinking-int4-AutoRound" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, dtype="auto", device_map="auto" ) # prepare the model input prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=32768, ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # parsing thinking content try: # rindex finding 151668 () index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("thinking content:", thinking_content) # no opening tag print("content:", content) """ thinking content: Okay, the user asked for a short introduction to large language models. Let me start by recalling what I know. LLMs are a type of AI that processes and generates human-like text. They're trained on vast amounts of text data to understand language patterns. First, I should define what an LLM is simply. Maybe start with "Large Language Models (LLMs) are advanced AI systems..." Then explain they're trained on huge datasets to predict the next word in a sequence. That's the core idea. Wait, the user might not know terms like "transformer architecture." Should I mention it? Maybe briefly, but keep it simple. Say they use deep learning techniques, specifically transformer models, which are good at handling context. Key points to cover: scale (billions of parameters), training data (diverse text sources), and capabilities like answering questions, writing, translating. Also, mention they're used in applications like chatbots, search engines, etc. Need to avoid jargon. Keep it short as requested. Maybe structure it in a few sentences. Start with a definition, then how they work, their scale, and common uses. End with a note about their impact. Check if there's anything missing. Oh, maybe clarify that they don't "understand" like humans but predict patterns. But the user asked for a short intro, so maybe skip that detail unless necessary. Focus on what they do, not the nuances of understanding. Also, should I mention examples? Like GPT, BERT? Maybe not necessary for a short intro. Keep it general. Possible structure: - Definition: LLMs are AI systems trained on massive text data to generate and understand human language. - How: Use deep learning, predict next word, handle context. - Scale: Billions of parameters, trained on diverse internet text. - Applications: Chatbots, translation, content creation, etc. - Impact: Transforming how we interact with technology. Make sure it's concise. Maybe 3-4 sentences. Let me draft: "Large Language Models (LLMs) are AI systems trained on vast amounts of text data to understand and generate human-like language. They use deep learning techniques, particularly transformer architectures, to predict the next word in a sequence based on context. With billions of parameters, LLMs power applications like chatbots, translation tools, and content creation, revolutionizing how we interact with technology." That's short. Check if it covers the basics without too much detail. Yes. Maybe add "state-of-the-art" but "advanced" is fine. "Revolutionizing" might be a bit strong, but it's common to say that. Alternatively, "changing how we interact" is good. Also, should I mention they're not perfect? Probably not for a short intro. The user just wants a basic overview. Yes, that should work. Keep it simple, clear, and concise. content: Large Language Models (LLMs) are advanced AI systems trained on massive amounts of text data to understand and generate human-like language. Using deep learning techniques (like transformer architectures), they predict the next word in a sequence based on context, enabling tasks like answering questions, writing stories, translating languages, and more. With billions of parameters, LLMs power tools like chatbots, search engines, and content creators, transforming how we interact with technology daily.""" ``` ### Generate the model ```bash auto_round --model Qwen/Qwen3-Next-80B-A3B-Thinking --scheme W4A16 --output_dir tmp_autoround ``` ## Evaluate Results | benchmark | n-shot | backend | Intel/Qwen3-Next-80B-A3B-Thinking-int4-AutoRound | Qwen/Qwen3-Next-80B-A3B-Thinking | | :-------: | :----: | :-----: | :----------------------------------------------: | :------------------------------: | | gsm8k | 5 | vllm | 0.8446 | 0.8453 | | mmlu_pro | 5 | vllm | 0.7079 | 0.7271 | ## Ethical Considerations and Limitations The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs. Therefore, before deploying any applications of the model, developers should perform safety testing. ## Caveats and Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Here are a couple of useful links to learn more about Intel's AI software: - [Intel Neural Compressor](https://github.com/intel/neural-compressor) ## Disclaimer The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes. ## Cite @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} } [arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)