ChemDFM-v2.0-14B

ChemDFM-v2.0 is the latest non-thinking model of ChemDFM, the pioneering open-sourced dialogue foundation model for Chemistry and molecule science.

To achieve better chemical capabilities, both the domain pre-training stage and the instruction tuning stage are upgraded. In the domain pre-training stage, we introduce web-scale molecules and reactions into the corpus along with their functional-group information and properties. In this way, ChemDFM is able to better acquire chemical knowledge at a finer level of granularity. In the instruction tuning stage, we significantly improve the diversity of our instruction tuning dataset by introducing more tasks and increasing the variability in the phrasing and expression of the instruction texts.

News

2025-10-26: The parameter of ChemDFM-R-14B is open-sourced!
2025-10-26: ChemDFM-v2.0-14B is released! The improved domain pre-training and instruction tuning procedure is implemented on Qwen2.5-14B to achieve a more advanced general LLM in Chemistry. More details can be found here.
2025-07-29: The paper of ChemDFM-R-14B is released on arXiv: ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge.
2024-11-09: ChemDFM-v1.5-8B is released! We implemented our domain pre-training and instruction tuning procedure on a stronger base model LLaMA-3-8B.
2024-03-12: The parameter of ChemDFM-v1.0-13B is open-sourced!
2024-01-26: The paper of ChemDFM-13B is released on arXiv: ChemDFM: Dialogue Foundation Model for Chemistry

local inference

To load and run ChemDFM-v2.0 locally, here is an example:

import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name_or_id = "OpenDFM/ChemDFM-v2.0-14B"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id)
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16).to("cuda")

instruction = "Can you please give detailed descriptions of the molecule below?\nCl.O=C1c2c(O)cccc2-c2nn(CCNCCO)c3ccc(NCCNCCO)c1c23"
message = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": instruction
    }
]

input_text = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.9,
    max_new_tokens=1024,
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id
)
outputs = model.generate(**inputs, generation_config=generation_config)

generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
input_text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
generated_text = generated_text[len(input_text):].strip()
print(f"{generated_text=}")

SMILES preprocess

When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the rdkit package to canonicalize the SMILES. Here is an example:

from rdkit import Chem
def canonicalize_smiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)

or directly:

from rdkit import Chem
def canonicalize_smiles(smiles):
    return Chem.CanonSmiles(smiles, useChiral=True)

Citation

@article{zhao2025developing,
         title={Developing ChemDFM as a large language foundation model for chemistry},
         author={Zhao, Zihan and Ma, Da and Chen, Lu and Sun, Liangtai and Li, Zihao and Xia, Yi and Chen, Bo and Xu, Hongshen and Zhu, Zichen and Zhu, Su and others},
         journal={Cell Reports Physical Science},
         volume={6},
         number={4},
         year={2025},
         publisher={Elsevier}
}

@misc{zhao2025chemdfmr,
      title={ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge}, 
      author={Zihan Zhao and Bo Chen and Ziping Wan and Lu Chen and Xuanze Lin and Shiyang Yu and Situo Zhang and Da Ma and Zichen Zhu and Danyang Zhang and Huayang Wang and Zhongyang Dai and Liyang Wen and Xin Chen and Kai Yu},
      year={2025},
      eprint={2507.21990},
      archivePrefix={arXiv},
      primaryClass={cs.CE},
      url={https://arxiv.org/abs/2507.21990}, 
}

Disclaimer

Current version of ChemDFM may generate incorrect or misleading information. Please use it with caution and verify the results with domain experts before making any decisions based on the results.

Contact

If you have any questions or further requests, please contact Zihan Zhao, Bo Chen, and Lu Chen.

Downloads last month: 8

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for OpenDFM/ChemDFM-v2.0-14B

Quantizations

2 models

Collection including OpenDFM/ChemDFM-v2.0-14B

ChemDFM

Collection

ChemDFM is the pioneering open-sourced dialogue foundation model for Chemistry and molecular science. • 5 items • Updated 2 days ago