Safetensors
English
qwen2

McGill-NLP/longcot-8k-1.5b

TL;DR

  • Markovian Thinking for RL in reasoning LLMs: replace the trivial MDP where state = prompt + all past thinking tokens (quadratic compute) with a bounded, fixed-size state, yielding linear compute in thinking tokens and constant memory by design.
  • Delethink RL trains a model to “think” in fixed-size chunks with bounded state..
  • This 1.5B model uses an effective thinking budget of about 24K tokens while only requiring an 8K active context at any time via chunked rollouts and short carryovers.
  • Initialized from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, trained with the Delethink RL paradigm. See the paper for full details.

Links

Model Summary

  • Base model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  • Objective: Reinforcement Learning using standard LongCoT, trained for 1000 steps.
  • Thinking 8K budget; uses the entire context.
  • Intended use: Math/logic reasoning with step-by-step derivations; final answer typically formatted inside LaTeX \boxed{}.
  • Library compatibility: Works well with SGLang for chunked inference; also usable with Transformers for standard generation.

Intended Uses and Limitations

  • Intended uses:
    • Long-form reasoning on math and related tasks.
    • Bounded-context rollouts with repeated chunking and short carryovers.
  • Not intended for:
    • Safety-sensitive applications without human oversight.
    • Use cases requiring faithful, verifiable citations to external sources.
  • Limitations:
    • May hallucinate, make arithmetic/algebraic mistakes, or produce inconsistent plans.
    • The chunked rollout procedure is needed to realize Delethink’s efficiency advantages.

Prompting

  • Use the model’s chat template and request a step-by-step solution with a final boxed answer:
    • “Please reason step by step, and put your final answer within \boxed{}.”

Quickstart (SGLang, chunked Delethink rollout)

import asyncio
import sglang as sgl

def main():
    llm = sgl.Engine(
        model_path="McGill-NLP/longcot-8k-1.5b",
        dtype="bfloat16",
        attention_backend="flashinfer",
        mem_fraction_static=0.8,
        log_level="WARNING",
    )

    prompt = (
        r"There exist real numbers $x$ and $y$, both greater than 1, such that "
        r"$\log_x\left(y^x\right)=\log_y\left(x^{4y}\right)=10$. Find $xy$."
        "\n\nPlease reason step by step, and put your final answer within \\boxed{}."
    )
    tok = llm.tokenizer_manager.tokenizer
    query_ids = tok.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=True,
        add_generation_prompt=True,
    )

    params = {"temperature": 0.6, "max_new_tokens": 8192}
    ids = llm.generate(input_ids=query_ids, sampling_params=params, return_logprob=True)
    print(tok.decode(ids, skip_special_tokens=False))

if __name__ == "__main__":
    main()

Suggested generation settings

  • temperature: 0.6
  • top_p: 1.0
  • top_k: -1

Safety and Use

  • This model can produce incorrect or misleading reasoning steps and answers. Always verify results.
  • Do not deploy in high-stakes domains without human oversight.

Citation

@misc{Aghajohari2025:TheMarkovianThinker,
      title={The Markovian Thinker}, 
      author={Milad Aghajohari and Kamran Chitsaz and Amirhossein Kazemnejad and Sarath Chandar and Alessandro Sordoni and Aaron Courville and Siva Reddy},
      year={2025},
      eprint={2510.06557},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.06557}, 
}
Downloads last month
19
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for McGill-NLP/longcot-8k-1.5b

Finetuned
(501)
this model

Dataset used to train McGill-NLP/longcot-8k-1.5b

Collection including McGill-NLP/longcot-8k-1.5b