--- base_model: - Qwen/Qwen2.5-7B-Instruct datasets: - miromind-ai/MiroRL-GenQA language: - en license: apache-2.0 tags: - agent - deepresearch - llm - rl - reinforcementlearning pipeline_tag: text-generation library_name: transformers --- # Model Card for PokeeResearch ## Model Details ### Model Description **PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs. The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads. - **Developed by:** Pokee AI - **Model type:** Tool-augmented large language model (LLM) research agent - **Language(s):** English, Chinese and many more - **License:** Apache 2.0 - **Finetuned from model:** Qwen2.5-7B-Instruct ### Model Sources - **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS) - **Paper:** [*PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*](https://arxiv.org/pdf/2510.15862), Pokee AI, October 2025 - **Project Page:** [https://pokee.ai/deepresearch-preview](https://pokee.ai/deepresearch-preview) --- ## Uses ### Direct Use PokeeResearch-7B is designed for **deep research automation**, where the model autonomously: - Decomposes complex user queries - Retrieves and reads from external sources - Synthesizes factual, verifiable, and grounded answers It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks. ### Downstream Use PokeeResearch-7B can be **fine-tuned** or **extended** for: - Domain-specific scientific discovery - Autonomous document retrieval and synthesis - Multi-source verification and summarization pipelines - Integration into reinforcement learning research agents (RLHF/RLAIF frameworks) ### Out-of-Scope Use The model should **not** be used for: - Generating unverified or speculative claims - Automated decision-making in high-stakes domains (medical, legal, or financial) - Applications requiring strict factual precision without external verification - Generating content without citation or evidence tracing --- ## Bias, Risks, and Limitations PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include: - Dependence on **external data quality** and **retrieval accuracy** - Potential **semantic bias** introduced by AI-based feedback signals - Limited coverage for **non-English** or **multi-modal** reasoning tasks - Risk of **hallucinated synthesis** when sources conflict or lack clarity ### Recommendations Users should: - Cross-verify answers, especially in multi-hop reasoning cases - Monitor output for citation accuracy and alignment with source data - Refrain from using outputs as sole evidence in decision-critical contexts --- ## How to Get Started with the Model please refer to the following codebase for how to use PokeeResearch-7B https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md --- ## Training Details ### Training Data - **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025) - **Data characteristics:** Complex, multi-turn question–answer pairs requiring multi-step reasoning - **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples ### Training Procedure #### Preprocessing - Normalization and tokenization aligned with Qwen2.5 tokenizer - Structured prompt–response pairs in research/verification format (``, ``, ``) #### Training Hyperparameters - **Algorithm:** RLOO (REINFORCE Leave-One-Out) - **Batch size:** 64 - **Research threads per prompt:** 8 - **Learning rate:** 3e-6 - **Context limit:** 32,768 tokens - **Steps:** 140 fine-tuning iterations - **Regularization:** None (no entropy or KL regularization) - **Precision regime:** bf16 mixed precision #### Reward Design - Combined reward signal from: - **AI feedback** (semantic equivalence via external LLM judge) - **Format adherence reward** (ensures correct agent behavior) #### Speeds, Sizes, Times - **Model size:** 7 billion parameters - **Training duration:** ~5 days on 8 × A100 80G GPUs - **Checkpoint size:** ~13 GB --- ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data 10 open-domain research and QA benchmarks: - NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam #### Factors - Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements. - Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop). #### Metrics - Mean accuracy (mean@4 across independent research threads) based on ### Results **PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks. Highlights (mean@4 accuracy): | **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** | |-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------| | R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 | | SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 | | ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 | | ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 | | DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 | | **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** | | **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** | #### Summary PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows. --- ## Technical Specifications ### Model Architecture and Objective - **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone) - **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning ### Compute Infrastructure #### Hardware - NVIDIA A100 80GB GPUs ×8 for training and x1 for inference --- ## Citation **BibTeX:** ```bibtex @article{pokee2025deepresearch, title={PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold}, author={Yi Wan* and Jiuqi Wang* and Liam Li and Jinsong Liu and Ruihao Zhu and Zheqing Zhu}, journal={Pokee AI Technical Report}, year={2025}, url={https://arxiv.org/pdf/2510.15862} } ``` **APA:** Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI. --- ## Glossary - **RLAIF:** Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals. - **RLOO:** REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning. - **RTS:** Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time. --- ## More Information For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS) For inquiries, contact: hello@pokee.ai --- ## Model Card Authors **Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team ## Model Card Contact Pokee AI Team — hello@pokee.ai