BAAI
/

wolfwjs commited on
Commit
547ce1e
·
verified ·
1 Parent(s): 63d72b1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ <div align='center'>
5
+ <h1>Emu3.5: Native Multimodal Models are World Learners</h1>
6
+
7
+ Emu3.5 Team, BAAI
8
+
9
+ [Project Page](https://emu.world/) | [🤗HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583)
10
+ </div>
11
+
12
+
13
+ <div align='center'>
14
+ <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
15
+ </div>
16
+
17
+
18
+ <div align='center'>
19
+ <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/co.png?raw=True" class="interpolation-image" alt="arch." height="90%" width="90%" />
20
+ </div>
21
+
22
+
23
+ | 🔹 | **Core Concept** | **Description** |
24
+ | :-: | :--------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- |
25
+ | 🧠 | **Unified World Modeling** | Predicts the **next state jointly across vision and language**, enabling coherent **world modeling** and **generation**. |
26
+ | 🧩 | **End-to-End Pretraining** | Trained with a **unified next-token prediction** objective over **interleaved vision–language sequences**. |
27
+ | 📚 | **Over 10T+ Multimodal Tokens** | Pre-trained on **over 10 trillion interleaved tokens** from **video frames** and **transcripts**, capturing **spatiotemporal structure**. |
28
+ | 🔄 | **Native Multimodal I/O** | Processes and generates **interleaved visual–text sequences** without **modality adapters** or **task-specific heads**. |
29
+ | 🎯 | **RL Post-Training** | Large-scale **reinforcement learning** enhances **reasoning**, **compositionality**, and **generation quality**. |
30
+ | ⚡ | **Discrete Diffusion Adaptation (DiDA)** | Converts **sequential decoding → bidirectional parallel prediction**, achieving **≈20× faster inference without performance loss**. |
31
+ | 🖼️ | **Versatile Generation** | Excels in **long-horizon vision–language generation**, **any-to-image (X2I)** synthesis, and **text-rich image creation**. |
32
+ | 🌐 | **Generalizable World Modeling** | Enables **spatiotemporally consistent world exploration**, and **open-world embodied manipulation** across diverse scenarios. |
33
+ | 🏆 | **Performance Benchmark** | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |
34
+
35
+
36
+
37
+ ## Table of Contents
38
+
39
+ 1. [Model & Weights](#1-model--weights)
40
+ 2. [Quick Start](#2-quick-start)
41
+ 3. [Schedule](#3-schedule)
42
+ 4. [Citation](#4-citation)
43
+
44
+ ## 1. Model & Weights
45
+
46
+ | Model name | HF Weight |
47
+ | ------------------------ | --------- |
48
+ | Emu3.5 | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5/tree/main) |
49
+ | Emu3.5-Image | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
50
+ | Emu3.5-VisionTokenizer | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |
51
+
52
+ ## 2. Quick Start
53
+
54
+ ### Environment Setup
55
+
56
+ ```bash
57
+ git clone https://github.com/baaivision/Emu3.5
58
+ cd Emu3.5
59
+ pip install -r requirements.txt
60
+ pip install flash_attn==2.8.3 --no-build-isolation
61
+ ```
62
+ ### Configuration
63
+
64
+ Edit `configs/config.py` to set:
65
+
66
+ - Paths: `model_path`, `vq_path`
67
+ - Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`, `use_image` controls `<|IMAGE|>` usage (set to true when reference images are provided)
68
+ - Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)
69
+
70
+ ### Run Inference
71
+
72
+ ```bash
73
+ python inference.py --cfg configs/config.py
74
+ ```
75
+
76
+ Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend ≥2 GPUs.
77
+
78
+ ### Visualize Protobuf Outputs
79
+
80
+ To visualize generated protobuf files:
81
+
82
+ ```bash
83
+ python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
84
+ ```
85
+
86
+ ## 3. Schedule
87
+
88
+ - [x] Inference Code
89
+ - [ ] Advanced Image Decoder
90
+ - [ ] Discrete Diffusion Adaptation(DiDA)
91
+
92
+
93
+ ## 4. Citation
94
+
95
+ ```bibtex
96
+ @misc{cui2025emu35nativemultimodalmodels,
97
+ title={Emu3.5: Native Multimodal Models are World Learners},
98
+ author={Yufeng Cui and Honghao Chen and Haoge Deng and Xu Huang and Xinghang Li and Jirong Liu and Yang Liu and Zhuoyan Luo and Jinsheng Wang and Wenxuan Wang and Yueze Wang and Chengyuan Wang and Fan Zhang and Yingli Zhao and Ting Pan and Xianduo Li and Zecheng Hao and Wenxuan Ma and Zhuo Chen and Yulong Ao and Tiejun Huang and Zhongyuan Wang and Xinlong Wang},
99
+ year={2025},
100
+ eprint={2510.26583},
101
+ archivePrefix={arXiv},
102
+ primaryClass={cs.CV},
103
+ url={https://arxiv.org/abs/2510.26583},
104
+ }
105
+ ```
106
+