✨ Compresses long sequences visually to bypass token limits ✨ Reduces computational and memory costs ✨ Preserves meaning through multimodal encoding ✨ Built on GLM-4.1V-9B-Base
✨ Any prior in → 3D world out ✨ Mix camera, intrinsics, depth as priors ✨ Predict point clouds, normals, Gaussians & more in one pass ✨ Unified architecture for all 3D task
If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible. We wanted to change that.
FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.
In the paper, we share how we built it: 🔍 finding and cleaning data at scale 🧹 removing excessive duplicates across sources 🤗 decontaminating against 66 public benchmarks
My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets. NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!
🎉 To celebrate the paper, I’m also releasing a concatenated and shuffled version of the full dataset! 👉HuggingFaceM4/FineVision_full_shuffled
It’s ready to stream, so you can start training your own models right away:
from datasets import load_dataset d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True) print(next(iter(d)))
A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
✨ Trained on Honey-Data-15M, a 15M-sample SFT corpus with dual-level CoT reasoning ✨ Backed by HoneyPipe, a transparent & reproducible open data curation suite
🌎 AI ethics and sustainability are two sides of the same coin.
In our new blog post with Dr. Sasha Luccioni, we argue that separating them (as is too often the case) means missing the bigger picture of how AI systems impact both people and the planet.
Ethical and sustainable AI development can’t be pursued in isolation. The same choices that affect who benefits or is harmed by AI systems also determine how much energy and resources they consume.
We explore how two key concepts, evaluation and transparency, can serve as bridges between these domains:
📊 Evaluation, by moving beyond accuracy or performance metrics to include environmental and social costs, as we’ve done with tools like the AI Energy Score.
🔍 Transparency, by enabling reproducibility, accountability, and environmental reporting through open tools like the Environmental Transparency Space.
AI systems mirror our priorities. If we separate ethics from sustainability, we risk building technologies that are efficient but unjust, or fair but unsustainable.