Mqleet's picture
[update] templates
a3d3755
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description" content="MMLongBench-Doc">
<meta name="keywords" content="multimodal chatbot">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>MMLongBench-Doc</title>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.1/css/bulma.min.css">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.1/css/all.min.css">
<link rel="stylesheet" href="static/css/index.css">
<link rel="icon" href="https://mayubo2333.github.io/MMLongBench-Doc/static/images/logo.png">
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.1/js/all.min.js"></script>
<script type="module" src="https://gradio.s3-us-west-2.amazonaws.com/3.27.0/gradio.js"></script>
</head>
<style>
.section {
margin-bottom: -30px; /* Adjust this value as needed to reduce the space */
}
.expandable-card .card-text-container {
max-height: 200px;
overflow-y: hidden;
position: relative;
}
.expandable-card.expanded .card-text-container {
max-height: none;
}
.expand-btn {
position: relative;
display: none;
background-color: rgba(255, 255, 255, 0.8);
/* margin-top: -20px; */
/* justify-content: center; */
color: #510c75;
border-color: transparent;
}
.expand-btn:hover {
background-color: rgba(200, 200, 200, 0.8);
text-decoration: none;
border-color: transparent;
color: #510c75;
}
.expand-btn:focus {
outline: none;
text-decoration: none;
}
.expandable-card:not(.expanded) .card-text-container:after {
content: "";
position: absolute;
bottom: 0;
left: 0;
width: 100%;
height: 90px;
background: linear-gradient(rgba(255, 255, 255, 0.2), rgba(255, 255, 255, 1));
}
.expandable-card:not(.expanded) .expand-btn {
margin-top: -40px;
}
.card-body {
padding-bottom: 5px;
}
.vertical-flex-layout {
justify-content: center;
align-items: center;
height: 100%;
display: flex;
flex-direction: column;
gap: 5px;
}
.figure-img {
max-width: 100%;
height: auto;
}
.adjustable-font-size {
font-size: calc(0.5rem + 2vw);
}
.chat-history {
flex-grow: 1;
overflow-y: auto;
/* overflow-x: hidden; */
padding: 5px;
border-bottom: 1px solid #ccc;
margin-bottom: 10px;
}
#gradio pre {
background-color: transparent;
}
/* 使用渐变颜色实现彩虹字体 */
.rainbow-text {
background: linear-gradient(to right, #3498db, #2ecc71);
-webkit-background-clip: text;
color: transparent;
display: inline-block;
font-weight: bold;
}
</style>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title"> <span class="rainbow-text">MMLongBench-Doc</span>: Benchmarking Long-context Document Understanding with Visualizations</h1>
<div class="is-size-5 publication-authors">
<span class="author-block"> <a href="https://mayubo2333.github.io">Yubo Ma</a><sup>1</sup>,</span>
<span class="author-block"> <a href="https://yuhangzang.github.io">Yuhang Zang</a><sup>&dagger;2</sup><sup>,</span>
<span class="author-block"> <a href="https://cliangyu.com">Liangyu Chen</a><sup>1</sup>,</span>
<span class="author-block"> <a href="https://chenmeiqii.github.io">Meiqi Chen</a><sup>3</sup>,</span>
<span class="author-block"> <a href="https://yzjiao.github.io">Yizhu Jiao</a><sup>4</sup>, </span>
<span class="author-block"> <a href="index.html">Xinze Li</a><sup>1</sup>, </span>
<span class="author-block"> <a href="https://xinyuanlu00.github.io">Xinyuan Lu</a><sup>5</sup>, </span>
<span class="author-block"> <a href="https://liuziyu77.github.io">Ziyu Liu</a><sup>6</sup>, </span>
<span class="author-block"> <a href="index.html">Yan Ma</a><sup>7</sup>, </span>
<span class="author-block"> <a href="https://lightdxy.github.io">Xiaoyi Dong</a><sup>2</sup>, </span>
<span class="author-block"> <a href="https://panzhang0212.github.io">Pan Zhang</a><sup>2</sup>, </span>
<span class="author-block"> <a href="http://www.liangmingpan.com">Liangming Pan</a><sup>8</sup>, </span>
<span class="author-block"> <a href="index.html">Yu-Gang Jiang</a><sup>9</sup>, </span>
<span class="author-block"> <a href="https://myownskyw7.github.io">Jiaqi Wang</a><sup>2</sup>, </span>
<span class="author-block"> <a href="https://sites.google.com/view/yixin-homepage">Yixin Cao</a><sup>&dagger;9</sup><sup>,</span>
<span class="author-block"> <a href="https://personal.ntu.edu.sg/axsun">Aixin Sun</a><sup>1</sup> </span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>S-Lab, Nanyang Technological University</span>
<span class="author-block"><sup>2</sup>Shanghai AI Laboratory</span>
<span class="author-block"><sup>3</sup>Peking University</span>
<span class="author-block"><sup>4</sup>University of Illinois Urbana-Champaign</span>
<span class="author-block"><sup>5</sup>National University of Singapore</span>
<span class="author-block"><sup>6</sup>Wuhan University</span>
<span class="author-block"><sup>7</sup>Singapore Management University</span>
<span class="author-block"><sup>8</sup>University of California, Santa Barbara</span>
<span class="author-block"><sup>9</sup>Fudan University</span>
</div>
<div class="is-size-6 publication-authors">
<span class="author-block"><sup>&dagger;</sup>Corresponding authors.</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block"> <a href="https://arxiv.org/abs/2407.01523"
class="external-link button is-normal is-rounded is-dark"> <span class="icon"> <i class="ai ai-arxiv"></i> </span> <span>arXiv</span> </a> </span>
<!-- Code Link. -->
<span class="link-block"> <a href="https://github.com/mayubo2333/MMLongBench-Doc"
class="external-link button is-normal is-rounded is-dark"> <span class="icon"> <i class="fab fa-github"></i> </span> <span>Code</span> </a> </span>
<!-- HuggingFace Link. -->
<span class="link-block"> <a href="https://huggingface.co/datasets/yubo2333/MMLongBench-Doc"
class="external-link button is-normal is-rounded is-dark"><span class="icon">🤗</span><span>Dataset</span> </a></span>
<!-- Twitter Link. -->
<span class="link-block"> <a href="https://x.com/mayubo2333/status/1811015423921639855"
class="external-link button is-normal is-rounded is-dark"><span class="icon">🌐</span><span>Twitter</span> </a></span>
<section class="section">
<div class="container is-max-desktop">
<centering>
<div style="text-align: center;">
<img id="pipeline" width="105%" src="static/images/top_figure.png">
</div>
</p>
<div class="columns is-centered has-text-centered">
<div class="column is-six-fifths">
<div style="text-align: center;">
</div><br>
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<style>
/* 使用渐变颜色实现彩虹字体 */
.rainbow-text {
background: linear-gradient(to right, #3498db, #2ecc71);
-webkit-background-clip: text;
color: transparent;
display: inline-block;
font-weight: bold;
}
</style>
<p>
Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents <b>MMLongBench-Doc</b>, a long-context, multi-modal benchmark comprising 1,091 expert-annotated questions. Distinct from previous datasets, it is constructed upon 135 lengthy PDF-formatted documents with an average of 47.5 pages and 21,214 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (<i>i.e.</i>, page number). Moreover, 33.0% of the questions are <i>cross-page questions</i> requiring evidence across multiple pages. 22.5% of the questions are designed to be <i>unanswerable</i> for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 44.9%, while the second-best, GPT-4V, scores 30.5%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs.
</p>
</div>
</div>
</div>
<section class="section" style="background-color:#efeff081" id="Highlight">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-six-fifths">
<h2 class="title is-3">🔥Highlight</h2>
<div class="content has-text-justified">
<p style="font-size: 15px;">
<ul>
<li><b>Multi-modality: </b>All selected documents are PDF-formatted with rich layouts and multi-modal components including text, table, chart and image. We annotate questions carefully from these multi-modal evidences.</li>
<li><b>Long-context: </b> Each document has an average of 47.5 pages and 21,214 tokens. Additionally, 33.0% of the questions are <i>cross-page</i> questions which necessitate the information collection and reasoning over multiple pages.</li>
<li><b>Challenging: </b> Experiments on 14 LVLMs demonstrate that long-context document understanding greatly challenges current models. Even the best-performing LVLM, GPT-4o, achieves an overall F1 score of only 44.9%.</li>
</ul>
</p>
</div>
</div>
</div>
</div>
</section><br>
<section class="section" id="Benchmark Overview">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-six-fifths">
<h2 class="title is-3"> <span class="rainbow-text">MMLongBench-Doc</span> Overview</h2>
</div>
</div>
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full-width">
<div class="content has-text-justified">
<p>
The automatic understanding of lengthy documents (Long-context Document Understanding; DU) stands as a long-standing task in urgent and practical needs. Although many LVLMs now claim (and show promising cases) their capabilities on long-context DU, there lacks a unified and quantitative evaluation of existing models due to the absence of related benchmark.
<br>
To bridge this gap, we construct <span class="rainbow-text">MMLongBench-Doc</span> which comprises 135 documents and 1091 qustions (each accompanied by a short, deterministic reference answer and detailed meta information.). The documents have an average of 47.5 pages and 21,214 tokens, cover 7 diverse domains, and are PDF-formatted with rich layouts and multi-modal components. The questions are either curated from existing datasets or newly-annotated by expert-level annotators. Towards a comprehensive evaluation, the questions cover different sources like text, table, chart, image, <i>etc.</i>, and different locations (page index) of the documents. Notably, 33.0% questions are cross-page questions necessitating comprehension and reasoning on evidences across multiple pages. And 22.5% questions are designed to be unanswerable for reducing the shortcuts in this benchmark and detecting LVLMs' hallucinations.
<br>
<centering>
<div style="text-align: center;">
<img id="teaser" width="80%" src="static/images/dataset_overview.png">
</div>
</div>
</b></font>
</div>
</div>
</div>
</section>
<section class="section" id="Construction">
<div class="columns is-centered has-text-centered">
<div class="column is-six-fifths">
<h2 class="title is-3"> <span class="rainbow-text">MMLongBench-Doc</span> Construction</h2>
</div>
</div>
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full-width">
<div class="content has-text-justified">
<p>
The annotation pipeline of <span class="rainbow-text">MMLongBench-Doc</span> includes three stages. <b>(1) Document Collection:</b> We firstly curate documents from both existing datasets and new resources. <b>(2) Q&A Collection:</b> We manually edit existing questions (evaluate and revise/remove them if necessary) and/or create new questions. Answers and other meta-information are also annotated in this stage. <b>(3) Quality Control:</b> We further adopt a semi-automatic, three-step quality control to revise/remove these unqualified questions and correct problematic answers.
<centering>
<div style="text-align: center;">
<img id="teaser" width="100%" src="static/images/annotation_pipeline.png">
</div>
</p>
</div>
</b></font>
</div>
</div>
</section>
<section class="section" id="Evaluation">
<div class="columns is-centered has-text-centered">
<div class="column is-six-fifths">
<h2 class="title is-3"> <span class="rainbow-text">MMLongBench-Doc</span> Evaluation</h2>
</div>
</div>
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full-width">
<div class="content has-text-justified">
<p>
We conduct extensive experiments on MMLONGBENCH-DOC to evaluate the long-context DU abilities of 14 LVLMs (fed with PNG-formatted screenshots). For comparison, we also evaluate 10 LLMs (fed with OCR-parsed, TXT-formatted tokens). Our key findings are summarized as follows.
<ul>
<li> Our benchmark poses significant challenges to current LVLMs. The best-performing LVLM, GPT-4o, achieves an overall F1 score of only 44.9%. while the second-best LVLM, GPT-4V, scores 30.5%. Among open-source evaluated LVLMs, only InternVL-Chat-v1.5 achieves more than 10% F1 score.</li>
<li> 12 out of 14 LVLMs (except GPT-4o and GPT-4v) present inferior performance than their LLM counterparts, even though LLMs handle lossy, OCR-parsed texts. We speculate that the scarce related pre-training corpus (<i>i.e.</i>, extremely multi-image or lengthy documents) hinders the long-context DU capabilities of LVLMs. We will leave related explorations for future work.</li>
</ul>
<centering>
<div style="text-align: center;">
<img id="teaser" width="85%" src="static/images/eval_result.png">
</div>
Futher fine-grained study reveals that
<ul>
<li> <b>Document Type:</b> The performance on different documents varies. Generally speaking, all models demonstrate decent performance on industrial documents (more standardized formats and less non-textual information). But most models perform significantly worse in specialized domains such as academic papers and financial reports.</li>
<li> <b>Evidence Source:</b> Only GPT-4o exhibits relatively balanced performance across different sources. LLMs demonstrate better or comparable performance to LVLMs on text-/table-related questions but worse performance on questions about other sources. This highlights the limitations of OCR when dealing with charts and images.</li>
</ul>
<br><br>
<centering>
<div style="text-align: center;">
<img id="teaser" width="75%" src="static/images/radar.png">
</div>
</p>
</div>
</b></font>
</div>
</div>
</section>
<section class="section" id="Case Study">
<div class="columns is-centered has-text-centered">
<div class="column is-six-fifths">
<h2 class="title is-3"> Case Study</span> </h2>
</div>
</div>
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full-width">
<div class="content has-text-justified">
<p>
Comparisons among responses from six models reveals that GPT-4o outperforms all other models by a significant margin in terms of perceiving visual information, reasoning over multiple pages, and generating richer responses.
<centering>
<div style="text-align: center;">
<img id="teaser" width="100%" src="static/images/case_study.png">
</div>
</p>
</div>
</b></font>
</div>
</div>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>