|
|
<!DOCTYPE html> |
|
|
<html lang="en"> |
|
|
<head> |
|
|
<meta charset="utf-8"> |
|
|
<meta name="description" content="Causal Graphical Models for Vision-Language Compositional Understanding"> |
|
|
<meta name="keywords" content="Vision-and-Language, Compositionality, Retrieval"> |
|
|
<meta name="viewport" content="width=device-width, initial-scale=1"> |
|
|
<title>Causal Graphical Models for Vision-Language Compositional Understanding</title> |
|
|
|
|
|
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> |
|
|
|
|
|
<link rel="stylesheet" href="static/css/bulma.min.css"> |
|
|
<link rel="stylesheet" href="static/css/bulma-carousel.min.css"> |
|
|
<link rel="stylesheet" href="static/css/bulma-slider.min.css"> |
|
|
<link rel="stylesheet" href="static/css/fontawesome.all.min.css"> |
|
|
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> |
|
|
<link rel="stylesheet" href="static/css/index.css"> |
|
|
<link rel="icon" href="static/images/favicon.png"> |
|
|
|
|
|
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> |
|
|
<script defer src="static/js/fontawesome.all.min.js"></script> |
|
|
<script src="static/js/bulma-carousel.min.js"></script> |
|
|
<script src="static/js/bulma-slider.min.js"></script> |
|
|
<script src="static/js/index.js"></script> |
|
|
|
|
|
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css"> |
|
|
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300&display=swap" rel="stylesheet"> |
|
|
<style> |
|
|
body { |
|
|
font-family: 'Roboto', sans-serif; |
|
|
background-color: #e8f5e9; |
|
|
color: #333; |
|
|
line-height: 1.6; |
|
|
} |
|
|
.jumbotron { |
|
|
background: linear-gradient(135deg, #388e3c, #66bb6a); |
|
|
color: white; |
|
|
padding: 2rem 1rem; |
|
|
margin-bottom: 1rem; |
|
|
border-radius: 0.3rem; |
|
|
} |
|
|
.display-4 { |
|
|
font-size: 3rem; |
|
|
font-weight: 700; |
|
|
} |
|
|
.lead { |
|
|
font-size: 1rem; |
|
|
font-weight: 300; |
|
|
color: white; |
|
|
} |
|
|
.section { |
|
|
padding: 1.5rem 0; |
|
|
} |
|
|
.section-title { |
|
|
border-bottom: 2px solid #2e7d32; |
|
|
margin-bottom: 1rem; |
|
|
padding-bottom: 0.5rem; |
|
|
color: #1b5e20; |
|
|
} |
|
|
.qualitative-img { |
|
|
max-width: 100%; |
|
|
border-radius: 8px; |
|
|
transition: transform 0.3s ease-in-out; |
|
|
} |
|
|
.qualitative-img:hover { |
|
|
transform: scale(1.05); |
|
|
} |
|
|
.bibtex-block { |
|
|
background-color: #c8e6c9; |
|
|
padding: 1rem; |
|
|
border-radius: 0.25rem; |
|
|
overflow-x: auto; |
|
|
font-family: monospace; |
|
|
} |
|
|
.footer { |
|
|
text-align: center; |
|
|
padding: 1rem 0; |
|
|
background-color: #a5d6a7; |
|
|
} |
|
|
.lead a { |
|
|
color: white; |
|
|
text-decoration: none; |
|
|
} |
|
|
.lead a:hover { |
|
|
text-decoration: underline; |
|
|
} |
|
|
.author-link { |
|
|
font-family: monospace; |
|
|
font-style: italic; |
|
|
margin: 0 10px; |
|
|
} |
|
|
.iclr-space { |
|
|
margin: 10px 0; |
|
|
font-size: 20px; |
|
|
color: #333; |
|
|
} |
|
|
title { |
|
|
font-weight: bold; |
|
|
} |
|
|
.button-container { |
|
|
display: flex; |
|
|
justify-content: center; |
|
|
gap: 10px; |
|
|
margin-top: 20px; |
|
|
} |
|
|
.icon-button { |
|
|
background-color: #333; |
|
|
color: white; |
|
|
border: none; |
|
|
padding: 10px 20px; |
|
|
border-radius: 20px; |
|
|
display: flex; |
|
|
align-items: center; |
|
|
gap: 5px; |
|
|
cursor: pointer; |
|
|
} |
|
|
.icon { |
|
|
height: 20px; |
|
|
} |
|
|
.section-content { |
|
|
max-width: 800px; |
|
|
margin: 0 auto; |
|
|
} |
|
|
|
|
|
.init-content { |
|
|
max-width: 700px; |
|
|
margin: 0 auto; |
|
|
} |
|
|
</style> |
|
|
</head> |
|
|
<body> |
|
|
|
|
|
<div class="jumbotron text-center"> |
|
|
<img src="static/images/logo.png" alt="ICLR 2025 Logo" class="img-fluid mb-3" style="max-height: 80px;"> |
|
|
<h1 class="display-4">Causal Graphical Models for Vision-Language Compositional Understanding</h1> |
|
|
<p class="lead"> |
|
|
<span class="iclr-space" style="margin-bottom: 2rem;">ICLR 2025<br></span> |
|
|
<span class="author-link"><a href="https://github.com/FiorenzoParascandolo1" target="_blank">Fiorenzo Parascandolo</a></span> |
|
|
<span class="author-link"><a href="https://nicholasmoratelli.github.io" target="_blank">Nicholas Moratelli</a></span> |
|
|
<span class="author-link"><a href="https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=144" target="_blank">Enver Sangineto</a></span> |
|
|
<span class="author-link"><a href="https://www.lorenzobaraldi.com/" target="_blank">Lorenzo Baraldi</a></span> |
|
|
<span class="author-link"><a href="https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=1" target="_blank">Rita Cucchiara</a></span> <br> |
|
|
University of Modena and Reggio Emilia <br> |
|
|
</p> |
|
|
|
|
|
<div class="button-container"> |
|
|
<span class="link-block"> |
|
|
<a href="https://github.com/aimagelab/COGT" class="external-link button is-normal is-rounded is-dark"> |
|
|
<span class="icon"> |
|
|
<i class="fab fa-github"></i> |
|
|
</span> |
|
|
<span>Code</span> |
|
|
</a> |
|
|
</span> |
|
|
<span class="link-block"> |
|
|
<a href="https://arxiv.org/pdf/2412.09353" class="external-link button is-normal is-rounded is-dark"> |
|
|
<span class="icon"> |
|
|
<i class="ai ai-arxiv"></i> |
|
|
</span> |
|
|
<span>arXiv</span> |
|
|
</a> |
|
|
</span> |
|
|
<span class="link-block"> |
|
|
<a class="external-link button is-normal is-rounded is-dark"> |
|
|
🤗 Models |
|
|
</a> |
|
|
</span> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="container section"> |
|
|
<div class="init-content"> |
|
|
|
|
|
<p><i> |
|
|
This paper introduces <u><b>COGT</b></u>, a novel approach for enhancing the compositional understanding of Vision-Language Models |
|
|
by modeling the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM). |
|
|
</i></p> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="container section"> |
|
|
<div class="section-content"> |
|
|
<h1 class="section-title">Abstract</h1> |
|
|
<p> |
|
|
Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional |
|
|
properties of the human language, usually modeling an image caption as a "bag of words". In this paper, we model |
|
|
the dependency relations among textual and visual tokens using a <i><b>Causal Graphical Model (CGM)</b></i>, built using a |
|
|
<i><b>dependency parser</b></i>, and we train a decoder conditioned by the VLM visual encoder. Differently from standard |
|
|
autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM |
|
|
structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence |
|
|
discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that |
|
|
our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, |
|
|
and it also improves over methods trained using much larger datasets. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="container section"> |
|
|
<div class="section-content"> |
|
|
<h1 class="section-title">Method</h1> |
|
|
|
|
|
<h3 class="section-title" style="font-size: 1.5em; margin-top: 2rem;">Causal Graphical Model (CGM) Construction</h3> |
|
|
<p> |
|
|
<div style="text-align: center; margin-top: 2rem; margin-bottom: 2rem;"> |
|
|
<img src="static/images/method.png" alt="Qualitative Result 1" class="qualitative-img"> |
|
|
</div> |
|
|
<p> |
|
|
We use an off-the-shelf <i>dependency parser</i>, which creates a syntactic tree from a given textual sentence. Specifically, given a caption, a dependency parser automatically builds a <i>Dependency Tree</i> (DT), in which each node is associated with a caption word and each edge represents a syntactic dependency relation between two words. |
|
|
The DT, jointly with the visual features extracted from the image using a frozen visual encoder, are used to build a CGM, which describes the dependency relations among image patches and textual tokens. Our token prediction strategy is based on the dependency relations contained in this CGM. |
|
|
The rationale behind this approach is illustrated in the figure using the caption "A brown bird has a small yellow head". For instance, in the resulting DT, the adjective "brown" depends on the noun "bird". |
|
|
</p> |
|
|
</p> |
|
|
|
|
|
<h3 class="section-title" style="font-size: 1.5em; margin-top: 2rem;">Dependency Guided Attention for Token Prediction</h3> |
|
|
<p> |
|
|
<div style="text-align: center; margin-top: 2rem; margin-bottom: 2rem;"> |
|
|
<img src="static/images/architecture.png" alt="Qualitative Result 1" class="qualitative-img"> |
|
|
</div> |
|
|
This figure presents a high-level architecture of our decoder. Each block of \(\mathcal{D}\) is composed of two layers. |
|
|
In the first layer, we compute the self-attention of each masked embedding \(\mathbf{m}_j\) with itself, jointly with the attention of \(\mathbf{m}_j\) with all the visible embeddings \(\mathbf{v}_{i_1}, ..., \mathbf{v}_{i_k}\), where |
|
|
\[\mathbf{PA}(W_j) = \{ W_{i_1}, ..., W_{i_k}, S_j, Z_1, ..., Z_m \}.\] |
|
|
Note that there is no attention between \(\mathbf{m}_{j_1}\) and \(\mathbf{m}_{j_2}\), with \(j_1 \neq j_2\). |
|
|
In the same layer, we compute the self-attention of each visible embedding \(\mathbf{v}_j\) with itself, jointly with the attention of \(\mathbf{v}_j\) with \(\mathbf{v}_{i_1}, ..., \mathbf{v}_{i_k}\). |
|
|
Note that there is no information leak, since \(\mathbf{m}_j\), later used for the final prediction, has no direct or indirect access to \(\mathbf{v}_j\). |
|
|
We call this <em>Dependency Guided Attention</em> to differentiate it from the standard self-attention. |
|
|
In the second layer of each block of \(\mathcal{D}\), both the masked (\(\mathbf{m}_j\)) and the visible (\(\mathbf{v}_j\)) embeddings pay attention to the visual features in \(\mathcal{Z}\) using cross-attention, in this way implementing the dependence between \(W_j\) and \(Z_1, ..., Z_m\). |
|
|
Finally, after the last block of \(\mathcal{D}\) we discard the visible-token embeddings and we feed each masked-token final embedding to a linear layer computing a posterior distribution over the vocabulary of textual terms. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="container section"> |
|
|
<div class="section-content"> |
|
|
<h1 class="section-title">Qualitative Results</h1> |
|
|
<div style="text-align: center; margin-top: 2rem;"> |
|
|
<img src="static/images/sugar_crepe.png" alt="Qualitative Result 1" class="qualitative-img"> |
|
|
<p class="caption">Qualitative results on sample images of SugarCrepe.</p> |
|
|
</div> |
|
|
<div style="text-align: center; margin-top: 2rem;"> |
|
|
<img src="static/images/color_swap.png" alt="Qualitative Result 2" class="qualitative-img"> |
|
|
<p class="caption">Qualitative results on sample images of ColorSwap.</p> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="container section"> |
|
|
<h1 class="section-title">BibTeX</h1> |
|
|
<div class="bibtex-block"> |
|
|
<pre> |
|
|
@InProceedings{parascandolo2024causal, |
|
|
title={Causal Graphical Models for Vision-Language Compositional Understanding}, |
|
|
author={Parascandolo, Fiorenzo and Moratelli, Nicholas and Sangineto, Enver and Baraldi, Lorenzo and Cucchiara, Rita}, |
|
|
booktitle={Proceedings of The Thirteenth International Conference on Learning Representations, ICLR}, |
|
|
year={2025} |
|
|
} |
|
|
</pre> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<footer class="footer"> |
|
|
<p>© 2025 Causal Graphical Models for Vision-Language Compositional Understanding</p> |
|
|
</footer> |
|
|
|
|
|
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script> |
|
|
<script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.9.3/dist/umd/popper.min.js"></script> |
|
|
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script> |
|
|
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script> |
|
|
|
|
|
</body> |
|
|
</html> |