Spaces:
Running
Running
| <html> | |
| <head> | |
| <title>Hibiki</title> | |
| <link | |
| href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" | |
| rel="stylesheet" | |
| /> | |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.3/font/bootstrap-icons.min.css"> | |
| <meta charset="utf-8" /> | |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> | |
| <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script> | |
| <script src="https://unpkg.com/wavesurfer.js@7"></script> | |
| <script src="helper.js" defer></script> | |
| <script> | |
| function _setup_callback(elem, elems) { | |
| elem.addEventListener("play", function () { | |
| for (other of elems) { | |
| if (other !== elem) { | |
| other.pause(); | |
| } | |
| } | |
| }); | |
| } | |
| document.addEventListener('DOMContentLoaded', function () { | |
| var elems = document.body.getElementsByTagName("audio"); | |
| for (elem of elems) { | |
| _setup_callback(elem, elems); | |
| } | |
| }); | |
| </script> | |
| <style> | |
| td { | |
| vertical-align: middle; | |
| text-align: center; | |
| } | |
| audio { | |
| width: 20vw; | |
| min-width: 100px; | |
| max-width: 100%; | |
| } | |
| h1, h2, h3, h4, h5, h6, body, b, strong, th { | |
| color: #595959; | |
| } | |
| .ratio-8x5 { | |
| --bs-aspect-ratio: 62.5%; | |
| } | |
| .btn-secondary { | |
| padding: 0.1rem 0.8rem; | |
| font-size: small | |
| } | |
| .container { | |
| max-width: 1620px; | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded"> | |
| <div class="text-center"> | |
| <h1>High-Fidelity Simultaneous Speech-To-Speech Translation</h1> | |
| <p class="lead"> | |
| <a href="https://kyutai.org">Kyutai</a> | |
| - code on <a href="https://github.com/kyutai-labs/hibiki">github</a> | |
| </p> | |
| </div> | |
| <p> | |
| <b>Abstract.</b> | |
| We introduce <i>Hibiki</i> ('echo' in Japanese) | |
| Hibiki leverages a multistream language model to synchronously process | |
| source and target speech, and jointly produces text and audio tokens to | |
| perform speech-to-text and speech-to-speech translation. | |
| We furthermore address the fundamental challenge of <i>simultaneous</i> interpretation, | |
| which unlike its <i>consecutive</i> counterpart---where one waits for | |
| the end of the source utterance to start translating--- adapts its flow | |
| to accumulate just enough context to produce a correct translation in | |
| real-time, chunk by chunk. <br /> | |
| To do so, we introduce a weakly-supervised method that leverages the | |
| perplexity of an off-the-shelf text translation system to identify | |
| optimal delays on a per-word basis and create aligned synthetic data. | |
| After supervised training, Hibiki performs adaptive, simultaneous | |
| speech translation with vanilla temperature sampling. On a | |
| French-English simultaneous speech translation task, Hibiki demonstrates | |
| state-of-the-art performance in translation quality, speaker fidelity | |
| and naturalness. Moreover, the simplicity of its inference process | |
| makes it compatible with batched translation and even real-time | |
| on-device deployment. | |
| </p> | |
| </div> | |
| <div class="container shadow p-5 mb-5 bg-white rounded"> | |
| <h3>In the Wild Examples<a id="vis"/></h3> | |
| <p class="mb-0"> | |
| </p> | |
| <div class="container pt-3 table-responsive"> | |
| <table class="table table-hover" width="100%"> | |
| <tr> | |
| <td witdth="50%"> | |
| <video class="embed-responsive-item" style="max-width: 80%; min-width: 400px;" controls> | |
| <source src="videos/RPckvIkNWhE_ss301_to390_babel_numerique_arte.mp4" type="video/mp4"> | |
| Your browser does not support HTML video. | |
| </video> | |
| </td> | |
| <td width="50%"> | |
| <video class="embed-responsive-item" style="max-width: 80%; min-width: 400px;" controls> | |
| <source src="videos/uNAmODXvAiQ_ss9_message_a_caractere_informatif.mp4" type="video/mp4"> | |
| Your browser does not support HTML video. | |
| </video> | |
| </td> | |
| <tr> | |
| <td> | |
| This example comes from a video explaining automated translation. | |
| (<a href="https://www.youtube.com/watch?v=RPckvIkNWhE" target="_blank">source</a>, original video (c) Arte) | |
| </td> | |
| <td> | |
| This example comes from a humoristic video. The source voice is high pitch on purpose, | |
| it is a good showcase of how well Hibiki replicates pitch and prosody and how robust it is to | |
| background noise <b>as no denoising is applied to the audio which is fed raw to Hibiki</b>. | |
| (<a href="https://www.youtube.com/watch?v=uNAmODXvAiQ" target="_blank">source</a>, original video (c) Canal+) | |
| </td> | |
| </tr> | |
| </table> | |
| </div> | |
| </div> | |
| <div class="container shadow p-5 mb-5 bg-white rounded"> | |
| <h3>Examples with Ground Truth Interpretation<a id="vis"/></h3> | |
| <p class="mb-0"> | |
| These samples come from the VoxPopuli dataset where the ground truth is real human | |
| interpretation. | |
| The volume for the sources has been reduced so that it's easier to hear the translations. | |
| </p> | |
| <div class="container pt-3 table-responsive"> | |
| <table class="table table-hover" id="vis-table2"></table> | |
| </div> | |
| </div> | |
| <div class="container shadow p-5 mb-5 bg-white rounded"> | |
| <h3>Multistream Visualization<a id="vis"/></h3> | |
| <p class="mb-0"> | |
| The audio for the source and translated versions are on different channels. Use headphones | |
| to hear both at the same time. These samples are the same as in the voxpopuli section with CFG | |
| set to 3. | |
| </p> | |
| <div class="container pt-3 table-responsive"> | |
| <table class="table table-hover" id="vis-table"></table> | |
| </div> | |
| </div> | |
| <div class="container shadow p-5 mb-5 bg-white rounded"> | |
| <h3>Impact of Classifier-Free Guidance<a id="voxpopuli"/></h3> | |
| <p class="mb-0"> | |
| Samples taken from the VoxPopuli dataset. The Hibiki samples are presented with different levels | |
| of classifier-free guidance (CFG). The higher the CFG value, the closer the generated voice will | |
| be to the original voice. This results in very strong accents for the generations with the higher | |
| values. | |
| </p> | |
| <div class="container pt-3 table-responsive"> | |
| <table | |
| class="table table-hover" | |
| id="voxpopuli-table" | |
| > | |
| <thead> | |
| <tr> | |
| <th style="text-align: center">Source</th> | |
| <th style="text-align: center">Hibiki CFG-1</th> | |
| <th style="text-align: center">Hibiki CFG-3</th> | |
| <th style="text-align: center">Hibiki CFG-10</th> | |
| <th style="text-align: center">Seamless</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| </div> | |
| <div class="container shadow p-5 mb-5 bg-white rounded"> | |
| <h3>Long-form Simultaneous Translations<a id="ntrex"/></h3> | |
| <p class="mb-0"> | |
| Samples taken from the audio NTREX dataset. | |
| </p> | |
| <div class="container pt-3 table-responsive"> | |
| <table | |
| class="table table-hover" | |
| id="ntrex-table" | |
| > | |
| <thead> | |
| <tr> | |
| <th style="text-align: center;min-width: 200px;">Source</th> | |
| <th style="text-align: center;">Hibiki</th> | |
| <th style="text-align: center">Seamless</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| </div> | |
| <div class="container shadow p-5 mb-5 bg-white rounded"> | |
| <h3>Short-form Simultaneous Translations<a id="cvss-c"/></h3> | |
| <p class="mb-0"> | |
| Samples taken from the CVSS-C dataset. | |
| </p> | |
| <div class="container pt-3 table-responsive"> | |
| <table | |
| class="table table-hover" | |
| id="cvss-table" | |
| > | |
| <thead> | |
| <tr> | |
| <th style="text-align: center;min-width: 200px;">Source</th> | |
| <th style="text-align: center;">Hibiki</th> | |
| <th style="text-align: center">Seamless</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| <tr> <td></td> <td></td> <td></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| </div> | |
| <div class="container p-5 mb-5 bg-white rounded"> | |
| <p class="mb-0"> | |
| This page was adapted from the <a href="https://google-research.github.io/seanet/soundstorm/examples">SoundStorm project page</a>. | |
| </p> | |
| </div> | |
| </body> | |
| </html> | |