Ayuto commited on
Commit
909e414
·
0 Parent(s):
Files changed (8) hide show
  1. .gitattributes +35 -0
  2. .gitignore +2 -0
  3. .python-version +1 -0
  4. README.md +56 -0
  5. README_spaces.md +62 -0
  6. app.py +227 -0
  7. pyproject.toml +7 -0
  8. requirements.txt +7 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ models/
2
+ __pycache__/
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.12
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Miipher 2 HuBERT HiFi GAN V0.1
3
+ emoji: 🎤
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.38.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ models:
12
+ - Atotti/miipher-2-HuBERT-HiFi-GAN-v0.1
13
+ ---
14
+
15
+ # 🎤 Miipher-2 Speech Enhancement Demo
16
+
17
+ This is a Gradio demo for **Miipher-2**, a high-quality speech enhancement model that combines HuBERT, Parallel Adapters, and HiFi-GAN vocoder.
18
+
19
+ ## Features
20
+
21
+ - **Real-time speech enhancement** - Remove noise, reverb, and other degradations
22
+ - **Multilingual support** - Built on mHuBERT-147 for 147 languages
23
+ - **High-quality output** - 22.05kHz audio output
24
+ - **Easy to use** - Simple drag-and-drop or microphone input
25
+
26
+ ## Model Details
27
+
28
+ - **Paper**: [Miipher-2: High-Quality Speech Enhancement](https://arxiv.org/abs/2505.04457)
29
+ - **Model**: [Atotti/miipher-2-HuBERT-HiFi-GAN-v0.1](https://huggingface.co/Atotti/miipher-2-HuBERT-HiFi-GAN-v0.1)
30
+ - **GitHub**: [open-miipher-2](https://github.com/your-repo/open-miipher-2)
31
+
32
+ ## How to Use
33
+
34
+ 1. **Upload** an audio file or record using microphone
35
+ 2. Click **"Enhance Audio"** button
36
+ 3. **Download** the enhanced result
37
+
38
+ ## Technical Details
39
+
40
+ The model uses:
41
+ - **SSL Backbone**: mHuBERT-147 (multilingual)
42
+ - **Adapter**: Parallel adapters inserted at layer 6
43
+ - **Vocoder**: HiFi-GAN trained on SSL features
44
+ - **Input**: Any sample rate (auto-resampled to 16kHz)
45
+ - **Output**: 22.05kHz enhanced audio
46
+
47
+ ## Citation
48
+
49
+ ```bibtex
50
+ @article{miipher2024,
51
+ title={Miipher-2: High-Quality Speech Enhancement via Self-Supervised Learning},
52
+ author={Your Name and Others},
53
+ journal={arXiv preprint arXiv:2505.04457},
54
+ year={2024}
55
+ }
56
+ ```
README_spaces.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Miipher-2 Speech Enhancement Demo
3
+ emoji: 🎵
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 4.0.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # Miipher-2 Speech Enhancement Demo
14
+
15
+ Miipher-2 is a speech enhancement system that uses Parallel Adapters inserted into mHuBERT layers to improve audio quality.
16
+
17
+ ## Features
18
+
19
+ - **Real-time speech enhancement** from noisy or degraded audio
20
+ - **Parallel Adapter architecture** for efficient fine-tuning
21
+ - **Lightning SSL-Vocoder** for high-quality audio synthesis
22
+ - **Easy-to-use Gradio interface**
23
+
24
+ ## Model Architecture
25
+
26
+ 1. **SSL Feature Extractor**: mHuBERT-147 (Layer 6)
27
+ 2. **Parallel Adapter**: Lightweight feedforward network
28
+ 3. **Lightning SSL-Vocoder**: HiFi-GAN based vocoder
29
+
30
+ ## Usage
31
+
32
+ 1. Upload an audio file or record using your microphone
33
+ 2. Click "音声を修復" (Enhance Audio)
34
+ 3. Listen to the enhanced audio output
35
+
36
+ ## Models
37
+
38
+ The demo automatically downloads the unified model from:
39
+ - Complete Model: `Atotti/miipher-2-HuBERT-HiFi-GAN-v0.1` (includes both Adapter and Vocoder)
40
+
41
+ ## Technical Details
42
+
43
+ - **Input**: Audio files (WAV, MP3, FLAC)
44
+ - **Output**: Enhanced audio at 22050Hz
45
+ - **Supported Languages**: Primarily trained on Japanese but works with other languages
46
+ - **Processing**: Real-time inference on CPU/GPU
47
+
48
+ ## License
49
+
50
+ Apache-2.0
51
+
52
+ ## Citation
53
+
54
+ If you use Miipher-2 in your research, please cite:
55
+
56
+ ```bibtex
57
+ @article{miipher2,
58
+ title={Miipher-2: Speech Enhancement with Parallel Adapters},
59
+ author={Your Name},
60
+ year={2024}
61
+ }
62
+ ```
app.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ import torchaudio
4
+ import numpy as np
5
+ from pathlib import Path
6
+ from huggingface_hub import hf_hub_download
7
+ from omegaconf import DictConfig
8
+
9
+ from miipher_2.model.feature_cleaner import FeatureCleaner
10
+ from miipher_2.lightning_vocoders.lightning_module import HiFiGANLightningModule
11
+
12
+ # Model configuration
13
+ MODEL_REPO_ID = "Atotti/miipher-2-HuBERT-HiFi-GAN-v0.1"
14
+ ADAPTER_FILENAME = "checkpoint_199k_fixed.pt"
15
+ VOCODER_FILENAME = "epoch=77-step=137108.ckpt"
16
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
17
+ SAMPLE_RATE_INPUT = 16000
18
+ SAMPLE_RATE_OUTPUT = 22050
19
+
20
+ # Cache for models
21
+ models_cache = {}
22
+
23
+ def download_models():
24
+ """Download models from Hugging Face Hub"""
25
+ print("Downloading models from Hugging Face Hub...")
26
+
27
+ adapter_path = hf_hub_download(
28
+ repo_id=MODEL_REPO_ID,
29
+ filename=ADAPTER_FILENAME,
30
+ cache_dir="./models"
31
+ )
32
+
33
+ vocoder_path = hf_hub_download(
34
+ repo_id=MODEL_REPO_ID,
35
+ filename=VOCODER_FILENAME,
36
+ cache_dir="./models"
37
+ )
38
+
39
+ return adapter_path, vocoder_path
40
+
41
+ def load_models():
42
+ """Load models into memory"""
43
+ if "cleaner" in models_cache and "vocoder" in models_cache:
44
+ return models_cache["cleaner"], models_cache["vocoder"]
45
+
46
+ adapter_path, vocoder_path = download_models()
47
+
48
+ # Model configuration
49
+ model_config = DictConfig({
50
+ "hubert_model_name": "utter-project/mHuBERT-147",
51
+ "hubert_layer": 6,
52
+ "adapter_hidden_dim": 768
53
+ })
54
+
55
+ # Initialize FeatureCleaner
56
+ print("Loading FeatureCleaner...")
57
+ cleaner = FeatureCleaner(model_config).to(DEVICE).eval()
58
+
59
+ # Load adapter weights
60
+ adapter_checkpoint = torch.load(adapter_path, map_location=DEVICE, weights_only=False)
61
+ cleaner.load_state_dict(adapter_checkpoint["model_state_dict"])
62
+
63
+ # Load vocoder
64
+ print("Loading vocoder...")
65
+ vocoder = HiFiGANLightningModule.load_from_checkpoint(
66
+ vocoder_path, map_location=DEVICE
67
+ ).to(DEVICE).eval()
68
+
69
+ # Cache models
70
+ models_cache["cleaner"] = cleaner
71
+ models_cache["vocoder"] = vocoder
72
+
73
+ return cleaner, vocoder
74
+
75
+ @torch.inference_mode()
76
+ def enhance_audio(audio_path, progress=gr.Progress()):
77
+ """Enhance audio using Miipher-2 model"""
78
+ try:
79
+ progress(0, desc="Loading models...")
80
+ cleaner, vocoder = load_models()
81
+
82
+ progress(0.2, desc="Loading audio...")
83
+ # Load audio
84
+ waveform, sr = torchaudio.load(audio_path)
85
+
86
+ # Resample to 16kHz if needed
87
+ if sr != SAMPLE_RATE_INPUT:
88
+ waveform = torchaudio.functional.resample(waveform, sr, SAMPLE_RATE_INPUT)
89
+
90
+ # Convert to mono if stereo
91
+ waveform = waveform.mean(0, keepdim=True)
92
+
93
+ # Move to device
94
+ waveform = waveform.to(DEVICE)
95
+
96
+ progress(0.4, desc="Extracting features...")
97
+ # Extract features using FeatureCleaner
98
+ with torch.no_grad(), torch.autocast(device_type=DEVICE.type, dtype=torch.float16, enabled=(DEVICE.type == "cuda")):
99
+ features = cleaner(waveform)
100
+
101
+ # Ensure correct shape for vocoder
102
+ if features.dim() == 2:
103
+ features = features.unsqueeze(0)
104
+
105
+ progress(0.7, desc="Generating enhanced audio...")
106
+ # Generate audio using vocoder
107
+ # Lightning SSL-Vocoderの入力形式に合わせる (batch, seq_len, input_channels)
108
+ batch = {"input_feature": features.transpose(1, 2)}
109
+ enhanced_audio = vocoder.generator_forward(batch)
110
+
111
+ # Convert to numpy
112
+ enhanced_audio = enhanced_audio.squeeze(0).cpu().to(torch.float32).detach().numpy()
113
+
114
+ progress(1.0, desc="Enhancement complete!")
115
+
116
+ # Save audio using torchaudio to avoid Gradio format issues
117
+ enhanced_audio = np.clip(enhanced_audio, -1.0, 1.0)
118
+ enhanced_audio_tensor = torch.from_numpy(enhanced_audio)
119
+
120
+ # Ensure 2D tensor: (channels, samples)
121
+ if enhanced_audio_tensor.dim() == 1:
122
+ enhanced_audio_tensor = enhanced_audio_tensor.unsqueeze(0)
123
+
124
+ # Save to temporary file using torchaudio
125
+ import tempfile
126
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_file:
127
+ torchaudio.save(tmp_file.name, enhanced_audio_tensor, SAMPLE_RATE_OUTPUT)
128
+ return tmp_file.name
129
+
130
+ except Exception as e:
131
+ raise gr.Error(f"Error during enhancement: {str(e)}")
132
+
133
+ # Create Gradio interface
134
+ def create_interface():
135
+ title = "🎤 Miipher-2 Speech Enhancement"
136
+
137
+ description = """
138
+ <div style="text-align: center;">
139
+ <p>High-quality speech enhancement using <b>Miipher-2</b> (HuBERT + Parallel Adapter + HiFi-GAN)</p>
140
+ <p>📄 <a href="https://arxiv.org/abs/2505.04457">Paper</a> |
141
+ 🤗 <a href="https://huggingface.co/Atotti/miipher-2-HuBERT-HiFi-GAN-v0.1">Model</a> |
142
+ 💻 <a href="https://github.com/your-repo/open-miipher-2">GitHub</a></p>
143
+ </div>
144
+ """
145
+
146
+ article = """
147
+ ## How it works
148
+
149
+ 1. **Upload** a noisy or degraded audio file
150
+ 2. **Process** using Miipher-2 model
151
+ 3. **Download** the enhanced audio
152
+
153
+ ### Model Details
154
+ - **SSL Backbone**: mHuBERT-147 (Multilingual)
155
+ - **Adapter**: Parallel adapters at layer 6
156
+ - **Vocoder**: HiFi-GAN trained on SSL features
157
+ - **Input**: Any sample rate (automatically resampled to 16kHz)
158
+ - **Output**: 22.05kHz high-quality audio
159
+
160
+ ### Tips
161
+ - Works best with speech audio
162
+ - Supports various noise types (background noise, reverb, etc.)
163
+ - Processing time depends on audio length and hardware
164
+ """
165
+
166
+ examples = [
167
+ ["examples/noisy_speech_1.wav"],
168
+ ["examples/noisy_speech_2.wav"],
169
+ ["examples/reverb_speech.wav"],
170
+ ]
171
+
172
+ with gr.Blocks(title=title, theme=gr.themes.Soft()) as demo:
173
+ gr.Markdown(f"# {title}")
174
+ gr.Markdown(description)
175
+
176
+ with gr.Row():
177
+ with gr.Column():
178
+ input_audio = gr.Audio(
179
+ label="Input Audio (Noisy/Degraded)",
180
+ type="filepath",
181
+ sources=["upload", "microphone"]
182
+ )
183
+
184
+ enhance_btn = gr.Button("🚀 Enhance Audio", variant="primary")
185
+
186
+ with gr.Column():
187
+ output_audio = gr.Audio(
188
+ label="Enhanced Audio",
189
+ type="filepath",
190
+ interactive=False
191
+ )
192
+
193
+ # Add examples if they exist
194
+ examples_dir = Path("examples")
195
+ if examples_dir.exists():
196
+ example_files = list(examples_dir.glob("*.wav")) + list(examples_dir.glob("*.mp3"))
197
+ if example_files:
198
+ gr.Examples(
199
+ examples=[[str(f)] for f in example_files[:3]],
200
+ inputs=input_audio,
201
+ outputs=output_audio,
202
+ fn=enhance_audio,
203
+ cache_examples=True
204
+ )
205
+
206
+ gr.Markdown(article)
207
+
208
+ # Connect the enhancement function
209
+ enhance_btn.click(
210
+ fn=enhance_audio,
211
+ inputs=input_audio,
212
+ outputs=output_audio,
213
+ show_progress=True
214
+ )
215
+
216
+ return demo
217
+
218
+ # Launch the app
219
+ if __name__ == "__main__":
220
+ # Pre-load models
221
+ print("Pre-loading models...")
222
+ load_models()
223
+ print("Models loaded successfully!")
224
+
225
+ # Create and launch interface
226
+ demo = create_interface()
227
+ demo.launch()
pyproject.toml ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "miipher-demo"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = []
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # UI framework
2
+ gradio>=4.0.0
3
+ # Hugging Face Hub
4
+ huggingface_hub>=0.16.0
5
+ # miipher-2 implementation
6
+ git+https://github.com/Atotti/miipher-2.git
7
+