AIvry commited on
Commit
f76a9ce
Β·
verified Β·
1 Parent(s): 480d8a7

Delete hf_readme.md

Browse files
Files changed (1) hide show
  1. hf_readme.md +0 -136
hf_readme.md DELETED
@@ -1,136 +0,0 @@
1
- ---
2
- title: MAPSS Multi Source Audio Perceptual Separation Scores
3
- emoji: 🎡
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 4.0.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
-
13
- # MAPSS: Multi-source Audio Perceptual Separation Scores
14
-
15
- Evaluate audio source separation quality using Perceptual Similarity (PS) and Perceptual Matching (PM) metrics.
16
-
17
- ## Features
18
-
19
- - **Perceptual Similarity (PS)**: Measures how similar separated outputs are to reference sources in perceptual embedding space
20
- - **Perceptual Matching (PM)**: Evaluates robustness against a comprehensive set of audio distortions
21
- - **Multiple embedding models**: Support for WavLM, Wav2Vec2, HuBERT, AST, and more
22
- - **Automatic output-to-reference matching**: Uses correlation-based Hungarian algorithm
23
- - **GPU-optimized processing**: Efficient batch processing with memory management
24
- - **Diffusion maps**: Advanced dimensionality reduction for perceptual space analysis
25
-
26
- ## Input Format
27
-
28
- Upload a ZIP file containing:
29
- ```
30
- your_mixture.zip
31
- β”œβ”€β”€ references/ # Original clean sources
32
- β”‚ β”œβ”€β”€ speaker1.wav
33
- β”‚ β”œβ”€β”€ speaker2.wav
34
- β”‚ └── ...
35
- └── outputs/ # Separated outputs from your algorithm
36
- β”œβ”€β”€ separated1.wav
37
- β”œβ”€β”€ separated2.wav
38
- └── ...
39
- ```
40
-
41
- ### Audio Requirements
42
- - Format: WAV files
43
- - Sample rate: Any (automatically resampled to 16kHz)
44
- - Channels: Mono or stereo (converted to mono)
45
- - Number of files: Equal number of references and outputs
46
-
47
- ## Output Format
48
-
49
- The tool generates a ZIP file containing:
50
- - `ps_scores_{model}.csv`: PS scores for each speaker/source (0-1, higher is better)
51
- - `pm_scores_{model}.csv`: PM scores for each speaker/source (0-1, higher is better)
52
- - `params.json`: Experiment parameters used
53
- - `manifest_canonical.json`: File mapping and processing details
54
-
55
- ### Score Interpretation
56
- - **PS Score**: Perceptual Similarity
57
- - 1.0 = Perfect separation (output identical to reference)
58
- - 0.5 = Moderate separation quality
59
- - 0.0 = Poor separation (output closer to other sources)
60
-
61
- - **PM Score**: Perceptual Matching (robustness)
62
- - 1.0 = Highly robust to distortions
63
- - 0.5 = Moderate robustness
64
- - 0.0 = Not robust (easily confused with distorted versions)
65
-
66
- ## Available Models
67
-
68
- | Model | Description | Default Layer | Use Case |
69
- |-------|-------------|---------------|----------|
70
- | `raw` | Raw waveform features | N/A | Baseline comparison |
71
- | `wavlm` | WavLM Large | 24 | Best overall performance |
72
- | `wav2vec2` | Wav2Vec2 Large | 24 | Strong performance |
73
- | `hubert` | HuBERT Large | 24 | Good for speech |
74
- | `wavlm_base` | WavLM Base | 12 | Faster, good quality |
75
- | `wav2vec2_base` | Wav2Vec2 Base | 12 | Faster processing |
76
- | `hubert_base` | HuBERT Base | 12 | Faster for speech |
77
- | `wav2vec2_xlsr` | Wav2Vec2 XLSR-53 | 24 | Multilingual |
78
- | `ast` | Audio Spectrogram Transformer | 12 | General audio |
79
-
80
- ## Parameters
81
-
82
- - **Model**: Select the embedding model for feature extraction
83
- - **Layer**: Which transformer layer to use (auto-selected by default)
84
- - **Alpha**: Diffusion maps parameter (0.0-1.0, default: 1.0)
85
- - 0.0 = No normalization
86
- - 1.0 = Full normalization (recommended)
87
-
88
- ## How It Works
89
-
90
- 1. **Feature Extraction**: Audio signals are processed through pre-trained self-supervised models to extract perceptual embeddings
91
- 2. **Voice Activity Detection**: Automatic detection of voiced segments using energy-based masking
92
- 3. **Diffusion Maps**: Embeddings are projected using diffusion maps for robust dimensionality reduction
93
- 4. **PS Computation**: Measures Mahalanobis distance between separated outputs and references vs other sources
94
- 5. **PM Computation**: Evaluates against comprehensive distortions including:
95
- - Noise (white, pink, brown at various SNRs)
96
- - Filtering (lowpass, highpass, notch, comb)
97
- - Effects (reverb, echo, tremolo, vibrato)
98
- - Distortions (clipping, pitch shift, time stretch)
99
- 6. **Scoring**: Frame-level scores are computed and aggregated
100
-
101
- ## Technical Details
102
-
103
- - **Loudness normalization**: ITU-R BS.1770 standard (-23 LUFS)
104
- - **Frame-based processing**: 20ms windows with 20ms hop
105
- - **Correlation-based assignment**: Hungarian algorithm for optimal matching
106
- - **Memory optimization**: Batch processing with automatic GPU memory management
107
- - **Robust statistics**: Covariance regularization and outlier handling
108
-
109
- ## Citation
110
-
111
- If you use MAPSS in your research, please cite:
112
-
113
- ```bibtex
114
- @article{mapss2024,
115
- title={MAPSS: Multi-source Audio Perceptual Separation Scores},
116
- author={Your Name},
117
- journal={arXiv preprint},
118
- year={2024}
119
- }
120
- ```
121
-
122
- ## Limitations
123
-
124
- - Processing time scales with audio length and model size
125
- - Memory requirements depend on number of sources and audio length
126
- - Currently optimized for speech separation (music separation support in development)
127
- - Maximum recommended sources: 10 per mixture
128
-
129
- ## License
130
-
131
- Code: MIT License
132
- Paper: CC-BY-4.0
133
-
134
- ## Support
135
-
136
- For issues, questions, or contributions, please visit the [GitHub repository](https://github.com/yourusername/mapss).