AIvry commited on
Commit
4ee0d45
Β·
verified Β·
1 Parent(s): f76a9ce

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +129 -7
  2. requirements.txt +26 -0
README.md CHANGED
@@ -1,14 +1,136 @@
1
  ---
2
- title: MAPSS Measures
3
- emoji: πŸ”₯
4
- colorFrom: green
5
- colorTo: indigo
6
  sdk: gradio
7
- sdk_version: 5.45.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: Granular leakage and distortion metrics in source separation
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: MAPSS Multi Source Audio Perceptual Separation Scores
3
+ emoji: 🎡
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
 
11
  ---
12
 
13
+ # MAPSS: Multi-source Audio Perceptual Separation Scores
14
+
15
+ Evaluate audio source separation quality using Perceptual Similarity (PS) and Perceptual Matching (PM) metrics.
16
+
17
+ ## Features
18
+
19
+ - **Perceptual Similarity (PS)**: Measures how similar separated outputs are to reference sources in perceptual embedding space
20
+ - **Perceptual Matching (PM)**: Evaluates robustness against a comprehensive set of audio distortions
21
+ - **Multiple embedding models**: Support for WavLM, Wav2Vec2, HuBERT, AST, and more
22
+ - **Automatic output-to-reference matching**: Uses correlation-based Hungarian algorithm
23
+ - **GPU-optimized processing**: Efficient batch processing with memory management
24
+ - **Diffusion maps**: Advanced dimensionality reduction for perceptual space analysis
25
+
26
+ ## Input Format
27
+
28
+ Upload a ZIP file containing:
29
+ ```
30
+ your_mixture.zip
31
+ β”œβ”€β”€ references/ # Original clean sources
32
+ β”‚ β”œβ”€β”€ speaker1.wav
33
+ β”‚ β”œβ”€β”€ speaker2.wav
34
+ β”‚ └── ...
35
+ └── outputs/ # Separated outputs from your algorithm
36
+ β”œβ”€β”€ separated1.wav
37
+ β”œβ”€β”€ separated2.wav
38
+ └── ...
39
+ ```
40
+
41
+ ### Audio Requirements
42
+ - Format: WAV files
43
+ - Sample rate: Any (automatically resampled to 16kHz)
44
+ - Channels: Mono or stereo (converted to mono)
45
+ - Number of files: Equal number of references and outputs
46
+
47
+ ## Output Format
48
+
49
+ The tool generates a ZIP file containing:
50
+ - `ps_scores_{model}.csv`: PS scores for each speaker/source (0-1, higher is better)
51
+ - `pm_scores_{model}.csv`: PM scores for each speaker/source (0-1, higher is better)
52
+ - `params.json`: Experiment parameters used
53
+ - `manifest_canonical.json`: File mapping and processing details
54
+
55
+ ### Score Interpretation
56
+ - **PS Score**: Perceptual Similarity
57
+ - 1.0 = Perfect separation (output identical to reference)
58
+ - 0.5 = Moderate separation quality
59
+ - 0.0 = Poor separation (output closer to other sources)
60
+
61
+ - **PM Score**: Perceptual Matching (robustness)
62
+ - 1.0 = Highly robust to distortions
63
+ - 0.5 = Moderate robustness
64
+ - 0.0 = Not robust (easily confused with distorted versions)
65
+
66
+ ## Available Models
67
+
68
+ | Model | Description | Default Layer | Use Case |
69
+ |-------|-------------|---------------|----------|
70
+ | `raw` | Raw waveform features | N/A | Baseline comparison |
71
+ | `wavlm` | WavLM Large | 24 | Best overall performance |
72
+ | `wav2vec2` | Wav2Vec2 Large | 24 | Strong performance |
73
+ | `hubert` | HuBERT Large | 24 | Good for speech |
74
+ | `wavlm_base` | WavLM Base | 12 | Faster, good quality |
75
+ | `wav2vec2_base` | Wav2Vec2 Base | 12 | Faster processing |
76
+ | `hubert_base` | HuBERT Base | 12 | Faster for speech |
77
+ | `wav2vec2_xlsr` | Wav2Vec2 XLSR-53 | 24 | Multilingual |
78
+ | `ast` | Audio Spectrogram Transformer | 12 | General audio |
79
+
80
+ ## Parameters
81
+
82
+ - **Model**: Select the embedding model for feature extraction
83
+ - **Layer**: Which transformer layer to use (auto-selected by default)
84
+ - **Alpha**: Diffusion maps parameter (0.0-1.0, default: 1.0)
85
+ - 0.0 = No normalization
86
+ - 1.0 = Full normalization (recommended)
87
+
88
+ ## How It Works
89
+
90
+ 1. **Feature Extraction**: Audio signals are processed through pre-trained self-supervised models to extract perceptual embeddings
91
+ 2. **Voice Activity Detection**: Automatic detection of voiced segments using energy-based masking
92
+ 3. **Diffusion Maps**: Embeddings are projected using diffusion maps for robust dimensionality reduction
93
+ 4. **PS Computation**: Measures Mahalanobis distance between separated outputs and references vs other sources
94
+ 5. **PM Computation**: Evaluates against comprehensive distortions including:
95
+ - Noise (white, pink, brown at various SNRs)
96
+ - Filtering (lowpass, highpass, notch, comb)
97
+ - Effects (reverb, echo, tremolo, vibrato)
98
+ - Distortions (clipping, pitch shift, time stretch)
99
+ 6. **Scoring**: Frame-level scores are computed and aggregated
100
+
101
+ ## Technical Details
102
+
103
+ - **Loudness normalization**: ITU-R BS.1770 standard (-23 LUFS)
104
+ - **Frame-based processing**: 20ms windows with 20ms hop
105
+ - **Correlation-based assignment**: Hungarian algorithm for optimal matching
106
+ - **Memory optimization**: Batch processing with automatic GPU memory management
107
+ - **Robust statistics**: Covariance regularization and outlier handling
108
+
109
+ ## Citation
110
+
111
+ If you use MAPSS in your research, please cite:
112
+
113
+ ```bibtex
114
+ @article{mapss2024,
115
+ title={MAPSS: Multi-source Audio Perceptual Separation Scores},
116
+ author={Your Name},
117
+ journal={arXiv preprint},
118
+ year={2024}
119
+ }
120
+ ```
121
+
122
+ ## Limitations
123
+
124
+ - Processing time scales with audio length and model size
125
+ - Memory requirements depend on number of sources and audio length
126
+ - Currently optimized for speech separation (music separation support in development)
127
+ - Maximum recommended sources: 10 per mixture
128
+
129
+ ## License
130
+
131
+ Code: MIT License
132
+ Paper: CC-BY-4.0
133
+
134
+ ## Support
135
+
136
+ For issues, questions, or contributions, please visit the [GitHub repository](https://github.com/yourusername/mapss).
requirements.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ gradio>=4.0.0
3
+ torch>=2.0.0
4
+ torchaudio>=2.0.0
5
+ transformers>=4.35.0
6
+ accelerate>=0.24.0
7
+
8
+ # Audio processing
9
+ librosa>=0.10.0
10
+ soundfile>=0.12.0
11
+ pyloudnorm>=0.1.0
12
+ scipy>=1.11.0
13
+ numpy>=1.24.0
14
+
15
+ # Data handling
16
+ pandas>=2.0.0
17
+
18
+ # Model specific
19
+ safetensors>=0.4.0
20
+ sentencepiece>=0.1.99 # For some tokenizers
21
+
22
+ # Optional optimizations
23
+ triton>=2.1.0 # For faster attention if available
24
+
25
+ # Memory management
26
+ psutil>=5.9.0