Chapter 3

Audio Tasks

Discover the tasks and baselines offered with VibraVox dataset.

Subsections of Audio Tasks

Speech Enhancement

Task

This task is mainly oriented towards denoising and bandwidth extension, also known as audio super-resolution, which is required to enhance the audio quality of body-conducted captured speech. The model is presented with a pair of audio clips (from a body-conducted captured speech, and from the corresponding clean, full bandwidth airborne-captured speech), and asked to enhance the audio by denoising and regenerating mid and high frequencies from low frequency content only.

Please refer to the Vibravox paper for more information.

Pre-trained models on HuggingFace

Please follow this link to go to the card of our EBEN models: https://huggingface.co/Cnam-LMSSC/vibravox_EBEN_models

Training code

Please follow this link to get the training code of our models: https://github.com/jhauret/vibravox

Audio Samples

Forehead In-ear Rigid In-ear Soft Temple Throat
Input
Enhanced by EBEN
Reference audio

Vibravox enhanced by EBEN

Explore all the test set enhanced by EBEN models :

Speech recognition

Task

The model is presented with an audio file and asked to transcribe the audio file to written text (either normalized text of phonemized text). The most common evaluation metrics are the word error rate (WER), character error rate (CER), or phoneme error rate (PER).

Please refer to the Vibravox paper for more information.

Pre-trained models on HuggingFace

Please follow this link to go to the card of our phonemizers: https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers

Training code

Please follow this link to get the training code of our models: https://github.com/jhauret/vibravox

Speaker Verification

Task

Given an input audio clip and a reference audio clip of a known speaker, the model’s objective is to compare the two clips and verify if they are from the same individual. This often involves extracting embeddings from a deep neural network trained on a large dataset of voices. The model then measures the similarity between these feature sets using techniques like cosine similarity or a learned distance metric. This task is crucial in applications requiring secure access control, such as biometric authentication systems, where a person’s voice acts as a unique identifier.

Please refer to the Vibravox paper for more information.

Testing code

Please follow this link to get the testing code of our model: https://github.com/jhauret/vibravox