VibraVox Dataset

Download

The dataset was released in July 2024 and is available on HuggingFace.

Complementary resources to reproduce experiments are also available on GitHub.

The scientific paper describing the Vibravox corpus and the results obtained for different speech processing tasks is available on arXiV (submitted to Speech Communication, under review).

A general purpose dataset of speech captured with body-conduction transducers

Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors : two in-ear microphones, two bone conduction vibration pickups and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference.

The Vibravox corpus contains 45.5 hours of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by an high order ambisonics 3D spatializer. The corpus includes annotations on the recording conditions and linguistic transcriptions.

Image of vibravox sensors on subject Image of vibravox sensors on subject

Tasks

We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.

Credits

If you use the Vibravox dataset for research, please cite this paper :

@article{jhauret-et-al-2024-vibravox,
     title={{Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors}},
     author={Hauret, Julien and Olivier, Malo and Joubaud, Thomas and Langrenne, Christophe and
       Poir{\'e}e, Sarah and Zimpfer, Véronique and Bavu, {\'E}ric},
     year={2024},
     eprint={2407.11828},
     archivePrefix={arXiv},
     primaryClass={eess.AS},
     url={https://arxiv.org/abs/2407.11828},
}

and this repository, which is linked to a DOI :

@misc{cnamlmssc2024vibravoxdataset,
   author={Hauret, Julien and Olivier, Malo and Langrenne, Christophe and
       Poir{\'e}e, Sarah and Bavu, {\'E}ric},
   title        = { {Vibravox} (Revision 7990b7d) },
   year         = 2024,
   url          = { https://huggingface.co/datasets/Cnam-LMSSC/vibravox },
   doi          = { 10.57967/hf/2727 },
   publisher    = { Hugging Face }
}

Subsections of VibraVox Dataset

Chapter 1

About

Discover why we investigate Body Conduction Microphones and speech enhancement techniques for communications.

Presentation on the subject

Subsections of About

Motivations

Context

Unlike traditional microphones, which rely on airborne sound waves, body-conduction microphones capture speech signals directly from the body, offering advantages in noisy environments by eliminating the capture of ambient noise. Although body-conduction microphones have been available for decades, their limited bandwidth has restricted their widespread usage. However, thanks to two tracks of improvements, this may be the awakening of this technology to a wide public for speech capture and communication in noisy environments.

Research progress

On the one hand, research development on the physics and electronics part is improving with some skin-attachable sensors. Like previous bone and throat microphones, these new wearable sensors detect skin acceleration, which is highly and linearly correlated with voice pressure. They improve the state of the art by having superior sensitivity over the voice frequency range, which helps improve the signal-to-noise ratio, and also have superior skin conformity, which facilitates adhesion to curved skin surfaces. However, they cannot capture the full bandwidth of the speech signal due to the inherent low-pass filtering of tissues. They are also not yet available for purchase as the manufacturing process needs to be stabilized.

Deep Learning

On the other hand, deep learning methods have shown outstanding performance in a wide range of tasks and can overcome this last drawback. For speech enhancement, works have been able to regenerate mid and high frequencies from low frequencies. For robust speech recognition, models like Whisper have pushed the limits of usable signals.

The need for an open dataset for research purposes

The availability of large-scale datasets plays a critical role in advancing research and development in speech enhancement and recognition using body-conduction microphones. These datasets allow researchers to train and evaluate deep learning models, which have been a key missing ingredient in achieving high-quality, intelligible speech with such microphones. Such datasets are still lacking. The largest is the ESMB corpus, which represents 128 hours of recordings, but only uses a bone microphone. Other private datasets exist, but they are too limited and not open source.

People

Julien HAURET

Julien Hauret Julien Hauret

Bio

is a PhD candidate at Cnam Paris, pursuing research in machine learning applied to speech processing. He holds two MSc degrees from ENS Paris Saclay, one in Electrical Engineering (2020) and the other in Applied Mathematics (2021). His research training is evidenced by his experiences at Columbia University, the French Ministry of the Armed Forces and the Pulse Audition start-up. Additionally, he has lectured for two consecutive years on algorithms and data structures at the École des Ponts ParisTech. His research focuses on the use of deep learning for speech enhancement applied to body-conducted speech. With a passion for interdisciplinary collaboration, Julien aims to improve human communication through technology.

Role

Co-coordinator of the project. Implemented the recording software and designed the recording procedure. Co-designed the website. Responsible for GDPR compliance. Participated in the selection of microphones. Led the Speech Enhancement task, co-coordinated the Automatic Speech Recognition, and provided support for the Speaker Verification task. Core contributor and manager of participant recording. Oversaw the GitHub project. Co-managed the dataset creation, post-filtering process, and upload to HuggingFace, as well as implemented the retained solution. Responsible for model training on Jean-Zay HPC and their upload/documentation to the Hugging Face Hub. Main contributor to the research article.

Éric BAVU

Éric Bavu Éric Bavu

Bio

Is a Full Professor of Acoustics and Signal Processing at the Laboratoire de Mécanique des Structures et des Systèmes Couplés (LMSSC) within the Conservatoire National des Arts et Métiers (Cnam), Paris, France. He completed his undergraduate studies at École Normale Supérieure de Cachan, France, from 2001 to 2005. In 2005, he earned an M.Sc in Acoustics, Signal Processing, and Computer Science Applied to Music from Université Pierre et Marie Curie Sorbonne University (UPMC), followed by a Ph.D. in Acoustics jointly awarded by Université de Sherbrooke, Canada, and UPMC, France, in 2008. He also conducted post-doctoral research on biological soft tissues imaging at the Langevin Institute at École Supérieure de Physique et Chimie ParisTech (ESPCI), France. Since 2009, he has supervised six Ph.D. students at LMSSC, focusing on time domain audio signal processing for inverse problems, 3D audio, and deep learning for audio. His current research interests encompass deep learning methods applied to inverse problems in acoustics, moving sound source localization and tracking, speech enhancement, and speech recognition.

Role

Co-coordinator of the project. Responsible for the selection, calibration, and adjustment of the microphones. Co-designed the website. Implemented the backend for sound spatialization. Co-coordinated the Automatic Speech Recognition and the Speaker Verification tasks. Assisted with participant recording. Co-managed the dataset creation, post-filtering process, and upload to HuggingFace. GitHub Contributor. Produced the HuggingFace Dataset card. Main contributor to the research article.

Malo OLIVIER

Malo Olivier Malo Olivier

Bio

Is a engineer-student at INSA of Lyon, that pursued an internship at the Laboratoire de Mécanique des Structures et des Systèmes Couplés (LMSSC) in the Conservatoire National des Arts et Métiers (CNAM), Paris, France. He is following his graduate studies at the Computer Sciences department of INSA Lyon for which he will receive his degree in 2024. Malo has valuable skills implementing different solutions from Information Systems problematics to deep neural networks architectures, including web applications. He foresees to pursue a Ph.D. in Artificial Intelligence, specializing in deep neural networks applied to science domains and hopes his engineer profile highlights his implementing abilities within projects of high interest.

Role

Core contributor to participant recording. Assisted in exploring the Automatic Speech Recognition task, dataset creation, post-filtering process, and upload to HuggingFace. GitHub Contributor. Contributed to the article review process.

Thomas JOUBAUD

Thomas Joubaud Thomas Joubaud

Bio

Is a Research Associate at the Acoustics and Soldier Protection department within the French-German Research Institute of Saint-Louis (ISL), France, since 2019. In 2013, he received the graduate degree from Ecole Centrale Marseille, France, as well as the master’s degree in Mechanics, Physics and Engineering, specialized in Acoustical Research, of the Aix-Marseille University, France. He earned the Ph.D. degree in Mechanics, specialized in Acoustics, of the Conservatoire National des Arts et Métiers (Cnam), Paris, France, in 2017. The thesis was carried out in collaboration with and within the ISL. From 2017 to 2019, he worked as a post-doctorate research engineer with Orange SA company in Cesson-Sévigné, France. His research interests include audio signal processing, hearing protection, psychoacoustics, especially speech intelligibility and sound localization, and high-level continuous and impulse noise measurement.

Role

Microphone selection assistance. Co-coordinated the Speaker Verification task. GitHub Contributor. Contributed to the article review process.

Christophe LANGRENNE

Christophe Langrenne Christophe Langrenne

Bio

is a scientific researcher at the Laboratoire de Mécanique des Structures et des Systèmes Couplés (LMSSC) at the Conservatoire National des Arts et Métiers (Cnam), Paris, France. After completing his PhD on the regularization of inverse problems, he developed a fast multipole method (FMM) algorithm for solving large-scale scattering and propagation problems. Also interested in 3D audio, he co-supervised 3 PhD students on this theme, in particular on Ambisonic (recording and decoding) and binaural restitution (front/back confusions).

Role

Participated in the microphone adjustment. Core contributor of participant recording. Contributed to the article review process.

Sarah POIRÉE

Sarah Poirée Sarah Poirée

Bio

is a technician at the Laboratoire de Mécanique et des Systèmes Couplés (LMSSC) within the Conservatoire National des Arts et Métiers (Cnam), Paris, France. Her activities focus on the design and development of experimental setups. Notably, she contributed to the creation of the 3D sound spatialization system used during the recording of the Vibravox dataset.

Role

Core contributor of participant recording. Contributed to the article review process.

Véronique ZIMPFER

Véronique Zimpfer Véronique Zimpfer

Bio

Is a Scientific Researcher at the Acoustics and Soldier Protection department within the French-German Research Institute of Saint-Louis (ISL), Saint-Louis, France, since 1997. She holds a M.Sc in Signal Processing from the Grenoble INP, France and obtained a PhD in Acoustics from INSA Lyon, France, in 2000. Her expertise lies at the intersection of communication in noisy environments and auditory protection. Her research focuses on improving adaptive auditory protectors, refining radio communication strategies through unconventional microphone methods, and enhancing auditory perception while utilizing protective gear.

Role

Microphone selection assistance. Contributed to the article review process.

Philippe CHENEVEZ

Philippe Chenevez Philippe Chenevez

Bio

is an audiovisual and acoustics professional, having graduated from the École Louis Lumière in 1984 with a BTS in audiovisual engineering, and from the CNAM in 1996 with an engineering degree in acoustics. He held the position of Technical Director at VDB from 1990 to 1998, where he specialized in HF and LF electronics, focusing on maintenance and development. In 2006, he founded CINELA, a renowned manufacturer of wind and vibration protection for recording microphones, making a significant contribution to the audiovisual industry with his innovative products.

Role

Responsible for the pre-amplification of the microphones.

Jean-Baptiste DOC

Jean-Baptiste Doc Jean-Baptiste Doc

Bio

received the Ph.D. degree in acoustics from the University of Le Mans, France, in 2012. He is currently an Associate Professor with the Laboratoire de Mécanique des Structures et des Systèmes Couplés, Conservatoire National des Arts et Métiers, Paris, France. His research interests include the modeling and optimisation of complex shaped waveguide, their acoustic radiation and the analysis of sound production mechanisms in wind instruments

Role

Participated in the microphone adjustment.

Chapter 2

Documentation

Subsections of Documentation

Hardware

Browse all details of the hardware used for the Vibravox project.

Image of vibravox sensors on subject Image of vibravox sensors on subject

Subsections of Hardware

Audio sensors

Browse all audio sensors used for the Vibravox project.

Participants wearing the sensors :

Close up on the sensors :

Subsections of Audio sensors

Reference airborne microphone

Image of airborne microphone 1 Image of airborne microphone 1

Reference

The reference of the air conduction microphone is Shure WH20XLR.

This microphone is available for sale at Thomann. The technical documentation can be found here.

Image of airborne microphone Image of airborne microphone

Rigid in-ear microphone

Image of rigid in-ear microphone 1 Image of rigid in-ear microphone 1

Reference

This rigid in-ear microphone is integrated into the Acoustically Transparent Earpieces product manufactured by the German company inear.de.

Technical details are given in AES publication by Denk et al. : A one-size-fits-all earpiece with multiple microphones and drivers for hearing device research.

For the VibraVox dataset, we only used the Knowles SPH1642HT5H-1 top-port MEMS in-ear microphone, the technical documentation for which is available at Knowles.

Image of rigid in-ear microphone Image of rigid in-ear microphone

Soft in-ear microphone

Soft in-ear microphone image 1 Soft in-ear microphone image 1

Reference

This microphone is a prototype produced jointly by the Cotral company, the ISL (Institut franco-allemand de recherches de Saint-Louis) and the LMSSC (Laboratoire de Mécanique des Structures et des Systèmes Couplés). It consists of an Alvis mk5 earmold combined with a STMicroelectronics MP34DT01 microphone. Several measures were taken to ensure optimum acoustic sealing for the in-ear microphone, in order to select the most suitable earmold.

Soft in-ear microphone image Soft in-ear microphone image

Pre-amplification

This microphone required a pre-amplification circuit.

Laryngophone

Image of the throat microphone 1 Image of the throat microphone 1

Reference

The reference of the throat microphone is Dual Transponder Throat Microphone - 3.5mm (1/8") Connector - XVTM822D-D35 manufactured by ixRadio. This microphone is available for sale on ixRadio.

Image of the throat microphone Image of the throat microphone

Forehead accelerometer

Accelerometer image 1 Accelerometer image 1

Reference

To offer a wide variety of body-conduction microphones, we incorporated a Knowles BU23173-000 accelerometer positioned on the forehead and secured in place with a custom 3D-printed headband.

Accelerometer image Accelerometer image

Pre-amplification

A dedicated preamplifier was developed for this particular sensor.

Hold in position

The designed headband is inspired by a headlamp design. A custom 3D-printed piece was necessary to accommodate the sensor to the headband.

GIF of the helmet GIF of the helmet

Temple vibration pickup

Image of the AKG microphone 1 Image of the AKG microphone 1

Reference

The reference of the temple contact microphone is C411 manufactured by AKG. This microphone is available for sale on thomann. It is typically used for string instruments but the VibraVox project will use it as a bone conduction microphone.

Image of the AKG microphone Image of the AKG microphone

Hold in position

This microphone was positioned on the temple using a custom 3D-printed piece. The design of this piece was based on a 3D scan of the Aftershokz helmet, with necessary modifications made to accommodate the sensor with a spherical link.

GIF of the helmet GIF of the helmet

Recorder

Reference

All of the microphones were connected to a Zoom F8n multitrack field recorder for synchronized recording.

Image of Zoom F8n Image of Zoom F8n

Parameters

Microphone Track Trim (dB) High-pass filter cutoff frequency (Hz) Input limiter Phantom powering
Temple 1 65 20 Advanced mode
Throat 2 24 20 Advanced mode
Rigid in-ear 3 20 20 Advanced mode
Soft in-ear 5 30 20 Advanced mode
Forehead 6 56 20 Advanced mode
Airborne 7 52 20 Advanced mode

Sound Spatializer

For all the ambient noise samples used in the dataset, the spatialization process was carried out using Spherebedev 3D sound spatialization sphere developed during Pierre Lecomte’s PhD in our lab, and the ambitools library, also developped by Pierre Lecomte during his PhD at Cnam.

The Spherebedev system is a spherical loudspeaker array with a radius of 1.07 meters, composed of 56 loudspeakers placed around the participants. To ensure precise spatialization in the full range of audio, two nested systems were used:

  • A low-frequency system with 6 high-performance loudspeakers (ScanSpeak, up to 200 Hz) for accurate bass reproduction.
  • A high-frequency system consisting of 50 loudspeakers (Aura, 2 inches, for frequencies above 200 Hz).

The multichannel audio used for higher order ambisonics resynthesis include third-order ambisonic recordings captured using a Zylia ZM-1S microphone, and a fifth-order ambisonic recordings captured with Memsbedev, a custom prototype ambisonic microphone built in our lab at LMSSC.

Image of sound spatializer microphone Image of sound spatializer microphone

ambitools ambitools

Software

Browse all details of the software used for the Vibravox project.

Subsections of Software

Frontend

The frontend, built with the tkinter library, consists of 9 sequential windows. The user interface is duplicated on a Wacom tablet used by the participant in the center of the spatialization sphere. Multiple threads were required to allow simultaneous actions, such as updating a progress bar while waiting for a button to be clicked.

UI Windows UI Windows

Backend

The backend part is comprising:

  • A dynamic reader implemented with the linecache library to avoid loading the entire corpus in memory when getting a new line of text.

  • A cryptography module using cryptography.fernet to encrypt and decrypt the participant identity, necessary to assert the right to oblivion.

  • A ssh client built with paramiko to send instructions to the spatialization sphere when playing the sound, changing tracks, and locating the reading head with jack_transport and ladish_control bash commands.

  • A timer with start, pause, resume and reset methods.

  • A non-bocking streaming recorder implemented with sounddevice, soundfile and queue libraries.

Noise

Noise for the QiN step

After standardizing the sampling frequency, channels and format of the AudioSet files. Single-channel noise is obtained from 32 10-second samples of the following 90 classes:

[‘Drill’, ‘Truck’, ‘Cheering’, ‘Tools’, ‘Civil defense siren’, ‘Police car (siren)’, ‘Helicopter’, ‘Vibration’, ‘Drum kit’, ‘Telephone bell ringing’, ‘Drum roll’, ‘Waves, surf’, ‘Emergency vehicle’, ‘Siren’, ‘Aircraft engine’, ‘Idling’, ‘Fixed-wing aircraft, airplane’, ‘Vehicle horn, car horn, honking’, ‘Jet engine’, ‘Light engine (high frequency)’, ‘Heavy engine (low frequency)’, ‘Engine knocking’, ‘Engine starting’, ‘Motorboat, speedboat’, ‘Motor vehicle (road)’, ‘Motorcycle’, ‘Boat, Water vehicle’, ‘Fireworks’, ‘Stream’, ‘Train horn’, ‘Foghorn’, ‘Chainsaw’, ‘Wind noise (microphone)’, ‘Wind’, ‘Traffic noise, roadway noise’, ‘Environmental noise’, ‘Race car, auto racing’, ‘Railroad car, train wagon’, ‘Scratching (performance technique)’, ‘Vacuum cleaner’, ‘Tubular bells’, ‘Church bell’, ‘Jingle bell’, ‘Car alarm’, ‘Car passing by’, ‘Alarm’, ‘Alarm clock’, ‘Smoke detector, smoke alarm’, ‘Fire alarm’, ‘Thunderstorm’, ‘Hammer’, ‘Jackhammer’, ‘Steam whistle’, ‘Distortion’, ‘Air brake’, ‘Sewing machine’, ‘Applause’, ‘Drum machine’, “Dental drill, dentist’s drill”, ‘Gunshot, gunfire’, ‘Machine gun’, ‘Cap gun’, ‘Bee, wasp, etc.’, ‘Beep, bleep’, ‘Frying (food)’, ‘Sampler’, ‘Meow’, ‘Toilet flush’, ‘Whistling’, ‘Glass’, ‘Coo’, ‘Mechanisms’, ‘Rub’, ‘Boom’, ‘Frog’, ‘Coin (dropping)’, ‘Crowd’, ‘Crackle’, ‘Theremin’, ‘Whoosh, swoosh, swish’, ‘Raindrop’, ‘Engine’, ‘Rail transport’, ‘Vehicle’, ‘Drum’, ‘Car’, ‘Animal’, ‘Inside, small room’, ‘Laughter’, ‘Train’]

This represents 8 hours of audio.

Loudness normalization

from ffmpeg_normalize import FFmpegNormalize

normalizer = FFmpegNormalize(normalization_type="ebu",
                            target_level = -15.0,
                            loudness_range_target=5,
                            true_peak = -2,
                            dynamic = True,
                            print_stats=False,
                            sample_rate = 48_000,
                            progress=True)

normalizer.add_media_file(input_file='tot.rf64',
                          output_file='tot_normalized.wav')
normalizer.run_normalization()

Spatialization

The direction of sound is sampled uniformly on the unit sphere using the inverse cumulative distribution function.

Noise for the SiN step

The noise used for the final stage of the recording was captured with a ZYLIA ZR-1 Portable. It consists of applause, demonstrations and opera for a total of 2h40.

ZYLIA ZR-1 Portable ZYLIA ZR-1 Portable

Recording protocol

Procedure

The recording process consists of four steps:

  • Speak in Silence: For a duration of 15 minutes, the participant reads sentences sourced from the French Wikipedia. Each utterance generates a new recording and the transcriptions are preserved.

  • Quiet in Noise: During 2 minutes and 24 seconds, the participant remains silent in a noisy environment created from the AudioSet samples. These samples have been selected from relevant classes, normalized in loudness, pseudo-spatialized and are played from random directions using a spatialization sphere equipped with 56 loudspeakers. The objective of this phase is to gather realistic background noises that will be combined with the Speak in Silence recordings to maintain a clean reference.

  • Quiet in Silence: The procedure is repeated for 54 seconds in complete silence to record solely physiological and microphone noises. These samples can be valuable for tasks such as heart rate tracking or simply analyzing the noise properties of the various microphones.

  • Speak in Noise: The final phase (54 seconds) will primarily serve to test the different systems (Speech Enhancement, Automatic Speech Recognition, Speaker Identification) that will be developed based on the recordings from the first three phases. This real-world testing will provide valuable insights into the performance and effectiveness of these systems in practical scenarios. The noise was recorded using the ZYLIA ZR-1 Portable Recorder from spatialized scenes and replayed in the spatialization sphere with ambisonic processing.

Post-processing

Post_processing indicators Post_processing indicators Post_processing filtering Post_processing filtering

Please refer to the Vibravox paper for more information on the dataset curation.

Analysis

Coherence functions

The coherence functions of all microphones are shown in the figure below during an active speech phase.

Consent form

Ensure compliance with GDPR

A consent form for participation in the VibraVox dataset has been drafted and approved by the Cnam lawyer. This form mentions that the dataset will be released under the Creative Commons BY 4.0 license, which allows anyone to share and adapt the data under the condition that the original authors are cited. All Cnil requirements have been checked, including the right to oblivion. This form must be signed by each VibraVox participant.

The voice recordings collected during this experiment are intended to be used for research into noise-resilient microphones, as part of the thesis project of Mr. Julien HAURET, PhD candidate at the Conservatoire national des arts et métiers (Cnam) (julien.hauret@lecnam.net).

The purpose of this form is to obtain the consent of each of the participants in this project to the collection and storage of their voice recordings, necessary for the production of the results of this research project.

The recordings collected will be anonymized and shared publicly at vibravox.cnam.fr under a Creative Commons BY 4.0 license, it being understood that this license allows anyone to share and adapt your data.

This processing of personal data is recorded in a computerized file by Cnam.

The data may be kept by Cnam for up to 50 years.

The recipients of the data collected will be the above-mentioned doctoral researcher and his thesis supervisor, Mr. Éric Bavu.

In accordance with the General Data Protection Regulation EU 2016/679 (GDPR) and laws no. 2018-493 of June 20, 2018 relating to the protection of personal data and no. 2004-801 of August 6, 2004 relating to the protection of individuals with regard to the processing of personal data and amending law no. 78-17 of January 6, 1978 relating to data processing, you have the right to access, rectify, oppose, delete, limit and port your personal data, i.e. your voice recordings.

To exercise these rights, or if you have any questions about the processing of your data under this scheme, please contact vibravox@cnam.fr. Although your right to be forgotten remains applicable for the duration of the data retention period, we advise you to exercise this right before the final publication of the database, which is scheduled for 01/10/2023.

If, after contacting us, you feel that your rights have not been respected, you can contact the Cnam-établissement public data protection officer directly at ep_dpo@lecnam.netonmicrosoft.com. You can also lodge a complaint with Cnil.

Chapter 3

Audio Tasks

Discover the tasks and baselines offered with VibraVox dataset.

Subsections of Audio Tasks

Speech Enhancement

Task

This task is mainly oriented towards denoising and bandwidth extension, also known as audio super-resolution, which is required to enhance the audio quality of body-conducted captured speech. The model is presented with a pair of audio clips (from a body-conducted captured speech, and from the corresponding clean, full bandwidth airborne-captured speech), and asked to enhance the audio by denoising and regenerating mid and high frequencies from low frequency content only.

Please refer to the Vibravox paper for more information.

Pre-trained models on HuggingFace

Please follow this link to go to the card of our EBEN models: https://huggingface.co/Cnam-LMSSC/vibravox_EBEN_models

Training code

Please follow this link to get the training code of our models: https://github.com/jhauret/vibravox

Audio Samples

Forehead In-ear Rigid In-ear Soft Temple Throat
Input
Enhanced by EBEN
Reference audio

Vibravox enhanced by EBEN

Explore all the test set enhanced by EBEN models :

Speech recognition

Task

The model is presented with an audio file and asked to transcribe the audio file to written text (either normalized text of phonemized text). The most common evaluation metrics are the word error rate (WER), character error rate (CER), or phoneme error rate (PER).

Please refer to the Vibravox paper for more information.

Pre-trained models on HuggingFace

Please follow this link to go to the card of our phonemizers: https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers

Training code

Please follow this link to get the training code of our models: https://github.com/jhauret/vibravox

Speaker Verification

Task

Given an input audio clip and a reference audio clip of a known speaker, the model’s objective is to compare the two clips and verify if they are from the same individual. This often involves extracting embeddings from a deep neural network trained on a large dataset of voices. The model then measures the similarity between these feature sets using techniques like cosine similarity or a learned distance metric. This task is crucial in applications requiring secure access control, such as biometric authentication systems, where a person’s voice acts as a unique identifier.

Please refer to the Vibravox paper for more information.

Testing code

Please follow this link to get the testing code of our model: https://github.com/jhauret/vibravox

Chapter 5

Credits

If you use the Vibravox dataset for research, please cite this paper :

@article{jhauret-et-al-2024-vibravox,
     title={{Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors}},
     author={Hauret, Julien and Olivier, Malo and Joubaud, Thomas and Langrenne, Christophe and
       Poir{\'e}e, Sarah and Zimpfer, Véronique and Bavu, {\'E}ric},
     year={2024},
     eprint={2407.11828},
     archivePrefix={arXiv},
     primaryClass={eess.AS},
     url={https://arxiv.org/abs/2407.11828},
}

and this repository, which is linked to a DOI :

@misc{cnamlmssc2024vibravoxdataset,
   author={Hauret, Julien and Olivier, Malo and Langrenne, Christophe and
       Poir{\'e}e, Sarah and Bavu, {\'E}ric},
   title        = { {Vibravox} (Revision 7990b7d) },
   year         = 2024,
   url          = { https://huggingface.co/datasets/Cnam-LMSSC/vibravox },
   doi          = { 10.57967/hf/2727 },
   publisher    = { Hugging Face }
}
Chapter 6

License

The Vibravox dataset has been released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Anyone is free to share, copy, and redistribute the dataset in any medium or format, as well as adapt, remix, transform, and build upon it for any purpose, even commercially.

The primary condition is that proper credit must be given to the creators, a link to the license must be provided, and any changes made must be indicated. This fosters broad reuse and innovation while ensuring that we, as original creators, are acknowledged for our contribution to open science.