Motivations
Context
Unlike traditional microphones, which rely on airborne sound waves, body-conduction microphones capture speech signals directly from the body, offering advantages in noisy environments by eliminating the capture of ambient noise. Although body-conduction microphones have been available for decades, their limited bandwidth has restricted their widespread usage. However, thanks to two tracks of improvements, this may be the awakening of this technology to a wide public for speech capture and communication in noisy environments.
Research progress
On the one hand, research development on the physics and electronics part is improving with some skin-attachable sensors. Like previous bone and throat microphones, these new wearable sensors detect skin acceleration, which is highly and linearly correlated with voice pressure. They improve the state of the art by having superior sensitivity over the voice frequency range, which helps improve the signal-to-noise ratio, and also have superior skin conformity, which facilitates adhesion to curved skin surfaces. However, they cannot capture the full bandwidth of the speech signal due to the inherent low-pass filtering of tissues. They are also not yet available for purchase as the manufacturing process needs to be stabilized.
Deep Learning
On the other hand, deep learning methods have shown outstanding performance in a wide range of tasks and can overcome this last drawback. For speech enhancement, works have been able to regenerate mid and high frequencies from low frequencies. For robust speech recognition, models like Whisper have pushed the limits of usable signals.
The need for an open dataset for research purposes
The availability of large-scale datasets plays a critical role in advancing research and development in speech enhancement and recognition using body-conduction microphones. These datasets allow researchers to train and evaluate deep learning models, which have been a key missing ingredient in achieving high-quality, intelligible speech with such microphones. Such datasets are still lacking. The largest is the ESMB corpus, which represents 128 hours of recordings, but only uses a bone microphone. Other private datasets exist, but they are too limited and not open source.