AISHELL6-Whisper: A Chinese Mandarin Audio-visual WHISPER SPEECH Dataset

A Large-Scale Open-Source Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines

Abstract

Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments.

The development of Chinese Mandarin audio-visual whispered speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset containing 30 hours each of whisper and parallel normal speech, with synchronized frontal facial videos.

Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whispered speech and 1.11% for normal speech in the test set of our dataset.

Dataset Overview

The AISHELL6-Whisper corpus was collected in a controlled studio environment, containing parallel recordings of whispered and normal speech. It comprises 167 speakers, each reading approximately 20 minutes of poetry texts without any overlap in content.

Among them, 121 participants were recorded using both a high-fidelity microphone and a synchronized RGBD camera, while the remaining 46 participants only recorded the audio signals. Audio was captured with a single-channel high-fidelity microphone (Neumann U87) at a 48 kHz sampling rate, with a background noise level of less than 20 dB.

The microphone was positioned below the speaker's chin to ensure sound quality without obscuring the speaker's lip movement. Video recordings were captured using a RGBD camera placed one meter directly in front of the speaker, with a resolution of 1280×720 at 25 fps.

Dataset Statistics

Data Split Statistics

Set Video Num of Spk Type Time (hrs) Utterances
Train Yes 82 Normal 14.8339 10012
Whisper 14.8359 9971
No 28 Normal 4.8887 3335
Whisper 5.0526 3332
Valid Yes 19 Normal 3.3252 2331
Whisper 3.3983 2304
No 10 Normal 1.7310 1190
Whisper 1.8038 1191
Test Yes 20 Normal 3.4126 2423
Whisper 3.3601 2351
No 8 Normal 1.3477 940
Whisper 1.3021 942

Comparison with Existing Whisper Speech Datasets

Dataset Type Language Time(hrs) Parallel Num of Spk
UTVE-I ASR English <1 Yes 12
AVWD AVSR Chinese <2.44 Yes 10
CHAINs ASR English <3 Yes 36
AV-Whisper AVSR English <10 Yes 11
iWhisper-Mandarin ASR Chinese 15 Yes 80
wTIMIT ASR English 26 Yes 48
AISHELL6-Whisper AVSR Chinese 30 Yes 167

Spectrogram Comparison

Spectrogram comparison between normal and whisper speech
Comparison of spectrograms for normal speech (left) and whisper speech (right)

Baseline Model & Experimental Results

We implemented Whisper-Flamingo as the baseline of our audio-visual whisper speech recognition systems. This model incorporates visual features from AV-HuBERT into the OpenAI Whisper speech recognition and translation framework through gated cross-attention.

We employ a parallel training strategy to align whisper speech embeddings with normal speech embeddings together and apply an additional projection layer on top of the whisper speech features to further improve the recognition accuracy of whisper speech.

Model Architecture

Model architecture diagram

Recognition Performance on AISHELL6-Whisper

Model Parallel training Projection layer Video Whisper Speech CER Normal Speech CER
Whisper (Large-V3) - - - 18.93% 3.95%
+Finetune - - - 6.69% 1.62%
+Parallel training Yes - - 4.53% 0.98%
+Projection layer Yes Yes - 4.34% 1.14%
+Video Yes - Yes 4.21% 1.08%
+Video (Proposed) Yes Yes Yes 4.13% 1.11%

Download

Note: Our released dataset contains audio-visual data with synchronized transcripts. The dataset is available for research purposes only.

To obtain the AISHELL6-Whisper dataset, please contact dataset@aishelltech.com via your institutional e-mail along with your affiliation and intended use of the data. Please allow up to 5 business days for processing your request.

The dataset includes:

  • Audio recordings (48kHz, WAV format)
  • Video recordings (1280×720, 25fps, MP4 format)
  • Text transcripts with timestamps
  • Speaker metadata
  • Preprocessed lip movement videos

License

The AISHELL6-Whisper dataset is licensed under the CC BY-NC-SA 4.0 license. This means that you can share and adapt the dataset for non-commercial purposes as long as you provide appropriate attribution and distribute your contributions under the same license. Detailed terms can be found on LICENSE.

Demo Samples

Parallel Normal and Whisper Speech Samples

Speaker ID Text Normal Speech Whisper Speech
S0120 今天天气真好,适合出去散步
S0187 科学研究需要耐心和毅力
S0256 人工智能技术正在改变世界

Audio-Visual Samples

Normal Speech with Video

Speaker: S0120

Whisper Speech with Video

Speaker: S0120