AISHELL6-Whisper: A Chinese Mandarin Audio-visual WHISPER SPEECH Dataset
A Large-Scale Open-Source Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines
Abstract
Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments.
The development of Chinese Mandarin audio-visual whispered speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset containing 30 hours each of whisper and parallel normal speech, with synchronized frontal facial videos.
Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whispered speech and 1.11% for normal speech in the test set of our dataset.
Dataset Overview
The AISHELL6-Whisper corpus was collected in a controlled studio environment, containing parallel recordings of whispered and normal speech. It comprises 167 speakers, each reading approximately 20 minutes of poetry texts without any overlap in content.
Among them, 121 participants were recorded using both a high-fidelity microphone and a synchronized RGBD camera, while the remaining 46 participants only recorded the audio signals. Audio was captured with a single-channel high-fidelity microphone (Neumann U87) at a 48 kHz sampling rate, with a background noise level of less than 20 dB.
The microphone was positioned below the speaker's chin to ensure sound quality without obscuring the speaker's lip movement. Video recordings were captured using a RGBD camera placed one meter directly in front of the speaker, with a resolution of 1280×720 at 25 fps.
Dataset Statistics
Data Split Statistics
Set | Video | Num of Spk | Type | Time (hrs) | Utterances |
---|---|---|---|---|---|
Train | Yes | 82 | Normal | 14.8339 | 10012 |
Whisper | 14.8359 | 9971 | |||
No | 28 | Normal | 4.8887 | 3335 | |
Whisper | 5.0526 | 3332 | |||
Valid | Yes | 19 | Normal | 3.3252 | 2331 |
Whisper | 3.3983 | 2304 | |||
No | 10 | Normal | 1.7310 | 1190 | |
Whisper | 1.8038 | 1191 | |||
Test | Yes | 20 | Normal | 3.4126 | 2423 |
Whisper | 3.3601 | 2351 | |||
No | 8 | Normal | 1.3477 | 940 | |
Whisper | 1.3021 | 942 |
Comparison with Existing Whisper Speech Datasets
Dataset | Type | Language | Time(hrs) | Parallel | Num of Spk |
---|---|---|---|---|---|
UTVE-I | ASR | English | <1 | Yes | 12 |
AVWD | AVSR | Chinese | <2.44 | Yes | 10 |
CHAINs | ASR | English | <3 | Yes | 36 |
AV-Whisper | AVSR | English | <10 | Yes | 11 |
iWhisper-Mandarin | ASR | Chinese | 15 | Yes | 80 |
wTIMIT | ASR | English | 26 | Yes | 48 |
AISHELL6-Whisper | AVSR | Chinese | 30 | Yes | 167 |
Spectrogram Comparison

Baseline Model & Experimental Results
We implemented Whisper-Flamingo as the baseline of our audio-visual whisper speech recognition systems. This model incorporates visual features from AV-HuBERT into the OpenAI Whisper speech recognition and translation framework through gated cross-attention.
We employ a parallel training strategy to align whisper speech embeddings with normal speech embeddings together and apply an additional projection layer on top of the whisper speech features to further improve the recognition accuracy of whisper speech.
Model Architecture

Recognition Performance on AISHELL6-Whisper
Model | Parallel training | Projection layer | Video | Whisper Speech CER | Normal Speech CER |
---|---|---|---|---|---|
Whisper (Large-V3) | - | - | - | 18.93% | 3.95% |
+Finetune | - | - | - | 6.69% | 1.62% |
+Parallel training | Yes | - | - | 4.53% | 0.98% |
+Projection layer | Yes | Yes | - | 4.34% | 1.14% |
+Video | Yes | - | Yes | 4.21% | 1.08% |
+Video (Proposed) | Yes | Yes | Yes | 4.13% | 1.11% |
Download
To obtain the AISHELL6-Whisper dataset, please contact dataset@aishelltech.com via your institutional e-mail along with your affiliation and intended use of the data. Please allow up to 5 business days for processing your request.
The dataset includes:
- Audio recordings (48kHz, WAV format)
- Video recordings (1280×720, 25fps, MP4 format)
- Text transcripts with timestamps
- Speaker metadata
- Preprocessed lip movement videos
License
The AISHELL6-Whisper dataset is licensed under the CC BY-NC-SA 4.0 license. This means that you can share and adapt the dataset for non-commercial purposes as long as you provide appropriate attribution and distribute your contributions under the same license. Detailed terms can be found on LICENSE.
Demo Samples
Parallel Normal and Whisper Speech Samples
Speaker ID | Text | Normal Speech | Whisper Speech |
---|---|---|---|
S0120 | 今天天气真好,适合出去散步 | ||
S0187 | 科学研究需要耐心和毅力 | ||
S0256 | 人工智能技术正在改变世界 |
Audio-Visual Samples
Normal Speech with Video
Speaker: S0120
Whisper Speech with Video
Speaker: S0120