AISHELL6-Whisper: A Chinese Mandarin Audio-visual WHISPER SPEECH Dataset

A Large-Scale Open-Source Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines

Abstract

Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments.

The development of Chinese Mandarin audio-visual whispered speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset containing 30 hours each of whisper and parallel normal speech, with synchronized frontal facial videos.

Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whispered speech and 1.11% for normal speech in the test set of our dataset.

Dataset Overview

The AISHELL6-Whisper corpus was collected in a controlled studio environment, containing parallel recordings of whispered and normal speech. It comprises 167 speakers, each reading approximately 20 minutes of poetry texts without any overlap in content.

Among them, 121 participants were recorded using both a high-fidelity microphone and a synchronized RGBD camera, while the remaining 46 participants only recorded the audio signals. Audio was captured with a single-channel high-fidelity microphone (Neumann U87) at a 48 kHz sampling rate, with a background noise level of less than 20 dB.

The microphone was positioned below the speaker's chin to ensure sound quality without obscuring the speaker's lip movement. Video recordings were captured using a RGBD camera placed one meter directly in front of the speaker, with a resolution of 1280×720 at 25 fps.

Dataset Statistics

Data Split Statistics

Set	Video	Num of Spk	Type	Time (hrs)	Utterances
Train	Yes	82	Normal	14.8339	10012
	Yes	82	Whisper	14.8359	9971
	No	28	Normal	4.8887	3335
	No	28	Whisper	5.0526	3332
Valid	Yes	19	Normal	3.3252	2331
	Yes	19	Whisper	3.3983	2304
	No	10	Normal	1.7310	1190
	No	10	Whisper	1.8038	1191
Test	Yes	20	Normal	3.4126	2423
	Yes	20	Whisper	3.3601	2351
	No	8	Normal	1.3477	940
	No	8	Whisper	1.3021	942

Comparison with Existing Whisper Speech Datasets

Dataset	Type	Language	Time(hrs)	Parallel	Num of Spk
UTVE-I	ASR	English	<1	Yes	12
AVWD	AVSR	Chinese	<2.44	Yes	10
CHAINs	ASR	English	<3	Yes	36
AV-Whisper	AVSR	English	<10	Yes	11
iWhisper-Mandarin	ASR	Chinese	15	Yes	80
wTIMIT	ASR	English	26	Yes	48
AISHELL6-Whisper	AVSR	Chinese	30	Yes	167

Spectrogram Comparison

Baseline Model & Experimental Results

We implemented Whisper-Flamingo as the baseline of our audio-visual whisper speech recognition systems. This model incorporates visual features from AV-HuBERT into the OpenAI Whisper speech recognition and translation framework through gated cross-attention.

We employ a parallel training strategy to align whisper speech embeddings with normal speech embeddings together and apply an additional projection layer on top of the whisper speech features to further improve the recognition accuracy of whisper speech.

Model Architecture

Recognition Performance on AISHELL6-Whisper

Model	Parallel training	Projection layer	Video	Whisper Speech CER	Normal Speech CER
Whisper (Large-V3)	-	-	-	18.93%	3.95%
+Finetune	-	-	-	6.69%	1.62%
+Parallel training	Yes	-	-	4.53%	0.98%
+Projection layer	Yes	Yes	-	4.34%	1.14%
+Video	Yes	-	Yes	4.21%	1.08%
+Video (Proposed)	Yes	Yes	Yes	4.13%	1.11%

Download

Note: Our released dataset contains audio-visual data with synchronized transcripts. The dataset is available for research purposes only.

To obtain the AISHELL6-Whisper dataset, please contact dataset@aishelltech.com via your institutional e-mail along with your affiliation and intended use of the data. Please allow up to 5 business days for processing your request.

The dataset includes:

Audio recordings (48kHz, WAV format)
Video recordings (1280×720, 25fps, MP4 format)
Text transcripts with timestamps
Speaker metadata
Preprocessed lip movement videos

License

The AISHELL6-Whisper dataset is licensed under the CC BY-NC-SA 4.0 license. This means that you can share and adapt the dataset for non-commercial purposes as long as you provide appropriate attribution and distribute your contributions under the same license. Detailed terms can be found on LICENSE.