AISHELL6-Whisper: A Chinese Mandarin Audio-visual WHISPER SPEECH Dataset

Dataset Overview

30 h Normal/Whisper

Parallel voiced speech to aid transfer and multi-task learning.

167 Speakers

Diverse Mandarin accents with balanced gender distribution.

RGB Video

Frontal facial capture synchronized at 25 fps for lip-reading research.

Large-scale whisper speech aligned with normal voice and video cues

AISHELL6-Whisper is a large-scale open-source audio-visual whisper speech dataset containing 30 hours each of whisper and parallel normal speech, with synchronized frontal facial videos.

Controlled studio sessions capturing whispered and normal speech in parallel for each participant.
167 speakers delivering 10–20 minutes of unique poetry readings without textual overlap.
121 sessions feature simultaneous high-fidelity audio and RGB video; 46 sessions provide premium audio-only recordings.

🎙️ Recording environment

Collected in a controlled studio environment with parallel whispered and normal speech sessions.

🔊 Audio

Single-channel Neumann U87 microphone, 48 kHz sampling rate, background noise below 20 dB.

📷 Video

RGBD camera placed one meter in front of the speaker, 1280×720 resolution, 25 fps.

Split statistics for AISHELL6-Whisper (hours and utterances per condition).
Set	Video	Num of Spk	Type	Time (hrs)	Utterances
Train	Yes	82	Normal	14.8339	10012
	Yes	82	Whisper	14.8359	9971
	No	28	Normal	4.8887	3335
	No	28	Whisper	5.0526	3332
Valid	Yes	19	Normal	3.3252	2331
	Yes	19	Whisper	3.3983	2304
	No	10	Normal	1.731	1190
	No	10	Whisper	1.8038	1191
Test	Yes	20	Normal	3.4126	2423
	Yes	20	Whisper	3.3601	2351
	No	8	Normal	1.3477	940
	No	8	Whisper	1.3021	942

Context against existing whisper speech corpora. ("Parallel" indicates the availability of both whisper and normal speech recordings, "Time(hrs)" refers to the duration of whisper speech.)
Dataset	Modality	Language	Time (hrs)	Parallel	Num of Spk	Sampling Rate	Bit
UTVE-I	Audio	English	<1	Yes	12	44.1HZ	16
UTVE-II	Audio	English	1	No	112	44.1HZ	16
AVWD	Audio-visual	Chinese	<2.44	Yes	10	44.1HZ	16
CHAINS	Audio	English	<3	Yes	36	44.1kHz	16
Whi-spe	Audio	Serbian	<5	Yes	10	22050HZ	16
AV-Whisper	Audio-visual	English	<10	Yes	11	48kHZ	-
CIAIR	Audio-visual	Japanese	15	Yes	123	16 kHz	16
iWhisper-Mandarin	Audio	Chinese	15	Yes	80	16kHz	-
wSPIRE	Audio	English	18	Yes	88	Sampled at 44.1kHz, processed at 16kHz	16
wTIMIT	Audio	English	26	Yes	48	44.1kHz	16
AISHELL6-Whisper	Audio-visual	Chinese	30	Yes	167	48kHZ	16

Demo Samples

Speaker ID: S0128

微风轻拂树叶沙沙作响带来了一丝丝宁静

Normal Speech

Whisper Speech

Speaker ID: S0075

小船在湖上荡漾宛如一片白云在水面徘徊

Normal Speech

Whisper Speech

Speaker ID: S0218

公园里儿童的嬉笑声犹如天籁让人感受到生活的美好

Normal Speech

Whisper Speech

Speaker ID: S0140

寂静的山谷传来悠远的钟声带来了心灵的平静

Normal Speech

Whisper Speech

Publications

Preprint Submitted to ICASSP 2026