AISHELL6-Whisper: Chinese Mandarin Audio-Visual Whisper Speech Dataset

A Large-Scale Open-Source Audio-Visual Whisper Speech Dataset

30 h Parallel Whisper/Normal Speech 167 Speakers RGB Video Sync

Dataset Overview

30 h Normal/Whisper

Parallel voiced speech to aid transfer and multi-task learning.

167 Speakers

Diverse Mandarin accents with balanced gender distribution.

RGB Video

Frontal facial capture synchronized at 25 fps for lip-reading research.

Large-scale whisper speech aligned with normal voice and video cues

AISHELL6-Whisper is a large-scale open-source audio-visual whisper speech dataset containing 30 hours each of whisper and parallel normal speech, with synchronized frontal facial videos.

  • Controlled studio sessions capturing whispered and normal speech in parallel for each participant.
  • 167 speakers delivering 10–20 minutes of unique poetry readings without textual overlap.
  • 121 sessions feature simultaneous high-fidelity audio and RGB video; 46 sessions provide premium audio-only recordings.

🎙️ Recording environment

Collected in a controlled studio environment with parallel whispered and normal speech sessions.

🔊 Audio

Single-channel Neumann U87 microphone, 48 kHz sampling rate, background noise below 20 dB.

📷 Video

RGBD camera placed one meter in front of the speaker, 1280×720 resolution, 25 fps.

Split statistics for AISHELL6-Whisper (hours and utterances per condition).
Set Video Num of Spk Type Time (hrs) Utterances
Train Yes 82 Normal 14.8339 10012
Whisper 14.8359 9971
No 28 Normal 4.8887 3335
Whisper 5.0526 3332
Valid Yes 19 Normal 3.3252 2331
Whisper 3.3983 2304
No 10 Normal 1.731 1190
Whisper 1.8038 1191
Test Yes 20 Normal 3.4126 2423
Whisper 3.3601 2351
No 8 Normal 1.3477 940
Whisper 1.3021 942
Context against existing whisper speech corpora. ("Parallel" indicates the availability of both whisper and normal speech recordings, "Time(hrs)" refers to the duration of whisper speech.)
Dataset Modality Language Time (hrs) Parallel Num of Spk Sampling Rate Bit
UTVE-I Audio English <1 Yes 12 44.1HZ 16
UTVE-II Audio English 1 No 112 44.1HZ 16
AVWD Audio-visual Chinese <2.44 Yes 10 44.1HZ 16
CHAINS Audio English <3 Yes 36 44.1kHz 16
Whi-spe Audio Serbian <5 Yes 10 22050HZ 16
AV-Whisper Audio-visual English <10 Yes 11 48kHZ -
CIAIR Audio-visual Japanese 15 Yes 123 16 kHz 16
iWhisper-Mandarin Audio Chinese 15 Yes 80 16kHz -
wSPIRE Audio English 18 Yes 88 Sampled at 44.1kHz, processed at 16kHz 16
wTIMIT Audio English 26 Yes 48 44.1kHz 16
AISHELL6-Whisper Audio-visual Chinese 30 Yes 167 48kHZ 16

Demo Samples

Speaker ID: S0128

微风轻拂树叶沙沙作响带来了一丝丝宁静

Normal Speech
Whisper Speech
Speaker ID: S0075

小船在湖上荡漾宛如一片白云在水面徘徊

Normal Speech
Whisper Speech
Speaker ID: S0218

公园里儿童的嬉笑声犹如天籁让人感受到生活的美好

Normal Speech
Whisper Speech
Speaker ID: S0140

寂静的山谷传来悠远的钟声带来了心灵的平静

Normal Speech
Whisper Speech

Publications

Preprint Submitted to ICASSP 2026

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines

Li, Cancan · Su, Fei · Liu, Juan · Bu, Hui · Wan, Yulong · Suo, Hongbin · Li, Ming

    @article{li2025aishell6,
      title={AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines},
      author={Li, Cancan and Su, Fei and Liu, Juan and Bu, Hui and Wan, Yulong and Suo, Hongbin and Li, Ming},
      journal={arXiv preprint arXiv:2509.23833},
      year={2025}
    }

Applications

AISHELL6-Whisper can be applied in the following areas:

Speech Recognition

AISHELL6-Whisper can be used for speech recognition tasks in Chinese.

Download

The AISHELL6-Whisper dataset is licensed under CC BY-NC-SA 4.0. This means that you can share and adapt the dataset for non-commercial purposes as long as you provide appropriate attribution and distribute your contributions under the same license. However, the dataset cannot be used for commercial purposes.

Huggingface
HuggingFace

Download from HuggingFace Datasets

Download
AI-SHELL Logo
AI-SHELL

Donwload from AI-SHELL Platform

Download

Acknowledgements