ASR-GigaSpeech-MulDomESC: A Multi-domain English Speech Corpus

GigaSpeech, prepared and released by SpeechColab, is an evolving, multi-domain English
speech recognition corpus with 10,000 hours of high quality labeled
audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles,and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription.

For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

For details of how we created the dataset, please refer to our Interspeech paper: "GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio". Preprint available on arxiv (https://arxiv.org/abs/2106.06909).

Please also check out our Github repository for applications and leaderboard (https://github.com/SpeechColab/GigaSpeech).

SIGN IN

注册

Total Size: 435GB

概览

数据集类型

语种

语音类型

内容

音频参数

文件格式

录音设备

录音环境

授权方式

TERMS OF ACCESS

ASR-GigaSpeech-MulDomESC: A Multi-domain English Speech Corpus

概览

数据集类型

语种

语音类型

内容

音频参数

文件格式

录音设备

录音环境

授权方式

TERMS OF ACCESS

京公网安备 11010802035822号

SIGN IN

注册

Total Size: 435GB

概览

数据集类型

语种

语音类型

内容

音频参数

文件格式

录音设备

录音环境

授权方式

TERMS OF ACCESS

ASR-GigaSpeech-MulDomESC: A Multi-domain English Speech Corpus

概览

数据集类型

语种

语音类型

内容

音频参数

文件格式

录音设备

录音环境

授权方式

TERMS OF ACCESS

京公网安备 11010802035822号

Verifying Email