Total Size: 435GB

登录 后进行下载

概览

数据集类型

ASR Speech Corpus

语种

英语

语音类型

Spontaneous conversation and reading

内容

Various topics

音频参数

16kHz, 16 bit, mono

文件格式

OPUS

录音设备

Various equipment

录音环境

Various environment

授权方式

热门数据集

开源数据集
6.13 hours
ASR数据集
3.09 GB
开源数据集
4.25 hours
ASR数据集
378 MB
开源数据集
5.08 hours
ASR数据集
436.14 MB
开源数据集
755 hours
ASR数据集
59 GB
开源数据集
0.71 hours
ASR数据集
62 MB
开源数据集
10.43 hours
ASR数据集
785 MB
开源数据集
4.1 hours
ASR数据集
355 MB
Third Party
ASR数据集
97G
Third Party
ASR数据集

Giga Speech

GigaSpeech : An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

概览

数据集类型

ASR Speech Corpus

语种

英语

语音类型

Spontaneous conversation and reading

内容

Various topics
16kHz, 16 bit, mono

文件格式

OPUS

录音设备

Various equipment

录音环境

Various environment

授权方式

GigaSpeech, prepared and released by SpeechColab, is an evolving, multi-domain English
speech recognition corpus with 10,000 hours of high quality labeled
audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles,and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription.

For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.


For details of how we created the dataset, please refer to our Interspeech paper: “GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio”. Preprint available on arxiv (https://arxiv.org/abs/2106.06909).

Please also check out our Github repository for applications and leaderboard (https://github.com/SpeechColab/GigaSpeech).

评论

{{ reviewsTotal }} Review
{{ reviewsTotal }} Reviews
写评论
*访客无法进行评论