MagicData

sign in

Total Size: 14.4GB

Dataset Overview

Dataset Type

ASR speech corpus

Language

Mandarin Chinese (China)

Speech Style

spontaneous conversations

Content

Automotive Virtual Assistant , Conversational Speech

Audio Parameters

16 kHz,16 bits

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

Indoor
Open Source
ASR Corpus
180 hours

ASR-RAMC-BigCCSC: A Chinese Conversational Speech Corpus

180 hours of transcribed Mandarin Chinese conversational speech

This dataset consists of 180 hours of transcribed Mandarin Chinese spontaneous conversational speech contributed by 663 speakers.

Sample:

Description

Open-Source MagicData-RAMC:

A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset
180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively.

Collection setup
The acoustic environments are small rooms under 20 m2 in area, and the reverberation time (RT60) is less than 0.4 seconds. The environments are relatively quiet during recording, with an ambient noise level lower than 40 dB(A).

The audios are recorded by Magic Data Technology Co., Ltd. over mainstream smartphones, with the ratio of Android and IOS systems around 1:1. All recording devices work at 16 kHz, 16-bit to guarantee high recording quality.

The datasets have been transcribed by Annotator, an AI-assisted Annotation Platform. The hesitations, repetitions, punctuations, non-speech and voice activity timestamps are annotated for high quality and informative labeling.

The gender and demographic distributions are balanced.

Figure 1 Gender Distribution Chart in MagicData-RAMC
Figure 2 Region Distribution Chart in MagicData-RAMC
Figure 3 Province Distribution Chart in MagicData-RAMC

There is a wide variety of topics.

It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.

It covers 15 topics, including humanities, entertainment, sports, military, finance, religion, family life, politics, education, digital devices, environment, science, professional development, art and ordinary life.

It is suitable for exploring speech processing techniques in dialog scenarios.

Baseline

Automatic Speech Recognition
We used ESPnet2[6] toolkit to train a Conformer[7] model. The training data includes 755h of MagicData-READ and 150h of MagicData-RAMC.
We achieved character error rates of 16.5% and 19.1% on the development set and test set, respectively.


Keyword Search Task
We retrieved 200 keywords, which is provided by MagicData-RAMC, based on the Conformer model and daynamic time alignment algorithm[8].
The precision and recall rates are 86.98% and 89.57% on the development set, and 85.87% and 88.79% on the test set.


Speaker Diarization Task
We used Kaldi[9] toolkit to build a speaker diarization system which includes speaker activity detection, speaker embedding extractor and Bayesian HMM clustering[10]. The timestamps are provided by MagicData-RAMC.
We achieved diarization error rates of 5.57% and 7.96% on the development set and test set, respectively.

Dataset Overview

Dataset Type

ASR speech corpus

Language

Mandarin Chinese (China)

Speech Style

spontaneous conversations

Content

Automotive Virtual Assistant , Conversational Speech

Audio Parameters

16 kHz,16 bits

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

Indoor
{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}{{ options.labels.pluralReviewCountLabel }}
{{ options.labels.newReviewButton }}
{{ userData.canReview.message }}

Verifying Email