MagicData
SIGN IN

Total Size: 579M

Dataset Overview

Dataset Type

ASR speech corpus

Language

zh-CN, Mandarin Chinese mixed with English phrases

Speech Style

spontaneous conversation

Content

conversations

Audio Parameters

16 kHz, 16 bits, mono

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

indoor environment
Open Source
ASRデータセット
10 hours

ASR-SECoMiCSC: A Chinese-English Code-Mixing Conversational Speech Corpus

Here we present a conversational dataset in Mandarin Chinese, code mixed with English words and phrases.

The total duration of the original dataset is about 22.54 hours, with an effective duration of about 9.57 hours. We split the dataset into two parts: the DEV set and the test set.

We present only the TEST part here for open access, of which the total duration is about 10 hours.  Audio files (.wav) with segments and manually annotated transcriptions are contained in the dataset.

10 participants (4 males and 6 females) from whom we collected the audio data from were aged 21 - 25 years old. And in total, 42 audios were collected, corresponding to 42 annotated texts.

The word correct rate of this dataset is above 99% when we test and evaluate this set. For any access to this dataset, please note our usage agreement.

Recommended Applications: ASR, Chatbot, TTS, Low-resources research

Dataset Overview

Dataset Type

ASR speech corpus

Language

zh-CN, Mandarin Chinese mixed with English phrases

Speech Style

spontaneous conversation

Content

conversations

Audio Parameters

16 kHz, 16 bits, mono

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

indoor environment
{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}{{ options.labels.pluralReviewCountLabel }}
{{ options.labels.newReviewButton }}
{{ userData.canReview.message }}

Verifying Email