Total Size: 715M

Dataset Overview

Dataset Type

ASR speech corpus

Language

zh-CN, Mandarin Chinese mixed with English phrases

Speech Style

spontaneous conversation

Content

conversations

Audio Parameters

16 kHz, 16 bits, mono

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

indoor environment

License

MAGIC DATA OPEN-SOURCE LICENSE

Open Source

ASR Corpus

12 hours

ASR-DevCECoMiCSC: A DEV Set of Chinese-English Code-Mixing Conversational Speech Corpus

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Here we present a conversational dataset in Mandarin Chinese, code mixed with English words and phrases.

The total duration of the original dataset is about 22.54 hours, with an effective duration of about 9.57 hours. We split the dataset into two parts: the DEV set and the test set.

We present only the DEV part here for open access, of which the total duration is about 12 hours. Audio files (.wav) with segments and manually annotated transcriptions are contained in the dataset.

10 participants (4 males and 6 females) from whom we collected the audio data from were aged 21 - 25 years old. And in total, 42 audios were collected, corresponding to 42 annotated texts.

The word correct rate of this dataset is above 99% when we test and evaluate this set.

For any access to this dataset, please note our usage agreement.

Recommended Applications: ASR, Chatbot, TTS, Low-resources research

Dataset Overview

Dataset Type

ASR speech corpus

Language

zh-CN, Mandarin Chinese mixed with English phrases

Speech Style

spontaneous conversation

Content

conversations

Audio Parameters

16 kHz, 16 bits, mono

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

indoor environment

License

MAGIC DATA OPEN-SOURCE LICENSE

备案号: 京ICP备18008050号-6号

京公网安备 11010802035822号

Your IP is: 216.73.216.53

SIGN IN

SIGN UP

Total Size: 715M

Dataset Overview

Dataset Type

Language

Speech Style

Content

Audio Parameters

File Format

Recording Equipment

Recording Environment

License

MAGIC DATA OPEN-SOURCE LICENSE

ASR-DevCECoMiCSC: A DEV Set of Chinese-English Code-Mixing Conversational Speech Corpus

Dataset Overview

Dataset Type

Language

Speech Style

Content

Audio Parameters

File Format

Recording Equipment

Recording Environment

License

MAGIC DATA OPEN-SOURCE LICENSE

京公网安备 11010802035822号

SIGN IN

SIGN UP

Total Size: 715M

Dataset Overview

Dataset Type

Language

Speech Style

Content

Audio Parameters

File Format

Recording Equipment

Recording Environment

License

MAGIC DATA OPEN-SOURCE LICENSE

ASR-DevCECoMiCSC: A DEV Set of Chinese-English Code-Mixing Conversational Speech Corpus

Dataset Overview

Dataset Type

Language

Speech Style

Content

Audio Parameters

File Format

Recording Equipment

Recording Environment

License

MAGIC DATA OPEN-SOURCE LICENSE

京公网安备 11010802035822号

Verifying Email