MagicData
SIGN IN

Total Size: 579M

概览

数据集类型

语音识别(ASR)音频数据集

语种

zh-CN, Mandarin Chinese mixed with English phrases

语音类型

自由对话

内容

对话

音频参数

16 kHz, 16 bits, 单通道

文件格式

WAV (PCM) TXT (UTF-8)

录音设备

手机

录音环境

室内
开源数据集
ASR数据集
10小时

ASR-SECoMiCSC: A Chinese-English Code-Mixing Conversational Speech Corpus

Here we present a conversational dataset in Mandarin Chinese, code mixed with English words and phrases.

The total duration of the original dataset is about 22.54 hours, with an effective duration of about 9.57 hours. We split the dataset into two parts: the DEV set and the test set.

We present only the TEST part here for open access, of which the total duration is about 10 hours.  Audio files (.wav) with segments and manually annotated transcriptions are contained in the dataset.

10 participants (4 males and 6 females) from whom we collected the audio data from were aged 21 - 25 years old. And in total, 42 audios were collected, corresponding to 42 annotated texts.

The word correct rate of this dataset is above 99% when we test and evaluate this set. For any access to this dataset, please note our usage agreement.

Recommended Applications: ASR, Chatbot, TTS, Low-resources research

概览

数据集类型

语音识别(ASR)音频数据集

语种

zh-CN, Mandarin Chinese mixed with English phrases

语音类型

自由对话

内容

对话

音频参数

16 kHz, 16 bits, 单通道

文件格式

WAV (PCM) TXT (UTF-8)

录音设备

手机

录音环境

室内
{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}评论
写评论
*访客无法进行评论

Verifying Email