1663323011-logo2022

sign in

Total Size: 579M

Dataset Overview

Dataset Type

ASR speech corpus

Language

zh-CN, Mandarin Chinese mixed with English phrases

Speech Style

spontaneous conversation

Content

conversations

Audio Parameters

16 kHz, 16 bits, mono

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

indoor environment
Open Source
ASR Corpus
10 hours

Testset of Chinese-English Code-Mixing Conversational Speech Corpus

This is a conversational dataset of Chinese Mandarin mixed with English words or phrases.

The total duration is about 22.54 hours, and the effective duration is about 9.57 hours.

In the test set, the total duration is about 10 hours.

We split the dataset into two parts: the DEV set and the test set (for testing). This dataset only contains test part.

The dataset contains audio files (.wav) with segments and transcriptions that were finely annotated by hand.

The age of the audio collectors ranged from 21 to 25 years old.

The total number of participants in the collection was 10, with a gender distribution: 4 males and 6 females.

In total, 42 audios were collected, corresponding to 42 annotated texts.

After our review and evaluation, the word correct rate of this dataset is above 99%.

If you use this dataset for training, please note the content of our usage agreement.

Dataset Overview

Dataset Type

ASR speech corpus

Language

zh-CN, Mandarin Chinese mixed with English phrases

Speech Style

spontaneous conversation

Content

conversations

Audio Parameters

16 kHz, 16 bits, mono

File Format

WAV (PCM) TXT (UTF-8)

Recording Equipment

mobile

Recording Environment

indoor environment
{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}{{ options.labels.pluralReviewCountLabel }}
{{ options.labels.newReviewButton }}
{{ userData.canReview.message }}