MagicData
SIGN IN

Total Size: 43.2MB

概览

数据集类型

N/A

语种

中文方言

语音类型

朗读式

内容

N/A

音频参数

48 kHz, 16 bits

文件格式

WAV (PCM)

录音设备

麦克风

录音环境

quiet indoor environment
开源数据集
TTS数据集
0.2 hours

MagicData-Dialect-Northeastern Chinese-TTS-Lite

Dataset Introduction

MagicData-Dialect-Northeastern Chinese-TTS-Lite is an open-source Northeastern Chinese TTS subset of the MagicData-Dialect-TTS-Lite collection released by Magic Data. It focuses on authentic Northeastern Chinese speech and is designed for research scenarios such as dialect speech synthesis, acoustic analysis, and model evaluation.

The dataset contains approximately 10 minutes of speech data, recorded by one native Northeastern Chinese speaker from Siping. The speaker was born and raised in the local region, and the recordings preserve authentic local accent, intonation, and expression habits.

概览

DialectCityCodeDurationSentencesSpeaker
Northeastern ChineseSipingNED10 minutes75 sentences1 female, 30 years old

Dataset Features

1. Native speaker with authentic accent

  • The speaker was born and raised in the local region until adulthood.
  • The speaker’s family and main social environment use the local dialect.

2. Daily-life content coverage

  • Weather, food, family conversations, numbers, time, and dates
  • A small amount of emotional expression
  • No complex technical terms, news reading, or poetry recitation, in order to avoid style deviation

3. Clean recording environment

  • Quiet indoor environment
  • 48 kHz / 16-bit / mono WAV

4. Moderate sentence length, suitable for TTS modeling

  • Each sentence is around 5–20 seconds, with an average length of about 10 seconds
  • Natural punctuation-based segmentation, with no forced truncation

Annotation Guidelines

  • Chinese character transcription: Standard Chinese characters are used, while dialect-specific words are preserved and restored, such as “咋整” in Northeastern Chinese.
  • Number annotation: Numbers are written in Chinese character form.
  • Standardization rule: The original dialect sentences are preserved and are not forcibly “translated” into Mandarin.

Example:

  • Original Northeastern Chinese sentence: 这事儿咋整啊?
  • Annotated text: 这事儿咋整啊?

Open-source File Structure

dialect-tts-lite/

├── 东北

│ ├── ProsodyLabeling

│ │ ├── txt

│ ├── wav

│ │ ├── wav/ # 75个音频文件

│ └──  2026自研短语音转写规范——中文

Usage Recommendations

Suitable for:

  • Zero-shot / few-shot baseline testing for multi-dialect TTS models
  • Acoustic analysis of dialectal phonetic features
  • Comparative experiments in academic research

Not suitable for:

  • Directly training production-level dialect TTS products, as the dataset is non-commercial and limited in scale
  • Evaluating extreme scenarios, such as noisy environments, far-field recording, or children’s voices

If you are interested in a larger-scale commercial version, please contact us.

Open-source License

This dataset is for non-commercial use only under the CC BY-NC-ND 4.0 license. It is suitable for academic research, personal development, and model evaluation.

📧 For the full commercial version, please contact: business@magicdatatech.com

概览

数据集类型

N/A

语种

中文方言

语音类型

朗读式

内容

N/A

音频参数

48 kHz, 16 bits

文件格式

WAV (PCM)

录音设备

麦克风

录音环境

quiet indoor environment
{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}评论
写评论
*访客无法进行评论

Verifying Email