MagicData
SIGN IN

New Open-Source Release | Chuan-Yu 12-City Sub-Dialect Speech Dataset: Helping Large Models Understand the Living Voices of Sichuan and Chongqing

1750128659-英文logo带背景

Posted at 10 hours ago

This dataset helps AI better capture the rich dialect diversity across different cities in Sichuan and Chongqing.

1. Dialects: A Critical Challenge AI Has Yet to Overcome

Sichuan and Chongqing are often discussed as one broad dialect region, but in real life, speech varies noticeably from city to city. The way people speak in Chengdu is not exactly the same as in Chongqing, Zigong, Leshan, Yibin, Luzhou, or other cities across the region. Differences in pronunciation, rhythm, intonation, and local usage all shape how these dialects sound in everyday conversations.

For speech AI, this creates a practical challenge. A model trained only on broadly labeled “Sichuanese” or “Chongqing dialect” data may not capture the finer differences between local city-level varieties. In real-world applications such as voice assistants, in-car voice interaction, smart home devices, and customer service, these differences can affect how well a system understands users and how natural the interaction feels.

That is why we are releasing the Chuan-Yu 12-City Sub-Dialect Speech Dataset. Covering 12 cities across Sichuan and Chongqing, this dataset is designed to provide more granular speech data for building, evaluating, and improving AI systems that need to work with real regional speech.

2. Our Dataset: Built to Help AI Understand Every City

Unlike existing large-scale Chuan-Yu dialect corpora, this dataset focuses on one core goal: enabling large models not only to process Chuan-Yu dialects, but also to accurately capture the subtle pronunciation differences between cities.

2.1 Coverage of 12 Cities with Precise Sub-Dialect Labeling

The dataset covers 12 prefecture-level cities across Sichuan and Chongqing.

Dialect AreaRepresentative CityDuration (hours)Sentences
Chengdu-Chongqing AreaChengdu5.181,993
Chengdu-Chongqing AreaChongqing4.992,034
Minjiang AreaLeshan3.521,308
Minjiang AreaYibin3.051,190
Minjiang AreaLuzhou3.261,330
Renfu Sub-AreaZigong2.27885
Renfu Sub-AreaNeijiang2.68889
Yagan Sub-AreaYa’an1.69727
Yagan Sub-AreaXichang3.281,222
OtherNanchong1.19476
OtherDazhou1.30478
OtherGuang’an1.38536
Total: 33 hours / 13,068 sentences / 38 native speakers

Each city is organized as an independent subset. All speakers are native speakers who were born and raised locally, ensuring that every audio clip carries the authentic voice of its city.

We do not treat Chuan-Yu dialect as one broad, mixed category. Instead, Chengdu speakers record Chengdu dialect, Chongqing speakers record Chongqing dialect, and so on. Only local speakers can accurately preserve the tones, rhythms, and pronunciation habits of their own city.

2.2 Recorded by Native Speakers, Reviewed by Native Speakers

On the collection side, all speakers are local native speakers with local household registration and long-term residence in the corresponding city. The dataset covers different age groups from 18 to 65, as well as diverse genders and occupational backgrounds, helping ensure both diversity and representativeness.

On the QA side, transcription and quality assessment were also completed by local native speakers. We established review teams for all 12 cities, with each team led by dialect specialists familiar with the local accent. Each utterance was carefully checked for transcription accuracy, pronunciation quality, and dialect-specific features.

This “local speakers record, local speakers review” mechanism helps maximize both authenticity and accuracy.

2.3 Annotation System: More Than Transcription

For each speech sample, we provide multi-dimensional annotation information, including:

  • Standard Mandarin text
  • Speaker gender
  • Speaker age group
  • Recording city
  • Sentence duration of 5–45 seconds, with an average duration of around 10 seconds
  • Natural punctuation and sentence segmentation, without forced cutting

This annotation system makes the dataset suitable not only for speech recognition training, but also for speech synthesis, dialect feature analysis, accent recognition, and other speech AI tasks.

2.4 Scenario-Based Design Close to Real Life

The collected content covers daily conversations, everyday scenarios, local cultural topics, and natural contexts. The goal is to help AI learn not “dictionary dialect,” but living language from real streets, homes, and communities.

3. Why This Matters

3.1 The Next Frontier for Large Models: Dialect Understanding

In scenarios such as intelligent customer service, in-car voice assistants, and smart home devices, users increasingly expect to interact with AI naturally in their own dialects. Dialect support is becoming a key differentiator for improving user experience and strengthening user engagement.

Whoever gains access to high-quality dialect data first will be better positioned to build localized AI services that truly work in real-world environments.

3.2 The Differentiated Value of Sub-Dialects

Existing Chuan-Yu dialect corpora are often labeled broadly as “Sichuan dialect” or “Chongqing dialect.” This dataset takes a more granular approach by focusing on sub-dialect differences across 12 prefecture-level cities.

This means:

  • Research institutions can use the dataset to train dialect recognition models capable of distinguishing city-level accents.
  • TTS systems can generate speech with specific city-level accent characteristics, rather than a generalized “Sichuan-flavored Mandarin.”
  • Voice assistants can automatically adapt their accent style based on the user’s city, enabling more authentic localization.

4. Open-Source Plan and Usage Guide

4.1 Open-Source Scope

We are open-sourcing the following resources:

  • Speech data: 12-city sub-dialect speech clips in WAV format, 16 kHz, 16-bit
  • Annotation files: metadata including Standard Mandarin text, speaker attributes, city labels, and related information

4.2 Use Cases

This dataset is suitable for:

  • Dialect ASR model training and fine-tuning
  • Dialect TTS system development
  • Dialect accent recognition and classification research
  • Dialect-to-Standard Mandarin machine translation
  • Dialect preservation and digital archiving

4.3 Access

Dataset link: https://magichub.com/datasets/chuan-yu-12-city-sub-dialect-speech-dataset

Open-Source License

This dataset is released under the CC BY-NC-ND 4.0 license. It is available for academic and non-commercial use. Please cite the dataset when using it.

Helping AI understand the living voices of Sichuan and Chongqing begins with every city.

Related Datasets

Datasets Download Rank

ASR-RAMC-BigCCSC: A Chinese Conversational Speech Corpus
Multi-Modal Driver Behaviors Dataset for DMS
ASR-SCCantDuSC: A Scripted Chinese Cantonese (Canton) Daily-use Speech Corpus
ASR-SCSichDiaDuSC: A Scripted Chinese Sichuan Dialect Daily-use Speech Corpus
ASR-CCantCSC: A Chinese Cantonese (Canton) Conversational Speech Corpus
ASR-EgArbCSC: An Egyptian Arabic Conversational Speech Corpus
ASR-SCCantCabSC: A Scripted Chinese Cantonese (Canton) Cabin Speech Corpus
ASR-SCShhiDiaDuSC: A Scripted Chinese Shanghai Dialect Daily-use Speech Corpus
ASR-CShhiDiaCSC: A Chinese Shanghai Dialect Conversational Speech Corpus