New Open-Source Release | Chuan-Yu 12-City Sub-Dialect Speech Dataset: Helping Large Models Understand the Living Voices of Sichuan and Chongqing

Posted at 3 weeks ago

This dataset helps AI better capture the rich dialect diversity across different cities in Sichuan and Chongqing.

1. Dialects: A Critical Challenge AI Has Yet to Overcome

Sichuan and Chongqing are often discussed as one broad dialect region, but in real life, speech varies noticeably from city to city. The way people speak in Chengdu is not exactly the same as in Chongqing, Zigong, Leshan, Yibin, Luzhou, or other cities across the region. Differences in pronunciation, rhythm, intonation, and local usage all shape how these dialects sound in everyday conversations.

For speech AI, this creates a practical challenge. A model trained only on broadly labeled “Sichuanese” or “Chongqing dialect” data may not capture the finer differences between local city-level varieties. In real-world applications such as voice assistants, in-car voice interaction, smart home devices, and customer service, these differences can affect how well a system understands users and how natural the interaction feels.

That is why we are releasing the Chuan-Yu 12-City Sub-Dialect Speech Dataset. Covering 12 cities across Sichuan and Chongqing, this dataset is designed to provide more granular speech data for building, evaluating, and improving AI systems that need to work with real regional speech.

2. Our Dataset: Built to Help AI Understand Every City

Unlike existing large-scale Chuan-Yu dialect corpora, this dataset focuses on one core goal: enabling large models not only to process Chuan-Yu dialects, but also to accurately capture the subtle pronunciation differences between cities.

2.1 Coverage of 12 Cities with Precise Sub-Dialect Labeling

The dataset covers 12 prefecture-level cities across Sichuan and Chongqing.

Dialect Area	Representative City	Duration (hours)	Sentences
Chengdu-Chongqing Area	Chengdu	5.18	1,993
Chengdu-Chongqing Area	Chongqing	4.99	2,034
Minjiang Area	Leshan	3.52	1,308
Minjiang Area	Yibin	3.05	1,190
Minjiang Area	Luzhou	3.26	1,330
Renfu Sub-Area	Zigong	2.27	885
Renfu Sub-Area	Neijiang	2.68	889
Yagan Sub-Area	Ya’an	1.69	727
Yagan Sub-Area	Xichang	3.28	1,222
Other	Nanchong	1.19	476
Other	Dazhou	1.30	478
Other	Guang’an	1.38	536

Total: 33 hours / 13,068 sentences / 38 native speakers

Each city is organized as an independent subset. All speakers are native speakers who were born and raised locally, ensuring that every audio clip carries the authentic voice of its city.

We do not treat Chuan-Yu dialect as one broad, mixed category. Instead, Chengdu speakers record Chengdu dialect, Chongqing speakers record Chongqing dialect, and so on. Only local speakers can accurately preserve the tones, rhythms, and pronunciation habits of their own city.

2.2 Recorded by Native Speakers, Reviewed by Native Speakers

On the collection side, all speakers are local native speakers with local household registration and long-term residence in the corresponding city. The dataset covers different age groups from 18 to 65, as well as diverse genders and occupational backgrounds, helping ensure both diversity and representativeness.

On the QA side, transcription and quality assessment were also completed by local native speakers. We established review teams for all 12 cities, with each team led by dialect specialists familiar with the local accent. Each utterance was carefully checked for transcription accuracy, pronunciation quality, and dialect-specific features.

This “local speakers record, local speakers review” mechanism helps maximize both authenticity and accuracy.

2.3 Annotation System: More Than Transcription

For each speech sample, we provide multi-dimensional annotation information, including:

Standard Mandarin text
Speaker gender
Speaker age group
Recording city
Sentence duration of 5–45 seconds, with an average duration of around 10 seconds
Natural punctuation and sentence segmentation, without forced cutting

This annotation system makes the dataset suitable not only for speech recognition training, but also for speech synthesis, dialect feature analysis, accent recognition, and other speech AI tasks.

2.4 Scenario-Based Design Close to Real Life

The collected content covers daily conversations, everyday scenarios, local cultural topics, and natural contexts. The goal is to help AI learn not “dictionary dialect,” but living language from real streets, homes, and communities.

3. Why This Matters

3.1 The Next Frontier for Large Models: Dialect Understanding

In scenarios such as intelligent customer service, in-car voice assistants, and smart home devices, users increasingly expect to interact with AI naturally in their own dialects. Dialect support is becoming a key differentiator for improving user experience and strengthening user engagement.

Whoever gains access to high-quality dialect data first will be better positioned to build localized AI services that truly work in real-world environments.

3.2 The Differentiated Value of Sub-Dialects

Existing Chuan-Yu dialect corpora are often labeled broadly as “Sichuan dialect” or “Chongqing dialect.” This dataset takes a more granular approach by focusing on sub-dialect differences across 12 prefecture-level cities.

This means:

Research institutions can use the dataset to train dialect recognition models capable of distinguishing city-level accents.
TTS systems can generate speech with specific city-level accent characteristics, rather than a generalized “Sichuan-flavored Mandarin.”
Voice assistants can automatically adapt their accent style based on the user’s city, enabling more authentic localization.

4. Open-Source Plan and Usage Guide

4.1 Open-Source Scope

We are open-sourcing the following resources:

Speech data: 12-city sub-dialect speech clips in WAV format, 16 kHz, 16-bit
Annotation files: metadata including Standard Mandarin text, speaker attributes, city labels, and related information

4.2 Use Cases

This dataset is suitable for:

Dialect ASR model training and fine-tuning
Dialect TTS system development
Dialect accent recognition and classification research
Dialect-to-Standard Mandarin machine translation
Dialect preservation and digital archiving

4.3 Access

Dataset link: https://magichub.com/datasets/chuan-yu-12-city-sub-dialect-speech-dataset

Open-Source License

This dataset is released under the CC BY-NC-ND 4.0 license. It is available for academic and non-commercial use. Please cite the dataset when using it.