The Chuan-Yu 12-City Sub-dialect Speech Dataset is an open-source Chinese dialect speech dataset focusing on city-level sub-dialect varieties in the Sichuan-Chongqing region. “Chuan-Yu” refers to Sichuan and Chongqing, a region where local dialects are widely used in daily communication and carry distinct pronunciation, intonation, and regional expression patterns.
The dataset is designed to help speech AI systems better understand fine-grained dialect differences within the Chuan-Yu region. Instead of treating Sichuanese or Chongqing dialects as broad categories, this dataset provides city-level coverage across 12 representative cities, making it suitable for research on sub-dialect variation, accent classification, dialect speech recognition, and localized speech technology.
Dataset Overview
| Dialect Area | Representative City | Duration (h) | Utterances |
| Cheng-Yu Area | Chengdu | 5.18 | 1,993 |
| Cheng-Yu Area | Chongqing | 4.99 | 2,034 |
| Minjiang Area | Leshan | 3.52 | 1,308 |
| Minjiang Area | Yibin | 3.05 | 1,190 |
| Minjiang Area | Luzhou | 3.26 | 1,330 |
| Renfu Sub-area | Zigong | 2.27 | 885 |
| Renfu Sub-area | Neijiang | 2.68 | 889 |
| Yagan Sub-area | Ya’an | 1.69 | 727 |
| Yagan Sub-area | Xichang | 3.28 | 1,222 |
| Others | Nanchong | 1.19 | 476 |
| Others | Dazhou | 1.3 | 478 |
| Others | Guang’an | 1.38 | 536 |
City-level Sub-dialect Coverage
The dataset covers 12 cities in the Sichuan-Chongqing region, including Chengdu, Chongqing, Leshan, Yibin, Luzhou, Zigong, Neijiang, Ya’an, Xichang, Nanchong, Dazhou, and Guang’an.
Each city is organized as an independent subset. This structure makes it easier to study the pronunciation, rhythm, tone, and accent differences between local varieties. For example, Chengdu, Chongqing, Zigong, and Mianyang-style speech may all be broadly associated with the Chuan-Yu dialect region, but their local pronunciation features and speaking styles can vary significantly.
Native Speaker Recording and Review
All speech data was recorded by local native dialect speakers. Speakers were selected from the corresponding cities and cover different age groups, genders, and occupational backgrounds, helping improve the diversity and representativeness of the dataset.
The annotation and quality review process was also conducted with the support of native speakers familiar with local accents. This “local speaker recording + local speaker review” process helps ensure the authenticity and accuracy of the speech data, transcription, and dialect-related features.
Annotation Information
Each speech segment includes multi-level annotation information:
- Standard Mandarin transcription
- Speaker gender
- Speaker age group
- Recording city
- Audio duration
Each utterance is approximately 5 to 45 seconds long, with an average duration of around 10 seconds. The utterances are naturally segmented with punctuation-based sentence boundaries, avoiding unnatural forced cuts.
Speech Content
The recording content covers daily conversations, real-life communication scenarios, and local cultural topics. The dataset is designed to capture practical spoken language rather than isolated dictionary-style dialect words, making it more suitable for real-world speech AI research and application development.
Data Format
- Audio format: WAV
- Sampling rate: 16 kHz
- Bit depth: 16-bit
- Transcription: Standard Mandarin text
- Metadata: speaker gender, age group, recording city, and other related information
Potential Applications
This dataset can be used for:
- Dialect speech recognition model training and fine-tuning
- Dialect-aware speech synthesis research
- Dialect-to-Standard Mandarin speech or text conversion
- Regional speech technology development
- Dialect culture preservation and digital archiving
By providing city-level sub-dialect speech data from the Chuan-Yu region, this dataset supports the development of speech AI systems that can better understand real-world regional language variation and provide more localized speech interaction experiences.
