
Dataset Overview

Dataset Type


Speech Style


Audio Parameters

File Format

Recording Equipment

Recording Environment

Third Party

LEX-MSUSwibTrans: A Transcriptions & Lexicon of Switchboard Dataset

About this resource:

This resource mirrors the transcriptions of Switchboard data generated at Mississippi State and the associated lexicon. These were released without any license restrictions.

The Switchboard (SWB) corpus is one of the most important historical benchmarks for recognition tasks involving large vocabulary conversational speech (LVCSR). It contains 2430 conversations averaging 6 minutes in length; in other words, over 240 hours of recorded speech, and about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.

The initial transcriptions for SWB have error rates above 10%, resulting in poor recognition performance, particularly on hard-to-recognize words such as monosyllabic words. This release of the SWB transcriptions, which was developed by the Institute for Signal and Information Processing at Mississippi State University in the late 1990s, includes transcriptions that were manually corrected to have error rates below 1%. The release also includes manually-adjusted segmentations and word alignments.

Dataset Overview

Dataset Type


Speech Style


Audio Parameters

File Format

Recording Equipment

Recording Environment


{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}{{ options.labels.pluralReviewCountLabel }}
{{ options.labels.newReviewButton }}
{{ userData.canReview.message }}

Verifying Email