MagicData
SIGN IN

Dataset Overview

Dataset Type

Language

Speech Style

Content

Audio Parameters

File Format

Recording Equipment

Recording Environment

Third Party
NLP Corpus

NLP-CantabTEDLIUM1.1: Cantab Research Language Models for the TEDLIUM Database

About this resource:

Cantab-TEDLIUM Release 1.1 (February 2015)

This is the README from the release http://cantabResearch.com/cantab-TEDLIUM.tar.bz2.

This release contains all the files required to reproduce the IWSLT baseline results quoted in Section 5.2 of "Scaling Recurrent Neural Network Language Models" (ICASSP 2015), which can be found at http://arxiv.org/abs/1502.00512.

Contents

  • cantab-TEDLIUM.txt contains 155,290,779 tokens entropy filtered from http://cantabResearch.com/cantab-1bn-norm.tar.bz2, which in turn was generated from https://code.google.com/p/1-billion-word-language-modeling-benchmark/.
  • cantab-TEDLIUM-unpruned.lm3 is the 3-gram built from cantab-TEDLIUM.txt with Witten-Bell smoothing.
  • cantab-TEDLIUM-pruned.lm3 is the pruned version of cantab-TEDLIUM-unpruned.lm3, suitable for use in a first pass decode with Kaldi.
  • cantab-TEDLIUM-unpruned.lm4 is an unpruned Kneser-Ney smoothed 4-gram provided for rescoring lattices produced by the above decode step.
  • cantab-TEDLIUM.dct is the 150 thousand word vocabulary for the above two LMs, including phonetic pronunciations.

Contact: tonyr _at_ cantabresearch.com

Dataset Overview

Dataset Type

Language

Speech Style

Content

Audio Parameters

File Format

Recording Equipment

Recording Environment

License

{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}{{ options.labels.pluralReviewCountLabel }}
{{ options.labels.newReviewButton }}
{{ userData.canReview.message }}

Verifying Email