MagicData
SIGN IN

概览

数据集类型

语种

语音类型

内容

音频参数

文件格式

录音设备

录音环境

第三方
NLP语料库

NLP-CantabTEDLIUM1.1: Cantab Research Language Models for the TEDLIUM Database

About this resource:

Cantab-TEDLIUM Release 1.1 (February 2015)

This is the README from the release http://cantabResearch.com/cantab-TEDLIUM.tar.bz2.

This release contains all the files required to reproduce the IWSLT baseline results quoted in Section 5.2 of "Scaling Recurrent Neural Network Language Models" (ICASSP 2015), which can be found at http://arxiv.org/abs/1502.00512.

Contents

  • cantab-TEDLIUM.txt contains 155,290,779 tokens entropy filtered from http://cantabResearch.com/cantab-1bn-norm.tar.bz2, which in turn was generated from https://code.google.com/p/1-billion-word-language-modeling-benchmark/.
  • cantab-TEDLIUM-unpruned.lm3 is the 3-gram built from cantab-TEDLIUM.txt with Witten-Bell smoothing.
  • cantab-TEDLIUM-pruned.lm3 is the pruned version of cantab-TEDLIUM-unpruned.lm3, suitable for use in a first pass decode with Kaldi.
  • cantab-TEDLIUM-unpruned.lm4 is an unpruned Kneser-Ney smoothed 4-gram provided for rescoring lattices produced by the above decode step.
  • cantab-TEDLIUM.dct is the 150 thousand word vocabulary for the above two LMs, including phonetic pronunciations.

Contact: tonyr _at_ cantabresearch.com

概览

数据集类型

语种

语音类型

内容

音频参数

文件格式

录音设备

录音环境

授权方式

{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}评论
写评论
*访客无法进行评论

Verifying Email