【數據集】中文語音識別可用的開源數據集整理


數據集下載地址

OpenSLR: http://www.openslr.org/resources.php

1.SLR18-THCHS-30

THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University. The origional recording was conducted in 2002 by Dong Wang, supervised by Prof. Xiaoyan Zhu, at the Key State Lab of Intelligence and System, Department of Computer Science, Tsinghua Universeity, and the original name was 'TCMSD', standing for 'Tsinghua Continuous Mandarin Speech Database'. The publication after 13 years has been initiated by Dr. Dong Wang and was supported by Prof. Xiaoyan Zhu. We hope to provide a toy database for new researchers in the field of speech recognition. Therefore, the database is totally free to academic users.

THCHS30是一個很經典的中文語音數據集了,包含了1萬余條語音文件,大約40小時的中文語音數據,內容以文章詩句為主,全部為女聲。它是由清華大學語音與語言技術中心(CSLT)出版的開放式中文語音數據庫。原創錄音於2002年由朱曉燕教授在清華大學計算機科學系智能與系統重點實驗室監督下進行,原名為“TCMSD”,代表“清華連續”普通話語音數據庫’。13年后的出版由王東博士發起,並得到了朱曉燕教授的支持。他們希望為語音識別領域的新入門的研究人員提供玩具級別的數據庫,因此,數據庫對學術用戶完全免費。

2.SLR33 Aishell

Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd.
400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz. The manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection. The data is free for academic use. We hope to provide moderate amount of data for new researchers in the field of speech recognition.

AISHELL是由北京希爾公司發布的一個中文語音數據集,其中包含約178小時的開源版數據。該數據集包含400個來自中國不同地區、具有不同的口音的人的聲音。錄音是在安靜的室內環境中使用高保真麥克風進行錄音,並采樣降至16kHz。通過專業的語音注釋和嚴格的質量檢查,手動轉錄准確率達到95%以上。該數據免費供學術使用。他們希望為語音識別領域的新研究人員提供適量的數據。

3.SLR38 ST-CMDS

This corpus were recorded in silence in-door environment using cellphone. It has 855 speakers. Each speaker has 120 utterances. All utterances were carefully transcribed and checked by human. Transcription accuracy is guaranteed. If there is any problem, we agree to correct them for you. The corpus contains:
audio files;
transcriptions;
metadata;

ST-CMDS是由一個AI數據公司發布的中文語音數據集,包含10萬余條語音文件,大約100余小時的語音數據。數據內容以平時的網上語音聊天和智能語音控制語句為主,855個不同說話者,同時有男聲和女聲,適合多種場景下使用。

4.SLR47 Primewords Chinese Corpus Set 1

This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd.
The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use.
The mapping between the transcript and utterance is given in JSON format.

Primewords包含了大約100小時的中文語音數據,這個免費的中文普通話語料庫由上海普力信息技術有限公司發布。語料庫由296名母語為英語的智能手機錄制。轉錄准確度大於98%,置信水平為95%,學術用途免費。抄本和話語之間的映射以JSON格式給出。

5.SLR62 aidatang_200zh

Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License.
The contents and the corresponding descriptions of the corpus include:
The corpus contains 200 hours of acoustic data, which is mostly mobile recorded data.600 speakers from different accent areas in China are invited to participate in the recording.
The transcription accuracy for each sentence is larger than 98%.
Recordings are conducted in a quiet indoor environment.
The database is divided into training set, validation set, and testing set in a ratio of 7: 1: 2.
Detail information such as speech data coding and speaker information is preserved in the metadata file.
Segmented transcripts are also provided.
The corpus aims to support researchers in speech recognition, machine translation, voiceprint recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.

Aidatatang_200zh是由北京數據科技有限公司(數據堂)提供的開放式中文普通話電話語音庫。
語料庫長達200小時,由Android系統手機(16kHz,16位)和iOS系統手機(16kHz,16位)記錄。邀請來自中國不同重點區域的600名演講者參加錄音,錄音是在安靜的室內環境或環境中進行,其中包含不影響語音識別的背景噪音。參與者的性別和年齡均勻分布。語料庫的語言材料是設計為音素均衡的口語句子。每個句子的手動轉錄准確率大於98%。

6.SLR68 magicdata

MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use.
The contents and the corresponding descriptions of the corpus include:
The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
1080 speakers from different accent areas in China are invited to participate in the recording.
The sentence transcription accuracy is higher than 98%.
Recordings are conducted in a quiet indoor environment.
The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
Detail information such as speech data coding and speaker information is preserved in the metadata file.
The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
Segmented transcripts are also provided.
The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.

Magic Data技術有限公司的語料庫,語料庫包含755小時的語音數據,其主要是移動終端的錄音數據。邀請來自中國不同重點區域的1080名演講者參與錄制。句子轉錄准確率高於98%。錄音在安靜的室內環境中進行。數據庫分為訓練集,驗證集和測試集,比例為51:1:2。諸如語音數據編碼和說話者信息的細節信息被保存在元數據文件中。錄音文本領域多樣化,包括互動問答,音樂搜索,SNS信息,家庭指揮和控制等。還提供了分段的成績單。該語料庫旨在支持語音識別,機器翻譯,說話人識別和其他語音相關領域的研究人員。因此,語料庫完全免費用於學術用途。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM