Dataocean AI Has Participated in Creating the Open-Source Dataset GigaSpeech 2: A Large-Scale and Multi-Domain ASR Corpus for Low-Resource Languages
Dataocean AI has collaborated with Shanghai Jiao Tong University, The Chinese University of Hong Kong, Tsinghua University, Pengcheng Lab, AISpeech, Birch AI, and Seasalt AI to successfully develop GigaSpeech 2. The development and test sets of GigaSpeech 2 are labeled by a professional team from Dataocean AI.
This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240924609911/en/
(Photo: Business Wire)
GigaSpeech 2 Overview
GigaSpeech 2 is an ever-expanding, large-scale, multi-domain, and multilingual speech recognition corpus designed to promote research and development in low-resource language speech recognition. GigaSpeech 2 raw contains 30,000 hours of automatically transcribed audio, covering Thai, Indonesian, and Vietnamese. After multiple rounds of refinement and iteration, GigaSpeech 2 refined offers 10,000 hours of Thai, 6,000 hours of Indonesian, and 6,000 hours of Vietnamese. The test sets labeled by Dataocean AI for Thai and Indonesian, each consist of 10 hours, while the development sets are 10 hours for Thai and Indonesian. The team have also open-sourced multilingual speech recognition models trained on the GigaSpeech 2 data, achieving performance comparable to commercial speech recognition services.
Dataset Construction
The construction process of GigaSpeech 2 has also been open-sourced. This is an automated process for building large-scale speech recognition datasets from vast amounts of unlabeled audio available on the internet. The automated process involves data crawling, transcription, alignment, and refinement. Initially, Whisper is used for preliminary transcription, followed by forced alignment with TorchAudio to produce GigaSpeech 2 raw through multi-dimensional filtering. The dataset is then refined iteratively using an improved Noisy Student Training (NST) method, enhancing the quality of pseudo-labels through repeated iterations, ultimately resulting in GigaSpeech 2 refined.
GigaSpeech 2 encompasses a wide range of thematic domains, including agriculture, art, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel. Additionally, it covers various content formats such as audiobooks, documentaries, lectures, monologues, movies and TV shows, news, interviews, and video blogs.
Training Set Details
GigaSpeech 2 offers a comprehensive and diverse training set, which is meticulously designed to support the development of robust and high-performing speech recognition models. The training set details are as follows:
- Thai: The raw version consists of 12,901.8 hours of speech data, while the refined version encompasses 10,262.0 hours.
- Indonesian: The raw data amounts to 8,112.9 hours, and the refined data comprises 5,714.0 hours.
- Vietnamese: The raw dataset includes 7,324.0 hours of speech recordings, with the refined dataset totaling 6,039.0 hours.
Development and Test Set Details
Dataocean AI’s COO - Ke Li, who is also one of the paper's authors, has led GigaSpeech 2 test sets project. With nearly 20 years of project experience, the team has contributed in Thai and Indonesian with word accuracy of over 97%. Besides those two East Asian languages, Dataocean AI’s team can also cover over 200 languages and dialects around the world. The company offer 1600+ high-quality off-the-shelf datasets are applicable for multiple scenarios such as Generative AI, Autonomous driving, Smart home, Customer services and etc., fulfilling the evolving needs of the AI industry.
Experimental Results
We conducted a comparative evaluation of speech recognition models trained on the GigaSpeech 2 dataset against industry-leading models, including OpenAI Whisper (large-v3, large-v2, base), Meta MMS L1107, Azure Speech CLI 1.37.0, and Google USM Chirp v2. The comparison was carried out in Thai, Indonesian, and Vietnamese languages. Performance evaluation was based on three test sets: GigaSpeech 2, Common Voice 17.0, and FLEURS, using Character Error Rate (CER) or Word Error Rate (WER) as metrics. The results indicate:
Thai: Our model demonstrated exceptional performance, surpassing all competitors, including commercial interfaces from Microsoft and Google. Notably, our model achieved this significant result while having only one-tenth the number of parameters compared to Whisper large-v3.
Indonesian and Vietnamese: Our system exhibited competitive performance compared to existing baseline models in both Indonesian and Vietnamese languages.
Resource Links
The GigaSpeech 2 dataset is now available for download:
https://huggingface.co/datasets/speechcolab/gigaspeech2
The automated process for constructing large-scale speech recognition datasets is available at:
https://github.com/SpeechColab/GigaSpeech2
The preprint paper is available at:
https://arxiv.org/pdf/2406.11546
Dataocean AI website:
https://www.dataoceanai.com
View source version on businesswire.com: https://www.businesswire.com/news/home/20240924609911/en/
Subscribe to releases from Business Wire
Subscribe to all the latest releases from Business Wire by registering your e-mail address below. You can unsubscribe at any time.
Latest releases from Business Wire
Microba Announces Landmark GI Study Results From Over 4,600 Patients15.5.2025 04:12:00 CEST | Press release
Results at a glance:71.4% of MetaXplore™ reports from 4,616 patients identified actionable results41.9% tested positive for abnormal microbiome markers linked to gastrointestinal health9.9% tested positive for gastrointestinal markers, including inflammation, pancreatic insufficiency, or blood in stool19.6% of reports tested positive for multiple markers (microbiome and gastrointestinal)65% of patients reported health improvements following clinician-directed recommendations informed by MetaXplore Microba Life Sciences Limited (ASX: MAP) (“Microba” or the “Company”), a precision microbiome company, today announces preliminary results from the analysis of over 4,600 MetaXplore™ GI Plus test results, a comprehensive test for the assessment and management of lower gastrointestinal disorders, spanning symptoms including chronic pain, bloating, constipation, or diarrhea. This study demonstrates that MetaXplore can support clinicians to identify and address underlying gut issues that often g
Bloomstreet Enters into Strategic Partnership Agreement with Google Israel for Market Expansion in Japan15.5.2025 04:00:00 CEST | Press release
Optimizing Global Websites for the Japanese Visitors: Detailed UX/UI Adaptation, Comprehensive Japanese Localization, Extensive Localization and Cultural Relevance to Fit the Local Audience. Bloomstreet Inc. (Headquarters: Chuo-ku, Tokyo; President & CEO: Junichi Takayama; hereinafter “Bloomstreet”), a company that supports overseas enterprises entering the Japanese market, is pleased to announce a strategic partnership agreement with Google Israel Ltd. (Headquarters: Tel Aviv District; Country Manager: Barak Regev; hereinafter “Google Israel”). This partnership aims to support Israeli companies that operate globally with Google Ads and are seeking to expand into the Japanese market. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250514387485/en/ How do Bloomstreet and Google Israel collaborate? The partnership between Bloomstreet and Google Israel is designed to support Israeli companies seeking to expand into the Japanese
Rubedo Life Sciences’ Drug Discovery Platform, ALEMBIC™, Helps Identify Senescent or “Zombie” Neurons in New Study Linking Neuropathic Pain and Aging Published in Peer-Reviewed Scientific Journal Nature Neuroscience15.5.2025 03:11:00 CEST | Press release
Study is the first demonstration of senescent neurons driving neuropathic pain1 Rubedo’s proprietary, AI-driven drug discovery platform ALEMBIC™ identified senescent neurons as novel therapeutic targets2 Rubedo Life Sciences, Inc. (Rubedo), an AI-driven, clinical-stage biotech focused on discovering and rapidly developing selective cellular rejuvenation medicines targeting aging cells, today announced that using open source codes integrated in the company’s broader propriety drug discovery platform, ALEMBIC™, helped to identify senescent neurons in a new study that found senescent neurons drive chronic pain with injury and age.1 Senescent cells, often called “zombie” cells, arise as the results of cellular stress and damage. These senescent cells do not die but undergo cellular changes, including secreting pro-inflammatory factors, thereby potentially contributing to inflammatory responses within the body.1 The study, led by Stanford University scientists, Vivianne Tawfik, MD, PhD, and
The New England Journal of Medicine Publishes Data from Phase 2b Trial of Oral Orexin Receptor 2 Agonist Oveporexton (TAK-861) in People with Narcolepsy Type 114.5.2025 23:03:00 CEST | Press release
–Data Demonstrated Statistically Significant Improvements in Primary and Secondary Endpoints with Most Subjects Achieving Near Normal Ranges of Wakefulness and Clinically Meaningful Improvements Across the Broad Range of Symptoms Investigated–Oveporexton Found to be Generally Safe and Well Tolerated–Phase 3 Readout of Oveporexton Anticipated in 2025 Takeda (TSE: 4502/NYSE:TAK) today announced that the New England Journal of Medicine published data from the Phase 2b trial of oveporexton (TAK-861) in people with narcolepsy type 1 (NT1). Oveporexton is an investigational oral orexin receptor 2 (OX2R)-selective agonist designed to restore orexin signaling to address the underlying orexin deficiency that causes NT1. Results demonstrated significant improvement in objective and subjective measures of excessive daytime sleepiness (EDS), reductions in cataplexy events and clinically meaningful improvements in disease severity and quality of life across all doses tested compared to placebo thro
BeiGene Showcases Strength of Hematology Portfolio at EHA 2025 with New Data Highlighting BRUKINSA’s Leadership and Next-Generation Innovation14.5.2025 22:05:00 CEST | Press release
Data across 31 abstracts illustrate BeiGene’s deep commitment to transforming treatment for B-cell malignanciesFour oral presentations highlight updated data from investigational BTK protein degrader BGB-16673 and BCL2 inhibitor sonrotoclax across a range of B-cell malignancies BeiGene, Ltd. (NASDAQ: ONC; HKEX: 06160; SSE: 688235), a global oncology company that will change its name to BeOne Medicines Ltd., today announced it will share data across a range of hematologic malignancies at the European Hematology Association (EHA) Congress in Milan, Italy, June 12–15. BeiGene has 31 abstracts accepted at EHA 2025, with four selected for oral presentations, featuring data from its best-in-class Bruton’s tyrosine kinase (BTK) inhibitor BRUKINSA® (zanubrutinib) and its investigational pipeline assets – a next-generation BCL2 inhibitor, sonrotoclax, and BTK protein degrader, BGB-16673. These data reflect BeiGene’s vision to redefine standards of care in hematology through next-generation scie
In our pressroom you can read all our latest releases, find our press contacts, images, documents and other relevant information about us.
Visit our pressroom