Dataocean AI Has Participated in Creating the Open-Source Dataset GigaSpeech 2: A Large-Scale and Multi-Domain ASR Corpus for Low-Resource Languages

24.9.2024 21:00:00 CEST | Business Wire | Press release

Dataocean AI has collaborated with Shanghai Jiao Tong University, The Chinese University of Hong Kong, Tsinghua University, Pengcheng Lab, AISpeech, Birch AI, and Seasalt AI to successfully develop GigaSpeech 2. The development and test sets of GigaSpeech 2 are labeled by a professional team from Dataocean AI.

This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240924609911/en/

(Photo: Business Wire)

GigaSpeech 2 Overview

GigaSpeech 2 is an ever-expanding, large-scale, multi-domain, and multilingual speech recognition corpus designed to promote research and development in low-resource language speech recognition. GigaSpeech 2 raw contains 30,000 hours of automatically transcribed audio, covering Thai, Indonesian, and Vietnamese. After multiple rounds of refinement and iteration, GigaSpeech 2 refined offers 10,000 hours of Thai, 6,000 hours of Indonesian, and 6,000 hours of Vietnamese. The test sets labeled by Dataocean AI for Thai and Indonesian, each consist of 10 hours, while the development sets are 10 hours for Thai and Indonesian. The team have also open-sourced multilingual speech recognition models trained on the GigaSpeech 2 data, achieving performance comparable to commercial speech recognition services.

Dataset Construction

The construction process of GigaSpeech 2 has also been open-sourced. This is an automated process for building large-scale speech recognition datasets from vast amounts of unlabeled audio available on the internet. The automated process involves data crawling, transcription, alignment, and refinement. Initially, Whisper is used for preliminary transcription, followed by forced alignment with TorchAudio to produce GigaSpeech 2 raw through multi-dimensional filtering. The dataset is then refined iteratively using an improved Noisy Student Training (NST) method, enhancing the quality of pseudo-labels through repeated iterations, ultimately resulting in GigaSpeech 2 refined.

GigaSpeech 2 encompasses a wide range of thematic domains, including agriculture, art, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel. Additionally, it covers various content formats such as audiobooks, documentaries, lectures, monologues, movies and TV shows, news, interviews, and video blogs.

Training Set Details

GigaSpeech 2 offers a comprehensive and diverse training set, which is meticulously designed to support the development of robust and high-performing speech recognition models. The training set details are as follows:

- Thai: The raw version consists of 12,901.8 hours of speech data, while the refined version encompasses 10,262.0 hours.
- Indonesian: The raw data amounts to 8,112.9 hours, and the refined data comprises 5,714.0 hours.
- Vietnamese: The raw dataset includes 7,324.0 hours of speech recordings, with the refined dataset totaling 6,039.0 hours.

Development and Test Set Details

Dataocean AI’s COO - Ke Li, who is also one of the paper's authors, has led GigaSpeech 2 test sets project. With nearly 20 years of project experience, the team has contributed in Thai and Indonesian with word accuracy of over 97%. Besides those two East Asian languages, Dataocean AI’s team can also cover over 200 languages and dialects around the world. The company offer 1600+ high-quality off-the-shelf datasets are applicable for multiple scenarios such as Generative AI, Autonomous driving, Smart home, Customer services and etc., fulfilling the evolving needs of the AI industry.

Experimental Results

We conducted a comparative evaluation of speech recognition models trained on the GigaSpeech 2 dataset against industry-leading models, including OpenAI Whisper (large-v3, large-v2, base), Meta MMS L1107, Azure Speech CLI 1.37.0, and Google USM Chirp v2. The comparison was carried out in Thai, Indonesian, and Vietnamese languages. Performance evaluation was based on three test sets: GigaSpeech 2, Common Voice 17.0, and FLEURS, using Character Error Rate (CER) or Word Error Rate (WER) as metrics. The results indicate:

Thai: Our model demonstrated exceptional performance, surpassing all competitors, including commercial interfaces from Microsoft and Google. Notably, our model achieved this significant result while having only one-tenth the number of parameters compared to Whisper large-v3.

Indonesian and Vietnamese: Our system exhibited competitive performance compared to existing baseline models in both Indonesian and Vietnamese languages.

Resource Links

The GigaSpeech 2 dataset is now available for download:
https://huggingface.co/datasets/speechcolab/gigaspeech2

The automated process for constructing large-scale speech recognition datasets is available at:
https://github.com/SpeechColab/GigaSpeech2

The preprint paper is available at:
https://arxiv.org/pdf/2406.11546

Dataocean AI website:
https://www.dataoceanai.com

View source version on businesswire.com: https://www.businesswire.com/news/home/20240924609911/en/

Subscribe to releases from Business Wire

Subscribe to all the latest releases from Business Wire by registering your e-mail address below. You can unsubscribe at any time.

Latest releases from Business Wire

The Coca-Cola Company Names New Leader for Europe Operating Unit18.7.2025 21:00:00 CEST | Press release

The Coca-Cola Company today announced that Luisa Ortega will become president of the Europe operating unit effective Sept. 1, succeeding Nikos Koumettis, who will retire in 2026 after a 25-year career with the company. Koumettis will remain with the company through Feb. 28, 2026, as a senior advisor. He will also serve on the board of directors of Hindustan Coca-Cola Beverages Pvt. Ltd., a company-owned bottler in India. Ortega joined Coca-Cola in 2019 and currently serves as president of the Africa operating unit. In this role, she leads a complex business that operates across 54 markets. Koumettis has led the Europe operating unit since it was created in 2021. “Luisa has done an outstanding job leading our African business, where our system has continued to make major investments to serve growing markets on the continent,” said Henrique Braun, Executive Vice President and Chief Operating Officer of The Coca-Cola Company. “As head of Europe, she will bring great international experien

NFL Running Back Derrick Henry Joins Amazfit as Athlete Ambassador18.7.2025 15:00:00 CEST | Press release

Henry to utilize Amazfit products to optimize health, recovery and performance as he enters his 10th NFL season Amazfit, a leading global smart wearables brand owned by Zepp Health (NYSE: ZEPP), announced Baltimore Ravens running back Derrick Henry as the newest elite athlete to join its growing roster of ambassadors. Known for his rare combination of speed and strength, Henry will utilize Amazfit wearables to power every phase of his training, recovery and sleep as he prepares for his 10th NFL season. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250718322498/en/ Derrick Henry is the newest elite athlete to join Amazfit's growing roster of ambassadors. As one of the most prolific running backs of his generation, Henry has amassed an impressive array of accolades during his career, including NFL Offensive Player of the Year, two rushing titles, and five Pro Bowl selections. With Amazfit as his official smart wearable partne

Qualcomm Announces Quarterly Cash Dividend18.7.2025 15:00:00 CEST | Press release

Qualcomm Incorporated (NASDAQ: QCOM) today announced a quarterly cash dividend of $0.89 per common share, payable on September 25, 2025, to stockholders of record at the close of business on September 4, 2025. About Qualcomm Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Building on our 40 years of technology leadership in creating era-defining breakthroughs, we deliver a broad portfolio of solutions built with our leading-edge AI, high-performance, low-power computing, and unrivaled connectivity. Our Snapdragon® platforms power extraordinary consumer experiences, and our Qualcomm Dragonwing™ products empower businesses and industries to scale to new heights. Together with our ecosystem partners, we enable next-generation digital transformation to enrich lives, improve businesses, and advance societies. At Qualcomm, we are engineering human progress. Qualcomm Incorporated includes our licensin

MultiBank Group Confirms $MBG Token Listings on MEXC and Gate.io on TGE Day in Addition to MultiBank.io and Uniswap18.7.2025 14:34:00 CEST | Press release

MultiBank Group, the world’s largest & most regulated financial derivatives institution, is proud to announce that its highly anticipated $MBG Token will be listed on two new major global cryptocurrency exchanges — MEXC and Gate.io — on the day of its official Token Generation Event (TGE), July 22, 2025, in addition to MultiBank.io and Uniswap. The $MBG Token will go live on: MultiBank.ioGate.ioMEXCUniswap This new dual listing will allow millions of users across both exchanges to seamlessly access and trade $MBG using their existing accounts, ensuring immediate market participation at launch. The Token Generation Event (TGE) is now approaching following the successful completion of two pre-sale rounds, where MultiBank Group issued 7 million tokens in Round 1 and 3 million tokens in Round 2 — both of which sold out within minutes. Naser Taher, Chairman and Founder of MultiBank Group said “With $MBG, we’re introducing a utility token built to deliver real-world value, transparency, and

SLB Announces Second-Quarter 2025 Results18.7.2025 12:50:00 CEST | Press release

Revenue of $8.55 billion increased 1% sequentially and decreased 6% year on year GAAP EPS of $0.74 increased 28% sequentially and decreased 4% year on year EPS, excluding charges and credits, of $0.74 increased 3% sequentially and decreased 13% year on year Net income attributable to SLB of $1.01 billion increased 27% sequentially and decreased 9% year on year Adjusted EBITDA of $2.05 billion increased 2% sequentially and decreased 10% year on year Cash flow from operations was $1.14 billion and free cash flow was $622 million Board approved quarterly cash dividend of $0.285 per share SLB (NYSE: SLB) today announced results for the second-quarter 2025. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250716727689/en/ The exterior of the SLB headquarters in Houston, Texas. Second-Quarter Results(Stated in millions, except per share amounts)Three Months EndedChangeJun. 30, 2025Mar. 31, 2025Jun. 30, 2024SequentialYear-on-yearReve

In our pressroom you can read all our latest releases, find our press contacts, images, documents and other relevant information about us.

Visit our pressroom