Corpus-based research and digital preservation of endangered Lingnan dialects

doi:70517/ijhsa464219

Research article
DOI: https://doi.org/10.70517/ijhsa464219

Volume 46, Issue 4
Pages: 2614
-2625
Open Access
Download

Corpus-based research and digital preservation of endangered Lingnan dialects

By: ^¹

¹Xijiang River Valley Folk Literature Research Center, Wuzhou University, Wuzhou, Guangxi, 543002, China

Published: 10/08/2025

Abstract

To establish and preserve a corpus of endangered Lingnan dialects, this paper combines convolutional neural networks and gated recurrent unit technology to build a CNN-CTC acoustic model, proposing a Lingnan dialect recognition model that achieves mapping recognition from Lingnan dialects to Mandarin. Taking 160 audio files and approximately 43 hours of raw audio corpus as the research object, a special topic analysis was conducted and the storage and presentation forms of the corpus data were presented. The results show that the highest word frequency of the corpus “My Hometown” is “ge”, which is 95 times, with a frequency of 0.0362, followed by “shi”, “ah”, “di”, etc. Approximately half of the 0.1% class symbols in the Lingnan dialect spoken language corpus correspond to character symbols. In the usage test of the Lingnan dialect corpus, the average SUS value was 82.40, which can drive the continuous optimization of corpus design and user experience, thereby achieving its digital preservation.

Keywords: CNN-CTC, dialect recognition, Lingnan dialect, corpus

On this page

Corpus-based research and digital preservation of endangered Lingnan dialects

Abstract