On this page

An Intelligent Performance Evaluation System Based on Computer Vision and Audio Data Fusion in Music Performance

By: Kai Wang 1
1 School of Art and Design, Huaqing College of Xi’an University of Architecture and Technology, Xi’an, Shaanxi, 710000, China

Abstract

Traditional music assessment methods rely on manual scoring, which is not only time-consuming but also highly subjective. And modern technology, especially the fusion of computer vision and audio processing, provides a new solution. In this paper, an intelligent performance evaluation system based on the fusion of computer vision and audio data is proposed. The system utilizes a deep learning model for music performance evaluation by combining improved visual feature extraction and audio acoustic feature extraction. In the audio feature extraction part, an improved Gammatone filter and FFT algorithm are used to optimize the audio feature extraction process; in the visual feature extraction part, lip features are extracted using a convolutional neural network (CNN), and sequential processing is carried out by an LSTM network. In order to improve the accuracy of the evaluation, the system also introduces a bimodal feature fusion technique, which further enhances the performance of the model by weighted fusion of audio and visual features. The experimental results show that the model in this paper performs well on the OAVQAD dataset, the training loss has reached convergence after 21 rounds, and the anti-jamming ability in the noisy environment is significantly higher than that of the other comparison models. The character error rate (CER) of this paper’s model is 0.32% in a high-intensity noise environment, which is much lower than that of the traditional model. The model’s pitch features and chord recognition are more excellent, and it can accurately capture the detailed features in music, providing reliable technical support for intelligent performance evaluation.