On this page

Research on expression recognition model based on multimodal hierarchical graph comparison learning

By: Xiaoyao Mo1, Hairui Wang1, Guifu Zhu2
1Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan, 650500, China
2Information Technology Construction Management Center, Kunming University of Science and Technology, Kunming,Yunnan, 650500, China

Abstract

In response to the limitations of existing methods in dynamic modeling of complex expressions, multimodal data quality optimization, and hierarchical feature fusion, this paper proposes a hierarchical graph comparison learning model based on local and global features. This model integrates graph neural network and contrastive learning techniques. It captures expression details by constructing local graphs, models cross-modal semantic collaboration through global graphs, and introduces an automatic graph enhancement strategy to improve the model’s generalization ability. In the multimodal feature extraction stage, key features are accurately obtained from the video, audio, and text modalities respectively, and then the features are integrated through the intra-modal attention and multimodal fusion mechanisms. The experiments use the CMU-MOSI and CMU-MOSEI datasets. The results show that compared with multiple benchmark models, the model proposed in this paper performs better in terms of accuracy, recall rate, F1 score, and other indicators, and its mean square error is at a relatively good level. It can effectively integrate multimodal information, has excellent performance in the expression recognition task, and provides new ideas and methods for the development of this field.