Multi-level Feature Fusion Transformer-CNN-LSTM Model for Dynamic Analysis and Personalized Prediction of Classroom Emotions of Non-native English Learners

doi:70517/ijhsa463617

Research article
DOI: https://doi.org/10.70517/ijhsa463617

Volume 46, Issue 3
Pages: 7184
-7203
Open Access
Download

Multi-level Feature Fusion Transformer-CNN-LSTM Model for Dynamic Analysis and Personalized Prediction of Classroom Emotions of Non-native English Learners

By: ^¹,², ^³

¹School of Graduate Studies, Mapua University, 1002 Metro Manila, Philippines

²GongQing Institute of Science and Technology, Gongqingcheng, Jiangxi, 332020, China

³School of Information Technology, Mapua University, 1002 Metro Manila, Philippines

Published: 06/08/2025

Abstract

The article proposes a novel cross-modal adversarial learning framework for analyzing the emotional dynamics of non-English learners during classroom engagement and predicting their individualized behaviors. The framework combines multilevel feature extraction and Transformer CNN-LSTM integrated model to handle multimodal data more efficiently and capture the complex relationship between emotions and behaviors. Low-level and high-level multilevel features are then extracted from the raw multimodal data. Meanwhile, Transformer is utilized to mine long-distance dependencies between multimodal data, CNN extracts local features, and LSTM is used to model dynamic changes in time series. In addition, the framework introduces adversarial training to learn shared features across modalities. Before 50 rounds of training, the CL-Transformer model loss function, emotion recognition accuracy, and behavior prediction accuracy converge, showing the fastest training speed and training results. The algorithm in this paper has more than 90% precision, recall, and F1 scores for emotion recognition and behavior prediction, and the recognition accuracy for different emotions is up to 0.96. In the fifth stage of the case study, the classroom emotion conversion rate and arousal is up to 0.66, and the model predicts that the probability of cell phone playing behavior is the highest for learners who are in angry moods, which is 64.7%. The learners’ classroom emotional acceptance as well as behavioral integration have an impact on their classroom engagement.

Keywords: multilevel feature extraction, Transformer CNN-LSTM, cross-modal adversarial, behavioral prediction, emotion dynamic analysis

On this page

Multi-level Feature Fusion Transformer-CNN-LSTM Model for Dynamic Analysis and Personalized Prediction of Classroom Emotions of Non-native English Learners

Abstract