The article proposes a novel cross-modal adversarial learning framework for analyzing the emotional dynamics of non-English learners during classroom engagement and predicting their individualized behaviors. The framework combines multilevel feature extraction and Transformer CNN-LSTM integrated model to handle multimodal data more efficiently and capture the complex relationship between emotions and behaviors. Low-level and high-level multilevel features are then extracted from the raw multimodal data. Meanwhile, Transformer is utilized to mine long-distance dependencies between multimodal data, CNN extracts local features, and LSTM is used to model dynamic changes in time series. In addition, the framework introduces adversarial training to learn shared features across modalities. Before 50 rounds of training, the CL-Transformer model loss function, emotion recognition accuracy, and behavior prediction accuracy converge, showing the fastest training speed and training results. The algorithm in this paper has more than 90% precision, recall, and F1 scores for emotion recognition and behavior prediction, and the recognition accuracy for different emotions is up to 0.96. In the fifth stage of the case study, the classroom emotion conversion rate and arousal is up to 0.66, and the model predicts that the probability of cell phone playing behavior is the highest for learners who are in angry moods, which is 64.7%. The learners’ classroom emotional acceptance as well as behavioral integration have an impact on their classroom engagement.