In the era of big data, the analysis of classroom teaching behavior in higher education institutions is increasingly becoming automated, informatized, and intelligent. This study investigates methods for analyzing teaching behavior data in higher education institutions and proposes a classroom speech emotion recognition model based on multi-feature fusion. Residual networks and LSTM networks are used for deep feature extraction, while the encoder part of the Transformer is employed for feature fusion. Through experiments on the dataset, the language emotion recognition accuracy of the model in different datasets was below 85%, demonstrating the accuracy of the proposed method for speech emotion recognition. Additionally, the recognition accuracy for each emotion was 6.63% to 17.17% and 16.50% to 20.44% higher than that of the comparison methods. Analysis of speech sentiment in real-world teaching interactions revealed that pleasant emotions in classroom interactions exhibit a trend of first increasing and then decreasing. The sentiment values of interaction segments are sequentially [-1, 1.25], [1.5, 2.0], [1.2, 2.0], [0.85, 1.3], [0.4, 1.25], and [0.8, 1.4], respectively, validating the rationality of the proposed method. It can serve as an intelligent analysis method for teaching behavior data in higher education, assisting teachers in obtaining classroom feedback and optimizing teaching quality.