Aiming at the problem of semantic fragmentation of multimodal data and insufficient dynamic modeling of emotion communication in cross-cultural human-computer communication, this paper proposes an improved TDSIR emotion communication model that integrates multimodal alignment and self-attention mechanism. Contrastive learning technique is adopted to realize cross-modal semantic alignment, and Transformer-based self-attention network is designed to realize multimodal emotion inference at character level. Using three-degree influence theory to model the emotion propagation model and optimize the propagation threshold parameters of the TD-SIR model. Based on 128,000 cross-cultural multimodal data from Sina Weibo, the effectiveness of the improved TD-SIR model is verified. Compared with the TD-SIR model, in the initial propagation stage, the improved TD-SIR model is closer to the real data and has a better fitting effect. Setting different experimental parameters, the improved TD-SIR model achieves the highest accuracy of 92.48% when the propagation probability threshold is 0.28 and the forgetting probability threshold is 0.035. Under this experimental parameter, the model proposed in this paper better simulates the sentiment evolution trend of public opinion events and performs better than ESIS and EC models.