Aiming at the problem of insufficient generalization ability of Neural Machine Translation (NMT) in crossdomain scenarios, this study proposes an LSTM-RNNs-attention model that integrates self-supervised multimodal features with an improved attention mechanism. Through multimodal self-supervised learning and text preprocessing optimization, the model constructs a graphic-text consistency classification and keyword annotation algorithm from image-text semantic correlation, which combined with the LSTM-CRF sequential word segmentation technique significantly improves the accuracy of the source language semantic representation. The experimental results show that the model F1 value reaches 99.13% when the character vector dimension is 125, and the performance is optimal when the Dropout ratio is 30% in the Chinese word segmentation task. For input noise robustness, the UNK-Tag strategy in the random word dropout mechanism has a BLEU value of 47.63 at a sampling probability of 0.15, which is 3.81% higher than the baseline. In the multilingual translation task, the BLEU scores of LSTM-RNNs-attention model on English-Chinese (Eng-Ch), Japanese-Chinese (Jap-Ch), and German-Chinese (Ger-Ch) are 45.82, 42.32, and 32.91, respectively, compared with the mainstream baseline model BERT-fused NMT, Multilingual NMT by an average of 2.6-18.0 points, and the convergence time is shortened to 6.58s (Eng-Ch), which is significantly better than the efficiency of Transformer’s 16.13s and RNN-NMT’s 11.79s.Manual evaluation further validates the model’s semantic coherence advantage, with the Eng-Ch task scoring 9.66 points ( out of 10). The study effectively solves the problem of semantic bias and long-distance dependency in cross-domain translation through self-supervised multimodal feature fusion, dynamic attention weight allocation and word segmentation optimization.