Improving the ability to deal with complex syntax and semantics is the key for English translation systems to move towards intelligence. In this paper, we incorporate a multimodal parallel fusion architecture into the design of the translation system, combining visual theme enhancement coding with detail fusion decoding to construct a cross-language-cross-modal semantic space. Semantic pre-tuning order training strategy and tree model syntactic encoding method are introduced to optimize the translation quality from source language to English. Experiments show that the BLEU values of this paper’s method on four datasets significantly outperform mainstream models. In the translation of long sentences with (35,45] and (45,80] word counts, the BLEU enhancement values are up to 2.51 and 2.67. The range of BLEU values of this paper’s method is enhanced to the range of 40-43 in the translation of complex sentences with syntactic and semantic structural adjustments.