In the era of “Artificial Intelligence”, the introduction of multimodal information fusion into vocabulary teaching is an important breakthrough in the reform of university English teaching in colleges and universities. In this paper, multimodal data are extracted from text, pictures and other domains, and the information of different modal data is fused through heterogeneous data fusion. Add positional coding and word vector embedding coding fusion operations in the information initialization stage, extract image features and text features, and send the information to the lexical model for fusion coding, use Transformer learning to decode the source utterance into the target utterance through the decoder, and use the Glove word vector model to realize the knowledge point vectorization operation in the knowledge point embedding layer. Design empirical analysis experiments to study the application effect of multimodal information fusion in English vocabulary teaching. The significance levels of the two classes of subjects in contextual discrimination and word selection by looking at pictures are 0.028 and 0.035 respectively, with the significance level less than 0.05, which indicates that the vocabulary learning method using multimodal information fusion algorithm is more effective in memorizing the words than the traditional mode. The network security mechanism is established, and the multimodal heterogeneous data operation security is evaluated through simulation experiments. The method in this paper can guarantee the data processing volume of 2~2.1Mb/s, and has high storage efficiency.