The rapid development of artificial intelligence technology has brought new opportunities to the field of education. Aiming at the problems of single scene and poor learning experience in traditional spoken English teaching, this paper proposes an immersive spoken English teaching scene design method based on cross-modal generative adversarial network. By constructing the SPSceneGAN model, the encoder-decoder structure and spectral regularization technique are used to realize the automatic generation of high-quality teaching scenes. The model is trained on the spoken English teaching dataset, which contains 7000 training images and 3000 test images. Experimental results show that the SPSceneGAN model significantly outperforms traditional methods in scene generation quality, with a PSNR value of 38.729dB, an SSIM value of 0.984, and an image processing speed of only 3.921s at a batch size of 500. User testing verifies the effectiveness of the system, with 500 college and university students taking part in a 50-minute comparative experiment, which shows that Students using the immersive teaching scenarios produced significant gains in all three dimensions of prior knowledge level, intrinsic motivation and self-efficacy. The method can effectively enhance students’ oral English learning experience and provide a new technological path for intelligent language teaching.