Tourism architecture texts serve as critical cultural interfaces, mediating the complex heritage and aesthetic values of built environments for international audiences. This study investigates the optimization of crosscultural communication (CCC) within this domain through a synergistic approach combining multimodal translation analysis and immersive technology integration. Employing a mixed-methods design, we compiled a specialized corpus of Chinese tourism architecture source texts (STs) and their English translations (TTs), alongside comparable authentic English texts. Multimodal Discourse Analysis (MDA) frameworks, particularly Kress and van Leeuwen’s Visual Grammar and extensions for spatial texts, were applied to dissect the interplay of verbal, visual (images, diagrams), and spatial semiotic resources. Quantitative analysis revealed significant discrepancies in information density, cultural term treatment (e.g., over-domestication of terms like “ting”as generic pavilion), and visual-verbal cohesion between STs and TTs. Qualitative analysis identified recurrent challenges: translating culturally embedded architectural concepts (“sunmao”), managing narrative perspective shifts, and inadequate multimodal complementarity. Building on this analysis, we propose and categorize targeted translation strategies—including Foreignization with Glossing, Multimodal Compensation, Cultural Schema Activation, and Spatial Recontextualization. Crucially, we present a novel framework for integrating Extended Reality (XR) technologies— specifically Augmented Reality (AR) overlays and Virtual Reality (VR) reconstructions—as dynamic multimodal supplements. A controlled user study (n=120 international tourists) demonstrated that translations employing these optimized strategies combined with AR annotations significantly enhanced comprehension (p<0.01), cultural appreciation (p<0.05), and engagement metrics compared to traditional text-only translations or non-optimized multimodal versions. This research provides empirically grounded strategies and a forward-looking framework for significantly enhancing CCC in heritage tourism, advocating for a deeply integrated multimodal and technologically augmented approach.