Digital protection and display of ancient buildings is an important technical means of cultural heritage transmission, and “Gongshu Hall” as a typical wooden ancient building, its wall materials are complex and diverse, and the precise identification and protection of surface damage is in urgent need. This study proposes a set of intelligent media fusion technology framework based on image recognition and computer vision algorithms, and realizes high-precision material recognition and damage detection by optimizing the network structure, loss function and algorithm fusion strategy. Firstly, the EfficientNet v2 network is improved, and the collaborative attention mechanism (CA) is introduced to replace the original SE module to enhance the spatial location perception of the feature map. To solve the problem of insufficient bounding box regression accuracy of YOLOv7 in crack detection, the sample gradient contribution is balanced by the normalization factor and monotonic focusing coefficient, which improves the model convergence speed and location accuracy. The two-level detection-segmentation joint algorithm is further constructed by combining the pixel-level segmentation capability of UNet3+ network. The model achieves a test accuracy of 93.87% on a dataset containing eight types of materials, with the highest recognition rate of metal (97.04%), followed by blue brick (95.13%) and stone (95.27%), but rammed earth (89.72%) and glazed glass (89.43%) are misclassified due to the complexity of surface features. Experiments show that the algorithm has excellent comprehensive performance in the detection of “spalling”, “phthalate” and “crack”, with an average F1 value of 97.21%, of which the F1 value of crack detection is the highest (97.64%), and the spalling accuracy (99.47%) and phthalate recall rate (95.96%) are outstanding.