Medical image detection plays a crucial role in disease diagnosis; however, traditional manual interpretation is often limited by subjectivity and low efficiency. Interpretable artificial intelligence (AI) techniques, grounded in image processing algorithms, demonstrate strong potential for enhancing objectivity and reliability in this domain. In this study, we propose an enhanced YOLOv12 algorithm that integrates both attention mechanisms and a residual feedback structure. Combined with a selective contextual transformer-enhanced SCTNet segmentation network, these components form a unified and intelligent medical image processing system. Methodologically, the AGs-ECSA hybrid attention module enhances feature extraction by incorporating Efficient Channel Attention (ECA) and spatial attention mechanisms. A loopback residual structure is introduced to preserve original feature information, while bounding box regression is improved using the Complete Intersection over Union (CIoU) loss function. Additionally, the Selective Contextual Transformer (SCT) module captures both local and global semantic dependencies.Experimental results on the DeepLesion dataset demonstrate that the proposed method achieves an average sensitivity of 88.97%, surpassing the strong baseline MULAN by 0.46% and consistently achieving higher detection accuracy at lower false positive rates. On the GLAS segmentation dataset, SCTNet achieves an mF1 score of 0.8285 and an mIoU of 0.7246, representing improvements of 4.98% and 8.64%, respectively, over existing mainstream methods. The system was further validated on cerebral hemorrhage CT scans, accurately localizing lesions and estimating their size. These findings demonstrate the effectiveness of interpretable AI in medical image detection and offer a reliable framework to support clinical diagnosis.