The accurate and efficient classification of mechanical defects is of great importance for ensuring the integrity and operational safety of industrial machinery. Although traditional image classification techniques are generally versatile, they often perform inadequately when identifying small and complex surface defects. This paper introduces the local-global visual transformer (LGViT), a highly advanced neural network architecture designed to significantly enhance defect detection capabilities. It achieves this by incorporating both local and global attention mechanisms within a transformer-based framework. LGViT employs a novel hierarchical transformer architecture that processes image features at multiple scales. This multi-level approach is inspired by the hierarchical feature extraction capabilities of convolutional neural networks, but it utilizes the powerful attention modeling of transformers. Furthermore, the LGViT architecture effectively integrates both attention mechanisms. The local attention module generates detailed feature maps, which are subsequently fed into the hierarchical global attention layers. These layers embed detailed features within a broader image context, thereby improving overall defect detection. This dual focus on both the granular and holistic aspects of the image ensures that the model not only identifies a greater number of defects but also does so with higher accuracy and reliability. To validate the effectiveness of LGViT, we conduct extensive experiments on various mechanical surface image datasets that are annotated with different types of defects. The model’s advanced capability in recognizing complex defect patterns is particularly evident in tests involving challenging defect scenarios.