This paper introduces a hybrid recommendation algorithm based on collaborative filtering of vocal resources and content in the design of interactive multimedia-assisted vocal teaching system. Based on historical learning data and so on, the sparse matrix between students’ attributes and resources is constructed, and the similarity is calculated to complete the accurate recommendation of vocal music learning resources. The speech emotion recognition module uses a convolutional neural network (CNN) model based on multilevel residual improvement. The multilevel residual structure reduces the loss rate of vocal singing voice features, and at the same time reduces the amount of model computation to ensure that students’ voices are accurately recognized. The results show that: the resource similarity range of this paper’s hybrid recommendation algorithm is [0.748,0.894], the resource coverage are greater than 95%, at the same time, the AUC area is greater than 0.9. The recognition rate of the model based on the improved CNN is stable greater than 0.95 for about 45 iterations, and the loss value is less than 0.4. The introduction of RMSProp algorithm has the optimization of 0.03 and 0.15, respectively. Effect. The mean value of the system’s effect on vocal music teaching reaches more than 4.5, and the standard deviations are all less than 0.10.