DOI QR코드

DOI QR Code

Reducing latency of neural automatic piano transcription models

인공신경망 기반 저지연 피아노 채보 모델

  • Dasol Lee (Department of Art & Technology, Sogang University) ;
  • Dasaem Jeong (Department of Art & Technology, Sogang University)
  • Received : 2023.03.03
  • Accepted : 2023.03.26
  • Published : 2023.03.31

Abstract

Automatic Music Transcription (AMT) is a task that detects and recognizes musical note events from a given audio recording. In this paper, we focus on reducing the latency of real-time AMT systems on piano music. Although neural AMT models have been adapted for real-time piano transcription, they suffer from high latency, which hinders their usefulness in interactive scenarios. To tackle this issue, we explore several techniques for reducing the intrinsic latency of a neural network for piano transcription, including reducing window and hop sizes of Fast Fourier Transformation (FFT), modifying convolutional layer's kernel size, and shifting the label in the time-axis to train the model to predict onset earlier. Our experiments demonstrate that combining these approaches can lower latency while maintaining high transcription accuracy. Specifically, our modified model achieved note F1 scores of 92.67 % and 90.51 % with latencies of 96 ms and 64 ms, respectively, compared to the baseline model's note F1 score of 93.43 % with a latency of 160 ms. This methodology has potential for training AMT models for various interactive scenarios, including providing real-time feedback for piano education.

자동 음악 채보는 주어진 오디오에서 음표 정보를 추출하는 태스크로, 이 연구에서는 피아노 음악의 자동음악 채보 모델에서 지연 시간을 줄이는 방법을 소개한다. 신경망 기반 채보 모델이 피아노 채보에도 적용되어 높은 정확도를 기록하였고 이를 이용한 실시간 구현도 소개된 바 있지만, 채보를 위한 지연 시간이 길어 인터랙티브 시나리오에서 활용하기에 한계가 있었다. 이 문제를 해결하기 위해 본 연구는 Fast Fourier Transformation(FFT)에서 윈도우 크기와 홉 크기를 줄이거나 합성곱 레이어의 커널 크기를 수정하고 시간 축에서 레이블을 이동하여 모델이 시작을 더 일찍 예측하도록 훈련하는 등 피아노 전사를 위한 신경망의 내재적 지연 시간을 줄이는 몇 가지 기술을 제안한다. 실험 결과, 이러한 접근 방식을 결합하면 높은 전사 정확도를 유지하면서 지연 시간을 줄일 수 있음을 알 수 있었다. 기존 모델은 160 ms의 지연 시간을 가지고 음표 F1 점수는 93.43 %였으나 제안한 방법을 적용하면 96 ms와 64 ms의 지연 시간 동안 각각 92.67 %와 90.51 %의 F1 점수를 달성할 수 있었다. 이러한 결과는 향후 피아노 교육을 위한 실시간 피드백 제공 등 다양한 인터랙티브 시나리오를 위한 자동 채보 모델에 활용될 수 있을 것이다.

Keywords

Acknowledgement

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korea Government (MSIT) (NRF-2022R1F1A1074566).

References

  1. K. O'Hanlon and M. D. Plumbley, "Polyphonic piano transcription using non-negative matrix factorisation with group sparsity," Proc. IEEE ICASSP, 3112-3116 (2014).
  2. C. Raphael, "Automatic transcription of piano music," Proc. 3rd ISMIR, (2002).
  3. V. Emiya, R. Badeau, and B. David "Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle," IEEE Trans. on Audio, Speech, and Lang. Process. 18, 1643-1654 (2010). https://doi.org/10.1109/TASL.2009.2038819
  4. L. Su and Y-H. Yang, "Combining spectral and temporal representations for multipitch estimation of polyphonic music," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 23, 1600-1612 (2015). https://doi.org/10.1109/TASLP.2015.2442411
  5. S. Sigtia, E. Benetos, and S. Dixon, "An end-to-end neural network for polyphonic piano music transcription," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 24, 927-939 (2016). https://doi.org/10.1109/TASLP.2016.2533858
  6. R. Kelz, M. Dorfer, F. Korzeniowski, S. Bock, A. Arzt, and G. Widmer, "On the potential of simple framewise approaches to piano transcription," Proc. 17th ISMIR, 475-481 (2016).
  7. C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck, "Onsets and frames: Dual-objective piano transcription," Proc. 19th ISMIR, 50-57 (2018).
  8. C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C-Z A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, "Enabling factorized piano music modeling and generation with the MAESTRO dataset," Proc. 7th ICLR, 1-12 (2019).
  9. Q. Kong, K. Choi, and Y. Wang, "Large-scale MIDI-based composer classification," arXiv preprint arXiv: 2010.14805 (2020).
  10. H. Zhang, J. Tang, S. R. Rafee, S. Dixon, G. Fazekas, and G. A. Wiggins, "ATEPP: A dataset of automatically transcribed expressive piano performance," Proc. 23st ISMIR, 446-453 (2022).
  11. Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, "High resolution piano transcription with pedals by regressing onset and offset times," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 29, 3707-3717 (2021). https://doi.org/10.1109/TASLP.2021.3121991
  12. T. Kwon, D. Jeong, and J. Nam, "Polyphonic piano transcription using autoregressive multi-state note model," Proc. the 21st ISMIR, 454-461 (2020).
  13. D. Jeong, "Real-time automatic piano music transcription system," Proc. Late Breaking/Demo of the 21st ISMIR, 4-6 (2020).
  14. A. A. Sawchuk, E. Chew, R. Zimmermann, C. Papadopoulos, and C. Kyriakakis, "From remote media immersion to distributed immersive performance," Proc. ACM SIGMM workshop on Experiential Telepresence, 110-120 (2003).
  15. J. W. Kim and J. P. Bello, "Adversarial learning for improved onsets and frames music transcription," Proc. 20th ISMIR, 670-677 (2019).
  16. M. Akbari and H. Cheng, "Real-time piano music transcription based on computer vision," IEEE Trans. Multimedia, 17, 2113-2121 (2015). https://doi.org/10.1109/TMM.2015.2473702
  17. A. Dessein, A. Cont, and G. Lemaitre, "Real-time polyphonic music transcription with non-negative matrix factorization and beta-divergence," Proc. 11th ISMIR, 489-494 (2010).
  18. C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. Ellis, "Mir_eval: A transparent implementation of common mir metrics," Proc. 15th ISMIR, 367-372 (2014).