DOI QR코드

DOI QR Code

Improving the speed of deep neural networks using the multi-core and single instruction multiple data technology

다중 코어 및 single instruction multiple data 기술을 이용한 심층 신경망 속도 향상

  • Received : 2017.08.29
  • Accepted : 2017.11.29
  • Published : 2017.11.30

Abstract

In this paper, we propose optimization methods for speeding the feedforward network of deep neural networks using NEON SIMD (Single Instruction Multiple Data) parallel instructions and multi-core parallelization on the multi-core ARM processor. As the result of the optimization using SIMD parallel instructions, we present the amount of speed improvement and arithmetic precision stage by stage. Through the optimization using SIMD parallel instructions on the single core, we obtain $2.6{\times}$ speedup over the baseline implementation using C compiler. Furthermore, by parallelizing the single core implementation on the multi-core, we obtain $5.7{\times}{\sim}7.7{\times}$ speedup. The results we obtain show the possibility for applying the arithmetic-intensive deep neural network technology to applications on mobile devices.

본 논문에서는 다중 코어 ARM 프로세서의 NEON SIMD(Single Instruction Multiple Data) 병렬 명령어 및 다중 코어 병렬화를 통하여 심층 신경망의 피드포워드 네트워크 연산을 최적화하는 방안을 제시하였다. SIMD 병렬 명령어를 이용한 최적화의 경우에는 단계 별 최적화 과정에서의 속도 향상과 정밀도를 제시 하였다. 단일 코어 상에서 SIMD 병렬 명령어를 이용하여 구현된 결과는 C 컴파일러를 이용한 구현보다 2.6배의 속도 향상을 얻을 수 있었다. 또한 단일 코어 상에서 최적화된 코드를 다중 코어로 병렬화함으로써 5.7배~7.7배의 속도 향상을 얻을 수 있었다. 이상의 결과를 통하여 이동형 단말기에서도 연산량이 많은 심층 신경망 기술을 활용할 수 있는 가능성을 확인하였다.

Keywords

References

  1. Y. Bengio, "Learning deep architectures for AI," Foundations and Trends in Machine Learning 2, 1-127 (2009). https://doi.org/10.1561/2200000006
  2. Y. LeCunm, Y. Bengio, and G. Hinton, "Deep learning," Nature 521, 436-444 (2015). https://doi.org/10.1038/nature14539
  3. G. E. Hinton, S. Osindero, and Y. W. Teh, "A fast learning algorithm for deep belief nets," Neural Computation 18, 1527-1554 (2006). https://doi.org/10.1162/neco.2006.18.7.1527
  4. A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, and B. Catanzaro, "Deep learning with COTS HPS systems," J. Mach. Learn. Res. 28, 1337-1345 (2013).
  5. NVIDA cuDNN - GPU Accelerated Deep Learning, https://developer.nvidia.com/cudnn, 2017.
  6. X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, "Accurate and compact large vocabulary speech recognition on mobile devices," Proc. Interspeech, 662-665 (2013).
  7. I. J. Chung and S. H. Kim "Improving the speed of deep neural network using NEON instructions" (in Korean), J. Acoust. Soc. Kr. Suppl.2(s) 34, 39-44 (2015).
  8. ARM, ARMv8 Instruction Set Overview reference manual, PRD03-GENC-010197 15, 2011.
  9. Google Code Archive, https://code.google.com/archive/p/math-neon/
  10. Quora, https://www.quora.com/Why-is-softmax-invariant-to-constant-offsets-to-the-input
  11. S. Anwar, K. Hwang, and W. Sung, "Fixed point optimization of deep convolutional neural networks for object recognition," ICASSP, IEEE, 1131-1134 (2015).
  12. Mixed-Precision Programming with CUDA 8, https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8, 2016.
  13. M. Courbariaux, J. P. David, and Y. Bengio, "Training deep neural networks with low precision multiplications," arXiv:1412.7024v5 (2015).