DOI QR코드

DOI QR Code

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression

구조적 압축을 통한 FPGA 기반 GRU 추론 가속기 설계

  • Chae, Byeong-Cheol (Department of Electronic and Information Engineering, Korea University)
  • Received : 2022.05.10
  • Accepted : 2022.06.03
  • Published : 2022.06.30

Abstract

To deploy Gate Recurrent Units (GRU) on resource-constrained embedded devices, this paper presents a reconfigurable FPGA-based GRU accelerator that enables structured compression. Firstly, a dense GRU model is significantly reduced in size by hybrid quantization and structured top-k pruning. Secondly, the energy consumption on external memory access is greatly reduced by the proposed reuse computing pattern. Finally, the accelerator can handle a structured sparse model that benefits from the algorithm-hardware co-design workflows. Moreover, inference tasks can be flexibly performed using all functional dimensions, sequence length, and number of layers. Implemented on the Intel DE1-SoC FPGA, the proposed accelerator achieves 45.01 GOPs in a structured sparse GRU network without batching. Compared to the implementation of CPU and GPU, low-cost FPGA accelerator achieves 57 and 30x improvements in latency, 300 and 23.44x improvements in energy efficiency, respectively. Thus, the proposed accelerator is utilized as an early study of real-time embedded applications, demonstrating the potential for further development in the future.

리소스가 제한된 임베디드 장치에 GRU를 배포하기 위해 이 논문은 구조적 압축을 가능하게 하는 재구성 가능한 FPGA 기반 GRU 가속기를 설계한다. 첫째, 조밀한 GRU 모델은 하이브리드 양자화 방식과 구조화된 top-k 프루닝에 의해 크기가 대폭 감소한다. 둘째, 본 연구에서 제시하는 재사용 컴퓨팅 패턴에 의해 외부 메모리 액세스에 대한 에너지 소비가 크게 감소한다. 마지막으로 가속기는 알고리즘-하드웨어 공동 설계 워크플로의 이점을 얻는 구조화된 희소 GRU 모델을 처리할 수 있다. 또한 모든 차원, 시퀀스 길이 및 레이어 수를 사용하여 GRU 모델에 대한 추론 작업을 유연하게 수행할 수 있다. Intel DE1-SoC FPGA 플랫폼에 구현된 제안된 가속기는 일괄 처리가 없는 구조화된 희소 GRU 네트워크에서 45.01 GOPs를 달성하였다. CPU 및 GPU의 구현과 비교할 때 저비용 FPGA 가속기는 대기 시간에서 각각 57배 및 30배, 에너지 효율성에서 300배 및 23.44배 향상을 달성한다. 따라서 제안된 가속기는 실시간 임베디드 애플리케이션에 대한 초기 연구로서 활용, 향후 더 발전될 수 있는 잠재력을 보여준다.

Keywords

References

  1. D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, "Deep speech 2: End-to-end speech recognition in english and mandarin," in Proceedings of The 33rd International conference on machine learning, New York: NY, USA, pp. 173-182, 2016. DOI: 10.5555/3045390.3045410.
  2. A. Graves, A. -R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver: BC, Canada, pp. 6645-6649, 2013. DOI: 10.1109/icassp.2013.6638947.
  3. K. Cho, B. V. Merrienboer, D. Bahdanau, and Y. Bengio, "On the Properties of Neural Machine Translation: Encoder -Decoder Approaches," in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, pp. 103-111, 2014. DOI: 10.3115/v1/w14-4012.
  4. S. -M. Lim, H. -C. Oh, J. Kim, J. Lee, and J. Park, "LSTM-Guided Coaching Assistant for Table Tennis Practice," Sensors, vol. 18, no. 12, p. 4112, Nov. 2018, DOI: 10.3390/s18124112.
  5. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997. DOI: 10.1162/neco.1997.9.8.1735.
  6. K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724-1734, 2014. DOI: 10.3115/v1/d14-1179.
  7. R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," in International conference on machine learning (PMLR), Atlanta: GA, USA, pp. 1310-1318, 2013. Available: https://proceedings.mlr.press/v28/pascanu13.html.
  8. S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, "ESE: Efficient speech recognition engine with sparse LSTM on FPGA," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey: CA, USA, pp. 75-84, 2017. DOI: 10.1145/3020078.3021745.
  9. A. X. M. Chang and E. Culurciello, "Hardware accelerators for recurrent neural networks on FPGA," in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore: MD, USA, pp. 1-4, 2017. DOI: 10.1109/ISCAS.2017.8050816.
  10. M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, "FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks," in 2016 IEEE International Workshop on Signal Processing Systems (SiPS), Dallas: TX, USA, pp. 230-235, 2016. DOI: 10.1109/SiPS.2016.48.
  11. Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong, "FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates," in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa: CA, USA, pp. 152-159, 2017, DOI: 10.1109/fccm.2017.25.
  12. M. Wang, Z. Wang, J. Lu, J. Lin, and Z. Wang, "E-LSTM: An Efficient Hardware Architecture for Long Short-Term Memory," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 280-291, Jun. 2019. DOI: 10.1109/JETCAS.2019.2911739.
  13. D. Shin, J. Lee, J. Lee and H. Yoo, "14.2 dnpu: An 8.1tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco: CA, USA, pp. 240-241, 2017. DOI: 10.1109/isscc.2017.7870350.
  14. S. Kim and H. Kim "Linear Domain-aware Log-scale Post-training Quantization," in 2021 IEEE International Conference on Consumer Electronics-Asia(ICCE-Asia), Gangwon, Republic of Korea, pp. 1-3, 2021. DOI: 10.1109/ICCE-Asia53811.2021.9642002.
  15. V. NGUYEN, J. CAI, Linyu WEI, and J. CHU, "Neural Networks Probability-Based PWL Sigmoid Function Approximation," IEICE Transactions on Information and Systems, vol. E103.D, no. 9, pp. 2023-2026, Sep. 2020. DOI: 10.1587/transinf.2020EDL8007.
  16. D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, "A Public Domain Dataset for Human Activity Recognition Using Smartphones," in 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning(ESANN), Bruges, Belgium, pp. 437-442, 2013. DOI: 10.1201/b16098-7.
  17. H. Fan, G. Luo, C. Zeng, M. Ferianc, Z. Que, S. Liu, X. Niu, and W. Luk, "F-E3D: FPGA-based Acceleration of an Efficient 3D Convolutional Neural Network for Human Action Recognition," in 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), New York: NY, USA, pp. 1-8, 2019. DOI: 10.1109/ASAP.2019.00-44.
  18. A. N. Mazumder, H. -A. Rashid, and T. Mohsenin, "An Energy-Efficient Low Power LSTM Processor for Human Activity Monitoring," in 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Las Vegas: NV, USA, pp. 54-59, 2020. DOI: 10.1109/SOCC49529.2020.9524796.
  19. N. B. Gaikwad, V. Tiwari, A. Keskar, and N. C. Shivaprakash, "Efficient FPGA Implementation of Multilayer Perceptron for Real-Time Human Activity Classification," IEEE Access, vol. 7, pp. 26696-26706, 2019. DOI: 10.1109/ACCESS.2019.2900084.