DOI QR코드

DOI QR Code

Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications

범용 응용프로그램 실행 시 하드웨어 구성과 분기 처리 기법에 따른 GPU 성능 분석

  • 최홍준 (전남대학교 전자컴퓨터공학부) ;
  • 김철홍 (전남대학교 전자컴퓨터공학부)
  • Received : 2013.01.07
  • Accepted : 2013.03.07
  • Published : 2013.03.28

Abstract

Due to increased computing power and flexibility of GPU, recent GPUs execute general purpose parallel applications as well as graphics applications. Programmers can use GPGPU by using the APIs from GPU vendors. Unfortunately, computational resources of GPU are not fully utilized when executing general purpose applications because of frequent branch instructions. To handle the branch problem, several warp formations have been proposed. Intuitively, we expect that the warp formations providing higher computational resource utilization show higher performance. Contrary to our expectations, according to simulation results, the performance of the warp formation providing better utilization is lower than that of the warp formation providing worse utilization. This is because warp formation providing high utilization causes serious memory bottleneck due to increased memory request. Therefore, warp formation providing high computation utilization cannot guarantee high performance without proper hardware resources. For this reason, we will analyze the correlation between hardware configuration and warp formation. Our simulation results present the guideline to solve the underutilization problem due to branch instructions when designing recent GPU.

Keywords

GPU;GGPPU;General-purpose Application;Branch Instruction;Warp Formation

Acknowledgement

Supported by : 한국연구재단, 정보통신산업진흥원

References

  1. V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microArchitectures," In Proceedings of 27th International Symposium on Computer Architecture, pp.248-259, 2000.
  2. N. P. Jouppi and D. W. Wall, "Available instruction-level parallelism for superscalar and superpipelined machines," In Proceedings of 3th International Conference on Architectural Support for Programming Languages and Operating Systems, pp.272-282, 1989.
  3. D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous multithreading: maximizing on-chip parallelism," In Proceedings of 22th International Symposium on Computer Architecture, pp.392-403, 1995.
  4. Y. H. Jang, C. Park, J. H. Park, N. Kim, and K. H. Yoo, "Parallel Processing for Integral Imaging Pickup using Multiple Threads," International Journal of Korea Contents, Vol.5, No.4, pp.30-34, 2009. https://doi.org/10.5392/IJoC.2009.5.4.030
  5. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: stream computing on graphics hardware," In Proceedings of 31th Annual Conference on Computer Graphics (SIGGRAPH), pp.777-786, 2004.
  6. E. Lindholm, M. J. Kligard, and H. P. Moreton, "A user-programmable vertex engine," In Proceedings of 28th Annual Conference on Computer Graphics (SIGGRAPH), pp.149-158, 2001.
  7. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell, "A Survey of General-Purpose Computation on Graphics Hardware," Eurographics 2005, State of the Art Reports, pp.21-51, 2005.
  8. http://developer.nvidia.com/object/cuda_3_1_do wnloads.html
  9. http://www.khronos.org/opencl/
  10. J. Helin, "Performance analysis of the CM-2, a massively parallel SIMD computer," In Proceedings of 6th International Conference on Supercomputing, pp.45-52, 1992.
  11. A. Levinthal and T. Porter, "Chap-a SIMD graphics processor," In Proceedings of 11th Annual Conference on Computer Graphics (SIGGRAPH), pp.77-82, 1984.
  12. S. Che, J. Meng, J. Sheaffer, and K. Skadron, "A performance study of general purpose applications on graphics processors using CUDA," Journal of Parallel and Distributed Computing, Vol.68, No.10, pp.1370-1380, 2008. https://doi.org/10.1016/j.jpdc.2008.05.014
  13. R. A. Lorie and H. R. Strong, "Method for conditional branch execution in SIMD vector processors," US Patent 4435758, Vol.6, 1984(3).
  14. S. Moy and E. Lindholm, "Method and system for programmable pipelined graphics processing with branching instructions," US Patent 6947047, Vol.20, 2005(9).
  15. E. Rotenberg, Q. Jacobson, and J. E. Smith, "A study of control independence in superscalar processors," In Proceedings of 5th International Symposium on High-Performance Computer Architecture, pp.115-124, 1999.
  16. B. W. Coon and J. E. Lindholm, "System and method for managing divergent threads in a SIMD architecture," US Patent 7353369, Vol.1, 2008(4).
  17. E. Rotenberg, Q. Jacobson, and J. Smith, "A study of control independence in superscalar processors," In Proceedings of 5th International Symposium on High-Performance Computer Architecture, pp.115-124, 1999.
  18. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," In Proceedings of 40th Microarchitecture, pp.407-420, 2007.
  19. H. J. Choi and C. H. Kim, "Performance Evaluation of the GPU Architecture Executing Parallel Applications," Journal of the Korea Contents Association, Vol.12, No.5, pp.10-21, 2012. https://doi.org/10.5392/JKCA.2012.12.05.010
  20. H. J. Choi, H. G. Jeon, and C. H. Kim, "Quantitative Anaysis of the Negative Factors on the GPU Performance," Journal of KIISE : Computing Practices and Letters, Vol.18, No.4, pp.282-287, 2012.
  21. H. J. Choi, S. G. Kang, J. M. Kim, and C. H. Kim, "Analysis of the CPU/GPU Temperature and Energy Efficiency depending on Executed Applications," Journal of The Korea Society of Computer and Information, Vol.17, No.5, pp.9-20, 2012. https://doi.org/10.9708/jksci.2012.17.5.009
  22. http://www.amd.com/stream
  23. https://developer.nvidia.com/cg-toolkit
  24. http://msdn2.microsoft.com/en-us/library/bb50 9638.aspx
  25. http://www.opengl.org/registry/doc/GLSLangS pec.Full.1.20.8.pdf
  26. http://www.simplescalar.com
  27. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
  28. http://www.nvidia.com/object/product_quadro_fx_5800_us.html
  29. http://nocs.stanford.edu/booksim.html
  30. http://developer.download.nvidia.com/compute/ cuda/sdk/website/samples.html
  31. http://www.nvidia.com/content/cudazone/
  32. M. J. Flynn, "Very high-speed computing systems," Proceedings of the IEEE, Vol.54, No.12, pp. 1901-1909, 1966. https://doi.org/10.1109/PROC.1966.5273

Cited by

  1. Analysis on the GPU Performance according to Hierarchical Memory Organization vol.14, pp.3, 2014, https://doi.org/10.5392/JKCA.2014.14.03.022