Performance Evaluation of the GPU Architecture Executing Parallel Applications

병렬 응용프로그램 실행 시 GPU 구조에 따른 성능 분석

  • 최홍준 (전남대학교 전자컴퓨터공학과) ;
  • 김철홍 (전남대학교 전자컴퓨터공학과)
  • Received : 2012.02.14
  • Accepted : 2012.04.20
  • Published : 2012.05.28


The role of GPU has evolved from graphics-specific processing to general-purpose processing with the development of unified shader core architecture. Especially, execution methods for general-purpose parallel applications using GPU have been researched intensively, since the parallel hardware architecture can be utilized efficiently when the parallel applications are executed. However, current GPU architecture has limitations in executing general-purpose parallel applications, since the GPU is not specialized for general-purpose computing yet. To improve the GPU performance when general-purpose parallel applications are executed, the GPU architecture should be evolved. In this work, we analyze the GPU performance according to the architecture varying the number of cores and clock frequency. Our simulation results show that the GPU performance improves by up to 125.8% and 16.2% as the number of cores increases and the clock frequency increases, respectively. However, note that the improvement of the GPU performance is saturated even though the number of cores increases and the clock frequency increases continuously, since the data cannot be provided to the GPU due to the limit of memory bandwidth. Consequently, to accomplish high performance effectiveness on GPU, computational resources must be more suitably considered.


Unified Shader;GPU;CUDA;GPGPU;Parallel Application


Supported by : 한국연구재단


  2. K. Gray, The Microsoft DirectX 9 Programmable Graphics Pipeline, Microsoft Press, 2003.
  4. NVIDIA CUDA$^{TM}$, Programming Guide Version 2.3.1, NVIDIA Corporation, 2009.
  6. Y. H. Jang, C. Park, J. H. Park, N. Kim, and K. H. Yoo, "Parallel Processing for Integral Imaging Pickup using Multiple Threads," International Journal of Korea Contents, Vol.5, No.4, pp.30-34, 2009.
  7. Y. H. Jang, C. Park, J. S. Jung, J. H. Park, N. Kim, J. S. Ha, and K. H. Yoo, "Integral Imaging Pickup Method of Bio-Medical Data using GPU and Octree," International Journal of Korea Contents, Vol.10, No.9, pp.1-9, 2009.
  8. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE MICRO, Vol.28, No.2, pp.39-55, 2008.
  9. Yao Zhang and John D. Owens, "A Quantitative Performance Analysis Model for GPU Architectures," In Proceedings of International Symposium on High Performance Computer Architecture, pp.382-393, 2011.
  10. Veynu Narasiman, Chang Joo Lee, Michael Shebanow, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," In Proceedings of international symposium on Microarchitecture, 2011.
  11. Wilson W. L. Fung, Inderpreet Singh, A. Brownsword, and Tor M. Aamodt, "Hardware Transactional Memory for GPU Architectures," In Proceedings of international symposium on Microarchitecture, 2011.
  12. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.

Cited by

  1. Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications vol.13, pp.3, 2013,
  2. Analysis on the GPU Performance according to Hierarchical Memory Organization vol.14, pp.3, 2014,