The Journal of the Korea Contents Association (한국콘텐츠학회논문지)
- Volume 14 Issue 3
- /
- Pages.22-32
- /
- 2014
- /
- 1598-4877(pISSN)
- /
- 2508-6723(eISSN)
DOI QR Code
Analysis on the GPU Performance according to Hierarchical Memory Organization
계층적 메모리 구성에 따른 GPU 성능 분석
- Received : 2013.11.14
- Accepted : 2013.12.26
- Published : 2014.03.28
Abstract
Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.
Keywords
GPU;Memory System;Hierarchical Memory Architecture;Memory Request Merging
File
Acknowledgement
Supported by : 한국연구재단, 정보통신산업진흥원
References
- http://nocs.stanford.edu/booksim.html
- E. Lindholm, J. Nickolls, S.Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE MICRO, Vol.28, No.2, pp.39-55, 2008. https://doi.org/10.1109/MM.2008.31
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
- D. C. Burger and T. M. Austin, "The SimpleScalar tool set, version 2.0," Computer Architecture News, Vol.25, No.3, pp.13-25, 1997.
- http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html
- J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell, "A Survey of General-Purpose Computation on Graphics Hardware," Euro-graphics 2005, State of the Art Reports, pp.21-51, 2005.
- Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: stream computing on graphics hardware," In Proceedings of 31th Annual Conference on Computer Graphics, pp.777-786, 2004.
- H. J. Choi and C. H. Kim, "Performance Evaluation of the GPU Architecture Executing Parallel Applications," Journal of the Korea Contents Association, Vol.12, No.5. pp.10-21, 2012. https://doi.org/10.5392/JKCA.2012.12.05.010
- H. J. Choi and C. H. Kim, "Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications," Journal of the Korea Contents Association, Vol.13, No.3. pp.9-21, 2013. https://doi.org/10.5392/JKCA.2013.13.03.009
- http://www.gpgpu.org
- http://www.khronos.org/opencl/
- http://www.amd.com/stream
- http://developer.nvidia.com/object/cuda_3_1_downloads.html
- J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," In Proceedings of 37th International Symposium on Computer Architecture, pp.235-246, 2010.
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," In Proceedings of 40th Microarchitecture, pp.407-420, 2007.
- J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch : Enabling Energy Optimizations in GPGPUs," In Proceedings of the 27th International Symposium on Computer Architecture, pp.487-498, 2013.
- N. B. Lakshminarayana and H. S. Kim, "Effect of Instruction Fetch and Memory Scheduling on GPU Performance," Workshop on Language, Compiler, and Architecture Support for GPGPU(in conjunction with HPCA/PPoPP 2010), 2010.
- http://www.nvidia.com/object/product_quadro_fx_5800_us.html
- W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," In Proceedings of the 17th International Symposium on High Performance Computer Architecture, pp.25-36, 2011.
- http://www.isuppli.com/
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger, "Dark Silicon and the End of Multicore Scaling," In Proceedings of International Symposium on Computer Architecture, pp.365-376, 2011.
- K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor," In Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp.2-11, 1996.
- V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microarchitectures," In Proceedings of the 27th International Symposium on Computer Architecture, pp.248-259, 2000.