DOI QR코드

DOI QR Code

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

  • Kim, Gwang Bok (School of Electronics and Computer Engineering, Chonnam National University) ;
  • Kim, Jong Myon (IT Convergence Department, University of Ulsan) ;
  • Kim, Cheol Hong (School of Electronics and Computer Engineering, Chonnam National University)
  • 투고 : 2019.01.22
  • 심사 : 2019.04.11
  • 발행 : 2019.04.30

초록

LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.

키워드

CPTSCQ_2019_v24n4_1_f0001.png 이미지

Fig. 1. Microarchitecture of SM

CPTSCQ_2019_v24n4_1_f0002.png 이미지

Fig. 2. Warp Level Parallelism

CPTSCQ_2019_v24n4_1_f0003.png 이미지

Fig. 3. Warp Scheduling Policy Comparison

CPTSCQ_2019_v24n4_1_f0004.png 이미지

Fig. 4. Microarchitecture of Proposed Unit

CPTSCQ_2019_v24n4_1_f0005.png 이미지

Fig. 5. L1 Data Cache Miss Rate

CPTSCQ_2019_v24n4_1_f0006.png 이미지

Fig. 8. IPC Comparison with Different Warp Scheduling Policy

CPTSCQ_2019_v24n4_1_f0007.png 이미지

Fig. 6. Reservation Fails on L1 Data Cache

CPTSCQ_2019_v24n4_1_f0008.png 이미지

Fig. 7. Stall Cycles

Table 1. Baseline Configurations

CPTSCQ_2019_v24n4_1_t0001.png 이미지

Table 2. Benchmarks

CPTSCQ_2019_v24n4_1_t0002.png 이미지

참고문헌

  1. NVIDIA, "CUDA C Programming Guide," 2012.
  2. Khronos OpenCL Group, "The OpenCL Specification," 2011.
  3. T. G. Rogers., M. O'Connor., and T. M. Aamodt, "Cache-conscious wavefront scheduling," Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture pp. 72-83, 2012.
  4. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 99-110, 2013.
  5. Kim, G. B. Kim, J. M., & Kim. C. H., "Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information," International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, Singapore, pp. 230-239, 2018.
  6. Zhang, Y., Xing, Z., Liu, C., Tang, C., & Wang, Q., "Locality based warp scheduling in GPGPUs," Future Generation Computer Systems, 82, pp. 520-527. 2018. https://doi.org/10.1016/j.future.2017.02.036
  7. ElTantawy, A., & Aamodt, T. M., "Warp scheduling for fine-grained synchronization," In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 375-388, 2018.
  8. Oh, Yunho, et al. "Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs." IEEE Transactions on Computers 68.4 (2019): 609-616. https://doi.org/10.1109/TC.2018.2878671
  9. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 308-317, 2011.
  10. S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads," ACM SIGARCH Computer Architecture (ISCA), pp. 515-527, 2015.
  11. M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs," High Performance Computer Architecture (HPCA), IEEE International Symposium on. pp. 370-381, 2016.
  12. M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro, "Draw: investigating benefits of adaptive fetch group size on gpu," In Performance Analysis of Systems and Software (ISPASS), pp. 183-192, 2015.
  13. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163-174, 2009.
  14. "NVIDIA CUDA SDK Code Samples," http://developer.nvidia.com/cuda-downloads, 2015.
  15. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Shadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44-54, 2009.
  16. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing (InPar), pp. 1-10, 2012.
  17. J. A. Stratton, C. Rodrigues, J. I. Sung, et al. "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Center for Reliable and High-Performance Computing, 2012.
  18. M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "A hierarchical thread scheduler and register file for energy-efficient throughput processors," ACM Transactions on Computer Systems (TOCS), Vol. 30, No. 2, April 2012.