Fig. 1. Microarchitecture of SM
Fig. 2. Warp Level Parallelism
Fig. 3. Warp Scheduling Policy Comparison
Fig. 4. Microarchitecture of Proposed Unit
Fig. 5. L1 Data Cache Miss Rate
Fig. 8. IPC Comparison with Different Warp Scheduling Policy
Fig. 6. Reservation Fails on L1 Data Cache
Fig. 7. Stall Cycles
Table 1. Baseline Configurations
Table 2. Benchmarks
참고문헌
- NVIDIA, "CUDA C Programming Guide," 2012.
- Khronos OpenCL Group, "The OpenCL Specification," 2011.
- T. G. Rogers., M. O'Connor., and T. M. Aamodt, "Cache-conscious wavefront scheduling," Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture pp. 72-83, 2012.
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 99-110, 2013.
- Kim, G. B. Kim, J. M., & Kim. C. H., "Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information," International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, Singapore, pp. 230-239, 2018.
- Zhang, Y., Xing, Z., Liu, C., Tang, C., & Wang, Q., "Locality based warp scheduling in GPGPUs," Future Generation Computer Systems, 82, pp. 520-527. 2018. https://doi.org/10.1016/j.future.2017.02.036
- ElTantawy, A., & Aamodt, T. M., "Warp scheduling for fine-grained synchronization," In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 375-388, 2018.
- Oh, Yunho, et al. "Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs." IEEE Transactions on Computers 68.4 (2019): 609-616. https://doi.org/10.1109/TC.2018.2878671
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 308-317, 2011.
- S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads," ACM SIGARCH Computer Architecture (ISCA), pp. 515-527, 2015.
- M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs," High Performance Computer Architecture (HPCA), IEEE International Symposium on. pp. 370-381, 2016.
- M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro, "Draw: investigating benefits of adaptive fetch group size on gpu," In Performance Analysis of Systems and Software (ISPASS), pp. 183-192, 2015.
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163-174, 2009.
- "NVIDIA CUDA SDK Code Samples," http://developer.nvidia.com/cuda-downloads, 2015.
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Shadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44-54, 2009.
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing (InPar), pp. 1-10, 2012.
- J. A. Stratton, C. Rodrigues, J. I. Sung, et al. "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Center for Reliable and High-Performance Computing, 2012.
- M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "A hierarchical thread scheduler and register file for energy-efficient throughput processors," ACM Transactions on Computer Systems (TOCS), Vol. 30, No. 2, April 2012.