DOI QR코드

DOI QR Code

Research Trends for Improving MPI Collective Communication Performance

MPI 집합통신 성능 향상 연구 동향

  • H.Y., Ahn ;
  • Y.M., Park ;
  • S.Y., Kim ;
  • W.J., Han
  • 안후영 (슈퍼컴퓨팅기술연구센터 ) ;
  • 박유미 (슈퍼컴퓨팅기술연구센터 ) ;
  • 김선영 (슈퍼컴퓨팅기술연구센터 ) ;
  • 한우종 (인공지능연구소 )
  • Published : 2022.12.01

Abstract

Message Passing Interface (MPI) collective communication has been applied to various science and engineering area such as physics, chemistry, biology, and astronomy. The parallel computing performance of the data-intensive workload in the above research fields depends on the collective communication performance. To overcome this limitation, MPI collective communication technology has been extensively researched over the last several decades to improve communication performance. In this paper, we provide a comprehensive survey of the state-of-the-art research performed on the MPI collective communication and examine the trends of recently developed technologies. We also discuss future research directions for providing high performance and scalability to large-scale MPI applications.

Keywords

Acknowledgement

이 논문은 대한민국 정부(과학기술정보통신부)의 재원으로 한국연구재단 슈퍼컴퓨터개발선도사업의 지원을 받아 수행된 연구임[과제번호: 2021M3H6A1017683].

References

  1. D.K. Panda et al., "The MVAPICH project: Transforming research into high-performance MPI library for HPC community," J. Comput. Sci., vol. 52, 2021, article no. 101208. 
  2. K.S. Jin, S.M. Lee, and Y.C. Kim, "Adaptive and optimized agent placement scheme for parallel agent-based simulation," ETRI J., vol. 44, no. 2, 2021. 
  3. B. Andjelkovic et al., "Grid-enabled parallel simulation based on parallel equation formulation," ETRI J., vol. 32, no. 4, 2010, pp. 555-565.  https://doi.org/10.4218/etrij.10.0109.0197
  4. M . GAO et al., "Proteome-scale deployment of protein structure prediction workflows on the summit supercomputer," arXiv preprint, CoRR, 2022, arXiv: 2201.10024. 
  5. A. Acharya et al., "Supercomputer-based ensemble docking drug discovery pipeline with application to COVID-19," J. Chem. Inf. Model., vol. 60, no. 12, 2020, pp. 5832-5852.  https://doi.org/10.1021/acs.jcim.0c01010
  6. M. Tolstykh et al., "SL-AV model: Numerical weather prediction at extra-massively parallel supercomputer," in Russian Supercomputing Days, Springer, Cham, Switzerland, 2018, pp. 379-387. 
  7. V . Khryashchev et al., "Comparison of different convolutional neural network architectures for satellite image segmentation," in Proc. Conf. Open Innov. Assoc. (FRUCT), (Bologna, Italy), Nov. 2018, pp. 172-179. 
  8. M.P. KATZ et al., "Preparing nuclear astrophysics for exascale," in Proc. SC20: Int. Conf. High Perform. Comput., Netw., Storage Analysis (Atlanta, GA, USA), Nov. 2020, pp. 1-12. 
  9. Wikipedia, Shared Memory, https://en.wikipedia.org/wiki/Shared_memory 
  10. Wikipedia, Distributed Memory, https://en.wikipedia.org/wiki/Distributed_memory 
  11. Wikipedia, Distributed Shared Memory, https://en.wikipedia.org/wiki/Distributed_shared_memory 
  12. Z. Huang et al., "VODCA: View-oriented, distributed, cluster-based approach to parallel computing," in Proc. IEEE Int. Symp. Cluster Comput. the Grid (CCGRID'06), (Singapore, Singapore), May 2006. 
  13. Argonne, MPICH: A High-Performance, Portable Implementation of MPI, https://www.anl.gov/mcs/mpich-a-highperformance-portable-implementation-of-mpi 
  14. The Open MPI Project, Open MPI: Open Source High Performance Computing, https://www.open-mpi.org/ 
  15. The Ohio State University, MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE, http://mvapich.cse.ohio-state.edu/ 
  16. IBM, IBM Spectrum MPI, https://www.ibm.com/products/spectrum-mpi?utm_content=SRCWW&p1=Search&p4=43700067987454012&p5=p&gclid=Cj0KCQiAr5iQBhCsARIsAPcwROMfviu3UCQI4w4tdjcY6gF9AzywHVCsqODz2ZBdV-RxaIcASCobBfMaAhPMEALw_wcB&gclsrc=aw.ds 
  17. Microsoft, Microsoft MPI, https://docs.microsoft.com/en-us/message-passing-interface/microsoft-mpi 
  18. Intel, Intel MPI Library, https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpilibrary.html#gs.pcbhj9 
  19. S.H. Cho and Y.H. Kim, "A fast transmission of mobile agents using binomial trees," The KIPS Trans.: Part A, vol. 9A, no. 3, 2002, pp. 341-350.  https://doi.org/10.3745/KIPSTA.2002.9A.3.341
  20. H. Zhao and J. Canny, "Butterfly mixing: Accelerating incremental-update algorithms on clusters," in Proceedings of the 2013 SIAM International Conference on Data Mining, SIAM, Philadelphia, PA, USA, 2013, pp. 785-793. 
  21. 이정희, 한동수, "패킷단위 병렬데이터 전송을 통한 MPICH-G2 집합통신 성능 향상," 한국정보과학회, 가을 학술발표논문집, 제30권 제2호, 2003. 
  22. R. Thakur et al., "Optimization of collective communication operations in MPICH," Int. J. High Perform. Comput. Appl., vol. 19, no. 1, 2005, pp. 49-66.  https://doi.org/10.1177/1094342005051521
  23. M. Chaarawi et al., "A tool for optimizing runtime parameters of Open MPI," in Recent Advances in Parallel Virtual Machine and Message Passing Interface, vol. 5205, Springer, Berlin, Heidelberg, Germany, 2008, pp. 210-217. 
  24. E. Nuriyev and A. Lastovetsky, "Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling," arXiv preprint, 2020, CoRR, arXiv: 2004.11062. 
  25. J. Pjesivac-Grbovic et al., "MPI collective algorithm selection and quadtree encoding," Parallel Comput. vol. 33, no. 9, 2007, pp. 613-623.  https://doi.org/10.1016/j.parco.2007.06.005
  26. S. Hunold et al., "Predicting MPI collective communication performance using machine learning," in Proc. IEEE Int. Conf. Clust. Comput. (CLUSTER), (Kobe, Japan), Sept. 2020. 
  27. J.M. Hashmi et al., "Design and characterization of shared address space MPI collectives on modern architectures," in Proc. IEEE/ACM Int. Symp. Clust., Cloud Grid Comput. (CCGRID), (Larnaca, Cyprus), May 2019. 
  28. Google, Xpmem: Cross-process Memory Mapping, 2011, https://code.google.com/archive/p/xpmem/ 
  29. Google, Google Code Archive XPMEM, https://code.google.com/archive/p/xpmem/ 
  30. S. Chakraborty et al., "SHMEMPMI--Shared memory based PMI for improved performance and scalability," in Proc. IEEE/ACM Int. Symp. Clust., Cloud Grid Comput. (CCGrid), (Cartagena, Colombia), May 2016. 
  31. P. Balaji et al., "PMI: A scalable parallel process-management interface for extreme-scale systems," in European MPI Users' Group Meeting, Springer, Berlin, Heidelberg, Germany, 2010, pp. 31-41. 
  32. R.L. Graham et al., "Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction," in Proc. Int. Workshop Commun. Optim. HPC (COMHPC), (Salt Lake City, UT, USA), Nov. 2016. 
  33. NVIDIA, NVIDIA Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), https://docs.nvidia.com/networking/display/sharpv214 
  34. J. Stern et al., "Accelerating MPI_Reduce with FPGAs in the Network," Proc Workshop on Exascale MPI. 2017. 
  35. P. Haghi et al., "FPGAs in the network and novel communicator support accelerate MPI collectives," in Proc. IEEE High Perform. Extreme Comput. Conf. (HPEC), (Waltham, MA, USA), Sept. 2020. 
  36. Mvapich, OSU Collective MPI Benchmarks, http://mvapich.cse.ohio-state.edu/benchmarks/ 
  37. S. Kumar et al., "Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer," Int. J. High Perform. Comput. Appl., vol. 28, no. 4, 2014, pp. 450-464.  https://doi.org/10.1177/1094342014552086
  38. J. Liu, A.R. Mamidala, and D.K. Panda, "Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support," in Proc. Int. Parallel Distrib. Process. Symp., (Santa Fe, NM, USA), Apr. 2004. 
  39. T. Hoefler, C. Siebert, and W. Rehm, "A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast," in Proc. Int. Symp. Parallel Distrib. Process. Symp., (Long Beach, CA, USA), Mar. 2007, pp. 1-8. 
  40. S. Aga et al., "Compute caches," in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), (Austin, TX, USA), Feb. 2017. 
  41. S. Jung et al., "A crossbar array of magnetoresistive memory devices for in-memory computing," Nature, vol. 601, 2022, pp. 211-216.  https://doi.org/10.1038/s41586-021-04196-6
  42. J. Huang et al., "Active-routing: Compute on the way for near-data processing," in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), (Washington, DC, USA), Feb. 2019. 
  43. M. Torabzadehkashi et al., "Catalina: In-storage processing acceleration for scalable big data analytics," in Proc. Euromicro Int. Conf. Parallel, Distrib. Netw.-Based Process. (PDP), (Pavia, Italy), Feb. 2019. 
  44. Github, Faiss, https://github.com/facebookresearch/faiss 
  45. Texmex, Datasets for approximate nearest neighbor search, http://corpus-texmex.irisa.fr 
  46. S.W. Jun et al., "Bluedbm: An appliance for big data analytics," in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), (Portland, OR, USA), June 2015. 
  47. B. GU et al., "Biscuit: A framework for near-data processing of big data workloads," ACM SIGARCH Comput. Archit. News, 2016, vol. 44, no. 3, pp. 153-165.  https://doi.org/10.1145/3007787.3001154
  48. S.C. Kim et al., "In-storage processing of database scans and joins," Inf. Sci., vol. 327, 2016, pp. 183-200.  https://doi.org/10.1016/j.ins.2015.07.056
  49. SoCs, MPSoCs & RFSoCs, https://www.xilinx.com/products/silicon-devices/soc.html 
  50. Hadoop, Hadoop MapReduce, https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html 
  51. 김선영 외, "CCIX 연결망과 메모리 확장기술 동향," 전자통신동향 분석, 제37권 제1호, 2022, pp. 42-52. https://doi.org/10.22648/ETRI.2022.J.370105