DOI QR코드

DOI QR Code

MPI 집합통신을 위한 프로세싱 노드 상태 기반의 메시지 전달 엔진 설계

Design of Message Passing Engine Based on Processing Node Status for MPI Collective Communication

  • 정원영 (연세대학교 전기전자공학과 프로세서 연구실) ;
  • 이용석 (연세대학교 전기전자공학과 프로세서 연구실)
  • 투고 : 2012.02.26
  • 심사 : 2012.07.02
  • 발행 : 2012.08.31

초록

본 논문은 MPI 집합 통신 함수가 처리 레벨 (transaction level) 에서 변환된다는 가정 하에 MPI 집합 통신 중 방송 (Broadcast), 확산 (Scatter), 취합 (Gather) 함수를 최적화한 알고리즘을 제안하였다. 또한 제안하는 알고리즘이 구동되는 MPI 전용 하드웨어 엔진을 설계하였으며, 이를 OCC-MPE (Optimized Collective Communication - Message Passing Engine) 라 명명하였다. OCC-MPE는 표준 송신 모드 (standard send mode)로 점대점 통신 (point-to-point communication) 을 하며, 집합 통신 중 가장 빈번하게 사용되는 방송, 취합, 확산을 제안하는 알고리즘에 의해 전송 순서를 결정한 후 통신하여 전체 통신 완료 시간을 단축시켰다. 제안한 알고리즘들의 성능을 측정하기 위하여 OCC-MPE를 SystemC 기반의 BFM(Bus Functional Model)을 제작하였다. SystemC 기반의 시뮬레이터를 통한 성능 평가 후에 VerilogHDL을 사용하여 제안하는 OCC-MPE를 포함한 MPSoC (Multi-Processor System on a Chip)를 설계하였다. TSMC 0.18 공정으로 합성한 결과 프로세싱 노드가 4개일 때 각 OCC-MPE가 차지하는 면적은 약 1978.95 이었다. 이는 전체 시스템에서 약 4.15%를 차지하므로 비교적 작은 면적을 차지함을 확인하였다. 본 논문에서 제안하는 OCC-MPE를 MPSoC에 내장하면, 비교적 작은 하드웨어 자원의 추가로 높은 성능향상을 얻을 수 있다.

In this paper, on the assumption that MPI collective communication function is converted into a group of point-to-point communication functions in the transaction level, an algorithm that optimizes broadcast, scatter and gather function among MPI collective communication is proposed. The MPI hardware engine that operates the proposed algorithm was designed, and it was named the OCC-MPE (Optimized Collective Communication Message Passing Engine). The OCC-MPE operates point-to-point communication by using the standard send mode. The transmission order is arranged according to the algorithm that proposes the most frequently used broadcast, scatter and gather functions among the collective communications, so the whole communication time is reduced. To measure the performance of the proposed algorithm, the OCC-MPE with the Bus Functional Model (BFM) based on SystemC was designed. After evaluating the performance through the BFM based on SystemC, the proposed OCC-MPE is designed by using VerilogHDL. As a result of synthesizing with the TSMC $0.18{\mu}m$, the gate count of each OCC-MPE is approximately 1978.95 with four processing nodes. That occupies approximately 4.15% in the whole system, which means it takes up a relatively small amount. Improved performance is expected with relatively small amounts of area increase if the OCC-MPE operated by the proposed algorithm is added to the MPSoC (Multi-Processor System on a Chip).

키워드

참고문헌

  1. L. Benini and G.de Micheli, "Networks On Chip: A New SoC Paradigm," IEEE Computer, Volume 35, Number 1, pages 70-78, January 2002.
  2. Daniel L. Ly, Manuel Saldana, Paul Chow, "The Challenges of Using An Embedded MPI for Hardware-based Processing Nodes," In Proceedings of Field-Programmable Technology, pages 120-127, 2009.
  3. R. Rabenseifner, "Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512," In Proceedings of the Message Passing Interface Developer's and User's Conference, pages 77-85, 1999.
  4. Won-young Chung, Ha-young Jeong, Won Woo Ro, and Yong-surk Lee, "A Low-Cost Standard Mode MPI Hardware Unit for Embedded MPSoC," IEICE, Trans. on Information and Systems, Volume E94-D, Number 7, pages 1497-1501, July 2011. https://doi.org/10.1587/transinf.E94.D.1497
  5. Argonne National Laboratory, "MPICH2: high-performance and widely portable MPI," June 2009, URL:http://www.mcs.anl.gov/research/projects/mpich2/.
  6. P. Mahr, C. Lorchner, H. Ishebabi, and C. Bobda, "SoC-MPI: A Flexible Message Passing Library for Mutliprocessor Systems-on-Chips," In Proceedings of Reconfigurable Computing and FPGA's, pages 187-192, 2008.
  7. Manuel Saldana, Emanuel and Paul Chow, "A Message-Passing Hardware/Software Co-simulation Environment to Aid in Reconfigurable Computing Design Using TMD-MPI," In Proceedings of Reconfigurable Computing and FPGA's, pages 265-270, 2008.