Implementation of Hardware Data Prefetcher Adaptable for Various State-of-the-Art Workload

Kim, KangHee;Park, TaeShin;Song, KyungHwan;Yoon, DongSung;Choi, SangBang;

doi:10.5573/ieie.2016.53.12.020

전자공학회논문지 (Journal of the Institute of Electronics and Information Engineers)

제53권12호
/
Pages.20-35
/
2016
/
2287-5026(pISSN)
/
2288-159X(eISSN)

대한전자공학회 (The Institute of Electronics and Information Engineers)

DOI QR Code

다양한 최신 워크로드에 적용 가능한 하드웨어 데이터 프리페처 구현

Implementation of Hardware Data Prefetcher Adaptable for Various State-of-the-Art Workload

김강희 (인하대학교 전자공학과) ;
박태신 (인하대학교 전자공학과) ;
송경환 (인하대학교 전자공학과) ;
윤동성 (인하대학교 전자공학과) ;
최상방 (인하대학교 전자공학과)

Kim, KangHee (Dept. of Electronic Engineering, Inha University) ;
Park, TaeShin (Dept. of Electronic Engineering, Inha University) ;
Song, KyungHwan (Dept. of Electronic Engineering, Inha University) ;
Yoon, DongSung (Dept. of Electronic Engineering, Inha University) ;
Choi, SangBang (Dept. of Electronic Engineering, Inha University)

투고 : 2016.07.04
심사 : 2016.11.11
발행 : 2016.12.25

https://doi.org/10.5573/ieie.2016.53.12.020 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

본 논문에선 병렬 십진 곱셈기의 축약 단계의 면적과 지연시간을 감소시켜 성능을 향상시키기 위해 다중 피연산자 십진 CSA과 개선된 십진 CLA를 이용한 트리 구조를 제안한다. 제안한 부분곱 축약 트리는 십진수 부분곱에 대해 다중 피연산자 십진 CSA를 사용하여 빠르게 부분곱을 축약한다. 각 CSA에서는 리코딩에 입력의 범위를 제한함으로써 가장 간단한 리코더 로직을 얻는다. 그리고 각 CSA는 특정한 아키텍처 트리의 특정한 위치에서 범위가 제한된 십진수를 더하기 때문에 부분곱 축약 단계의 연산을 효율적으로 수행할 수 있다. 또한, 사용되는 십진 CLA의 로직을 개선하여 BCD 결과를 빠르게 얻을 수 있다. 제안한 십진 부분곱 축약 단계의 성능의 평가를 위해 Design Compiler를 통해 SMIC사의 180nm CMOS 공정 라이브러리를 이용하여 합성하였다. 일반 방법을 이용하는 축약 단계에 비해 제안한 부분곱 축약 단계의 지연시간은 약 15.6% 감소하였고 면적은 약 16.2% 감소하였다. 또한 십진 CLA의 지연시간과 면적이 증가가 있음에도 불구하고 전체 지연시간과 전체 면적이 감소함을 확인하였다.

In this paper, in order to reduce the delay and area of the partial product accumulation (PPA) of the parallel decimal multiplier, a tree architecture that composed by multi-operand decimal CSAs and improved CLA is proposed. The proposed tree using multi-operand CSAs reduces the partial product quickly. Since the input range of the recoder of CSA is limited, CSA can get the simplest logic. In addition, using the multi-operand decimal CSAs to add decimal numbers that have limited range in specific locations of the specific architecture can reduce the partial products efficiently. Also, final BCD result can be received faster by improving the logic of the decimal CLA. In order to evaluate the performance of the proposed partial product accumulation, synthesis is implemented by using Design Complier with 180 nm COMS technology library. Synthesis results show the delay of the proposed partial product accumulation is reduced by 15.6% and area is reduced by 16.2% comparing with which uses general method. Also, the total delay and area are still reduced despite the delay and area of the CLA are increased.

키워드

참고문헌

B. Falsafi and T. F. Wenisch, A Primer on Hardware Prefetching, Morgan & Claypool Publisher, p. 1-5. 2014.
S. P. Vanderwiel and D. J. Lilja, "Data prefetch mechanisms," ACM, Computing Surveys., vol. 32, no. 2, pp. 174-199, Jun 2000. https://doi.org/10.1145/358923.358939
Y. S. Jeong, J. H. Kim, T. H. Cho, and S. B. Choi, "Instructions and Data Prefetch Mechanism using Displacement History Buffer," Journal of The Institute of Electronics Engineers of Korea, vol. 52, no. 10, pp 82-94, Oct 2015.
D. Y. Jung and Y. S. Lee, "Cache Replacement Policy Based on Dynamic Counter for High Performance Processor," Journal of The Institute of the Electronics Engineers of Korea, vol. 50, no. 4, pp. 52-58, Apr 2013.
The 1st JILP Data Prefetching Championship (DPC-1) Available at : http://www.jilp.org/dpc/
The 2nd Data Prefetching Championship (DPC2) Available at : http://comparch-conf.gatech.edu/dpc2/
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers," in Proc. of IEEE Conf International Symposium on High Performance Computer Architecture, pp. 10-14, Scottsdale, USA, Feb 2007.
S. H. Pugsley, Z. Chishti, C. Wilterson, P. f. Chuang, R. L. Scott, A. Jaleel, S. L. Lu, K. Chow, and R. Balasubramonian, "Sandbox Prefetching: Safe Run-Time Evaluation of Aggressive Prefetchers," in Proc. of IEEE Conf. International Symposium on High Performance Computer Architecture, pp. 15-19, Orlando, USA, Feb 2014.
B. Panda and S Balachandran, "Expert Prefetch Prediction: An Expert Predicting the Usefulness of Hardware Prefetchers," in IEEE Computer Architecture Letters, vol. 15, no. 1, pp. 13-16, Jan.-June 1 2016. https://doi.org/10.1109/LCA.2015.2428703
X. Zhuang and H. H. S. Lee, "A hardware-based cache pollution filtering mechanism for aggressive prefetches, " in Proc International Conference on Parallel Processing, pp. 286-293. Kaohsiung, Oct 2003.
N. Binkert, S. Sardashti, R. Sen et al, "The gem5 simulator," ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1-7, May 2011.
C. Bienia, S. Kumar, J. P. Jaswinder, and K. Li, "The PARSEC benchmark suite: characterization and architectural implications," In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72-81, Toronto, Canada, Oct 2008.

전자공학회논문지 (Journal of the Institute of Electronics and Information Engineers)

다양한 최신 워크로드에 적용 가능한 하드웨어 데이터 프리페처 구현

Implementation of Hardware Data Prefetcher Adaptable for Various State-of-the-Art Workload

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)