DOI QR코드

DOI QR Code

Efficient Use of On-chip Memory through Profile-Driven Array Reorganization

  • Received : 2011.05.04
  • Accepted : 2011.07.07
  • Published : 2011.12.31

Abstract

In high performance embedded systems, the use of multiple on-chip memories is an essential architectural feature for exploiting inherent parallelism in multimedia applications. This feature allows multiple data accesses to be executed in parallel. However, it remains difficult to effectively exploit of multiple on-chip memories. The successful use of this architecture strongly depends on how to efficiently detect and exploit memory parallelism in target applications. In this paper, we propose a technique based on a linear array access descriptor [1], which is generated from profiled data, to detect and exploit memory parallelism. The proposed technique tackles an array reorganization problem to maximize memory parallelism in multimedia applications. We present preliminary experiments applying the proposed technique onto a representative coarse grained reconfigurable array processor (CGRA) with multimedia kernel codes. Our experimental results demonstrate that our technique optimizes data placement by putting independent data on separate storage. The results exhibit 9.8% higher performance on average compared to the existing method.

Keywords

References

  1. Yunheung Paek, Jay Hoeflinger, and David Padua, "Simplification of array access patterns for compiler optimizations", In PLDI'98, pages60-71.
  2. Jean-Francois Collard and Daniel Lavery, "Optimizations to prevent cache penalties for the intel Itanium 2 processor", In Proceedings of the CGO'03, 105-114.
  3. P. Grun, N. Dutt, and A. Nicolau, "Access pattern based local memory customization for low power embedded systems", In Proceedings of the conference on DATE, 778-784.
  4. M. Gupta and P. Banerjee, "Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers", IEEE Trans. Parallel Distrib. Syst., 3(2):179-193, 1992. https://doi.org/10.1109/71.127259
  5. Hartej Singh, Guangming Lu, Eliseu Filho, Rafael Maestre, Ming-Hau Lee, Fadi Kurdahi, and Nader Bagherzadeh, "Morphosys: case study of a reconfigurable computing system targeting multimedia applications", In Proceedings of DAC, 573-578, 2000.
  6. M. Wolfe, "More iteration space tiling", In Proceedings of the ACM/IEEE conferenceon, Supercomputing'89, 655-664.
  7. Nainesh Agarwal and Nikitas Dimopoulos, "Dspstone benchmark of codel's automated clock gating platform", In Proceedings of the IEEE VLSI, 508-509, 2007.
  8. M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, "Mibench: A free, commercially representative embedded benchmark suite", In Proceedings of the WWC-4. 2001.
  9. ICD-C compiler framework, University of Dortmund, .http://www.icd.de/es/icd-c/
  10. Yoonjin Kim, Mary Kiemb, Chulsoo Park, Jinyong Jung, and Kiyoung Choi, "Resource sharing and pipelining in coarse-grained reconfigurable architecture for domain-specific optimization", In Proceedings of DATE'05, 12-17.
  11. A. Hatanaka and N. Bagherzadeh, "A modulo scheduling algorithm for a coarse-grain reconfigurable array template", In Proceedings of the IPDPS'07, 1-8, 2007.
  12. Hyunchul Park, Kevin Fan, Manjunath Kudlur, and Scott Mahlke, "Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures", In Proceedings of CASES'06, 136-146.
  13. Kathryn McKinley and Steve Carr, "Improving data locality with loop transformations", ACM Transactions on Programming Languages and Systems, 18: 424-453, 1996. https://doi.org/10.1145/233561.233564
  14. B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, "Adres: An architecture with tightly coupled vliw processor and coarse grained reconfigurable matrix", In Proceeding of Field Programmable Logic, FPL'03, 61-70.
  15. Michael Joseph Wolfe, "High Performance Compilers for Parallel Computing", Addison-Wesley Longman Publishing Co., USA, 1995.
  16. Wei Li, "Compiling for numa parallel machines", PhD thesis, Ithaca, NY, USA,1993.
  17. Michael E. Wolf and Monica S. Lam, "A data locality optimizing algorithm", In Proceedings of the ACM SIGPLAN 1991, 30-44.
  18. Michael E. Wolf, Dror E. Maydan, and Ding-Kai Chen, "Combining loop transformations considering caches and scheduling", In MICRO29, 274-286, 1996.
  19. Daniel Edward Lenoski, "The design and analysis of DASH: a scalable directory-based multiprocessor", PhD thesis, Stanford, CA, USA, 1992.
  20. Kai Li, "Shared virtual memory on loosely coupled multiprocessors", PhD thesis, 1986.
  21. S. Lumetta, L. Murphy, X. Li, D. Culler, and I. Khalil, "Decentralized optimal power pricing: The development of a parallel program", In IEEE Parallel and Distributed Technology, 240-249, 1993.
  22. V. Balasundaram and K. Kennedy, "A technique for summarizing data access and its use in parallelism enhancing transformations", In Proceedings of the ACM SIGPLAN 1989, 41-53.
  23. Chau wen Tseng, "Compiler optimizations for eliminating barrier synchronization", ACM SIGPLAN, 144-155, 1995.