Color Media Instructions for Embedded Parallel Processors

임베디드 병렬 프로세서를 위한 칼라미디어 명령어 구현

  • 김철홍 (전남대학교 전자컴퓨터공학과) ;
  • 김종면 (울산대학교 컴퓨터정보통신공학부)
  • Published : 2008.08.15


As a mobile computing environment is rapidly changing, increasing user demand for multimedia-over-wireless capabilities on embedded processors places constraints on performance, power, and sire. In this regard, this paper proposes color media instructions (CMI) for single instruction, multiple data (SIMD) parallel processors to meet the computational requirements and cost goals. While existing multimedia extensions store and process 48-bit pixels in a 32-bit register, CMI, which considers that color components are perceptually less significant, supports parallel operations on two-packed compressed 16-bit YCbCr (6 bit Y and 5 bits Cb, Cr) data in a 32-bit datapath processor. This provides greater concurrency and efficiency for YCbCr data processing. Moreover, the ability to reduce data format size reduces system cost. The reduction in data bandwidth also simplifies system design. Experimental results on a representative SIMD parallel processor architecture show that CMI achieves an average speedup of 6.3x over the baseline SIMD parallel processor performance. This is in contrast to MMX (a representative Intel's multimedia extensions), which achieves an average speedup of only 3.7x over the same baseline SIMD architecture. CMI also outperforms MMX in both area efficiency (a 52% increase versus a 13% increase) and energy efficiency (a 50% increase versus an 11% increase). CMI improves the performance and efficiency with a mere 3% increase in the system area and a 5% increase in the system power, while MMX requires a 14% increase in the system area and a 16% increase in the system power.

최근 모바일 컴퓨팅 환경의 변화로 멀티미디어 데이타의 고성능, 저전력 처리에 대한 수요가 증가하고, 프로세서에 있어서 멀티미디어 전용 가속기 기능의 중요성이 크게 부각되고 있다. 이에 본 논문은 고성능, 저전력 멀티미디어 처리를 위한 SIMD 병렬 프로세서용 칼라미디어 명령어를 제안한다. 기존의 범용 마이크로프로세서 전용 멀티미디어 명령어 (e.g., MMX, VIS, AltiVec)는 4개의 8 비트 픽셀을 32 비트 레지스터에 저장하고 처리하는 반면에, 제안하는 칼라미디어 명령어는 인간의 시각이 칼라에 덜 민감한 점을 고려하여 32비트 데이타패스 아키텍처에서 두 쌍 (6개의 픽셀)의 압축된 16비트 YCbCr (6비트 Y, 5비트 Cb와 Cr) 데이타를 32비트 레지스터에 저장하고 동시에 처리함으로써 YCbCr 데이타 처리에서 높은 병렬성과 효율성을 보여준다. 또한 칼라미디어 명령어는 데이타 포맷 사이즈를 줄임으로써 전체시스템의 비용을 절감할 뿐만 아니라 데이타 대역폭의 감소로 시스템 디자인을 간소화한다. SIMD 병렬 프로세서 아키텍처에서 모의 실험한 결과, 칼라미디어 명령어 기반 프로그램은 baseline 명령어 프로그램보다 평균 6.3배 성능향상을 보여준다. 반면, Intel의 대표적인 멀티미디어 명령어인 MMX 기반 프로그램은 동일한 SIMD 병렬 프로세서에서 baseline 명령어 프로그램보다 단지 3.7배 성능향상을 나타낸다. 또한, 칼라미디어 명령어는 MMX보다 시스템 면적 효율 (52% 증가 대비 13% 증가)과 시스템 전력 효율 (50% 증가 대비 11% 증가)에서 우수성을 보여준다. 칼라미디어 명령어는 이러한 성능과 효율을 단지 3%의 시스템 면적과 5%의 시스템 전력의 증가로 얻는 반면, MMX는 14%의 시스템 면적과 16%의 시스템 전력증가가 요구된다.



  1. K. N. Plataniotis and A. N. Venetsanopoulos, Color Image Processing and Applications, Springer Verlag, 2000
  2. A. Peleg and U. Weiser, "MMX Technology Extension to the Intel Architecture," IEEE Micro, Vol.16, No.4, pp. 42-50, Aug. 1996
  3. S. K. Raman, V. Pentkovski, and J. Keshava, "Implementing Streaming SIMD Extensions on the Pentium III Processor," IEEE Micro, Vol.20, No.4, pp. 28-39, 2000
  4. R. B. Lee, "Subword Parallelism with MAX-2," IEEE Micro, Vol.16, No.4, pp. 51-59, Aug. 1996
  5. M. Tremblay, J. M. O'Connor, V. Narayanan, and L. He, "VIS Speeds New Media Processing," IEEE Micro, Vol.16, No.4, pp. 10-20, Aug. 1996
  6. R. Sites, Ed., Alpha Reference Manual, Burlington, MA: Digital, 1992
  7. H. Nguyen and L. John, "Exploiting SIMD Parallelism in DSP and Multimedia Algorithms using the AltiVec Technology," in Proc. Intl. Conf. on Supercomputer, pp. 11-20, June 1999
  8. TMS320C64x families: tic64xx.htm
  9. J. Fridman and Z. Greenfield, "The TigerSHARC DSP architecture," in Proc. IEEE/ACM Intl. Sym. on Computer Architecture, pp. 124-135, May 1999
  10. ARM9 Family: families/ARM9Family.html
  11. A. D. Blas et. al., "The UCSC Kestrel Parallel Processor," IEEE Trans. on Parallel and Distributed Systems, Vol.16, No.1, pp. 80-92, Jan. 2005
  12. A. Gentile and D. S. Wills, "Portable Video Supercomputing," IEEE Trans. on Computers, Vol.53, No.8, pp. 960-973, Aug. 2004
  13. J. Kim and D. S. Wills, "Quantized color instruction set for multimedia-on-demand applications," in Proceedings of the IEEE International Conference on Multimedia and Expo, pages 141-144, July 2003
  14. J. Kim and D. S. Wills, "Evaluating a 16-bit YCbCr (6:5:5) color representation for low memory, embedded video processing," in Proc. of the IEEE Intl. Conf. on Consumer Electronics, pp. 181-182, Jan. 2005
  15. P. Ranganathan, S. Adve, and N. P. Jouppi, "Performance of image and video processing with general-purpose processors and media ISA extensions, in Proc. of the 26th Intl. Sym. on Computer Architecture, pp. 124-135, May 1999
  16. R. Bhargava, L. John, B. Evans, and R. Radhakrishnan, "Evaluating MMX technology using DSP and multimedia applications," in Proc. of IEEE/ ACM Sym. on Microarchitecture, pp. 37-46, 1998
  17. N. Slingerland and A. J. Smith, "Measuring the performance of multimedia instruction sets," IEEE Trans. on Computers, Vol.51, No.11, pp. 1317-1332, Nov. 2002
  18. A. Krikelis, I. P. Jalowiecki, D. Bean, R. Bishop, M. Facey, D. Boughton, S. Murphy, and M. Whitaker, "A programmable processor with 4096 processing units for media applications," in Proc. of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Vol.2, pp. 937-940, May 2001
  19. L. W. Tucker and G. G. Robertson, "Architecture and applications of the connection machine," IEEE Computer, Vol.21, No.8, pp. 26-38, 1988
  20. "Connection machine model CM-2 technical summary," Thinking Machines Corp., version 51, May 1989
  21. MarPar (MP-2) System Data Sheet. MarPar Corporation, 1993
  22. M. J. Irwin, R. M. Owens, "A Two-Dimensional, Distributed Logic Processor," IEEE Trans. on Computers, Vol.40, No.10, pp. 1094-1101, 1991
  23. M. Bolotski, R. Armithrajah, W. Chen, "ABACUS: A High Performance Architecture for Vision," in Proceedings of the International Conference on Pattern Recognition, 1994
  24. C. C. Yang, "Effects of coordinate systems on color image processing," MS Thesis, University of Arizona, Tucson, 1992
  25. H.-M. Hang and B. G. Haskell, "Interpolative vector quantization of color images," IEEE Trans. Commun., Vol.COM-36, No.4, pp. 465-470, April 1988
  26. S. C. Kwatra, C. M. Lin, and W. A. Whyte, "An adaptive algorithm for motion compensated color image coding," IEEE Trans. Commun., Vol. COM-35, pp. 747-754, July 1987
  27. J. Suh and V. K. Prasanna, "An Efficient Algorithm for Out-of-core Matrix Transposition," IEEE Trans. on Computers, Vol.51, No.4, pp. 420-438, April 2002
  28. S. M. Chai, T. M. Taha, D. S. Wills, and J. D. Meindl, "Heterogeneous architecture models for interconnect-motivated system design," IEEE Trans. VLSI Systems, special issue on system level interconnect prediction, Vol.8, No.6, pp. 660-670, Dec. 2000
  29. J. C. Eble, V. K. De, D. S. Wills, and J. D. Meindl, "A generic system simulator (GENESYS) for ASIC technology and architecture beyond 2001," in Proc. of the Ninth Ann. IEEE Intl. ASIC Conf., pp. 193-196, Sept. 1996
  30. V. Tiwari, S. Malik, and A. Wolfe, "Compilation Techniques for Low Energy: An Overview," in Proc. of the IEEE Intl. Symp. on Low Power Electron., pp. 38-39, Oct. 1994