Search | Korea Science

Scalable Application Mapping for SIMD Reconfigurable Architecture

Kim, Yongjoo;Lee, Jongeun;Lee, Jinyong;Paek, Yunheung
- JSTS:Journal of Semiconductor Technology and Science
- /
- v.15 no.6
- /
- pp.634-646
- /
- 2015
Coarse-Grained Reconfigurable Architecture (CGRA) is a very promising platform that provides fast turn-around-time as well as very high energy efficiency for multimedia applications. One of the problems with CGRAs, however, is application mapping, which currently does not scale well with geometrically increasing numbers of cores. To mitigate the scalability problem, this paper discusses how to use the SIMD (Single Instruction Multiple Data) paradigm for CGRAs. While the idea of SIMD is not new, SIMD can complicate the mapping problem by adding an additional dimension of iteration mapping to the already complex problem of operation and data mapping, which are all interdependent, and can thus significantly affect performance through memory bank conflicts. In this paper, based on a new architecture called SIMD reconfigurable architecture, which allows SIMD execution at multiple levels of granularity, we present how to minimize bank conflicts considering all three related sub-problems, for various RA organizations. We also present data tiling and evaluate a conflict-free scheduling algorithm as a way to eliminate bank conflicts for a certain class of mapping problem.
https://doi.org/10.5573/JSTS.2015.15.6.634 인용 PDF KSCI

Low-latency SAO Architecture and its SIMD Optimization for HEVC Decoder

Kim, Yong-Hwan;Kim, Dong-Hyeok;Yi, Joo-Young;Kim, Je-Woo
- IEIE Transactions on Smart Processing and Computing
- /
- v.3 no.1
- /
- pp.1-9
- /
- 2014
This paper proposes a low-latency Sample Adaptive Offset filter (SAO) architecture and its Single Instruction Multiple Data (SIMD) optimization scheme to achieve fast High Efficiency Video Coding (HEVC) decoding in a multi-core environment. According to the HEVC standard and its Test Model (HM), SAO operation is performed only at the picture level. Most realtime decoders, however, execute their sub-modules on a Coding Tree Unit (CTU) basis to reduce the latency and memory bandwidth. The proposed low-latency SAO architecture has the following advantages over picture-based SAO: 1) significantly less memory requirements, and 2) low-latency property enabling efficient pipelined multi-core decoding. In addition, SIMD optimization of SAO filtering can reduce the SAO filtering time significantly. The simulation results showed that the proposed low-latency SAO architecture with significantly less memory usage, produces a similar decoding time as a picture-based SAO in single-core decoding. Furthermore, the SIMD optimization scheme reduces the SAO filtering time by approximately 509% and increases the total decoding speed by approximately 7% compared to the existing look-up table approach of HM.
https://doi.org/10.5573/IEIESPC.2014.3.1.1 인용 PDF KSCI

An Implementation of Efficient Quicksort Utilizing SIMD-Based VBP Technique (SIMD 기반의 VBP 기법을 적용한 효율적인 퀵정렬의 구현)

Hong, Gilseok;Kim, Hongyeon;Kang, Seonghyeon;Min, Jun-Ki
- KIISE Transactions on Computing Practices
- /
- v.23 no.8
- /
- pp.498-503
- /
- 2017
SIMD (Single Instruction Multiple Data) is a representative parallelization architecture that processes multiple data loaded in a SIMD register with a single instruction. Quicksort is a sorting algorithm that picks an element as a pivot from the array and reorders the array such that all elements having the values less than the pivot value are located in the left side on the pivot as well as all elements having the value greater than the pivot value are located in the right side on the pivot and then the algorithm performs the same task on both sublist recursively. In this paper, we propose an efficient Quicksort algorithm applying the SIMD instructions which minimally invokes conditional branches to avoid the performance degradation incurred by branch misprediction in a pipeline architecture. In addition, we improve the performance of the Quicksort algorithm by fetching data into a SIMD register as a byte unit to apply VBP (Vertical Bit Parallel) and the early pruning technique.
https://doi.org/10.5626/KTCP.2017.23.8.498 인용 KSCI

The Design of low-cost SIMD MAC/MAS for Embedded Systems (임베디드 시스템을 위한 저비용 SIMD MAC/MAS 블록 설계)

Lee Yong Joo;Jung Jin Woo;Lee Yong Surk
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.29 no.10C
- /
- pp.1460-1468
- /
- 2004
In this paper, we developed a low-area and low-cost SIMD MAC/MAS(Single Instruction Multiple Data Multiply and ACcumulate/Multiply And Subtract) for multimedia that is used much in real life. We compared the result of this research with a previously developed more large and high performance SIMD MAC/MAS. This paper is consist of 5 parts, which are an introduction, the contents of designing SIMD MAC/MAS hardware, a special qualities for previous works, the result of synthesis and conclusion. The design result reduced by size 32% of whole hardware than 64 bit SIMD MAC/MAS block of designed for high performance. This improved ISA (Instruction Set Architecture) to be suitable to embedded DSP(Digital Signal Processor), and shortened bit range of 64-bit data to 32-bit and implement more optimally.
PDF KSCI

Hardware Implementation of Rasterizer with SIMD Architecture Applicable to Mobile 3D Graphics System (모바일 3차원 그래픽스 시스템에 적용 가능한 SIMD 구조를 갖는 래스터라이저의 하드웨어 구현)

Ha, Chang-Soo;Sung, Kwang-Ju;Choi, Byeong-Yoon
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2010.05a
- /
- pp.313-315
- /
- 2010
In this paper, we describe research results of developing hardware rasterizer that is applicable to mobile 3D graphics system, designed in SIMD architecture and verified in FPGA. Tile-based scan conversion unit is designed like SIMD architecture running four tiles simultaneously and each tile traverses pixels hierarchical in 3-level so that visiting counts is minimized. As experimental results, $8{\times}8$ is the most efficient size of tile and the last step of tile traversing is performed on $2{\times}2$ sized subtile. The rasterizer supports flat shading and gouraud shading and texture mapper supports affine mapping and perspective corrected mapping. Also, texture mapper supports point sampling mode and bilinear interpolating sampling mode and two types of wrapping modes and various blending modes. The rasterzer operates as 120Mhz on xilinx vertex4 $l{\times}100$ device. To easy verification, texture memory and frame buffer are generated as block rom and block ram.
PDF

An Analytical Evaluation of 2D Mesh-connected SIMD Architecture for Parallel Matrix Multiplication (2D Mesh SIMD 구조에서의 병렬 행렬 곱셈의 수치적 성능 분석)

Kim, Cheong-Ghil
- Journal of The Institute of Information and Telecommunication Facilities Engineering
- /
- v.10 no.1
- /
- pp.7-13
- /
- 2011
Matrix multiplication is a fundamental operation of linear algebra and arises in many areas of science and engineering. This paper introduces an efficient parallel matrix multiplication scheme on N ${\times}$ N mesh-connected SIMD array processor, called multiple hierarchical SIMD architecture (HMSA). The architectural characteristic of HMSA is the hierarchically structured control units which consist of a global control unit, N local control units configured diagonally, and $N^2$ processing elements (PEs) arranged in an N ${\times}$ N array. PEs are communicating through local buses connecting four adjacent neighbor PEs in mesh-torus networks and global buses running across the rows and columns called horizontal buses and vertical buses, respectively. This architecture enables HMSA to have the features of diagonally indexed concurrent broadcast and the accessibility to either rows (row control mode) or columns (column control mode) of 2D array PEs alternately. An algorithmic mapping method is used for performance evaluation by mapping matrix multiplication on the proposed architecture. The asymptotic time complexities of them are evaluated and the result shows that paralle matrix multiplication on HMSA can provide significant performance improvement.
PDF

Implementation of Pixel Subword Parallel Processing Instructions for Embedded Parallel Processors (임베디드 병렬 프로세서를 위한 픽셀 서브워드 병렬처리 명령어 구현)

Jung, Yong-Bum;Kim, Jong-Myon
- The KIPS Transactions:PartA
- /
- v.18A no.3
- /
- pp.99-108
- /
- 2011
Processor technology is currently continued to parallel processing techniques, not by only increasing clock frequency of a single processor due to the high technology cost and power consumption. In this paper, a SIMD (Single Instruction Multiple Data) based parallel processor is introduced that efficiently processes massive data inherent in multimedia. In addition, this paper proposes pixel subword parallel processing instructions for the SIMD parallel processor architecture that efficiently operate on the image and video pixels. The proposed pixel subword parallel processing instructions store and process four 8-bit pixels on the partitioned four 12-bit registers in a 48-bit datapath architecture. This solves the overflow problem inherent in existing multimedia extensions and reduces the use of many packing/unpacking instructions. Experimental results using the same SIMD-based parallel processor architecture indicate that the proposed pixel subword parallel processing instructions achieve a speedup of $2.3{\times}$ over the baseline SIMD array performance. This is in contrast to MMX-type instructions (a representative Intel multimedia extension), which achieve a speedup of only $1.4{\times}$ over the same baseline SIMD array performance. In addition, the proposed instructions achieve $2.5{\times}$ better energy efficiency than the baseline program, while MMX-type instructions achieve only $1.8{\times}$ better energy efficiency than the baseline program.
https://doi.org/10.3745/KIPSTA.2011.18A.3.099 인용 PDF KSCI

A Parallel Memory Suitable for SIMD Architecture Processing High-Definition Image Haze Removal in High-Speed (고화질 영상에서 고속 안개 제거를 위한 SIMD 구조에 적합한 병렬메모리)

Lee, Hyung
- Journal of the Korea Society of Computer and Information
- /
- v.19 no.7
- /
- pp.9-16
- /
- 2014
Since the haze removal algorithm using dark channel prior was introduced, many researches for improving processing speed have been addressed even if it presented impressive results. Remarkable one is using median dark channel prior. Although it has been considered as a very attactive method, processing speed is as low as ever. So, a parallel memory model which is suitable for SIMD architecture processing haze removal on high-definition images in high-speed is introduced in this paper. The proposed parallel memory model allows to access n pixels simultaneously. It is also support stride 3, 5, 7, and 11 in order to execute convolution mask operations, e.g., median filter. The proposed parallel memory model can therefore support enough data bandwidth to process the algorithm using median dark channel prior in high-speed.
https://doi.org/10.9708/jksci.2014.19.7.009 인용 PDF KSCI

Photon Mapping SIMD Processor Design using Reconfigurable Cell (재구성 Cell을 이용한 Photon mapping SIMD프로세서 설계)

Ryu, Hyun-Woo;Kim, Young-Jin;Lee, Hyon-Soo
- Proceedings of the IEEK Conference
- /
- 2005.11a
- /
- pp.719-722
- /
- 2005
The synthesis of the 3D images is the most important part of the virtual reality. The photon mapping is the best method for reality in the 3D graphics. This paper presents an architecture for photon mapping applications on SOC devices. The proposed architecture reduces the computation time to photonmap search and radiance estimation. Also this architecture is implemented by a SIMD processor which trades parallelism for frequency of operation.
PDF

A Speed-up Method of HOG Pedestrian Detector in Advanced SIMD Architecture (Advanced SIMD 아키텍처에서의 HOG 보행자 검출기 고속화 방법)

Kwon, Ki-Pyo;Lee, Jae-Heung
- Journal of IKEEE
- /
- v.18 no.1
- /
- pp.106-113
- /
- 2014
A pedestrian detector can be applied for various purposes such as monitoring or counting the number of people in some place, or detecting the people plunging in the driveway. There was a lot of related research. But, the detection speed is slow in embedded system because of the limited computing power. An algorithm for fast pedestrian detector using HOG in ARM SIMD architecture is presented in this paper. There is a way to quickly remove the background of image and to improve the detection speed using NEON parallel technique. When we tested with INRIA Person Dataset, the proposed pedestrian detector improves the speed by 3.01 times than previous one.
https://doi.org/10.7471/ikeee.2014.18.1.106 인용 PDF KSCI

Search Result 60, Processing Time 0.03 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)