• Title/Summary/Keyword: SIMD

Search Result 176, Processing Time 0.263 seconds

A Novel Reconfigurable Processor Using Dynamically Partitioned SIMD for Multimedia Applications

  • Lyuh, Chun-Gi;Suk, Jung-Hee;Chun, Ik-Jae;Roh, Tae-Moon
    • ETRI Journal
    • /
    • v.31 no.6
    • /
    • pp.709-716
    • /
    • 2009
  • In this paper, we propose a novel reconfigurable processor using dynamically partitioned single-instruction multiple-data (DP-SIMD) which is able to process multimedia data. The SIMD processor and parallel SIMD (P-SIMD) processor, which is composed of a number of SIMD processors, are usually used these days. But these processors are inefficient because all processing units (PUs) should process the same operations all the time. Moreover, the PUs can process different operations only when every SIMD group operation is predefined. We propose a processor control method which can partition parallel processors into multiple SIMD-based processors dynamically to enhance efficiency. For performance evaluation of the proposed method, we carried out the inverse transform, inverse quantization, and motion compensation operations of H.264 using processors based on SIMD, P-SIMD, and DP-SIMD. Experimental results show that the DP-SIMD control method is more efficient than SIMD and P-SIMD control methods by about 15% and 14%, respectively.

Multi-Dimensional Record Scan with SIMD Vector Instructions (SIMD 벡터 명령어를 이용한 다차원 레코드 스캔)

  • Cho, Sung-Ryong;Han, Hwan-Soo;Lee, Sang-Won
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.6
    • /
    • pp.732-736
    • /
    • 2010
  • Processing a large amount of data becomes more important than ever. Particularly, the information queries which require multi-dimensional record scan can be efficiently implemented with SIMD instruction sets. In this article, we present a SIMD record scan technique which employs row-based scanning. Our technique is different from existing SIMD techniques for predicate processes and aggregate operations. Those techniques apply SIMD instructions to the attributes in the same column of the database, exploiting the column-based record organization of the in-memory database systems. Whereas, our SIMD technique is useful for multi-dimensional record scanning. As the sizes of registers and the memory become larger, our row-based SIMD scan can have bigger impact on the performance. Moreover, since our technique is orthogonal to the parallelization techniques for multi-core processors, it can be applied to both uni-processors and multi-core processors without too many changes in the software architectures.

An Implementation of Efficient Quicksort Utilizing SIMD-Based VBP Technique (SIMD 기반의 VBP 기법을 적용한 효율적인 퀵정렬의 구현)

  • Hong, Gilseok;Kim, Hongyeon;Kang, Seonghyeon;Min, Jun-Ki
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.8
    • /
    • pp.498-503
    • /
    • 2017
  • SIMD (Single Instruction Multiple Data) is a representative parallelization architecture that processes multiple data loaded in a SIMD register with a single instruction. Quicksort is a sorting algorithm that picks an element as a pivot from the array and reorders the array such that all elements having the values less than the pivot value are located in the left side on the pivot as well as all elements having the value greater than the pivot value are located in the right side on the pivot and then the algorithm performs the same task on both sublist recursively. In this paper, we propose an efficient Quicksort algorithm applying the SIMD instructions which minimally invokes conditional branches to avoid the performance degradation incurred by branch misprediction in a pipeline architecture. In addition, we improve the performance of the Quicksort algorithm by fetching data into a SIMD register as a byte unit to apply VBP (Vertical Bit Parallel) and the early pruning technique.

Acceleration of FFT on a SIMD Processor (SIMD 구조를 갖는 프로세서에서 FFT 연산 가속화)

  • Lee, Juyeong;Hong, Yong-Guen;Lee, Hyunseok
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.2
    • /
    • pp.97-105
    • /
    • 2015
  • This paper discusses the implementation of Bruun's FFT on a SIMD processor. FFT is an algorithm used in digital signal processing area and its effective processing is important in the enhancement of signal processing performance. Bruun's FFT algorithm is one of fast Fourier transform algorithms based on recursive factorization. Compared to popular Cooley-Tukey algorithm, it is advantageous in computations because most of its operations are based on real number multiplications instead of complex ones. However it shows more complicated data alignment patterns and requires a larger memory for storing coefficient data in its implementation on a SIMD processor. According to our experiment result, in the processing of the FFT with 1024 complex input data on a SIMD processor, The Bruun's algorithm shows approximately 1.2 times higher throughput but uses approximately 4 times more memory (20 Kbyte) than the Cooley-Tukey algorithm. Therefore, in the case with loose constraints on silicon area, the Bruun's algorithm is proper for the processing of FFT on a SIMD processor.

SIMD Instruction-based Fast HEVC RExt Decoder (SIMD 명령어 기반 HEVC RExt 복호화기 고속화)

  • Mok, Jung-Soo;Ahn, Yong-Jo;Ryu, Hochan;Sim, Donggyu
    • Journal of Broadcast Engineering
    • /
    • v.20 no.2
    • /
    • pp.224-237
    • /
    • 2015
  • In this paper, we introduce the fast decoding method with the SIMD (Single Instruction Multiple Data) instructions for HEVC RExt (High Efficiency Video Coding Range Extensions). Several tools of HEVC RExt such as intra prediction, interpolation, inverse-quantization, inverse-transform, and clipping modules can be classified as the proper modules for applying the SIMD instructions. In consideration of bit-depth increasement of RExt, intra prediction, interpolation, inverse-quantization, inverse-transform, and clipping modules are accelerated by SSE (Streaming SIMD Extension) instructions. In addition, we propose effective implementations for interpolation filter, inverse-quantization, and clipping modules by utilizing a set of AVX2 (Advanced Vector eXtension 2) instructions that can use 256 bits register. The evaluation of the proposed methods were performed on the private HEVC RExt decoder developed based on HM 16.0. The experimental results show that the developed RExt decoder reduces 12% average decoding time, compared with the conventional sequential method.

The Design of low-cost SIMD MAC/MAS for Embedded Systems (임베디드 시스템을 위한 저비용 SIMD MAC/MAS 블록 설계)

  • Lee Yong Joo;Jung Jin Woo;Lee Yong Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.29 no.10C
    • /
    • pp.1460-1468
    • /
    • 2004
  • In this paper, we developed a low-area and low-cost SIMD MAC/MAS(Single Instruction Multiple Data Multiply and ACcumulate/Multiply And Subtract) for multimedia that is used much in real life. We compared the result of this research with a previously developed more large and high performance SIMD MAC/MAS. This paper is consist of 5 parts, which are an introduction, the contents of designing SIMD MAC/MAS hardware, a special qualities for previous works, the result of synthesis and conclusion. The design result reduced by size 32% of whole hardware than 64 bit SIMD MAC/MAS block of designed for high performance. This improved ISA (Instruction Set Architecture) to be suitable to embedded DSP(Digital Signal Processor), and shortened bit range of 64-bit data to 32-bit and implement more optimally.

Fast implementation of Hadamard transformation of HEVC with SIMD (SIMD를 이용한 HEVC 하다마드 트랜스폼의 고속 구현)

  • You, Jong-Hun;Jo, Hyun-Ho;Sim, Dong-Gyu
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2011.11a
    • /
    • pp.307-309
    • /
    • 2011
  • 본 논문에서는 SIMD(Single Instruction Multiple Data) 프로세서를 사용한 HEVC 부호화기의 하다마트 트랜스폼 고속화를 제안한다. 본 논문에서는 MMX와 SSE 레지스터를 사용하여 하다마드 트랜스폼을 SIMD 연산으로 대체함으로써 메모리 접근 횟수와 명령어의 수를 줄여 부호화기를 고속화하였다. 또한, HEVC의 10비트 입력에 따른 SIMD 구조의 비효율적인 구현을 해결하기 위하여 하다마드 트랜스폼의 입력 픽셀 비트수를 감소시키는 IBDD(Internal Bit Depth Decreasing)를 제안했다. HEVC 부호화기에 하다마드 트랜스폼을 SIMD 연산으로 대체한 결과 부호화 효율의 저하 없이, 부호화기의 수행 시간은 10% 감소되었다.

  • PDF

Parallel Factorization using Quadratic Sieve Algorithm on SIMD machines (SIMD상에서의 이차선별법을 사용한 병렬 소인수분해 알고리즘)

  • Kim, Yang-Hee
    • The KIPS Transactions:PartA
    • /
    • v.8A no.1
    • /
    • pp.36-41
    • /
    • 2001
  • In this paper, we first design an parallel quadratic sieve algorithm for factoring method. We then present parallel factoring algorithm for factoring a large odd integer by repeatedly using the parallel quadratic sieve algorithm based on the divide-and-conquer strategy on SIMD machines with DMM. We show that this algorithm is optimal in view of the product of time and processor numbers.

  • PDF

An Efficient 4$\times$4 Integer Transform Algorithm on SIMD (SIMD 기반의 효율적인 4$\times$4 정수변환 방법)

  • 유상준;오승준;안창범
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.10a
    • /
    • pp.55-57
    • /
    • 2004
  • DCT(Discrete Cosine Transform)는 현존하는 블록기반 영상 압축 코딩기법의 핵심이 되는 부분이다. 많은 고속 방법이 제안되었으며, 최근 들어 SIMD 병렬구조를 이용한 고속방법들이 제안되고 있다. 본 논문에서는 SIMD명령어를 가지는 프로세서에서 4$\times$4 정수변환의 속도를 최적화하기 위한 알고리즘을 제안한다. 본 논문에서 제안하는 알고리즘은 128비트 SIMD영령어로 확장이 가능하며 비슷한 구조를 가지는 Hadamard 변환에서 적용할 수 있다. 제안하는 방법을 펜티엄4 2.4G에서 구현할 경우 H.264 참조 부호화기의 4$\times$4 정수변환 방법보다 64비트 SIMD 명령어를 사용할 경우 4.34배 128-bit SIMD 명령어를 사용할 경우 6.77배의 성능을 얻을 수 있다.

  • PDF

SIMD Optimization for Improving the Performance of a CPU-Based Graph Engine (SIMD 최적화를 이용한 CPU 기반 그래프 엔진의 성능 개선)

  • Ikhyeon Jo;Myung-Hwan Jang;Sang-Wook Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.383-385
    • /
    • 2023
  • Single-machine-based 그래프 엔진의 state-of-the-art 모델인 RealGraph 는 쓰레드를 이용한 병렬화로 성능을 향상하였으나 쓰레드 내부에서의 병렬성은 고려되지 않았다. 본 논문은 SIMD 명령어를 이용해 RealGraph 의 병렬성을 향상시켰다. 쓰레드 내부의 효율성을 높이기 위해 RealGraph 의 구조와 그래프 알고리즘의 분석을 통한 SIMD 명령어의 적용 가능한 영역을 탐색하였다. 실험으로 SIMD 명령어의 적용을 통해 쓰레드 내부에서 벡터 연산을 수행하여 평균 7.6%, 11.7%, 9.2%의 수행 시간 단축을 이끌어냈으며 SIMD 명령어의 적용이 그래프 엔진의 분석 성능에 얼마나 도움이 될 수 있는지 확인하였다.