• Title/Summary/Keyword: OpenCL

Search Result 281, Processing Time 0.035 seconds

Automatic Optimization Methods for Image Processing Programs Using OpenCL (OpenCL을 이용한 이미지 처리 프로그램의 자동 최적화 방법)

  • Shin, Jaeho;Jo, Gangwon;Lee, Ilkoo;Lee, Jaejin
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.3
    • /
    • pp.188-193
    • /
    • 2017
  • In this paper, we propose automatic OpenCL optimization techniques that offer the best performance for image processing programs on any hardware system. Developers should seek a proper way of parallelization and an appropriate work-group size for the architecture of target compute devices to achieve the best performance. However, testing potential devices to find them is both time-consuming and costly. Our techniques automatically set up hardware-optimized parallelization and find a suitable work-group size for the target device. Furthermore, using OpenCL does not always provide better performance in image processing. Hence, we also propose a way to automatically search for a threshold image size to allow image processing programs to decide whether or not to use OpenCL. Our findings demonstrate that out techniques improve the image processing performance significantly.

Parallelization of Feature Detection and Panorama Image Generation using OpenCL and Embedded GPU (OpenCL 및 Embedded GPU를 이용한 영상 특징 추출 및 파노라마 영상 생성의 병렬화)

  • Kang, Seung Heon;Lee, Seung-Jae;Lee, Man Hee;Park, In Kyu
    • Journal of Broadcast Engineering
    • /
    • v.19 no.3
    • /
    • pp.316-328
    • /
    • 2014
  • In this paper, we parallelize the popular feature detection algorithms, i.e. SIFT and SURF, and its application to fast panoramic image generation on the latest embedded GPU. Parallelized algorithms are implemented using recently developed OpenCL as the embedded GPGPU software platform. We compare the implementation efficiency and speed performance of conventional OpenGL Shading Language and OpenCL. Experimental result shows that implementation on OpenCL has comparable performance with GLSL. Compared with the performance on the embedded CPU in the same application processor, the embedded GPU runs 3~4 times faster. As an example of using feature extraction, panorama image synthesis is performed on embedded GPU by applying image matching using detected features.

Performance Enhancement and Evaluation of AES Cryptography using OpenCL on Embedded GPGPU (OpenCL을 이용한 임베디드 GPGPU환경에서의 AES 암호화 성능 개선과 평가)

  • Lee, Minhak;Kang, Woochul
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.7
    • /
    • pp.303-309
    • /
    • 2016
  • Recently, an increasing number of embedded processors such as ARM Mali begin to support GPGPU programming frameworks, such as OpenCL. Thus, GPGPU technologies that have been used in PC and server environments are beginning to be applied to the embedded systems. However, many embedded systems have different architectural characteristics compare to traditional PCs and low-power consumption and real-time performance are also important performance metrics in these systems. In this paper, we implement a parallel AES cryptographic algorithm for a modern embedded GPU using OpenCL, a standard parallel computing framework, and compare performance against various baselines. Experimental results show that the parallel GPU AES implementation can reduce the response time by about 1/150 and the energy consumption by approximately 1/290 compare to OpenMP implementation when 1000KB input data is applied. Furthermore, an additional 100 % performance improvement of the parallel AES algorithm was achieved by exploiting the characteristics of embedded GPUs such as removing copying data between GPU and host memory. Our results also demonstrate that higher performance improvement can be achieved with larger size of input data.

Parallel LDPC Decoder for CMMB on CPU and GPU Using OpenCL (OpenCL을 활용한 CPU와 GPU 에서의 CMMB LDPC 복호기 병렬화)

  • Park, Joo-Yul;Hong, Jung-Hyun;Chung, Ki-Seok
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.6
    • /
    • pp.325-334
    • /
    • 2016
  • Recently, Open Computing Language (OpenCL) has been proposed to provide a framework that supports heterogeneous computing platforms. By using an OpenCL framework, digital communication systems can support various protocols in a unified computing environment to achieve both high portability and high performance. This article introduces a parallel software decoder of Low Density Parity Check (LDPC) codes for China Multimedia Mobile Broadcasting (CMMB) on a heterogeneous platform. Each step of LDPC decoding has different parallelization characteristics. In this paper, steps suitable for task-level parallelization are executed on the CPU, and steps suitable for data-level parallelization are processed by the GPU. To improve the performance of the proposed OpenCL kernels for LDPC decoding operations, explicit thread scheduling, loop-unrolling, and effective data transfer techniques are applied. The proposed LDPC decoder achieves high performance by using heterogeneous multi-core processors on a unified computing framework.

Mobile Advanced Driver Assistance System using OpenCL : Pedestrian Detection (OpenCL을 이용한 모바일 ADAS : 보행자 검출)

  • Kim, Jong-Hee;Lee, Chung-Su;Kim, Hakil
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.10
    • /
    • pp.190-196
    • /
    • 2014
  • This paper proposes a mobile-optimized pedestrian detection method using Cascade of HOG(Histograms of Oriented Gradients) for ADAS(Advanced Driver Assistance System) on smartphones. In order to use the limited resource of mobile platforms efficiently, the method is implemented by the OpenCL(Open Computing Language) library, and its processing time is reduced in the following two aspects. Firstly, the method sets a program build option specifically and adjusts work group sizes as variety of kernels in the host code. Secondly, it utilizes local memory and a LUT(Look-Up Table) in the kernel code to accelerate the program. For performance evaluation, the developed algorithm is compared with the mobile CPU-based OpenCV(Open Computer Vision) for Android function. The experimental results show that the processing speed is 25% faster than the OpenCV hogcascade.

Spark Framework Based on a Heterogenous Pipeline Computing with OpenCL (OpenCL을 활용한 이기종 파이프라인 컴퓨팅 기반 Spark 프레임워크)

  • Kim, Daehee;Park, Neungsoo
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.67 no.2
    • /
    • pp.270-276
    • /
    • 2018
  • Apache Spark is one of the high performance in-memory computing frameworks for big-data processing. Recently, to improve the performance, general-purpose computing on graphics processing unit(GPGPU) is adapted to Apache Spark framework. Previous Spark-GPGPU frameworks focus on overcoming the difficulty of an implementation resulting from the difference between the computation environment of GPGPU and Spark framework. In this paper, we propose a Spark framework based on a heterogenous pipeline computing with OpenCL to further improve the performance. The proposed framework overlaps the Java-to-Native memory copies of CPU with CPU-GPU communications(DMA) and GPU kernel computations to hide the CPU idle time. Also, CPU-GPU communication buffers are implemented with switching dual buffers, which reduce the mapped memory region resulting in decreasing memory mapping overhead. Experimental results showed that the proposed Spark framework based on a heterogenous pipeline computing with OpenCL had up to 2.13 times faster than the previous Spark framework using OpenCL.

Parallel LDPC Decoding on a Heterogeneous Platform using OpenCL

  • Hong, Jung-Hyun;Park, Joo-Yul;Chung, Ki-Seok
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.6
    • /
    • pp.2648-2668
    • /
    • 2016
  • Modern mobile devices are equipped with various accelerated processing units to handle computationally intensive applications; therefore, Open Computing Language (OpenCL) has been proposed to fully take advantage of the computational power in heterogeneous systems. This article introduces a parallel software decoder of Low Density Parity Check (LDPC) codes on an embedded heterogeneous platform using an OpenCL framework. The LDPC code is one of the most popular and strongest error correcting codes for mobile communication systems. Each step of LDPC decoding has different parallelization characteristics. In the proposed LDPC decoder, steps suitable for task-level parallelization are executed on the multi-core central processing unit (CPU), and steps suitable for data-level parallelization are processed by the graphics processing unit (GPU). To improve the performance of OpenCL kernels for LDPC decoding operations, explicit thread scheduling, vectorization, and effective data transfer techniques are applied. The proposed LDPC decoder achieves high performance and high power efficiency by using heterogeneous multi-core processors on a unified computing framework.

Parallel String Matching and Optimization Using OpenCL on FPGA (FPGA 상에서 OpenCL을 이용한 병렬 문자열 매칭 구현과 최적화 방향)

  • Yoon, Jin Myung;Choi, Kang-Il;Kim, Hyun Jin
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.66 no.1
    • /
    • pp.100-106
    • /
    • 2017
  • In this paper, we propose a parallel optimization method of Aho-Corasick (AC) algorithm and Parallel Failureless Aho-Corasick (PFAC) algorithm using Open Computing Language (OpenCL) on Field Programmable Gate Array (FPGA). The low throughput of string matching engine causes the performance degradation of network process. Recently, many researchers have studied the string matching engine using parallel computing. FPGA's vendors offer a parallel computing platform using OpenCL. In this paper, we apply the AC and PFAC algorithm on DE1-SoC board with Cyclone V FPGA, where the optimization that considers FPGA architecture is performed. Experiments are performed considering global id, local id, local memory, and loop unrolling optimizations using PFAC algorithm. The performance improvement using loop unrolling is 129 times greater than AC algorithm that not adopt loop unrolling. The performance improvements using loop unrolling are 1.1, 0.2, and 1.5 times greater than those using global id, local id, and local memory optimizations mentioned above.

Implementing Efficient Camera ISP Filters on GPGPUs Using OpenCL (GPGPU 기반의 효율적인 카메라 ISP 구현)

  • Park, Jongtae;Facchini, Beron;Hong, Jingun;Burgstaller, Bernd
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.1784-1787
    • /
    • 2010
  • General Purpose Graphic Processing Unit (GPGPU) computing is a technique that utilizes the high-performance many-core processors of high-end graphic cards for general-purpose computations such as 3D graphics, video/image processing, computer vision, scientific computing, HPC and many more. GPGPUs offer a vast amount of raw computing power, but programming is extremely challenging because of hardware idiosyncrasies. The open computing language (OpenCL) has been proposed as a vendor-independent GPGPU programming interface. OpenCL is very close to the hardware and thus does little to increase GPGPU programmability. In this paper we present how a set of digital camera image signal processing (ISP) filters can be realized efficiently on GPGPUs using OpenCL. Although we found ISP filters to be memory-bound computations, our GPGPU implementations achieve speedups of up to a factor of 64.8 over their sequential counterparts. On GPGPUs, our proposed optimizations achieved speedups between 145% and 275% over their baseline GPGPU implementations. Our experiments have been conducted on a Geforce GTX 275; because of OpenCL we expect our optimizations to be applicable to other architectures as well.

Implementation of Neural Network Accelerator for Rendering Noise Reduction on OpenCL (OpenCL을 이용한 랜더링 노이즈 제거를 위한 뉴럴 네트워크 가속기 구현)

  • Nam, Kihun
    • The Journal of the Convergence on Culture Technology
    • /
    • v.4 no.4
    • /
    • pp.373-377
    • /
    • 2018
  • In this paper, we propose an implementation of a neural network accelerator for reducing the rendering noise using OpenCL. Among the rendering algorithms, we selects a ray tracing to assure a high quality graphics. Ray tracing rendering uses ray to render, less use of the ray will result in noise. Ray used more will produce a higher quality image but will take operation time longer. To reduce operation time whiles using fewer rays, Learning Base Filtering algorithm using neural network was applied. it's not always produce optimize result. In this paper, a new approach to Matrix Multiplication that is based on General Matrix Multiplication for improved performance. The development environment, we used specialized in high speed parallel processing of OpenCL. The proposed architecture was verified using Kintex UltraScale XKU6909T-2FDFG1157C FPGA board. The time it takes to calculate the parameters is about 1.12 times fast than that of Verilog-HDL structure.