• Title/Summary/Keyword: unified memory

Search Result 52, Processing Time 0.026 seconds

Implementation of Integrated CPU-GPU for Efficient Uniform Memory Access Method and Verification System (CPU-GPU간 긴밀성을 위한 효율적인 공유메모리 접근 방법과 검증 시스템 구현)

  • Park, Hyun-moon;Kwon, Jinsan;Hwang, Tae-ho;Kim, Dong-Sun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.2
    • /
    • pp.57-65
    • /
    • 2016
  • In this paper, we propose a system for efficient use of shared memory between CPU and GPU. The system, called Fusion Architecture, assures consistency of the shared memory and minimizes cache misses that frequently occurs on Heterogeneous System Architecture or Unified Virtual Memory based systems. It also maximizes the performance for memory intensive jobs by efficient allocation of GPU cores. To test between architectures on various scenarios, we introduce the Fusion Architecture Analyzer, which compares OpenMP, OpenCL, CUDA, and the proposed architecture in terms of memory overhead and process time. As a result, Proposed fusion architectures show that the Fusion Architecture runs benchmarks 55% faster and reduces memory overheads by 220% in average.

Thermomechanical Behaviors of Shape Memory Alloy Using Finite Element Analysis (유한요소해석을 이용한 형상기억합금의 열적/기계적 거동 연구)

  • ;Scott R. White
    • Proceedings of the Korean Society of Precision Engineering Conference
    • /
    • 2001.04a
    • /
    • pp.833-836
    • /
    • 2001
  • The thermomechanical behaviors of the shape memory alloy were conducted through the finite element analysis of ABAQUS with UMAT user subroutine. The unified thermomechanical constitutive equation suggested by Lagoudas was adapted into the UMAT user subroutine to investigate the characteristics of the shape memory alloy. The three cases were solved to investigate the thermomechanical characteristics of the shape memory alloy. The material properties for the analysis were obtained by DSC and DMA techniques. According to the results, the thermomechanical characteristics, such as a shape memory effect and a pseudoelastic effect, could be obtained through the finite element analysis and the analysis results were revealed to agree well with the experimental results. Therefore, the finite element analysis using UMAT user subroutine is one of prominent analysis techniques to investigate the thermomechnical behaviors of the shape memory alloy quantitatively.

  • PDF

Unified Dual-Gate Phase Change RAM (PCRAM) with Phase Change Memory and Capacitor-Less DRAM (Phase Change Memory와 Capacitor-Less DRAM을 사용한 Unified Dual-Gate Phase Change RAM)

  • Kim, Jooyeon
    • Journal of the Korean Institute of Electrical and Electronic Material Engineers
    • /
    • v.27 no.2
    • /
    • pp.76-80
    • /
    • 2014
  • Dual-gate PCRAM which unify capacitor-less DRAM and NVM using a PCM instead of a typical SONOS flash memory is proposed as 1 transistor. $VO_2$ changes its phase between insulator and metal states by temperature and field. The front-gate and back-gate control NVM and DRAM, respectively. The feasibility of URAM is investigated through simulation using c-interpreter and finite element analysis. Threshold voltage of NVM is 0.5 V that is based on measured results from previous fabricated 1TPCM with $VO_2$. Current sensing margin of DRAM is 3 ${\mu}A$. PCM does not interfere with DRAM in the memory characteristics unlike SONOS NVM. This novel unified dual-gate PCRAM reported in this work has 1 transistor, a low RESET/SET voltage, a fast write/erase time and a small cell so that it could be suitable for future production of URAM.

Estimation of long memory parameter in nonparametric regression

  • Cho, Yeoyoung;Baek, Changryong
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.6
    • /
    • pp.611-622
    • /
    • 2019
  • This paper considers the estimation of the long memory parameter in nonparametric regression with strongly correlated errors. The key idea is to minimize a unified mean squared error of long memory parameter to select both kernel bandwidth and the number of frequencies used in exact local Whittle estimation. A unified mean squared error framework is more natural because it provides both goodness of fit and measure of strong dependence. The block bootstrap is applied to evaluate the mean squared error. Finite sample performance using Monte Carlo simulations shows the closest performance to the oracle. The proposed method outperforms existing methods especially when dependency and sample size increase. The proposed method is also illustreated to the volatility of exchange rate between Korean Won for US dollar.

A Performance Study on CPU-GPU Data Transfers of Unified Memory Device (통합메모리 장치에서 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.5
    • /
    • pp.133-138
    • /
    • 2022
  • Recently, as GPU performance has improved in HPC and artificial intelligence, its use is becoming more common, but GPU programming is still a big obstacle in terms of productivity. In particular, due to the difficulty of managing host memory and GPU memory separately, research is being actively conducted in terms of convenience and performance, and various CPU-GPU memory transfer programming methods are suggested. Meanwhile, recently many SoC (System on a Chip) products such as Apple M1 and NVIDIA Tegra that bundle CPU, GPU, and integrated memory into one large silicon package are emerging. In this study, data between CPU and GPU devices are used in such an integrated memory device and performance-related research is conducted during transmission. It shows different characteristics from the existing environment in which the host memory and GPU memory in the CPU are separated. Here, we want to compare performance by CPU-GPU data transmission method in NVIDIA SoC chips, which are integrated memory devices, and NVIDIA SMX-based V100 GPU devices. For the experimental workload for performance comparison, a two-dimensional matrix transposition example frequently used in HPC applications was used. We analyzed the following performance factors: the difference in GPU kernel performance according to the CPU-GPU memory transfer method for each GPU device, the transfer performance difference between page-locked memory and pageable memory, overall performance comparison, and performance comparison by workload size. Through this experiment, it was confirmed that the NVIDIA Xavier can maximize the benefits of integrated memory in the SoC chip by supporting I/O cache consistency.

Evaluation of the Data Migration between CPU Memory and GPU Memory for a NVIDIA Pascal GPU Using Unified Memory (통합 메모리를 사용하는 NVIDIA 파스칼 GPU에서의 CPU 메모리와 GPU 메모리 간 데이터 통신 분석)

  • Shin, Philkyue;Hong, Seongsoo
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.7-10
    • /
    • 2018
  • 통합 메모리는 CPU 메모리와 GPU 메모리 간의 데이터 통신을 개발자에게 투명하게 내재적으로 수행하는 소프트웨어 런타임 환경으로 개발자에게 CPU 메모리와 GPU 메모리가 통합된 하나의 메모리로 보이게 해준다. 통합 메모리는 장점에도 불구하고 아직 널리 사용되지 못하고 있는데 그 이유는 내재적으로 수행되는 데이터 통신의 오버헤드가 큰 것으로 알려져 있기 때문이다. 하지만 이 데이터 통신이 구체적으로 어떻게 이루어지고 오버헤드는 어떻게 발생하는지 분석한 연구는 아직 존재하지 않는다. 우리는 NVIDIA 사의 최신 GPU 마이크로아키텍처 중 하나인 파스칼을 사용하는 GPU를 대상으로 하여, 통합 메모리를 사용할 시 데이터 통신이 이루어지는 조건과 GPU 응용의 수행시간에 데이터 통신이 끼치는 영향을 실험을 통해 분석한다. 실험 결과 통합 메모리의 오버헤드는 두 가지 원인 때문에 발생한다. 첫째, 통합 메모리를 사용하면 CPU 또는 GPU가 데이터에 접근할 때마다 이 데이터는 CPU 또는 GPU 메모리로 옮겨지고 옮겨진 데이터는 제거된다. 따라서 재사용할 데이터도 제거되어 추가적인 데이터 통신이 발생하고, 이 데이터 통신의 지연시간은 GPU 응용의 수행시간에 더해진다. 둘째, 통합 메모리를 사용하면 데이터 통신과 커널들이 서로 다른 스트림에 할당되어도 동시에 수행되지 못한다. 따라서 GPU 응용의 수행시간은 동시에 수행되던 데이터 통신과 커널의 수행시간만큼 증가한다.

  • PDF

A Development of Fusion Processor Architecture for Efficient Main Memory Access in CPU-GPU Environment (CPU-GPU환경에서 효율적인 메인메모리 접근을 위한 융합 프로세서 구조 개발)

  • Park, Hyun-Moon;Kwon, Jin-San;Hwang, Tae-Ho;Kim, Dong-Sun
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.11 no.2
    • /
    • pp.151-158
    • /
    • 2016
  • The HSA resolves an old problem with existing CPU and GPU architectures by allowing both units to directly access each other's memory pools via unified virtual memory. In a physically realized system, however, frequent data exchanges between CPU and GPU for a virtual memory block result bottlenecks and coherence request overheads. In this paper, we propose Fusion Processor Architecture for efficient access of main memory from both CPU and GPU. It consists of Job Manager, Re-mapper, and Pre-fetcher to control, organize, and distribute work loads and working areas for GPU cores. These components help on reducing memory exchanges between the two processors and improving overall efficiency by eliminating faulty page table requests. To verify proposed algorithm architectures, we develop an emulator based on QEMU, and compare several architectures such as CUDA(Compute Unified Device Architecture), OpenMP, OpenCL. As a result, Proposed fusion processor architectures show 198% faster than others by removing unnecessary memory copies and cache-miss overheads.

Design and Implementation of Unified Index for Moving Objects Databases (이동체 데이타베이스를 위한 통합 색인의 설계 및 구현)

  • Park Jae-Kwan;An Kyung-Hwan;Jung Ji-Won;Hong Bong-Hee
    • Journal of KIISE:Databases
    • /
    • v.33 no.3
    • /
    • pp.271-281
    • /
    • 2006
  • Recently the need for Location-Based Service (LBS) has increased due to the development and widespread use of the mobile devices (e.g., PDAs, cellular phones, labtop computers, GPS, and RFID etc). The core technology of LBS is a moving-objects database that stores and manages the positions of moving objects. To search for information quickly, the database needs to contain an index that supports both real-time position tracking and management of large numbers of updates. As a result, the index requires a structure operating in the main memory for real-time processing and requires a technique to migrate part of the index from the main memory to disk storage (or from disk storage to the main memory) to manage large volumes of data. To satisfy these requirements, this paper suggests a unified index scheme unifying the main memory and the disk as well as migration policies for migrating part of the index from the memory to the disk during a restriction in memory space. Migration policy determines a group of nodes, called the migration subtree, and migrates the group as a unit to reduce disk I/O. This method takes advantage of bulk operations and dynamic clustering. The unified index is created by applying various migration policies. This paper measures and compares the performance of the migration policies using experimental evaluation.

A Phenomenological Constitutive Model for Pseudoelastic Shape Memory Alloy (의탄성 형상기억합금에 대한 현상학적 구성모델)

  • Ho, Kwang-Soo
    • Transactions of Materials Processing
    • /
    • v.19 no.8
    • /
    • pp.468-473
    • /
    • 2010
  • Shape memory alloys (SMAs) have the ability to recover their original shape upon thermo-mechanical loading even after large inelastic deformation. The unique feature is known as pseudoelasticity and shape memory effect caused by the crystalline structural transformation between two solid-state phases called austenite and martensite. To support the engineering application, a number of constitutive models, which can be formally classified into either micromechanics-based or phenomenological model, have been developed. Most of the constitutive models include a kinetic law governing the crystallographic transformation. The present work presents a one-dimensional, phenomenological constitutive model for SMAs in the context of the unified viscoplasticity theory. The proposed model does not incorporate the complex mechanisms of phase transformation. Instead, the effects induced by the transformation are depicted through the growth law for the back stress that is an internal state variable of the model.

The Construction and Viterbi Decoding of New (2k, k, l) Convolutional Codes

  • Peng, Wanquan;Zhang, Chengchang
    • Journal of Information Processing Systems
    • /
    • v.10 no.1
    • /
    • pp.69-80
    • /
    • 2014
  • The free distance of (n, k, l) convolutional codes has some connection with the memory length, which depends on not only l but also on k. To efficiently obtain a large memory length, we have constructed a new class of (2k, k, l) convolutional codes by (2k, k) block codes and (2, 1, l) convolutional codes, and its encoder and generation function are also given in this paper. With the help of some matrix modules, we designed a single structure Viterbi decoder with a parallel capability, obtained a unified and efficient decoding model for (2k, k, l) convolutional codes, and then give a description of the decoding process in detail. By observing the survivor path memory in a matrix viewer, and testing the role of the max module, we implemented a simulation with (2k, k, l) convolutional codes. The results show that many of them are better than conventional (2, 1, l) convolutional codes.