• Title/Summary/Keyword: prefetcher

Search Result 8, Processing Time 0.019 seconds

Gated Recurrent Unit based Prefetching for Graph Processing (그래프 프로세싱을 위한 GRU 기반 프리페칭)

  • Shivani Jadhav;Farman Ullah;Jeong Eun Nah;Su-Kyung Yoon
    • Journal of the Semiconductor & Display Technology
    • /
    • v.22 no.2
    • /
    • pp.6-10
    • /
    • 2023
  • High-potential data can be predicted and stored in the cache to prevent cache misses, thus reducing the processor's request and wait times. As a result, the processor can work non-stop, hiding memory latency. By utilizing the temporal/spatial locality of memory access, the prefetcher introduced to improve the performance of these computers predicts the following memory address will be accessed. We propose a prefetcher that applies the GRU model, which is advantageous for handling time series data. Display the currently accessed address in binary and use it as training data to train the Gated Recurrent Unit model based on the difference (delta) between consecutive memory accesses. Finally, using a GRU model with learned memory access patterns, the proposed data prefetcher predicts the memory address to be accessed next. We have compared the model with the multi-layer perceptron, but our prefetcher showed better results than the Multi-Layer Perceptron.

  • PDF

A Cache Controller to Maximize Effectiveness of Hierarchical Memory Architecture (계층적 메모리 구조의 효과를 극대화하는 캐시 제어기)

  • Uh Bong Yong;Ju Young Kwan;Cheon Joong Nam;Kim Suk Il
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.608-616
    • /
    • 2005
  • A cache architecture is proposed here which evokes prefetch at level 1 cache miss. Existing structures only prefetch at level 2 cache miss. In the proposed cache architecture, level 1 cache miss would select demand fetch block and prefetch block from the level 2 cache and store to level 1 cache and prefetch cache, respectively. According to an experimental analysis using 11 benchmark programs, the hierarchical cache architecture that employs both a level 1 cache prefetcher and a level 2 cache prefetcher obtained a maximum $19\%$ increased performance when compared to the cache architecture that employs only a level 2 cache prefetcher.

An L1 Cache Prefetching Scheme using Excessively Aggressive Prefetchering and a Small Direct-mapped Filtering Cache (공격적인 선인출 및 직접 사상 필터링을 이용한 L1 캐시 선인출 기법)

  • Chon, Young-Suk
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.11
    • /
    • pp.836-852
    • /
    • 2006
  • This paper proposes an L1 cache prefetch scheme using an excessively aggressive hardware prefetcher and a hardware prefetch filter having a small direct-mapped filtering cache. A quantitative analysis method has been introduced and applied to analyze nonideal effects of aggressive cache prefetching. From those analysis results, the structure and algorithm of a prefetch filter has been derived and simulated, and the overall system performance has been measured using a cycle-by-cycle cache simulator. Experimental results show that the proposed scheme improves the overall system performance by 18% on the average over several benchmarks

A VLSI implementation of 32-bit RISC embedded controller (내장형 32비트 RISC 콘트롤러의 VLSI 구현)

  • 이문기;최병윤;이승호
    • Journal of the Korean Institute of Telematics and Electronics A
    • /
    • v.31A no.10
    • /
    • pp.141-151
    • /
    • 1994
  • this paper describes the design and implementation of a RISC processor for embedded control systems. This RISC processor integrates a register file, a pipelined execution unit, a FPU interface, a memory interface, and an instruction prefetcher. Its characteristics include both single cycle executions of most instructions in a 2 phase 20 MHz frequency and the worst case interrupt latency of 7 cycles with the vectored interrupt handling that makes it possible to be applicable to the real time processing system. For efficient handling of multi-cycle instructions, data stationary hardwired control scheme equippedwith cycle counter was used. This chip integrates about 139K transistors and occupies 9.1mm$\times$9.1mm in a 1.0um DLM CMOS technology. The power dissipation is 0.8 Watts from a 5V supply at 20 MHz operation.

  • PDF

Dynamical Polynomial Regression Prefetcher for DRAM-PCM Hybrid Main Memory (DRAM-PCM 하이브리드 메인 메모리에 대한 동적 다항식 회귀 프리페처)

  • Zhang, Mengzhao;Kim, Jung-Geun;Kim, Shin-Dug
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.11a
    • /
    • pp.20-23
    • /
    • 2020
  • This research is to design an effective prefetching method required for DRAM-PCM hybrid main memory systems especially used for big data applications and massive-scale computing environment. Conventional prefetchers perform well with regular memory access patterns. However, workloads such as graph processing show extremely irregular memory access characteristics and thus could not be prefetched accurately. Therefore, this research proposes an efficient dynamical prefetching algorithm based on the regression method. We have designed an intelligent prefetch engine that can identify the characteristics of the memory access sequences. It can perform regular, linear regression or polynomial regression predictive analysis based on the memory access sequences' characteristics, and dynamically determine the number of pages required for prefetching. Besides, we also present a DRAM-PCM hybrid memory structure, which can reduce the energy cost and solve the conventional DRAM memory system's thermal problem. Experiment result shows that the performance has increased by 40%, compared with the conventional DRAM memory structure.

Prefetching Framework for General Workloads Using Breakpoint (브레이크포인트를 이용한 범용 워크로드 프리페칭 프레임워크)

  • Ko, Kwangjin;Ryu, Junhee;Kang, Kyungtae;Shin, Heonshik
    • Journal of KIISE
    • /
    • v.41 no.10
    • /
    • pp.832-837
    • /
    • 2014
  • Application loading speed can be improved by timely prefetching disk blocks likely to be needed by an application. However, existing prefetchers -- if they are not specialized to a particular application -- incur high overheads and are poor at identifying the blocks that will actually be required. There are many sequences in which blocks may be needed and, even if two access sequences are identical, block tracing and access timings can be affected significantly by the state of the buffer cache. We propose a new application-independent software-based prefetching technique, in which breakpoints are inserted at appropriate places in an application to collect the information on correlations between the blocks and to prefetch the potential blocks ahead of their schedule based on it. Experiments on an HDD-based desktop PC demonstrated an average 30% reduction in application launch time and 15% in general I/O, while reducing the wasted overhead.

Instructions and Data Prefetch Mechanism using Displacement History Buffer (변위 히스토리 버퍼를 이용한 명령어 및 데이터 프리페치 기법)

  • Jeong, Yong Su;Kim, JinHyuk;Cho, Tae Hwan;Choi, SangBang
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.10
    • /
    • pp.82-94
    • /
    • 2015
  • In this paper, we propose hardware prefetch mechanism with an efficient cache replacement policy by giving priority to the trigger block in which a spatial region and producing a spatial region by using the displacement field. It could be taken into account the sequence of the program since a history is based on the trigger block of history record, and it could be quickly prefetching the instructions or data address by adding a stored value to the trigger address and displacement field since a history is stored as a displacement value. Also, we proposed a method of replacing at random by the cache replacement policy from the low priority block when the cache area is full after giving priority to the trigger block. We analyzed using the memory simulator program gem5 and PARSEC benchmark to assess the performance of the hardware prefetcher. As a result, compared to the existing hardware prefecture to generate the spatial region using a bit vector, L1 data cache miss rate was reduced about 44.5% on average and an average of 26.1% of L1 instruction misses occur. In addition, IPC (Instruction Per Cycle) showed an improvement of about 23.7% on average.

Implementation of Hardware Data Prefetcher Adaptable for Various State-of-the-Art Workload (다양한 최신 워크로드에 적용 가능한 하드웨어 데이터 프리페처 구현)

  • Kim, KangHee;Park, TaeShin;Song, KyungHwan;Yoon, DongSung;Choi, SangBang
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.12
    • /
    • pp.20-35
    • /
    • 2016
  • In this paper, in order to reduce the delay and area of the partial product accumulation (PPA) of the parallel decimal multiplier, a tree architecture that composed by multi-operand decimal CSAs and improved CLA is proposed. The proposed tree using multi-operand CSAs reduces the partial product quickly. Since the input range of the recoder of CSA is limited, CSA can get the simplest logic. In addition, using the multi-operand decimal CSAs to add decimal numbers that have limited range in specific locations of the specific architecture can reduce the partial products efficiently. Also, final BCD result can be received faster by improving the logic of the decimal CLA. In order to evaluate the performance of the proposed partial product accumulation, synthesis is implemented by using Design Complier with 180 nm COMS technology library. Synthesis results show the delay of the proposed partial product accumulation is reduced by 15.6% and area is reduced by 16.2% comparing with which uses general method. Also, the total delay and area are still reduced despite the delay and area of the CLA are increased.