• Title/Summary/Keyword: Uniform Memory Access

Search Result 31, Processing Time 0.027 seconds

Implementation of Integrated CPU-GPU for Efficient Uniform Memory Access Method and Verification System (CPU-GPU간 긴밀성을 위한 효율적인 공유메모리 접근 방법과 검증 시스템 구현)

  • Park, Hyun-moon;Kwon, Jinsan;Hwang, Tae-ho;Kim, Dong-Sun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.2
    • /
    • pp.57-65
    • /
    • 2016
  • In this paper, we propose a system for efficient use of shared memory between CPU and GPU. The system, called Fusion Architecture, assures consistency of the shared memory and minimizes cache misses that frequently occurs on Heterogeneous System Architecture or Unified Virtual Memory based systems. It also maximizes the performance for memory intensive jobs by efficient allocation of GPU cores. To test between architectures on various scenarios, we introduce the Fusion Architecture Analyzer, which compares OpenMP, OpenCL, CUDA, and the proposed architecture in terms of memory overhead and process time. As a result, Proposed fusion architectures show that the Fusion Architecture runs benchmarks 55% faster and reduces memory overheads by 220% in average.

Efficient Processing of Grouped Aggregation on Non-Uniformed Memory Access Architecture (비균등 메모리 접근 구조에서의 효율적인 그룹화 집단 연산의 처리)

  • Choe, Seongjun;Min, Jun-Ki
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.14-27
    • /
    • 2018
  • Recently, to alleviate the memory bottleneck problme occurred in Symmetric Multiprocessing (SMP) architecture, Non-Uniform Memory Access (NUMA) architecture was proposed. In addition, since an aggregation operator is an important operator providing properties and summary of data, the efficiency of the aggregation operator is crucial to overall performance of a system. Thus, in this paper, we propose an efficient aggregation processing technique on NUMA architecture. Our proposed technique consists of partition phase and merge phase. In the partition phase, the target relation is partitioned into several partial relations according to grouping attribute. Thus, since each thread can process aggregation operator on partial relation independently, we prevent the remote memory access during the merge phase. Furthermore, at the merge phase, we improve the performance of the aggregation processing by letting each thread compute aggregation with a local hash table as well as avoiding lock contention to merge aggregation results generated by all threads into one.

MBS-LVM: A High-Performance Logical Volume Manager for Memory Bus-Connected Storages over NUMA Servers

  • Lee, Yongseob;Park, Sungyong
    • Journal of Information Processing Systems
    • /
    • v.15 no.1
    • /
    • pp.151-158
    • /
    • 2019
  • With the recent advances of memory technologies, high-performance non-volatile memories such as non-volatile dual in-line memory module (NVDIMM) have begun to be used as an addition or an alternative to server-side storages. When these memory bus-connected storages (MBSs) are installed over non-uniform memory access (NUMA) servers, the distance between NUMA nodes and MBSs is one of the crucial factors that influence file processing performance, because the access latency of a NUMA system varies depending on its distance from the NUMA nodes. This paper presents the design and implementation of a high-performance logical volume manager for MBSs, called MBS-LVM, when multiple MBSs are scattered over a NUMA server. The MBS-LVM consolidates the address space of each MBS into a single global address space and dynamically utilizes storage spaces such that each thread can access an MBS with the lowest latency possible. We implemented the MBS-LVM in the Linux kernel and evaluated its performance by porting it over the tmpfs, a memory-based file system widely used in Linux. The results of the benchmarking show that the write performance of the tmpfs using MBS-LVM has been improved by up to twenty times against the original tmpfs over a NUMA server with four nodes.

Concurrent Hash Table Optimized for NUMA System (NUMA 시스템에 최적화된 병렬 해시 테이블)

  • Choi, JaeYong;Jung, NaiHoon
    • Journal of Korea Game Society
    • /
    • v.20 no.5
    • /
    • pp.89-98
    • /
    • 2020
  • In MMO game servers, NUMA (Non-Uniform Memory Access) architecture is generally used to achieve high performance. Furthermore, such servers normally use hash tables as internal data structure which have constant time complexity for insert, delete, and search operations. In this study, we proposed a concurrent hash table optimized for NUMA system to make MMO game servers improve their performance. We tested our hash table on 4 socket NUMA system, and the hash table shows at most 100% speedup over another high-performance hash table.

Cost-effective multistage interconnection network for UNMA model system (NUMA(non-uniform memory access) 모델 시스템을 위한 cost-effective한 다단계 상호연결망)

  • 최창훈;김성천
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.34C no.5
    • /
    • pp.19-32
    • /
    • 1997
  • So far, the multiple path MINs to provide redundant paths in the traditional UPP MINs have been realized by adding additional hardware such as extra stages, duplicated data links, or multiple copies of sthe MIN. And the traditional MINs do not exploit locality: communication with all processor-memory paris takes the same amount of time. Also so far there has been little progress for exploiting locality of reference in MINs. In this paper, we present a new topology MIN, hybrid MIN that is constructed with 2N-3 SEs which is far fewer SEs than that of traditional MINs. Although the hybrid MIN is constructed with 2N-3 SEs, the hybrid MIN satisfies full access capability (FAC) and has redundant paths(but providing single path for 2 memory modules of each processor). Moreover the has redundant paths (but providing single path for 2 memory modules of each processor). Moreover the Hybrid MIN provides shortcut path between pairs which have frequent dat acommunication (locality of reference). Its performance under varing degrees of localized communication is analyzed.

  • PDF

Performance Comparison of Synchronization Methods for CC-NUMA Systems (CC-NUMA 시스템에서의 동기화 기법에 대한 성능 비교)

  • Moon, Eui-Sun;Jhang, Seong-Tae;Jhon, Chu-Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.4
    • /
    • pp.394-400
    • /
    • 2000
  • The main goal of synchronization is to guarantee exclusive access to shared data and critical sections, and then it makes parallel programs work correctly and reliably. Exclusive access restricts parallelism of parallel programs, therefor efficient synchronization is essential to achieve high performance in shared-memory parallel programs. Many techniques are devised for efficient synchronization, which utilize features of systems and applications. This paper shows the simulation results that existing synchronization methods have inefficiency under CC-NUMA(Cache Coherent Non-Uniform Memory Access) system, and then compares the performance of Freeze&Melt synchronization that can remove the inefficiency. The simulation results present that Test-and-Test&Set synchronization has inefficiency caused by broadcast operation and the pre-defined order of Queue-On-Lock-Bit (QOLB) synchronization to execute a critical section causes inefficiency. Freeze&Melt synchronization, which removes these inefficiencies, has performance gain by decreasing the waiting time to execute a critical section and the execution time of a critical section, and by reducing the traffic between clusters.

  • PDF

A Remote Cache Coherence Protocol for Single Shared Memory in Multiprocessor System (단일 공유 메모리를 가지는 다중 프로세서 시스템의 원격 캐시 일관성 유지 프로토콜)

  • Kim, Seong-Woon;Kim, Bo-Gwan
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.42 no.6
    • /
    • pp.19-28
    • /
    • 2005
  • The multiprocessor architecture is a good method to improve the computer system performance. The CC-NUMA provides a single shared space with the physically distributed memories is used widely in the multiprocessor computer system. A CC-NUMA has the full-mapped directory for the shared memory md uses a remote cache memory for tile fast memory access. In this paper, we propose a processing node architecture for a CC-NUMA system and a cache coherency protocol on the physically distributed but logically shared system. We show an implementation result of the system which is adopted the cache coherency protocol.

Page replication mechanism using adjustable DELAY counter in NUMA multiprocessors (NUMA 다중처리기에서 조정가능한 지연 카운터를 이용한 페이집 복사 기법)

  • 이종우;조유곤
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.33B no.6
    • /
    • pp.23-33
    • /
    • 1996
  • The exploitation of locality of reference in shared memory NUMA multiprocessors is one of the improtant problems in parallel processing today. In this paper, we propose a revised hardeare reference counter to help operating system to manage locality. In contrast to the previous one, the value of counter can abe adjusted dynamically and periodically to adapt the page replication policy to the various memory reference patterns of processors. We use execution-driven simulation of real applications to evaluate the effectiveness of our adjustable DELAY counter. Our main conclusijon is that by using the adjustable DELAY counter the t normalized average memory access costs and the variance of them become smaller for most applications than the previous one and more robust memory management policies can be provided for the operating systems.

  • PDF

A study of workload consolidation considering NUMA affinity (NUMA affinity를 고려한 Workload Consolidation 연구)

  • Seo, Dongyou;Kim, Shin-gye;Choi, Chanho;Eom, Hyeonsang;Yeom, Heon Y.
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.11a
    • /
    • pp.204-206
    • /
    • 2012
  • SMP(Symmetric Multi-Processing)는 Shared memory bus 를 사용함으로써 scalability 가 제한적이었다. 이런 SMP의 scalability 제한을 극복하기 위해 제안 된 것이 NUMA(Non Uniform Memory Access)이다. NUMA는 memory bus 를 CPU 별 local 하게 가지고 있어 자신이 가지는 memory 영역에 대해서는 다른 영역을 접근하는 것 보다 더 빠른 latency 를 가지는 구조이다. Local 한 memory 영역의 존재는 scalability를 높여 주었지만 서버 가상화 환경에서 VM을 동적으로 scheduling 을 하였을 때 VM의 page 가 실행되는 core 의 local 한 메모리 영역에 존재하지 않게 되면 remote access로 인해 local access보다 성능이 떨어진다. 이 논문에서는 서버 가상화 환경에서 최신 architecture인 AMD bulldozer에서 NUMA affinity가 위반되었을 때 발생하는 성능 저하와 어떤 상황에서 이런 NUMA affinity가 위반되어도 성능저하가 없는지 연구하였다.

J-Tree: An Efficient Index using User Searching Patterns for Large Scale Data (J-tree : 사용자의 검색패턴을 이용한 대용량 데이타를 위한 효율적인 색인)

  • Jang, Su-Min;Seo, Kwang-Seok;Yoo, Jae-Soo
    • Journal of KIISE:Databases
    • /
    • v.36 no.1
    • /
    • pp.44-49
    • /
    • 2009
  • In recent years, with the development of portable terminals, various searching services on large data have been provided in portable terminals. In order to search large data, most applications for information retrieval use indexes such as B-trees or R-trees. However, only a small portion of the data set is accessed by users, and the access frequencies of each data are not uniform. The existing indexes such as B-trees or R-trees do not consider the properties of the skewed access patterns. And a cache stores the frequently accessed data for fast access in memory. But the size of memory used in the cache is restricted. In this paper, we propose a new index based on disk, called J-tree, which considers user's search patterns. The proposed index is a balanced tree which guarantees uniform searching time on all data. It also supports fast searching time on the frequently accessed data. Our experiments show the effectiveness of our proposed index under various settings.