Search | Korea Science

Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

Ordonez, Carlos;Navas, Mario;Garcia-Alvarado, Carlos
- Journal of Computing Science and Engineering
- /
- v.5 no.2
- /
- pp.111-120
- /
- 2011
Data mining algorithms should exploit new hardware technologies to accelerate computations. Such goal is difficult to achieve in database management system (DBMS) due to its complex internal subsystems and because data mining numeric computations of large data sets are difficult to optimize. This paper explores taking advantage of existing multithreaded capabilities of multicore CPUs as well as caching in RAM memory to efficiently compute summaries of a large data set, a fundamental data mining problem. We introduce parallel algorithms working on multiple threads, which overcome the row aggregation processing bottleneck of accessing secondary storage, while maintaining linear time complexity with respect to data set size. Our proposal is based on a combination of table scans and parallel multithreaded processing among multiple cores in the CPU. We introduce several database-style and hardware-level optimizations: caching row blocks of the input table, managing available RAM memory, interleaving I/O and CPU processing, as well as tuning the number of working threads. We experimentally benchmark our algorithms with large data sets on a DBMS running on a computer with a multicore CPU. We show that our algorithms outperform existing DBMS mechanisms in computing aggregations of multidimensional data summaries, especially as dimensionality grows. Furthermore, we show that local memory allocation (RAM block size) does not have a significant impact when the thread management algorithm distributes the workload among a fixed number of threads. Our proposal is unique in the sense that we do not modify or require access to the DBMS source code, but instead, we extend the DBMS with analytic functionality by developing User-Defined Functions.
https://doi.org/10.5626/JCSE.2011.5.2.111 인용 PDF KPUBS

Odysseus/m: a High-Performance ORDBMS Tightly-Coupled with IR Features (오디세우스/IR: 정보 검색 기능과 밀결합된 고성능 객체 관계형 DBMS)

Whang Kyu-Young;Lee Min-Jae;Lee Jae-Gil;Kim Min-Soo;Han Wook-Shin
- Journal of KIISE:Computing Practices and Letters
- /
- v.11 no.3
- /
- pp.209-215
- /
- 2005
Conventional ORDBMS vendors provide extension mechanisms for adding user-defined types and functions to their own DBMSs. Here, the extension mechanisms are implemented using a high-level interface. We call this technique loose-coupling. The advantage of loose-coupling is that it is easy to implement. However, it is not preferable for implementing new data types and operations in large databases when high Performance is required. In this paper, we propose to use the notion of tight-coupling to satisfy this requirement. In tight-coupling, new data types and operations are integrated into the core of the DBMS engine. Thus, they are supported in a consistent manner with high performance. This tight-coupling architecture is being used to incorporate information retrieval(IR) features and spatial database features into the Odysseus/IR ORDBMS that has been under development at KAIST/AITrc. In this paper, we introduce Odysseus/IR and explain its tightly-coupled IR features (U.S. patented). We then demonstrate a web search engine that is capable of managing 20 million web pages in a non-parallel configuration using Odysseus/IR.
PDF KSCI

Cache Performance Analysis of Multiprocessor Systems for OLTP Applications based on a Memory-Resident DBMS (메모리 상주 DBMS 기반의 OLTP 응용을 위한 다중프로세서 시스템 캐쉬 성능 분석)

Chung, Yong-Wha;Hahn, Woo-Jong;Yoon, Suk-Han;Park, Jin-Won;Lee, Kang-Woo;Kim, Yang-Woo
- Journal of KIISE:Computing Practices and Letters
- /
- v.6 no.4
- /
- pp.383-392
- /
- 2000
Currently, multiprocessors are evaluated almost exclusively with scientific applications. Commercial applications are rarely explored because it is difficult to obtain the source codes of commercial DBMS. Even when the source code is available, such as for POSTGRES, understanding the source code enough to perform detailed meaningful performance evaluations is a daunting task for computer architects.To evaluate multiprocessors with commercial applications, we have developed our own DBMS, called EZDB. EZDB is a parallelized DBMS, loosely inspired from POSTGRES, and running on top of a software architecture simulator. It is capable of executing parallel programs written in SQL. Contrary to POSTGRES, EZDB is not intended as a prototype for a production-quality DBMS. Its purpose is to easily run and evaluate the performance of commercial applications on multiprocessor architectures. To illustrate the usefulness of EZDB, we showed the cache performance data collected for the TPC-B benchmark on a shared-memory multiprocessor. The simulation results showed that the data structures exhibited unique sharing characteristics and that their locality properties and working sets were very different from those in scientific applications.
PDF

Odysseus/Parallel-OOSQL: A Parallel Search Engine using the Odysseus DBMS Tightly-Coupled with IR Capability (오디세우스/Parallel-OOSQL: 오디세우스 정보검색용 밀결합 DBMS를 사용한 병렬 정보 검색 엔진)

Ryu, Jae-Joon;Whang, Kyu-Young;Lee, Jae-Gil;Kwon, Hyuk-Yoon;Kim, Yi-Reun;Heo, Jun-Suk;Lee, Ki-Hoon
- Journal of KIISE:Computing Practices and Letters
- /
- v.14 no.4
- /
- pp.412-429
- /
- 2008
As the amount of electronic documents increases rapidly with the growth of the Internet, a parallel search engine capable of handling a large number of documents are becoming ever important. To implement a parallel search engine, we need to partition the inverted index and search through the partitioned index in parallel. There are two methods of partitioning the inverted index: 1) document-identifier based partitioning and 2) keyword-identifier based partitioning. However, each method alone has the following drawbacks. The former is convenient in inserting documents and has high throughput, but has poor performance for top h query processing. The latter has good performance for top-k query processing, but is inconvenient in inserting documents and has low throughput. In this paper, we propose a hybrid partitioning method to compensate for the drawback of each method. We design and implement a parallel search engine that supports the hybrid partitioning method using the Odysseus DBMS tightly coupled with information retrieval capability. We first introduce the architecture of the parallel search engine-Odysseus/parallel-OOSQL. We then show the effectiveness of the proposed system through systematic experiments. The experimental results show that the query processing time of the document-identifier based partitioning method is approximately inversely proportional to the number of blocks in the partition of the inverted index. The results also show that the keyword-identifier based partitioning method has good performance in top-k query processing. The proposed parallel search engine can be optimized for performance by customizing the methods of partitioning the inverted index according to the application environment. The Odysseus/parallel OOSQL parallel search engine is capable of indexing, storing, and querying 100 million web documents per node or tens of billions of web documents for the entire system.
PDF KSCI

Design and Implementation of Distributed In-Memory DBMS-based Parallel K-Means as In-database Analytics Function (분산 인 메모리 DBMS 기반 병렬 K-Means의 In-database 분석 함수로의 설계와 구현)

Kou, Heymo;Nam, Changmin;Lee, Woohyun;Lee, Yongjae;Kim, HyoungJoo
- KIISE Transactions on Computing Practices
- /
- v.24 no.3
- /
- pp.105-112
- /
- 2018
As data size increase, a single database is not enough to serve current volume of tasks. Since data is partitioned and stored into multiple databases, analysis should also support parallelism in order to increase efficiency. However, traditional analysis requires data to be transferred out of database into nodes where analytic service is performed and user is required to know both database and analytic framework. In this paper, we propose an efficient way to perform K-means clustering algorithm inside the distributed column-based database and relational database. We also suggest an efficient way to optimize K-means algorithm within relational database.
https://doi.org/10.5626/KTCP.2018.24.3.105 인용 KSCI

Cost Model for Parallel Spatial Joins using Fixed Grids (고정 그리드를 이용한 병렬 공간 조인을 위한 비용 모델)

Kim, Jin-Deog;Hong, Bong-Hee
- Journal of KIISE:Databases
- /
- v.28 no.4
- /
- pp.665-676
- /
- 2001
The most expensive spatial operation in patial database in a spatial join which computes a combined table of which tuple consists of two tuples of the two tables satisgying a spatial predicate. Although the execution time of sequential processing of a spatial join has been so far considerably improved the response time is not tolerable because of not meeting the requiremetns of interactive users. It is usually appropriate to use parallel processing to improve the performance of spatial join processing. in spatial database the fixed grids which consist of the regularly partitioned cells can be employed the previous works on the spatial joins have not studied the parallel processing of spatial joins using fixed grids. This paper has presented an analytical cost model that estimates the comparative performance of a parallel spatial join algorithm based on the fixed grids in terms of the number of MBR comparisons. disk accesses, and message passing, Several experiments on the synthetic and real datasets show that the proposed analytical model is very accurate. This most model is also expected to used for implementing a very important DBMS component, Called the query processing optimizer.
PDF

Query Reorganization Scheme supporting Parallel Query Processing of Theta Join and Nested SQL on Distributed CUBRID (분산 CUBIRD 상에서 세타 조인 및 중첩 SQL 병렬 질의처리를 지원하는 질의 재구성 기법)

Yang, Hyeon-Sik;Kim, Hyeong-Jin;Chang, Jae-Woo
- Proceedings of the Korea Contents Association Conference
- /
- 2014.11a
- /
- pp.37-38
- /
- 2014
최근 SNS의 발전으로 인해 데이터의 양이 급격히 증가하였으며, 이에 따라 빅데이터 처리를 위한 분산 DBMS 기반 질의 처리 연구가 활발히 진행되고 있다. 이를 위해 CUBRID는 CUBRID Shard 서비스를 통해 데이터베이스를 shard 단위로 수평 분할하여 각기 다른 물리 노드에 데이터를 분산 저장하도록 지원한다. 그러나 CUBRID Shard는 shard간 데이터가 독립적으로 관리되기 때문에 세타 조인 및 중첩 질의와 같이 다수 서버에서의 테이블 참조가 필요한 질의는 처리가 불가능하다. 따라서 본 논문에서는 분산 CUBRID 상에서 세타 조인 및 중첩 SQL를 지원하는 질의 재구성 기법을 제안한다.
PDF

A Method to Process Spatial Information in Parallel Spatial DBMS (병렬 공간데이터베이스 시스템에서 공간 정보 처리 방안)

Kim, JinDeog
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2016.05a
- /
- pp.811-812
- /
- 2016
최근 공간 정보는 생산 되는 양과 데이터의 생성 빈도 및 다양성으로 인해 기존의 공간 데이터베이스 시스템에서 처리하기 어렵다. 그래서 공간 정보는 빅데이터와 연계에 관한 시도가 활발히 진행되고 있다. 그러나 효율적인 단일할당, 다중할당 색인기반 공간 연산에 대한 연구는 거의 없다. 이 논문에서는 공간 연산 중 비용이 매우 큰 공간 조인을 빅데이터 시스템에서 처리하기 위한 고려요소를 제시하고자 한다. 구체적으로 맵리듀스 시스템의 태스크 할당을 위한 단일 할당 공간 색인방안을 설명하고, 불균일 분포가 심한 공간 정보의 특성을 고려한 부하 균등화 시 고려 요소를 제시하고자 한다. 맵리듀스와 같은 병렬 공간 데이터베이스 시스템에서의 두 가지 문제인 데이터 불균일 분포 문제와 경계 겹침 색인의 문제와의 연관성을 기술한다.
PDF

A Cache Consistency Control for B-Tree Indices in a Database Sharing System (데이타베이스 공유 시스템에서 B-트리 인덱스를 위한 캐쉬 일관성 제어)

On, Gyeong-O;Jo, Haeng-Rae
- The KIPS Transactions:PartD
- /
- v.8D no.5
- /
- pp.593-604
- /
- 2001
A database sharing system (DSS) refers to a system for high performance transaction processing. In the DSS, the processing nodes are coupled via a high speed network and share a common database at the disk level. Each node has a local memory and a separate copy of operating system. To reduce the number of disk accesses, the node caches data pages and index pages in its memory buffer. In general, B-tree index pages are accessed more often and thus cached at more processing nodes, than their corresponding data pages. There are also complicated operations in the B-tree such as Fetch, Fetch Next, Insertion and Deletion. Therefore, an efficient cache consistency scheme supporting high level concurrency is required. In this paper, we propose cache consistency schemes using identifiers of index pages and page_LSN of leaf page. The propose schemes can improve the system throughput by reducing the required message traffic between nodes and index re-traversal.
PDF

Design of an OMNeT++ based Parallel Simulator for a Bio-Inspired System and Its Performance on PC-Clusters (생태계 모방 시스템을 위한 OMNeT++ 기반 병렬 시뮬레이터의 설계 및 PC 클러스터 상에서의 성능 분석)

Moon, Joo-Sun;Nang, Jong-Ho
- Journal of KIISE:Computer Systems and Theory
- /
- v.34 no.9
- /
- pp.416-424
- /
- 2007
The Bio-Inspired system is a computing model that emulates the objects in ecosystem which are evolving themselves and cooperate each other to perform some tasks. Since it could be used to solved the complex problems that have been very difficult to resolve with previous algorithms, there have been a lot of researches to develop an application based on the Bio-Inspired system. However, since this computing model requires the process of evolving and cooperating with a lot of objects and this process takes a lot of times, it has been very hard to develop an application based on this computing model. This paper presents a parallel simulator for a Bio-Inspired system that is designed and implemented with OMNeT++ on PC clusters, and proves its usefulness by showing its simulation performance for a couple of applications. In the proposed parallel simulator, the functions required in the ERS platform for evolving and cooperating between objects (called Ecogent) are mapped onto the functions of OMNeT++, and they are simulated on PC clusters simultaneously to reduce the total simulation time. The simulation results could be monitored with a GUI In realtime, and they are also recorded into DBMS for systematic analyses afterward. This paper shows the usefulness of the proposed system by analyzing its performances for simulating various applications based on Bio-Inspired system on PC clusters with 4 PCs.
PDF KSCI

Search Result 10, Processing Time 0.028 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)