• Title, Summary, Keyword: Hadoop

Search Result 325, Processing Time 0.038 seconds

Design and Implementation of a Monitor for Hadoop Cluster (Hadoop 클러스터를 위한 모니터의 설계 및 구현)

  • Keum, Tae-Hoon;Lee, Won-Joo;Jeon, Chang-Ho
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.1
    • /
    • pp.8-15
    • /
    • 2012
  • In this paper, we propose a new monitor for collecting job information from Hadoop clusters in real time. This monitor is made of two programs called Collector and Agent. Agent collects Hadoop cluster's node information and job information, and Collector analyzes the collected information and saves it in a database. Also, Collector was placed in a new node outside the Hadoop cluster so that it does not affect Hadoop's work and will not cause overload. When the proposed monitor was implemented and applied, the testbed cluster was able to detect the occurrence of dead nodes immediately. In addition, we were able to find Hadoop jobs which were inefficient and when we modified such jobs to further enhance the performance of Hadoop.

An Empirical Performance Analysis on Hadoop via Optimizing the Network Heartbeat Period

  • Lee, Jaehwan;Choi, June;Roh, Hongchan;Shin, Ji Sun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.11
    • /
    • pp.5252-5268
    • /
    • 2018
  • To support a large-scale Hadoop cluster, Hadoop heartbeat messages are designed to deliver the significant messages, including task scheduling and completion messages, via piggybacking to reduce the number of messages received by the NameNode. Although Hadoop is designed and optimized for high-throughput computing via batch processing, the real-time processing of large amounts of data in Hadoop is increasingly important. This paper evaluates Hadoop's performance and costs when the heartbeat period is controlled to support latency sensitive applications. Through an empirical study based on Hadoop 2.0 (YARN) architecture, we improve Hadoop's I/O performance as well as application performance by up to 13 percent compared to the default configuration. We offer a guideline that predicts the performance, costs and limitations of the total system by controlling the heartbeat period using simple equations. We show that Hive performance can be improved by tuning Hadoop's heartbeat periods through extensive experiments.

An Analytical Approach to Evaluation of SSD Effects under MapReduce Workloads

  • Ahn, Sungyong;Park, Sangkyu
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.15 no.5
    • /
    • pp.511-518
    • /
    • 2015
  • As the cost-per-byte of SSDs dramatically decreases, the introduction of SSDs to Hadoop becomes an attractive choice for high performance data processing. In this paper the cost-per-performance of SSD-based Hadoop cluster (SSD-Hadoop) and HDD-based Hadoop cluster (HDD-Hadoop) are evaluated. For this, we propose a MapReduce performance model using queuing network to simulate the execution time of MapReduce job with varying cluster size. To achieve an accurate model, the execution time distribution of MapReduce job is carefully profiled. The developed model can precisely predict the execution time of MapReduce jobs with less than 7% difference for most cases. It is also found that SSD-Hadoop is 20% more cost efficient than HDD-Hadoop because SSD-Hadoop needs a smaller number of nodes than HDD-Hadoop to achieve a comparable performance, according to the results of simulation with varying the number of cluster nodes.

Performance Comparison of DW System Tajo Based on Hadoop and Relational DBMS (하둡 기반 DW시스템 타조와 관계형 DBMS의 성능 비교)

  • Liu, Chen;Ko, Junghyun;Yeo, Jeongmo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.9
    • /
    • pp.349-354
    • /
    • 2014
  • Since Hadoop which is the Big-data processing platform was announced, SQL-on-Hadoop is the spotlight as the technique to analyze data using SQL on Hadoop. Tajo created by Korean programmers has recently been promoted to Top-Level-Project status by the Apache in April and has been paid attention all around world. Despite a sensible change caused by Hadoop's appearance in DW market, researches of those performance is insufficient. Thus, this study has been conducted to help choose a DW solution based on SQL-on-Hadoop as progressing the test on comparison analysis of RDBMS and Tajo. It has shown that Tajo based on Hadoop is more superior than RDBMS if it is used with accurate strategy. In addition, open-source project Tajo is expected not only to achieve improvements in technique due to active participation of many developers but also to be in charge of an important role of DW in the filed of data analysis.

Performance Evaluation of MapReduce Application running on Hadoop (Hadoop 상에서 MapReduce 응용프로그램 평가)

  • Kim, Junsu;Kang, Yunhee;Park, Youngbom
    • Journal of Software Engineering Society
    • /
    • v.25 no.4
    • /
    • pp.63-67
    • /
    • 2012
  • According to the growth of data being generated in man fields, a distributed programming model MapReduce has been introduced to handle it. In this paper, we build two cluster system with Solaris and Linux environment on SUN Blade150 respectively and then to evaluate the performance of a MapReduce application running on MapReduce middleware Hadoop in terms of its average elapse time and standard deviation. As a result of this experiment, we show that the overall performance of the MapReduce application based on Hadoop is affected by the configuration of the cluster system.

  • PDF

Hadoop Security Technologies and Vulnerability Analysis (하둡 보안 기술과 취약점 분석)

  • Kim, A-Yong;He, Yilun;Kim, Han-Kil;Park, Man-Seub;Jung, Hoe-Kyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • /
    • pp.681-683
    • /
    • 2013
  • And were the prevalence of smartphones is the Big Data era, such as Facebook or Twitter, SNS (Social Network Service) routine is used in the real world. Take advantage of the analysis, and to extract and utilize developed in the Apache Foundation Hadoop (Hadoop) without abandoning the SNS unstructured data here. Hadoop is an open source framework that can handle large amounts of data. Hadoop has been introduced in the domestic corporate and commercial development and Compared to the technology development Hadoop has been pointed out that the lack of security sector. In this paper, we propose a method to enhance the security and vulnerability analysis of security technologies and Hadoop.

  • PDF

Implementaion of Video Processing Framework using Hadoop-based cloud computing (Hadoop 기반 클라우드 컴퓨팅을 이용한 영상 처리 프레임워크 구현)

  • Ryu, Chungmo;Lee, Daecheol;Jang, Minwook;Kim, Cheolgi
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • /
    • pp.139-142
    • /
    • 2013
  • 최근 대용량 영상데이터로부터 정보 수집, 영상 처리를 위한 클라우드 관련 연구들이 활발하다. 그러나 공개 소프트웨어를 이용한 클라우드 연구의 대부분은 라이브러리 수준이 아닌 단순히 프로그램 수준의 조합으로 작동한다. 이런 이유로 단순 조합에 따른 비효율성에 의한 성능문제는 크게 다루어지지 않는다. 본 논문에서는 이 비효율성을 해결하는데 중점을 두고 FFmpeg과 Hadoop을 라이브러리 수준으로 결합하여 기존보다 더 나은 성능의 영상클라우드 환경을 구축하였다. C기반의 영상처리 라이브러리인 FFmpeg와 JAVA기반의 클라우드 환경 Hadoop의 결합을 위해 JNI(Java Native Interface)를 이용하였다. 상세구현으로는 HDFS(Hadoop Distributed File System)을 확장하여 Hadoop MapReduce가 직접 FFmpeg을 통한 영상파일 접근이 가능하게 하였다. 이로써 FFmpeg과 Hadoop간 상이한 파일 접근 방식에서 발생하는 불필요한 작업에 의한 시스템의 성능저하를 막았다. 또한 응용의 확장성을 위해 영상작업시 작업영상을 영상처리의 최소단위인 GOP(Group of Pictures)단위로 잘라 클라우드의 노드들에게 분산시켰다. 결과적으로 기존에 존재하는 Hadoop과 FFmpeg을 프로그램적으로 결합한 영상처리 클라우드보다 총 처리시간을 앞당겼고, GOP 단위의 영상 처리는 영상기반 작업에 안정성과 응용의 확장성을 보장해주었다.

  • PDF

Analysis of big data using Rhipe (Rhipe를 활용한 빅데이터 처리 및 분석)

  • Ko, Youngjun;Kim, Jinseog
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.5
    • /
    • pp.975-987
    • /
    • 2013
  • The Hadoop system was developed by the Apache foundation based on GFS and MapReduce technologies of Google. Many modern systems for managing and processing the big data have been developing based on the Hadoop because the Hadoop was designed for scalability and distributed computing. The R software has been considered as a well-suited analytic tool in the Hadoop based systems because the R is flexible to other languages and has many libraries for complex analyses. We introduced Rhipe which is a R package supporting MapReduce programming easily under the Hadoop system, and implemented a MapReduce program using Rhipe for multiple regression especially. In addition, we compared the computing speeds of our program with the other packages (ff and bigmemory) for processing the large data. The simulation results showed that our program was more fast than ff and bigmemory as the size of data increases.

Processing Method of Mass Small File Using Hadoop Platform (하둡 플랫폼을 이용한 대량의 스몰파일 처리방법)

  • Kim, Chang-Bok;Chung, Jae-Pil
    • The Journal of Advanced Navigation Technology
    • /
    • v.18 no.4
    • /
    • pp.401-408
    • /
    • 2014
  • Hadoop is composed with MapReduce programming model for distributed processing and HDFS distributed file system. Hadoop is suitable framework for big data processing, but processing of mass small files have many problems. The processing of mass small file in hadoop have problems to created one mapper per one file, and it have problems to needed many memory for store of meta information of file. This paper have comparison evaluation processing method of mass small file with various method in hadoop platform. The processing of general compression format is inadequate because of processing by one mapper regardless of data size. The processing of sequence and hadoop archive file is removed memory problem of namenode by compress and combine of small file. Hadoop archive file is faster then sequence file about combine time of small file. The processing using CombineFileInputFormat class is needed not combine of small file, and it have similar speed big data processing method.

Performance Analysis of Distributed Hadoop Systems (분산 하둡 시스템의 성능 비교 분석)

  • Bae, Byoung-Jin;Kim, Young-Joo;Kim, Young-Kuk
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • /
    • pp.479-482
    • /
    • 2014
  • Nowadays open-source hadoop systems have been using widely to efficiently manage a fast-growing big data. Hadoop systems consist of distributed file processing system called HDFS (Hadoop Distributed File System) and distributed parallel processing system called MapReduce. The MapReduce reads and processes big data from HDFS and then processed results are written in HDFS again by the MapReduce. Such a processing method has different system structure respectively according to hadoop version. Therefore, this paper shows analysis results for performance of hadoop systems. For this, we devise a way which monitors hadoop systems and measure occurrence frequency of processes, threads, and variables generated in hadoop system itself using the devised way. So, by using the measured results as analysis indicator, we help the indicator predict inner performance of hadoop systems.

  • PDF