DOI QR코드

DOI QR Code

Hadoop and MapReduce

하둡과 맵리듀스

  • Park, Jeong-Hyeok (School of Industrial Management Engineering, Korea University) ;
  • Lee, Sang-Yeol (School of Industrial Management Engineering, Korea University) ;
  • Kang, Da Hyun (School of Industrial Management Engineering, Korea University) ;
  • Won, Joong-Ho (School of Industrial Management Engineering, Korea University)
  • 박정혁 (고려대학교 산업경영공학부) ;
  • 이상열 (고려대학교 산업경영공학부) ;
  • 강다현 (고려대학교 산업경영공학부) ;
  • 원중호 (고려대학교 산업경영공학부)
  • Received : 2013.07.07
  • Accepted : 2013.08.12
  • Published : 2013.09.30

Abstract

As the need for large-scale data analysis is rapidly increasing, Hadoop, or the platform that realizes large-scale data processing, and MapReduce, or the internal computational model of Hadoop, are receiving great attention. This paper reviews the basic concepts of Hadoop and MapReduce necessary for data analysts who are familiar with statistical programming, through examples that combine the R programming language and Hadoop.

대용량 데이터 분석의 필요성이 급격히 증대되면서 이를 가능케 해 주는 플랫폼인 하둡과 그 내부적인 계산 모형인 맵리듀스에 대한 관심 또한 늘고 있다. 본고에서는 R 등의 통계 프로그래밍에 익숙한 데이터 분석가가 하둡을 사용하고자 할 때 알아야 할 기본 개념들을 R과 하둡을 결합하는 몇가지 예제와 함께 소개한다.

Keywords

References

  1. Bache, K. and Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. [Online; accessed June 2013].
  2. Cho, S., Lee, S., Lee, K. and Kim, Y. (2009). Distributed filtering service model for spam mails based on hadoop framework. In Proceedings of the 2009 Korean Society for Internet Information, Korean Society for Internet Information, Seoul, 165-168.
  3. Dean, J. and Ghemawat, S. (2004). Mapreduce: Simplified data processing on large clusters. In OSDI4: Proceedings of the 6th Symposium on Operating Systems Design and Implementation. USENIX Association, San Francisco.
  4. Facebook Engineering Team (2012). Under the hood: scheduling MapReduce jobs more efficiently with Corona. https://www.facebook.com/notes/facebook-engineering/under-the-hood-schedulingmapreduce- jobs-more-efficiently-with-corona/10151142560538920. [Online; accessed June 2013].
  5. Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003). The google file system. ACM SIGOPS Operating Systems Review, 37, 29-43. https://doi.org/10.1145/1165389.945450
  6. Guha, S. (2010). Computing environment for the statistical analysis of large and complex data, PhD thesis, Department of Statistics, Purdue University, West Lafayette.
  7. Guha, S., Hafen, R. P., Kidwell, P. and Cleveland, W. S. (2009). Visualization databases for the analysis of large complex datasets. Journal of Machine Learning Research, 5, 193-200.
  8. Harris, D. (2011). Why the pace of Hadoop innovation has to pick up. http://gigaom.com/2011/04/25/why-we-need-more-hadoop-innovation/. [Online; accessed June 2013].
  9. Jung, H., Kim, J., Park, H. and Lee, J. (2011). The design of content-based music search system using hadoop. In Proceedings of the 2011 Korean Institute of Information Scientists and Engineers, The Korean Institute of Information Scientists and Engineers, Seoul, 377-380.
  10. Kim, M., Cui, Y., Han, S. and Lee, H. (2012). A hadoop-based media transcoding system for mobile media service. In Proceedings of the 2012 Korean Society for Internet Information, Korean Society for Internet Information, Seoul, 233-234.
  11. Lam, C. (2012). Hadoop in action (Korean translation), Ji & Son, Seoul.
  12. McKinsey Global Institute (2011). Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, New York.
  13. Park, S., Lee, B., Kim, H., Kim, D. and Yoon, S. (2011). A study on speedup of multiple sequence alignment using mapreduce on cloud infrastructure. In Proceedings of the 2011 Korean Institute of Information Scientists and Engineers, The Korean Institute of Information Scientists and Engineers, Seoul, 123-126.
  14. Piccolboni, A. (2013). Mapreduce in R. https://github.com/RevolutionAnalytics/rmr2/blob/master/ docs/tutorial.md. [Online; accessed June 2013].
  15. Revolution Analytics (2011). Advanced big dataanalytics with R and Hadoop. http://www.revolutionanalytics.com/why-revolution-r/whitepapers/advanced-big-data-analytics-with-rand- hadoop.php. [Online; accessed June 2013].
  16. Seo, S., Kim, J., Park, Y., Lee, J. and Myeong, J. (2013). Hadoop & NoSQL, Gilbut, Seoul.
  17. The Apache Software Foundation (2008). MapReduce tutorial. http://hadoop.apache.org/docs/stable/ mapred_tutorial.html. [Online; accessed June 2013].

Cited by

  1. Enhancing the performance of taxi application based on in-memory data grid technology vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1035
  2. A Block Relocation Algorithm for Reducing Network Consumption in Hadoop Cluster vol.19, pp.11, 2014, https://doi.org/10.9708/jksci.2014.19.11.009
  3. Big data distributed processing system using RHadoop vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1155
  4. Current trends in high dimensional massive data analysis vol.29, pp.6, 2016, https://doi.org/10.5351/KJAS.2016.29.6.999
  5. Structuring of unstructured big data and visual interpretation vol.25, pp.6, 2014, https://doi.org/10.7465/jkdi.2014.25.6.1431
  6. RHadoop platform for K-Means clustering of big data vol.27, pp.3, 2016, https://doi.org/10.7465/jkdi.2016.27.3.609
  7. An elastic distributed parallel Hadoop system for bigdata platform and distributed inference engines vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1129
  8. 빅데이터 통합모형 비교분석 vol.28, pp.4, 2013, https://doi.org/10.7465/jkdi.2017.28.4.755
  9. 빅데이터 수집 처리를 위한 분산 하둡 풀스택 플랫폼의 설계 vol.12, pp.7, 2021, https://doi.org/10.15207/jkcs.2021.12.7.045