Advanced SearchSearch Tips
External Merge Sorting in Tajo with Variable Server Configuration
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
  • Journal title : Journal of KIISE
  • Volume 43, Issue 7,  2016, pp.820-826
  • Publisher : Korean Institute of Information Scientists and Engineers
  • DOI : 10.5626/JOK.2016.43.7.820
 Title & Authors
External Merge Sorting in Tajo with Variable Server Configuration
Lee, Jongbaeg; Kang, Woon-hak; Lee, Sang-won;
There is a growing requirement for big data processing which extracts valuable information from a large amount of data. The Hadoop system employs the MapReduce framework to process big data. However, MapReduce has limitations such as inflexible and slow data processing. To overcome these drawbacks, SQL query processing techniques known as SQL-on-Hadoop were developed. Apache Tajo, one of the SQL-on-Hadoop techniques, was developed by a Korean development group. External merge sort is one of the heavily used algorithms in Tajo for query processing. The performance of external merge sort in Tajo is influenced by two parameters, sort buffer size and fanout. In this paper, we analyzed the performance of external merge sort in Tajo with various sort buffer sizes and fanouts. In addition, we figured out that there are two major causes of differences in the performance of external merge sort: CPU cache misses which increase as the sort buffer size grows; and the number of merge passes determined by fanout.
SQL-on-hadoop;apache tajo;external merge sort;sort buffer size;fanout;
 Cited by
Cisco, "Data Virtualization Redefines the Stock Exchange," Cisco, 2013.

Apache Hadoop. [Online]. Available:

K.-H. Lee, W.J. Park, K.S. Cho, W.Ryu, "The MapReduce framework for Large-scale Data Analysis: Overview and Research Trends," Electronics and Telecommunications Trends, Vol. 28, No. 6, pp. 156-166, Dec. 2013. (in Korean)

Ma, Zhiqiang, and Lin Gu, "The limitation of Map-Reduce: A probing case and a lightweight solution," Proc. of the 1st Intl. Conf. on Cloud Computing, GRIDs, and Virtualization, pp. 68-73, 2010.

White, Tom, Hadoop: The definitive guide, O'Reilly Media, Inc, 2012.

Shvachko, Konstantin, et al., "The hadoop distributed file system," Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, pp. 1-10, 2010.

Dean, Jeffrey, and Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, Vol. 51, No. 1, pp. 107-113, 2008.

Apache Hive. [Online]. Available:

Kornacker, Marcel, et al., "Impala: A Modern, Open-Source SQL Engine for Hadoop," CIDR, 2015.

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, "Spark: cluster computing with working sets," Proc. of the 2nd USENIX conference on Hot topics in cloud computing, pp. 10-10, Jun. 2010

Zaharia, Matei, et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," Proc. of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association, pp. 2-2, 2012.

Apache Tajo: A big data warehouse system on Hadoop, [Online]. Available:

Chen, Yueguo, et al., "A study of sql-on-hadoop systems," Big Data Benchmarks, Performance Optimization, and Emerging Hardware. Springer International Publishing, pp. 154-166, 2014.

Arnaldo Carvalho de Melo, "The New Linux perf tools," presentation from Linux Kongress, 2010.