Advanced SearchSearch Tips
Capturing Data from Untapped Sources using Apache Spark for Big Data Analytics
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
Capturing Data from Untapped Sources using Apache Spark for Big Data Analytics
Nichie, Aaron; Koo, Heung-Seo;
  PDF(new window)
The term "Big Data" has been defined to encapsulate a broad spectrum of data sources and data formats. It is often described to be unstructured data due to its properties of variety in data formats. Even though the traditional methods of structuring data in rows and columns have been reinvented into column families, key-value or completely replaced with JSON documents in document-based databases, the fact still remains that data have to be reshaped to conform to certain structure in order to persistently store the data on disc. ETL processes are key in restructuring data. However, ETL processes incur additional processing overhead and also require that data sources are maintained in predefined formats. Consequently, data in certain formats are completely ignored because designing ETL processes to cater for all possible data formats is almost impossible. Potentially, these unconsidered data sources can provide useful insights when incorporated into big data analytics. In this project, using big data solution, Apache Spark, we tapped into other sources of data stored in their raw formats such as various text files, compressed files etc and incorporated the data with persistently stored enterprise data in MongoDB for overall data analytics using MongoDB Aggregation Framework and MapReduce. This significantly differs from the traditional ETL systems in the sense that it is compactible regardless of the data formats at source.
Apache Spark;MapReduce;Big Data;MongoDB;Analytics;
 Cited by
D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Widom, "Challenges and Opportunities with Big Data: A white paper prepared for the Computing Community Consortium committee of the Computing Research Association". pp. 1-17, Nov.2012. [Online]. Available: Feb. 14, 2016.

Bringing Big Data to The Enterprise. [Online]. Available: accessed on Feb. 05, 2016.

P. Nathan, "Intro to Apache Spark", Chicago International Software Conference 2015. pp. 1-188, May 14, 2015. [Online]. Available: http://training.dat Downloaded: Ja n. 03, 2016.

L. Neal, "Will NoSQL Databases Live Up to Their Promise?", Technology News, IEEE Computer Society, pp. 12-14, Oct. 2010.

Big Data Analytics. [Online]. Available: Accessed: Feb. 14, 2016.

A. Hafiz, O. Lukumon, B. Muhammad, A. Olugbenga, O. Hakeem, A. Saheed, "Bankruptcy Prediction of Construction Businesses: Towards a Big Data Analytics Approach", IEEE Conf. Pub., pp.1-5, Mar. 09, 2015.

M. Kalan, "Tutorial for Operationalizing Spark with MongoDB", [Online]. Available: Accessed Dec. 12, 2015.

MongoDB, "Apache Spark and MongoDB Turning Analytics into Real-Time Action", A MongoDB White Paper, Aug. 2015.

QAing New Code with MMS: Map/Reduce vs. Aggregation Framework, available at accessed on Mar. 01, 2016.

How Apache Spark Is Transforming Big Data Processing, Development. [Online]. Available: Accessed: Feb. 16, 2016.