Go to the main menu
Skip to content
Go to bottom
REFERENCE LINKING PLATFORM OF KOREA S&T JOURNALS
> Journal Vol & Issue
Journal of the Korean Data and Information Science Society
Journal Basic Information
Journal DOI :
Korean Data and Information Science Society
Editor in Chief :
Volume & Issues
Volume 24, Issue 6 - Nov 2013
Volume 24, Issue 5 - Sep 2013
Volume 24, Issue 4 - Jul 2013
Volume 24, Issue 3 - May 2013
Volume 24, Issue 2 - Mar 2013
Volume 24, Issue 1 - Jan 2013
Selecting the target year
Big data and statistics
Kim, Yongdai ; Cho, Kwang Hyun ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 959~974
DOI : 10.7465/jkdi.2013.24.5.959
We investigate the roles of statistics and statisticians in the big data era. Definition and application areas of big data are reviewed and statistical characteristics of big data and their meanings are discussed. Various statistical methodologies applicable to big data analysis are illustrated, and two real big data projects are explained.
Analysis of big data using Rhipe
Ko, Youngjun ; Kim, Jinseog ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 975~987
DOI : 10.7465/jkdi.2013.24.5.975
The Hadoop system was developed by the Apache foundation based on GFS and MapReduce technologies of Google. Many modern systems for managing and processing the big data have been developing based on the Hadoop because the Hadoop was designed for scalability and distributed computing. The R software has been considered as a well-suited analytic tool in the Hadoop based systems because the R is flexible to other languages and has many libraries for complex analyses. We introduced Rhipe which is a R package supporting MapReduce programming easily under the Hadoop system, and implemented a MapReduce program using Rhipe for multiple regression especially. In addition, we compared the computing speeds of our program with the other packages (ff and bigmemory) for processing the large data. The simulation results showed that our program was more fast than ff and bigmemory as the size of data increases.
Support vector machines for big data analysis
Choi, Hosik ; Park, Hye Won ; Park, Changyi ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 989~998
DOI : 10.7465/jkdi.2013.24.5.989
We cannot analyze big data, which attracts recent attentions in industry and academy, by batch processing algorithms developed in data mining because big data, by definition, cannot be uploaded and processed in the memory of a single system. So an imminent issue is to develop various leaning algorithms so that they can be applied to big data. In this paper, we review various algorithms for support vector machines in the literature. Particularly, we introduce online type and parallel processing algorithms that are expected to be useful in big data classifications and compare the strengths, the weaknesses and the performances of those algorithms through simulations for linear classification.
Documents recommendation using large citation data
Chae, Minwoo ; Kang, Minsoo ; Kim, Yongdai ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 999~1011
DOI : 10.7465/jkdi.2013.24.5.999
In this research, we propose a document recommendation method which can find documents that are relatively important to a specific document based on citation information. The key idea is parameter tuning in the Neumann kernal which is an intermediate between a measure of importance (HITS) and of relatedness (co-citation). Our method properly selects the tuning parameter
in the Neumann kernal minimizing the prediction error in future citation. We also discuss some comutational issues needed for analysing large citation data. Finally, results of analyzing patents data from the US Patent Office are given.
Hadoop and MapReduce
Park, Jeong-Hyeok ; Lee, Sang-Yeol ; Kang, Da Hyun ; Won, Joong-Ho ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 1013~1027
DOI : 10.7465/jkdi.2013.24.5.1013
As the need for large-scale data analysis is rapidly increasing, Hadoop, or the platform that realizes large-scale data processing, and MapReduce, or the internal computational model of Hadoop, are receiving great attention. This paper reviews the basic concepts of Hadoop and MapReduce necessary for data analysts who are familiar with statistical programming, through examples that combine the R programming language and Hadoop.
Review on statistical methods for protecting privacy and measuring risk of disclosure when releasing information for public use
Lee, Yonghee ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 1029~1041
DOI : 10.7465/jkdi.2013.24.5.1029
Recently, along with emergence of big data, there are incresing demands for releasing information and micro data for public use so that protecting privacy and measuring risk of disclosure for released database become important issues in goverment and business sector as well as academic community. This paper reviews statistical methods for protecting privacy and measuring risk of disclosure when micro data or data analysis sever is released for public use.
Introduction to general purpose GPU computing
Yu, Donghyeon ; Lim, Johan ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 1043~1061
DOI : 10.7465/jkdi.2013.24.5.1043
Recent advances in computer technology introduce massive data and their analysis becomes important. The high performance computing is one of the most essential part in analysis of massive data. In this paper, we review the general purpose of the graphics processing unit and its application to parallel computing, which has been of great interest in statistics communities.
Multiple testing and its applications in high-dimension
Jang, Woncheol ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 1063~1076
DOI : 10.7465/jkdi.2013.24.5.1063
The power of modern technology is opening a new era of big data. The size of the datasets affords us the opportunity to answer many open scientific questions but also presents some interesting challenges. High-dimensional data such as microarray are common in big data. In this paper, we give an overview of recent development of multiple testing including global and simultaneous testing and its applications to high-dimensional data.
A small review and further studies on the LASSO
Kwon, Sunghoon ; Han, Sangmi ; Lee, Sangin ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 1077~1088
DOI : 10.7465/jkdi.2013.24.5.1077
High-dimensional data analysis arises from almost all scientific areas, evolving with development of computing skills, and has encouraged penalized estimations that play important roles in statistical learning. For the past years, various penalized estimations have been developed, and the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani (1996) has shown outstanding ability, earning the first place on the development of penalized estimation. In this paper, we first introduce a number of recent advances in high-dimensional data analysis using the LASSO. The topics include various statistical problems such as variable selection and grouped or structured variable selection under sparse high-dimensional linear regression models. Several unsupervised learning methods including inverse covariance matrix estimation are presented. In addition, we address further studies on new applications which may establish a guideline on how to use the LASSO for statistical challenges of high-dimensional data analysis.
Revisiting the Bradley-Terry model and its application to information retrieval
Jeon, Jong-June ; Kim, Yongdai ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 1089~1099
DOI : 10.7465/jkdi.2013.24.5.1089
The Bradley-Terry model is widely used for analysis of pairwise preference data. We explain that the popularity of Bradley-Terry model is gained due to not only easy computation but also some nice asymptotic properties when the model is misspecified. For information retrieval required to analyze big ranking data, we propose to use a pseudo likelihood based on the Bradley-Terry model even when the true model is different from the Bradley-Terry model. We justify using the Bradley-Terry model by proving that the estimated ranking based on the proposed pseudo likelihood is consistent when the true model belongs to the class of Thurstone models, which is much bigger than the Bradley-Terry model.
Erratum to "Prediction of extreme rainfall with a generalized extreme value distribution"
Sung, Yong Kyu ; Sohn, Joong K. ;
Journal of the Korean Data and Information Science Society, volume 24, issue 5, 2013, Pages 1101~1101
DOI : 10.7465/jkdi.2013.24.5.1101