DOI QR코드

DOI QR Code

고차원 자료에서 영향점의 영향을 평가하기 위한 그래픽 방법

Graphical method for evaluating the impact of influential observations in high-dimensional data

  • Ahn, Sojin (Department of Statistics, Pukyong National University) ;
  • Lee, Jae Eun (Department of Statistics, Pukyong National University) ;
  • Jang, Dae-Heung (Department of Statistics, Pukyong National University)
  • 투고 : 2017.11.03
  • 심사 : 2017.11.23
  • 발행 : 2017.11.30

초록

고차원 자료에서는 관측값의 개수보다 변수의 개수가 과다하게 많은 것이 특징이다. 그러므로 회귀 계수 추정에 있어 관측값의 영향이 매우 클 수 있다. Jang과 Anserson-Cook (2017)은 라쏘추정량 사용시 영향점의 영향을 평가할 수 있는 라쏘 영향그림을 제안하였다. 본 연구에서는 고차원 자료에서 영향점을 평가하기 위한 그래픽 방법들로서 라쏘 영향그림 뿐만 아니라 라쏘 변수선택 순위그림, 삼차원 라쏘 영향그림을 제안하였다. 실세 두 가지 고차원 자료 예들에 영향점들을 찾기 위한 회귀진단 수단으로서 세가지 그래픽 방법들을 사용하여 본 결과 영향점들을 효과적으로 찾아낼 수 있었다.

In the high-dimensional data, the number of variables is very larger than the number of observations. In this case, the impact of influential observations on regression coefficient estimates can be very large. Jang and Anderson-Cook (2017) suggested the LASSO influence plot. In this paper, we propose the LASSO influence plot, LASSO variable selection ranking plot, and three-dimensional LASSO influence plot as graphical methods for evaluating the impact of influential observations in high-dimensional data. With real two high-dimensional data examples, we apply these graphical methods as the regression diagnostics tools for finding influential observations. It has been found that we can obtain influential observations with by these graphical methods.

키워드

과제정보

연구 과제 주관 기관 : 부경대학교

참고문헌

  1. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S. Mack, D. and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proceedings of National Academy of Science USA, 96, 6745-6750. https://doi.org/10.1073/pnas.96.12.6745
  2. Fan, J., Feng, Y., Saldana, D. F., Samworth, R. and Wu, Y. (2017). http://www.stat.columbia.edu/-yangfeng/pubs/jss1375.pdf, Package 'SIS'.
  3. Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space. Journal of the Royal Statistical Society Series B, 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  4. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537. https://doi.org/10.1126/science.286.5439.531
  5. Hwang, E. J. and Na, J. H. (2015). Influenza prediction models by using meteorological and social media informations. Journal of theKorean Data & Information Science Society, 26, 1087-1095. https://doi.org/10.7465/jkdi.2015.26.5.1087
  6. Jang, D. H. and Anderson-Cook, C. M (2017). Influence plots for LASSO. Quality and Reliability Engineering International, 33, 1317-1326. https://doi.org/10.1002/qre.2106
  7. Jung, B. H. and Lim, D. H. (2016). Learning algorithms for big data logistic regression on RHIPE platform. Journal of theKorean Data & Information Science Society, 27, 911-923. https://doi.org/10.7465/jkdi.2016.27.4.911
  8. Lee, S., Cho, J., Kang, C. and Choi, S. (2015). Study on prediction for a film success using text mining. Journal of theKorean Data & Information Science Society, 26, 1259-1269. https://doi.org/10.7465/jkdi.2015.26.6.1259
  9. Lee, W. and Chun, H. (2016). A deep learning analysis of the Chinese Yuan’s volatility in the onshore and offshore markets. Journal of theKorean Data & Information Science Society, 27, 327-335. https://doi.org/10.7465/jkdi.2016.27.2.327
  10. Shin, J. E., Oh, Y. S. and Lim, D. H. (2016). RHadoop platform for K-Means clustering of big data. Journal of theKorean Data & Information Science Society, 27, 609-619. https://doi.org/10.7465/jkdi.2016.27.3.609
  11. Zeng, V. and Breheny, P. (2017). https://arxiv.org/abs/1701.05936, Package 'biglasso'.