DOI QR코드

DOI QR Code

Data Quality Measurement on a De-identified Data Set Based on Statistical Modeling

통계모형의 정확도에 기반한 비식별화 데이터의 품질 측정

  • Received : 2019.03.04
  • Accepted : 2019.04.18
  • Published : 2019.05.28

Abstract

In this study, the method of quality measurement for the statistical usefulness of de-identified data was examined in terms of prediction accuracy by statistical modeling. In the era of the 4th industrial revolution, effective use of big data is essential to innovation through information and communication technology, but personal information issues are constrained to actively utilize big data. In order to solve this problem, de-identification guidelines have been established and the possibility of actual re-identification of personal information has become very low due to the utilization of various de-identification methods. On the other hand, strong de-identification can have side effects that degrade the usefulness of the data. We have studied the quality of statistical usefulness of the de-identified data by KLT model which is a representative de-identification method, A case study was conducted to see how statistical accuracy of prediction is degraded by de-identification. We also proposed a new measure of data usefulness of the de-identified data by quantifying how much data is added to the de-identified data to restore the accuracy of the predictive model.

Keywords

Personal Information;Data Quality;De-identification;Predictive Model;KLT-Model

표 1. 원본DB 구성 변수들

CCTHCV_2019_v19n5_553_t0001.png 이미지

표 2. 비식별 전후의 추정된 회귀계수

CCTHCV_2019_v19n5_553_t0002.png 이미지

표 3. 검증용 자료의 분류성능

CCTHCV_2019_v19n5_553_t0003.png 이미지

표 4. 두 모형의 분류성능

CCTHCV_2019_v19n5_553_t0004.png 이미지

References

  1. 양현철, 이영주, 김신곤, "개인정보 비식별화기술 적용수준이 빅데이터 활성화에 미치는 영향," 정보화연구, 제13권, 제3호, pp.395-404, 2016.
  2. 국무조정실 등, 개인정보 비식별 조치 가이드라인, 2016.
  3. 이영환, 전희주, 윤정연, "데이터 산업에서 창업 활성화를 위한 데이터 거래소 제안 : 금융거래소형 데이터거래소를 중심으로," 한국창업학회지, 제10권, 제2호, pp.28-49, 2015.
  4. 김동국, 이혁, "빅데이터 기반의 개인정보 비식별화 동향," 한국인터넷정보학회지, 제16권, 제2호, pp.15-22, 2015.
  5. 이현승, 송지환, 개인정보 비식별화기술의 쟁점 연구, 소프트웨어정책연구소, 2016.
  6. 임형진, "빅데이터 환경에서의 개인정보 비식별 처리방법 분석," 전자금융과 금융보안, 제8호, pp.9-37, 금융보안원, 2017.
  7. 엄수현, 이인경, 이우기, "빅데이터 기반 개인정보 비식별화 동향," 정보화연구, 제15권, 제4호, pp.545-552, 2018.
  8. 김근령, 이대희, "보건의료 빅데이터 활용에 관한 법적검토-개인정보보호를 중심으로-," 과학기술법연구, 제24권, 제3호, pp.57-90, 2018.
  9. D. Rebollo-Monedero, J. Forne, M. Soriano, and J. P. Allepuz, "k-Anonymous microaggregation with preservation of statistical dependence," Information Sciences, Vol.342, pp.1-23, 2016. https://doi.org/10.1016/j.ins.2016.01.012
  10. J. Soria-Comas, J. Domingo-Ferrer, D. Sanchez, and S. Martinez, "Enhancing Data Utility in Differential Privacy via Microaggregation- based k-Anonymity," The International Journal on Very Large Data Bases, Vol.23, No.5, pp.771-794, 2014. https://doi.org/10.1007/s00778-014-0351-4
  11. D. Sanchez, J. Domingo-Ferrer, S. Martinez, and J. Soria-Comas, "Utility-preserving differentially private data releases via individual ranking microaggregation," Information Fusion, Vol.30, pp.1-14, 2016. https://doi.org/10.1016/j.inffus.2015.11.002
  12. 강동현, 오현석, 용우석, 이원석, "비식별 데이터의 유사성 보존에 관한 연구," 한국정보처리학회 추계학술발표대회 논문집, 제24권, 제2호, pp.285-288, 2017.
  13. H. Lee, S. Kim, J. W. Kim, and Y. D. Chung, "Utility-preserving anonymization for health data publishing," BMC Medical informatics and Decision Making, Vol.17, No.1(104), 2017.
  14. 김동한, "개인정보 비식별화 기술 동향 및 전망," Weekly ICT Trend 주간기술동향, 제1809호, 정보통신기술진흥센터, pp.14-24, 2017.
  15. K. LeFevre, D. DeWitt, and R. Ramakrishnan, "Incognito: Efficient full-domain k-anonymity," In Proceedings of the 2005 ACM SIGMOD international conference on Management of data (SIGMOD '05) , pp.49-60, 2005.
  16. A. Machanavajjhala, J. Gehrke, and D. Kifer," $\ell$-Diversity: Privacy beyond k-anonymity," 22nd International Conference on Data Engineering, 2006.
  17. N. Li, T. Li, and S. Venkatasubramanian, "t-Closeness: Privacy beyond k-anonymity and l-diversity," IEEE 23rd International Conference on Data Engineering , 2007.