A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data

Lee, Won-Jo;

doi:10.17703/JCCT.2022.8.6.891

The Journal of the Convergence on Culture Technology (문화기술의 융합)

Volume 8 Issue 6
/
Pages.891-897
/
2022
/
2384-0358(pISSN)
/
2384-0366(eISSN)

The International Promotion Agency of Culture Technology (국제문화기술진흥원)

DOI QR Code

A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data

비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구

Lee, Won-Jo (Dept. of Industrial Management Eng., Ulsan College)

이원조 (울산과학대학교 산업경영공학과 (울산대학교 전자계산학과울산과학대학교 컴퓨터 IT학부))

Received : 2022.10.21
Accepted : 2022.11.08
Published : 2022.11.30

https://doi.org/10.17703/JCCT.2022.8.6.891 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In big data analysis, raw text data mostly exists in various unstructured data forms, so it becomes a structured data form that can be analyzed only after undergoing heuristic pre-processing and computer post-processing cleansing. Therefore, in this study, unnecessary elements are purified through pre-processing of the collected raw data in order to apply the wordcloud of R program, which is one of the text data analysis techniques, and stopwords are removed in the post-processing process. Then, a case study of wordcloud analysis was conducted, which calculates the frequency of occurrence of words and expresses words with high frequency as key issues. In this study, to improve the problems of the "nested stopword source code" method, which is the existing stopword processing method, using the word cloud technique of R, we propose the use of "general stopword corpus" and "user-defined stopword corpus" and conduct case analysis. The advantages and disadvantages of the proposed "unstructured data cleansing process model" are comparatively verified and presented, and the practical application of word cloud visualization analysis using the "proposed external corpus cleansing technique" is presented.

빅데이터 분석에서 원시 텍스트 데이터는 대부분 다양한 비정형 데이터 형태로 존재하기 때문에 휴리스틱 전처리 정제와 컴퓨터를 이용한 후처리 정제과정을 거쳐야 분석이 가능한 정형 데이터 형태가 된다. 따라서 본 연구에서는 텍스트 데이터 분석 기법의 하나인 R 프로그램의 워드클라우드를 적용하기 위해서 수집된 원시 데이터 전처리를 통해 불필요한 요소들을 정제하고 후처리 과정에서 불용어를 제거한다. 그리고 단어들의 출현 빈도수를 계산하고 출현빈도가 높은 단어들을 핵심 이슈들로 표현해 주는 워드클라우드 분석의 사례 연구를 하였다. 이번 연구는 R의워드클라우드 기법으로 기존의 불용어 처리 방법인 "내포된 불용어 소스코드" 방법의 문제점을 개선하기 위하여 "일반적인 불용어 코퍼스"와 "사용자 정의 불용어 코퍼스"의 활용 방안을 제안하고 사례 분석을 통해서 제안된 "비정형 데이터 정제과정 모델"의 장단점을 비교 검증하여 제시하고 "제안된 외부 코퍼스 정제기법"을 이용한 워드클라우드 시각화 분석의 실무적용에 대한 효용성을 제시한다.

Keywords

References

W. Lee, A Study on Data Cleansing Techniques for Word Cloud Analysis of Text Data, JCCT, vol. 7, No. 4, pp. 745-750, 2021.
W. Lee, A Study on Word Cloud Techniques for Analysis of Unstructured Text Data, JCCT, vol. 6, No. 3, pp. 337-341, 2020.
J. Lee, D. Yun, S. O, C. Lee, A Big Data Analysis of Civel Complaint Texts Using R Language, KIICE, 2020.
Kumar, P. Thakur, K. Gupta, and A. Pal, 2015, Text mining approach to analyse the relation between obesity and breast cancer data, ILNS
M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
Giseop Noh, An Analysis on Internet Information using Real Time Search Words, JCCT, vol. 4, No. 4, pp. 337-341, 2018.
I. Chun, D. Park, Y. Kang, Python and data science, Saengneun Publishing, pp. 222-233, 2019.
M. Chi , S. Lin, S. Chen, C. Lin, T. Lee, Morphab1e word Clouds for Time-Varying Text Data Visualization, IEEE, 2015.
M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
Jong Suk Lee and 3 others, Big data analysis of civil complaint texts using R language, 2020.
Insun Lee and 1 others, Unstructured data analysis and visualization, Korean Psychology Association, 2018.
Jongyong LEE, A Study on Tourism Analysis in Uijeongbu Region Using Big Data, JCCT, vol. 6, No. 1, pp. 413-419, 2020.
Sunghuk Moon, Big data environment analysis and research on ways to secure global competitiveness, JCCT, vol. 5 No. 2, pp. 361-367
Web Mining, IT Glossary, Korea Information and Communication Technology Association
text mining, Biochemistry Encyclopedia
Sejong Oh, R data analysis for everyone, R data analysis for everyone, Hanbit Media, 2019.
https://wikidocs.net/22530.

The Journal of the Convergence on Culture Technology (문화기술의 융합)

A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data

비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)