DOI QR코드

DOI QR Code

Keyword Analysis Based Document Compression System

  • Cao, Kerang (Department of Computer Science and Engineering, Shenyang University of Chemical Technology) ;
  • Lee, Jongwon (Department of Computer Engineering, Pai Chai University) ;
  • Jung, Hoekyung (Department of Computer Engineering, Pai Chai University)
  • Received : 2018.01.04
  • Accepted : 2018.03.19
  • Published : 2018.03.31

Abstract

The traditional documents analysis was centered on words based system was implemented using a morpheme analyzer. These traditional systems can classify used words in the document but, cannot help to user's document understanding or analysis. In this problem solved, System needs extract for most valuable paragraphs what can help to user understanding documents. In this paper, we propose system extracts paragraphs of normalized XML document. User insert to system what filename when wants for analyze XML document. Then, system is search for keyword of the document. And system shows results searched keyword. When user choice and inserts keyword for user wants then, extracting for paragraph including keyword. After extracting paragraph, system operating maintenance paragraph sequence and check duplication. If exist duplication then, system deletes paragraph of duplication. And system informs result to user what counting each keyword frequency and weight to user, sorted paragraphs.

Keywords

E1ICAW_2018_v16n1_48_f0001.png 이미지

Fig. 1. System architecture.

E1ICAW_2018_v16n1_48_f0002.png 이미지

Fig. 2. System flowchart.

E1ICAW_2018_v16n1_48_f0003.png 이미지

Fig. 3. Open XML document flowchart.

E1ICAW_2018_v16n1_48_f0004.png 이미지

Fig. 4. Extraction keyword flowchart.

E1ICAW_2018_v16n1_48_f0005.png 이미지

Fig. 5. Extraction paragraph including keyword flowchart.

E1ICAW_2018_v16n1_48_f0006.png 이미지

Fig. 6. Check duplication flowchart.

E1ICAW_2018_v16n1_48_f0007.png 이미지

Fig. 7. Test result.

References

  1. B. Noh, Z. Xu, J. Lee, D. Park, and Y. Chung, "Keyword network based repercussion effect analysis of foot-and-mouth disease using online news," Journal of the Korean Institute of Information Technology, vol. 14, no. 9, pp. 143-152, 2016. DOI: 10.14801/jkiit.2016.14.9.143.
  2. J. Li, E. Lee, and J. H. Lee, "Sequence-to-sequence based morphological analysis and part-of-speech tagging for Korean language with convolutional features," Journal of the Korean Institute of Information Scientists and Engineering, vol. 44, no. 1, pp. 57-62, 2017. DOI: 10.5626/JOK.2017.44.1.57.
  3. H. Ha and B. Y. Hwang, "Keyword filtering about disaster and the method of detecting area in detecting real-time event using Twitter," KIPS Transactions on Software and Data Engineering, vol. 5, no. 7, pp. 345-350, 2016. DOI: 10.3745/KTSDE.2016.5.7.345.
  4. K. S. Shim, "Automatic word spacing using raw corpus and a morphological analyzer," Journal of the Korean Institute of Information Scientists and Engineering, vol. 42, no. 1, pp. 68-75, 2015. DOI: 10.5626/JOK.2015.42.1.68.
  5. H. Y. Lee, J. S. Lee, B. D. Kang, and S. W. Yang, "Functional expansion of morphological analyzer based on longest phrase matching for efficient Korean parsing," Journal of Digital Contents Society, vol. 17, no. 3, pp. 203-210, 2012. DOI: 10.9728/dcs.2016.17.3.203.
  6. J. Y. Lee, J. H. Lee, and Y. H. Park, "A design and implementation of the management system for number of keyword searching results using Google searching engine," Journal of the Korea Institute of Information and Communication Engineering, vol. 20, no. 5, pp. 880-886, 2016. DOI: 10.6109/jkiice.2016.20.5.880.
  7. S. Y. Park, J. Chang, and T. Kihl, "Document classification model using web documents for balancing training corpus size per category," Journal of Information and Communication Convergence Engineering, vol. 11, no. 4, pp. 268-273, 2013. DOI: 10.6109/jicce.2013.11.4.268.

Cited by

  1. Improving Elasticsearch for Chinese, Japanese, and Korean Text Search through Language Detector vol.18, pp.1, 2018, https://doi.org/10.6109/jicce.2020.18.1.33