DOI QR코드

DOI QR Code

A Machine-Learning Based Approach for Extracting Logical Structure of a Styled Document

  • Kim, Tae-young (Dept. of Software Engineering, CAIIT, Chonbuk National University) ;
  • Kim, Suntae (Dept. of Software Engineering, CAIIT, Chonbuk National University) ;
  • Choi, Sangchul (Dept. of Software Engineering, CAIIT, Chonbuk National University) ;
  • Kim, Jeong-Ah (Department of Computer Education, Catholic Kwandong University) ;
  • Choi, Jae-Young (College of Information and Communication Engineering, SungKyunKwan University) ;
  • Ko, Jong-Won (College of Information and Communication Engineering, SungKyunKwan University) ;
  • Lee, Jee-Huong (College of Information and Communication Engineering, SungKyunKwan University) ;
  • Cho, Youngwha (College of Information and Communication Engineering, SungKyunKwan University)
  • Received : 2016.10.10
  • Accepted : 2017.01.16
  • Published : 2017.02.28

Abstract

A styled document is a document that contains diverse decorating functions such as different font, colors, tables and images generally authored in a word processor (e.g., MS-WORD, Open Office). Compared to a plain-text document, a styled document enables a human to easily recognize a logical structure such as section, subsection and contents of a document. However, it is difficult for a computer to recognize the structure if a writer does not explicitly specify a type of an element by using the styling functions of a word processor. It is one of the obstacles to enhance document version management systems because they currently manage the document with a file as a unit, not the document elements as a management unit. This paper proposes a machine learning based approach to analyzing the logical structure of a styled document composing of sections, subsections and contents. We first suggest a feature vector for characterizing document elements from a styled document, composing of eight features such as font size, indentation and period, each of which is a frequently discovered item in a styled document. Then, we trained machine learning classifiers such as Random Forest and Support Vector Machine using the suggested feature vector. The trained classifiers are used to automatically identify logical structure of a styled document. Our experiment obtained 92.78% of precision and 94.02% of recall for analyzing the logical structure of 50 styled documents.

Keywords

References

  1. R. Mohemad, A.R. Hamdan, Z.A. Othman, and N.M.M. Noor, "Automatic Document Structure Analysis of Structured PDF Files," International Journal of New Computer Architectures and their Applications (IJNCAA), vol. 1, no. 2, pp. 404-411, August, 2011.
  2. J. Kim, D. X. Le, and G.R. Thoma, "Automated labeling in document images," in Proc. of SPIE Conference on Document Recognition and Retrieval VIII, vol. 4307, pp. 111-122, January, 2001.
  3. D. Niyogi and S.N. Srihari, "Knowledge-based derivation of document logical structure," in Proc. of International Conference on Document Analysis and Recognition, pp. 472-475, August 14 - 15, 1995.
  4. R. Rauf, M. Antkiewicz, and K. Czarnecki, "Logical structure extraction from software requirements documents," in Proc. of 19th IEEE International Requirements Engineering Conference, pp. 101-110, August 29, 2011.
  5. Kan, Min-Yen, Luong, Minh-Thang, "Logical Structure Recovery in Scholarly Articles with Rich Document Features," International Journal of Digital Library Systems, vol. 1, no. 4, pp. 1-23, October, 2010. https://doi.org/10.4018/jdls.2010100101
  6. S. Mao, Z. Xu, T. Tjahjadi, and G. R. Thoma, "Logical Entity Recognition in Multi-Style Document Page Images," in Proc. of 18th International Conference on Pattern Recognition, pp. 876-879, August 20-24, 2006.
  7. L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, October, 2001. https://doi.org/10.1023/A:1010933404324
  8. C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, July, 1995. https://doi.org/10.1007/BF00994018
  9. David W. Aha, Dennis F. Kibler, Marc K. Albert, "Instance-based learning algorithms," Machine Learning, vol. 6, pp. 37-66, January, 1991.
  10. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd Edition, Prentice Hall, New Jersey, 2003.
  11. C. Bishop, Pattern recognition and machine learning, Springer, Berlin, 2006.
  12. Weka Home Page. (Available at http://www.cs.waikato.ac.nz/ml/weka/).
  13. Docx4j Enterprise Edition Homepage. (Available at http://www.docx4java.org/trac/docx4j).
  14. T. Mitchell. Machine Learning, The Mc-Graw-Hill, New York, 1997.
  15. W.B. Frakes and R. Baeza-Yates, Information Retrival : Data Structures and Algorithms, Prentice-Hall, New Jersey, 1992.
  16. J. D. Lafferty, A. McCallum, and F. C. N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," in Proc. of the Eighteenth International Conference on Machine Learning, pp. 282-289, June 28 - July 1, 2001.
  17. I.H. Witten, E. Frank, and M.A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, Morgan Kaufmann, Burlington, 2011.
  18. S. Klampfl, M. Granitzer, K. Jack and R. Kern, "Unsupervised document structure analysis of digital scientific articles," International Journal on Digital Libraries, vol. 14, Issue 3-4, pp. 83-99, August, 2014. https://doi.org/10.1007/s00799-014-0115-1
  19. S. Klampfl and R. Kern, "Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications," Semantic Web Evaluation Challenges - Second SemWebEval Challenge at ESWC 2015, pp. 105-116 , May 31 - June 4, 2015.
  20. S. Klampfl and R. Kern, "Reconstructing the logical structure of a Scientific Publication Using Machine Learning," in Proc. of Semantic Web Challenges - Third SemWebEval Challenge at ESWC 2016, pp. 255-268, May 29 - June 2, 2016.
  21. J. Lafferty , A. McCallum and F. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," in Proc. of the Eighteenth International Conference on Machine Learning, pp. 282-289, June 28 - July 1, 2001.
  22. Ora Lassila, Ralph R. Swick, Resource Description Framework (RDF) Model and Syntax Specification. 1999. (Available at https://www.w3.org/TR/1999/REC-rdf-syntax-19990222/).
  23. L. Liu and M. Tamer Ozsu, Encyclopedia of Database Systems. Springer, Berlin, 2009.