DOI QR코드

DOI QR Code

A Study on Information Resource Evaluation for Text Categorization

문서범주화 효율성 제고를 위한 정보원 평가에 관한 연구

  • 정은경 (이화여자대학교 사회과학대학 문헌정보학)
  • Published : 2007.12.31

Abstract

The purpose of this study is to examine whether the information resources referenced by human indexers during indexing process are effective on Text Categorization. More specifically, information resources from bibliographic information as well as full text information were explored in the context of a typical scientific journal article data set. The experiment results pointed out that information resources such as citation, source title, and title were not significantly different with full text. Whereas keyword was found to be significantly different with full text. The findings of this study identify that information resources referenced by human indexers can be considered good candidates for text categorization for automatic subject term assignment.

이 연구는 색인가가 주제 색인하는 과정에서 참조하는 여러 문서구성요소를 문서 범주화의 정보원으로 인식하여 이들이 문서 범주화 성능에 미치는 영향을 살펴보는데 그 목적이 있다. 이는 기존의 문서 범주화 연구가 전문(full text)에 치중하는 것과는 달리 문서구성요소로서 정보원의 영향을 평가하여 문서 범주화에 효율적으로 사용될 수 있는지를 파악하고자 한다. 전형적인 과학기술분야의 저널 및 회의록 논문을 데이터 집합으로 하였을때 정보원은 본문정보 중심과 문서구성요소중심으로 나뉘어 질 수 있다. 본문정보중심은 본론자체와 서론과 결론으로 구성되며, 문서구성요소중심은 제목, 인용, 출처, 초록, 키워드로 파악된다. 실험결과를 살펴보면, 인용, 출처, 제목정보원은 본문정보원과 비교하여 유의한 차이를 보이지 않으며, 키워드정보원은 본문 정보원과 비교하여 유의한 차이를 보인다. 이러한 결과는 색인가가 참고하는 문서구성요소로서의 정보원이 문서 범주화에 본문을 대신하여 효율적으로 활용될 수 있음을 보여주고 있다.

Keywords

References

  1. Chan, L.M. (1981). Cataloging and classification: An introduction. New York City, NY: McGraw -Hill
  2. Chan, L.M. (1987). Instructional materials used in teaching cataloging and classification. Cataloging and Classification. (7) : 131-144
  3. Chu, C.M. & O'Brien, A. (1993). Subject analysis: The critical first stage in indexing. Journal of Information Science. (19) : 439- 454 https://doi.org/10.1177/016555159301900603
  4. Cunningham, S.J., Witten, I.H., & L ittin, J. (1999). Applications of machinelearning in information retrieval. Annual Review of Information Science and Technology, (34) : 341-384
  5. Diaz, I., Ranilla, J., Montanes, E., Fernandez, J., & Combarro, E. (2004). Improving performance of text categorization by combining filtering and support vector machines, Journal of the American Society for Information Science and Technology, 55(7) : 579-592 https://doi.org/10.1002/asi.10409
  6. Efron, M., Marchionini, G., Elsas, J., & Zhang, J. (2004). Machine learning for information architecture in a large governmental website. Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 151-159
  7. Engineering Village 2. (n.d.). Retrieved November 11, 2006, from http://www.negineeringviallge2.org/controller/servlet/Controller
  8. Foskett, A.C. (1996). The Subject Approach to Information. London: Library Association Publishing
  9. ISO 5963: 1985. (1985). Documentation-methods for examining documents: Determining their subjects and selecting indexing terms. International Standards Organization
  10. Jeng, L.H. (1996). Using verbal reports to understand cataloging expertise: Two cases, Library Resources and Technical Services 40(4) : 343-358 https://doi.org/10.5860/lrts.40n4.343
  11. Joachims, T. (1998). Text categorization with support vector machine : Learning with many relevant features, Proceedings of the 10th European Conference on Machine Learning, 137-142
  12. Larkey, L.S. (1999). A patent search and classification system. Proceedings of the 4th ACM Conference on Digital Libraries, 179-187
  13. Lewis, D.D. (1995). Evaluating and optimizing autonomous text categorization systems. Unpublished Doctoral Dissetation, University of Massachusetts, Massachusetts
  14. Mai, J.E. (2005). Analysis in indexing : document and domain centered approaches, Information Processing and Management, (41) : 599-611 https://doi.org/10.1016/j.ipm.2003.12.004
  15. Mitchell, J.S. et al. (Eds.). (2003). Dewey Decimal Classification and Relative Index. Dublin, OH: OCLC Online Library Computer, Inc
  16. Moens, M.F. (2000). Automatic Indexing and Abstracting of Document Texts. Norwell, MS: Kluwer Academic Publishers
  17. O'Connor, B.C. (1996). Explorations in Indexing and Abstracting: pointing, virtue, and power. CO: Libraries Unlimited
  18. Porter, M.F. (1980). An algorithm for suffix stripping, Program, (14) : 130-137 https://doi.org/10.1108/eb046814
  19. Sauperl, A. (2002). Subject determination during the cataloging process. Lanham, MD; Scarecrow Press
  20. Sauperl, A. (2004). Catalogers' common ground and shared knowledge. Journal of the American Society for Information Science and Technology, 55(1) : 55-63 https://doi.org/10.1002/asi.10351
  21. Sebastiani, F. (2002). Hypertext categorization. In A. Zanasi (Eds.), Text Mining and Its Applications(pp. 109-129), Southhampton, U.K.: WIT Press
  22. Sebastiani, F. (2005). Text categorization. In A. Zanasi (Eds.), Text mining and its applications (pp. 109-129), Southhampton, U.K : WIT Press
  23. Slattery, S. (2002). Hypertext categorization. Unpublished Doctoral Dissertation. School of Computer Science. Carnegie Mellon University
  24. Taylor, A.G. (2003). The organization of information (2nd ed.). Englewood, CO; Libraries Unlimited
  25. van Rijsbergen, C.J. (1979). Information Retrieval Butterworths, London
  26. Witten, I.H. & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and techniques with JAVA Implementations. CA: SanDiego, Academic Press
  27. Yang, Y. 1999. An evaluation of statistcial approaches to text categorization. Information Retrieval, (1) : 69-90 https://doi.org/10.1023/A:1009982220290
  28. Zhang, B., Goncalves, M.A., Fan, W., Chen, Y., Fox, E.A., Calado, P. & Cristo, M. (2004). Combining structural and citation-based evidence for text categorization, Proceedings of the 13th ACM Conference on Information and Knowledge Management, 162-163 https://doi.org/10.1145/1031171.1031204