DOI QR코드

DOI QR Code

Semantic Word Categorization using Feature Similarity based K Nearest Neighbor

  • Jo, Taeho (School of Game, Hongik University)
  • Received : 2018.02.27
  • Accepted : 2018.05.14
  • Published : 2018.06.30

Abstract

This article proposes the modified KNN (K Nearest Neighbor) algorithm which considers the feature similarity and is applied to the word categorization. The texts which are given as features for encoding words into numerical vectors are semantic related entities, rather than independent ones, and the synergy effect between the word categorization and the text categorization is expected by combining both of them with each other. In this research, we define the similarity metric between two vectors, including the feature similarity, modify the KNN algorithm by replacing the exiting similarity metric by the proposed one, and apply it to the word categorization. The proposed KNN is empirically validated as the better approach in categorizing words in news articles and opinions. The significance of this research is to improve the classification performance by utilizing the feature similarities.

Keywords

References

  1. K. Abainia, S. Ouamour, and H. Sayoud. "Neural Text Categorizer for topic identification of noisy Arabic Texts", 1-8, in Proceedings of 12th IEEE Conference on Computer Systems and Applications, pp.1-8, 2015.
  2. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley, 2011.
  3. L. Firte, C. Lemnaru, and R. Potolea, "Spam detection filter using KNN algorithm and resampling", pp27-33, in Proceedings of IEEE International Conference on Intelligent Computer Communication and Processing, pp. 27-33, 2010.
  4. E. Han, S. G. Karypis, and V. Kumar, "Text categorization using weight adjusted k-nearest neighbor classification", in Proceedings of Pacific-asia conference on knowledge discovery and data mining, pp. 53-65, 2001.
  5. C. James, I. Koprinska, and J. Poon, "A neural network based approach to automated e-mail classification", pp702-705, in Proceedings of IEEE International Conferences on Web Intelligence, pp.702-705, 2003.
  6. T. Jo, "NeuroTextCategorizer: A New Model of Neural Network for Text Categorization", in Proceedings of ICONIP, pp. 280-285, 2000.
  7. T. Jo, "The Implementation of Dynamic Document Organization using Text Categorization and Text Clustering", PhD Dissertation, Department of Computer Science, University of Ottawa, Ottawa, Canada, 2006.
  8. T. Jo, "Table based Single Pass Algorithm for Clustering News Articles", International Journal of Fuzzy Logic and Intelligent Systems, vol. 8, no. 3, pp. 231-237, 2008. https://doi.org/10.5391/IJFIS.2008.8.3.231
  9. T. Jo, "Neural Text Categorizer for Exclusive Text Categorization", Journal of Information Processing Systems, vol. 4, no 2, pp. 77-86, 2008. https://doi.org/10.3745/JIPS.2008.4.2.077
  10. T. Jo, "Modification of Classification Algorithm in Favor of Text Categorization", International Journal of Computer Science and Software Technology, vol. 2, no. 1, pp. 13-23, 2009.
  11. T. Jo, "Modification of Clustering Algorithms for Text Clustering", International Journal of Computer Science and Software Technology, vol. 3, no. 1, pp.21-33, 2010.
  12. T. Jo, "NTC (Neural Text Categorizer): Neural Network for Text Categorization", International Journal of Information Studies, vol. 2, no. 2, pp. 83-96, 2010.
  13. T. Jo, "NTSO (Neural Text Self Organizer): A New Neural Network for Text Clustering", pp31-43, Journal of Network Technology, pp. 31-43, vol. 1, no. 1, 2010.
  14. T. Jo, "Device and Method for Categorizing Electronic Document Automatically", 10-2009-0041272, 10-1071495, 2011.
  15. T. Jo, "Normalized Table Matching Algorithm as Approach to Text Categorization", Soft Computing, vol. 19, no. 4, pp. 839-849, 2015. https://doi.org/10.1007/s00500-014-1411-9
  16. T. Jo, "Simulation of Numerical Semantic Operations on String in Text Collection", International Journal of Applied Engineering Research, vol. 10, no. 24, pp. 45585-45591, 2015.
  17. T. Jo, "KNN based Word Categorization considering Feature Similarities", The Proceedings of 17th International Conference on Artificial Intelligence, pp. 343-346, 2015.
  18. T. Jo and N. Japkowicz, "Text Clustering using NTSO", The Proceedings of IJCNN, pp. 558-563, 2005.
  19. A. Khan, B. Baharudin, L.H. Lee, and K. Khan, "A review of machine learning algorithms for text- documents classification", Journal of advances in information technology, vol 1, no 1, pp. 4-20, 2010.
  20. R. J. Kate and R. J. Mooney, "Using String Kernels for Learning Semantic Parsers", in Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 913-920, 2006.
  21. Y. Kim, B. Zhang, and Y.T. Kim, "Collocation dictionary optimization using WordNet and k-nearest neighbor learning", Machine Translation, vol. 16, no. 2, pp. 99-108, 2001.
  22. C. Lai and M. Tsai, "An empirical performance comparison of machine learning methods for spam e-mail categorization", in Proceedings of IEEE International Conference on Hybrid Intelligent Systems, pp. 44-48, 2004.
  23. C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble, "Mismatch String Kernels for Discriminative Protein Classification", Bioinformatics, vol. 20, no. 4, pp. 467-476, 2004. https://doi.org/10.1093/bioinformatics/btg431
  24. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, "Text Classification with String Kernels", Journal of Machine Learning Research, vol. 2, no. 2, pp. 419-444, 2002.
  25. T. Mitchell, Machine Learning, McGraw-Hill, 1997.
  26. P. Y. Pawar and S. H. Gawande, "A Comparative Study on Different Types of Approaches to Text Categorization", International Journal of Machine Learning and Computing, vol. 2, no. 4, pp. 423-426, 2012.
  27. V. Pekar and S. Staab, "Word classification based on combined measures of distributional and semantic similarity", in Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, pp. 147-150, 2003.
  28. F. Sebastiani, "Machine learning in automated text categorization", ACM Computing Survey, vol. 34, no. 1, pp. 1-47, 2002. https://doi.org/10.1145/505282.505283
  29. M. Stauffer and A. Fischer and K. Riesen, "A novel graph database for handwritten word images", Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition, pp. 553-563, 2016.
  30. B. Vishwanath, V. Kumar, P. Kumari, and J. Pascual, "KNN based machine learning approach for text and document mining", International Journal of Database Theory and Application, vol. 7, no. 1, pp. 61-70, 2014. https://doi.org/10.14257/ijdta.2014.7.1.06
  31. E. D. Wiener, "A Neural Network Approach to Topic Spotting in Text", Master Thesis, the Faculty of the Graduate School of the University of Colorado, 1995.
  32. Y. Yang, "An evaluation of statistical approaches to text categorization", Information retrieval, vol. 1, no. 1, pp. 69-90, 1999. https://doi.org/10.1023/A:1009982220290
  33. Y. Zheng, X. Cheng, R. Huang, and Y. Man, "A comparative study on text clustering methods", Advanced Data Mining and Applications, pp. 644-651, 2006.