DOI QR코드

DOI QR Code

Text Summarization on Large-scale Vietnamese Datasets

  • Received : 2022.04.23
  • Accepted : 2022.11.16
  • Published : 2022.12.31

Abstract

This investigation is aimed at automatic text summarization on large-scale Vietnamese datasets. Vietnamese articles were collected from newspaper websites and plain text was extracted to build the dataset, that included 1,101,101 documents. Next, a new single-document extractive text summarization model was proposed to evaluate this dataset. In this summary model, the k-means algorithm is used to cluster the sentences of the input document using different text representations, such as BoW (bag-of-words), TF-IDF (term frequency - inverse document frequency), Word2Vec (Word-to-vector), Glove, and FastText. The summary algorithm then uses the trained k-means model to rank the candidate sentences and create a summary with the highest-ranked sentences. The empirical results of the F1-score achieved 51.91% ROUGE-1, 18.77% ROUGE-2 and 29.72% ROUGE-L, compared to 52.33% ROUGE-1, 16.17% ROUGE-2, and 33.09% ROUGE-L performed using a competitive abstractive model. The advantage of the proposed model is that it can perform well with O(n,k,p) = O(n(k+2/p)) + O(nlog2n) + O(np) + O(nk2) + O(k) time complexity.

Keywords

References

  1. M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut, "Text summarization techniques: A brief survey," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 8, no. 10, pp. 397-405, 2017. DOI: 10.14569/IJACSA.2017.081052.
  2. D. Graff and C. Cieri, "English gigaword," Linguistic Data Consortium, Philadelphia, vol. 4, no. 1, pp. 34, Jan. 2003.
  3. A. M. Rush, S. Chopra, and J. Weston, "A neural attention model for abstractive sentence summarization," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015. DOI: 10.18653/v1/d15-1044.
  4. A. See, P. J. Liu, and C. D. Manning, "Get to the point: summarization with pointer-generator networks," arXiv preprint arXiv:1704.04368, 2017. DOI: arXiv preprint arXiv:1704.04368.
  5. H. Q. To, K. V. Nguyen, N. L. -T. Nguyen, and A. G. T. Nguyen, "Monolingual vs multilingual BERTology for Vietnamese extractive multi-document summarization," in Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, Shanghai: China, pp. 692-699, 2021.
  6. N. Van-Hau, N. Thanh-Chinh, N. Minh-Tien, and H. Nguyen, "VNDS: A Vietnamese dataset for summarization," in Proceedings of 2019 6th NAFOSTED Conference on Information and Computer Science (NICS), Hanoi, Vietnam, pp. 375-380, 2019. DOI: 10.1109/NICS48868.2019.9023886.
  7. T. A. Nguyen-Hoang, K. Nguyen, and Q. V. Tran, "TSGVi: A graphbased summarization system for Vietnamese documents," Journal of Ambient Intelligence and Humanized Computing, vol. 3, no. 4, pp. 305-313, Jun. 2012. DOI: 10.1007/s12652-012-0143-x.
  8. H. P. Luhn, "The automatic creation of literature abstracts," IBM Journal of research and development, vol. 2, no. 2, pp. 159-165, Apr. 1958. DOI: 10.1147/rd.22.0159.
  9. D. R. Radev, H. Jing, M. Stys, and D. Tam, "Centroid-based summarization of multiple documents," Information Processing & Management, vol. 40, no. 6, pp. 919-938, Nov. 2004. DOI: 10.1016/j.ipm.2003.10.006.
  10. G. Rossiello, P. Basile, and G. Semeraro, "Centroid-based text summarization through compositionality of word embeddings," in Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, Valencia, Spain, pp. 12-21, 2017. DOI: 10.18653/v1/W17-1003.
  11. D. Arthur and S. Vassilvitskii, "How slow is the k-means method?," in Proceedings of the twenty-second annual symposium on Computational geometry, Sedona: AZ, USA, pp. 144-153, Jun. 2006. DOI: 10.1145/1137856.1137880.
  12. P. McIlroy, "Optimistic sorting and information theoretic complexity," in Proceedings of the fourth annual ACM-SIAM symposium on Discrete algorithms, Austin: TX, USA, pp. 467-474, 1993.
  13. S. Lloyd, "Least squares quantization in PCM," IEEE Transactions on Information Theory, IEEE, vol. 28, no. 2, pp. 129-137, Mar. 1982. DOI: 10.1109/TIT.1982.1056489.
  14. J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland: CA, USA, vol. 1, no. 14, pp. 281-297, 1967.
  15. D. Sculley, "Web-scale k-means clustering", in Proceedings of the 19th International Conference on World wide Web, Raleigh: NC, USA, pp. 1177-1178, 2010. DOI: 10.1145/1772690.1772862.
  16. Z. S. Harris, "Distributional structure," Word, vol. 10, no. 146-162, pp. 146-162, 1954. DOI: 10.1080/00437956.1954.11659520.
  17. K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of Documentation, vol. 28, no. 1, pp. 11-21, Jan. 1972. https://doi.org/10.1108/eb026526
  18. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe: NV, USA, vol. 2, pp. 3111-3119, Dec. 2013.
  19. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng, "Large scale distributed deep networks," in Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe: NV, USA, vol. 1, pp. 1223-1231, 2012.
  20. J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, pp. 1532-1543, 2014. DOI: 10.3115/v1/D14-1162.
  21. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of tricks for efficient text classification," in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, vol. 2 , pp. 427-431, 2016.
  22. I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, pp. 3104-3112, 2014.
  23. S. Hochreiter and C. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997. DOI: 10.1162/neco.1997.9.8.1735.
  24. D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014. DOI: arXiv preprint arXiv:1409.0473.
  25. O. Vinyals, M. Fortunato, and N. Jaitly, "Pointer networks", Advances in Neural Information Processing Systems, vol. 28, pp. 1-9, 2015.
  26. C. Y. Lin, "ROUGE: A package for automatic evaluation of summaries," Text summarization branches out, Barcelona, Spain, pp. 74-81, 2004.
  27. K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, "Teaching machines to read and comprehend," in Proceedings of the 28th International Conference on Neural Information Processing Systems, Cambridge: MA, USA, vol. 1, pp. 1693-1701, 2015.