DOI QR코드

DOI QR Code

Web Page Similarity based on Size and Frequency of Tokens

토큰 크기 및 출현 빈도에 기반한 웹 페이지 유사도

  • 이은주 (경북대학교 IT대학 컴퓨터학부) ;
  • 정우성 (충북대학교 전자정보대학 컴퓨터공학과)
  • Received : 2012.07.27
  • Accepted : 2012.10.03
  • Published : 2012.12.31

Abstract

It is becoming hard to maintain web applications because of high complexity and duplication of web pages. However, most of research about code clone is focusing on code hunks, and their target is limited to a specific language. Thus, we propose GSIM, a language-independent statistical approach to detect similar pages based on scarcity and frequency of customized tokens. The tokens, which can be obtained from pages splitted by a set of given separators, are defined as atomic elements for calculating similarity between two pages. In this paper, the domain definition for web applications and algorithms for collecting tokens, making matrics, calculating similarity are given. We also conducted experiments on open source codes for evaluation, with our GSIM tool. The results show the applicability of the proposed method and the effects of parameters such as threshold, toughness, length of tokens, on their quality and performance.

Keywords

References

  1. Aversano, L., G. Canfora, A. De Lucia, and P. Gallucci, "Web Site Reuse:Cloning and Adapting", Proc. of the 3rd Int'l Workshop on Web Site Evolution, (2001), pp.107-111.
  2. Boldyreff, C. and R. Kewish, "Reverse Engineering to Achieve Maintainable WWW Sites", Proc . of 8th Working Conf. on Reverse Eng., (2001), pp.249-257.
  3. Calefato, F., F. Lanubile, and T. Mallardo, "Function Clone Detection in Web Applications: A Semiautomated Approach," Journal of Web Engineering, Vol.3, No.1(2004), pp.3-21.
  4. Duala-Ekoko, E. and M. P. Robillard, "Clone Tracker:Tool Support for Code Clone Management," Proc. of Int'l Conf. on Software Eng., (2008), pp.843-846.
  5. Higo, Y., T. Kamiya, S. Kusumoto, and K. Inoue, "Aries:Refactoring Support Environment Based on Code Clone Analysis," Proc. of the 8th IASTED Int'l Conf. on Software Eng., and Applications, (2004), pp. 222-229.
  6. Jiang, L., G. Misherghi, Z. Su, and S. Glondu, "DECKARD:Scalable and Accurate Treebased Detection of Code Clones", Proc. of Int'l Conf. on Software Eng., (2007), pp.96- 105.
  7. Kamiya, T., S. Kusumoto, and K. Inoue, "CCFinder:A multilinguistic token based code clone detection system for large scale source code", IEEE Trans. Software Engineering, Vol.28, No.7(2002), pp.654-670. https://doi.org/10.1109/TSE.2002.1019480
  8. Kim, M., V. Sazawal, D. Notkin, and G. C. Murphy, "An Empirical Study of Code Clone Genealogies," Proc. of the Joint European Software Eng. Conf. and ACM SIG SOFT Symposium on the Foundataion of Software Eng., (2005), pp.187-196.
  9. Lanubile F. and T. Mallardo, "Finding Function Clones in Web Applications", Proc. of the 7th European Conf. on Software Maintenance and Reeng, (2003), pp.379-386.
  10. Levenshtein, V. L., "Binary Codes Capable of Correcting Deletion, Insertions, and Reversals", Cynernetics and Control Theory, Vol.10(1966), pp.290-299.
  11. Li, Z., S. Lu, S. Myagmar, and Y. Zhou, "CP-Miner:A Tool for Finding Copy-paste and Related Bugs in Operating System Code," Proc. of Operating System Design and Implementation, (2004), pp.289-302.
  12. Li, Z., S. Lu, S. Myagmar, Y. Zhou, "CPMiner: Finding copy-paste and related bugs in large-scale software code", IEEE Trans. on Software Eng., Vol.32, No.3(2006), pp.176- 192. https://doi.org/10.1109/TSE.2006.28
  13. Lin, Z., M. Lyu, and I. King, "PageSim:A Novel Link-based Measure of Web Page Similarity", Proc. of World Wide Web, (2006), pp.1019-1020.
  14. Di Lucca, G. A., M. Di Penta, and A. R. Fasolino, "An Approach to Identify Duplicated Web Pages", Proc. of the 26th Annual Int'l Computer Software and Applications Conference, (2002), pp.481-486.
  15. De Lucia, A., G. Scanniello, and G. Tortora, "Identifying Clones in Dynamic Web Sites Using Similarity Thresholds", Proc. of Int'l Conf. on Enterprise Information Systems, (2004), pp.391-396.
  16. De Lucia, A., R. Francese, G. Scanniello, and G. Tortora, "Reengineering Web Applications Based on Cloned Pattern Analysis", Proc. of the 12th Int'l Workshop on Program Comprehension, (2004), pp.132-141.
  17. De Lucia, A., R. Francese, G. Scanniello, and G. Tortora, "Understanding Cloned Patterns in Web Applications", Proc. of 13th Int'l Workshop on Program Comprehension, (2005), pp.333-336.
  18. De Lucia, A., G. Scanniello, and G. Tortora, "Identifying Cloned Navigational Patterns in Web Applications", Int'l Journal of Web Eng., Vol.5, No.2(2006), pp.150-174.
  19. Mediawiki, http://www.mediawiki.org/.
  20. Ricca, F. and P. Tonella, "Using Clustering to Support the Migration from Static to Dynamic Web Pages", Proc. of Int'l Workshop on Program Comprehension, (2003), pp.207- 216.
  21. Synytekyy, N., J. R. Cordy, and T. Dean, "Resolution of Static Clones in Dynamic Web Pages", Proc. of Int'l Workshop on Web Site Evolution, (2003), pp.49-56.