DOI QR코드

DOI QR Code

Predicting numeric ratings for Google apps using text features and ensemble learning

  • Umer, Muhammad (Department of Computer Science, Khawaja Freed University) ;
  • Ashraf, Imran (Department of Information and Communication Engineering, Yeungnam Univeristy) ;
  • Mehmood, Arif (Department of Computer Science and Information Technology, The Islamia University of Bahawalpur) ;
  • Ullah, Saleem (Department of Computer Science, Khawaja Freed University) ;
  • Choi, Gyu Sang (Department of Information and Communication Engineering, Yeungnam Univeristy)
  • Received : 2019.09.26
  • Accepted : 2020.03.02
  • Published : 2021.02.01

Abstract

Application (app) ratings are feedback provided voluntarily by users and serve as important evaluation criteria for apps. However, these ratings can often be biased owing to insufficient or missing votes. Additionally, significant differences have been observed between numeric ratings and user reviews. This study aims to predict the numeric ratings of Google apps using machine learning classifiers. It exploits numeric app ratings provided by users as training data and returns authentic mobile app ratings by analyzing user reviews. An ensemble learning model is proposed for this purpose that considers term frequency/inverse document frequency (TF/IDF) features. Three TF/IDF features, including unigrams, bigrams, and trigrams, were used. The dataset was scraped from the Google Play store, extracting data from 14 different app categories. Biased and unbiased user ratings were discriminated using TextBlob analysis to formulate the ground truth, from which the classifier prediction accuracy was then evaluated. The results demonstrate the high potential for machine learning-based classifiers to predict authentic numeric ratings based on actual user reviews.

Keywords

Acknowledgement

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2019R1A2C1006159), and MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2019-2016-0-00313) supervised by the IITP (Institute for Information & Communications Technology Promotion).

References

  1. Statista, Number of available application in the Google Play store from December 2009 to March 2019, https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/, Online: accessed 22 May 2019.
  2. Statistaa, Number of mobile app downloads worldwide in 2017, 2018 and 2020 (in billions), https://www.statista.com/statistics/271644/worldwide-free-and-paid-mobile-app-store-downloads/, Online: accessed 22 May 2019.
  3. J. Horrigan, Online shopping, pew internet and american life project, Washington, DC, 2018, http://www.pewinternet.org/Reports/2008/Online-Shopping/01-Summary-of-Findings.aspx Online: accessed 8 Aug. 2014.
  4. D. Pagano and W. Maalej, User feedback in the appstore: An empirical study, in Proc. IEEE Int. Requirements Eng. Conf. (Rio de Janeiro, Brazil), July 2013, pp. 125-134.
  5. T. Chumwatana, Using sentiment analysis technique for analyzing Thai customer satisfaction from social media, 2015.
  6. T. Thiviya et al., Mobile apps' feature extraction based on user reviews using machine learning, 2019.
  7. H. Hanyang et al., Studying the consistency of star ratings and reviews of popular free hybrid android and ios apps, Empirical Softw. Eng. 24 (2019), no. 7, 7-32. https://doi.org/10.1007/s10664-018-9617-6
  8. N. Kumari and S. Narayan Singh, Sentiment analysis on e-commerce application by using opinion mining, in Proc. Int. Conf.-Cloud Syst. Big Data Eng. (Noida, India), Jan. 2016, pp. 320-325.
  9. R. M. Duwairi and I. Qarqaz, Arabic sentiment analysis using supervised classification, in Proc. Int. Conf. Future Internet Things Cloud (Barcelona, Spain), Aug. 2014, pp. 579-583.
  10. H. S. Le, T. V. Le, and T. V. Pham, Aspect analysis for opinion mining of vietnamese text, in Proc. Int. Conf. Adv. Comput. Applicat. (Ho Chi Minh, Vietnam), Nov. 2015, pp. 118-123.
  11. H. Wang, L. Yue, and C. Zhai, Latent aspect rating analysis on review text data: A rating regression approach, in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (Washington, D.C., USA), July 2010, pp. 783-792.
  12. K. Dave, S. Lawrence, and D. M. Pennock, Mining the peanut gallery: Opinion extraction and semantic classification of product reviews, in Proc. Int. Conf. World Wide Web (New York, USA), 2003, pp. 519-528.
  13. B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: Sentiment classification using machine learning techniques, in Proc. ACL-02 Conf. Empirical Methods Natural Language Process. (Stroudsbrug, PA, USA), 2002, pp. 79-86.
  14. C. Cardie et al., Combining low-level and summary representations of opinions for multi-perspective question answering, New directions in question answering, 2003, pp. 20-27.
  15. H. Takamura, T. Inui, and M. Okumura, Extracting semantic orientations of words using spin model, in Proc. Annu. Meeting Association Comput. Linguistics (Ann Arbor, MI, USA), 2005, pp. 133-140.
  16. A. Buche, D. Chandak, and A. Zadgaonkar, Opinion mining and analysis: A survey, arXiv preprint arXiv:1307.3336, 2013. https://doi.org/10.5121/ijnlc.2013.2304
  17. M. Suleman, A. Malik, and S. S. Hussain, Google play store app ranking prediction using machine learning algorithm, Urdu News Headline, Text Classification by Using Different Machine Learning Algorithms, 2019.
  18. F. Sarro et al., Customer rating reactions can be predicted purely using app features, in Proc. IEEE Int. Requirements Eng. Conf. (Banaf, Canada), Aug. 2018, pp. 76-87.
  19. S. Aslam and I. Ashraf, Data mining algorithms and their applications in education data mining, Int. J. Adv. Res. Computer Sci. Manag. Studies 2 (2014), no. 7, 50-56.
  20. D. Martens and T. Johann, On the emotion of users in app reviews, in Proc. IEEE/ACM Int. Workshop Emotion Awareness Softw. Eng. (Buenos Aires, Argentina), May 2017, pp. 8-14.
  21. G. Hackeling, Mastering machine learning with scikit-learn, Packt Publishing Ltd, 2017.
  22. Scikit learn, Scikit-learn classification and regression models, http://scikitlearn.org/stable/supervised_learning.html#supervised-learning/, Online: accessed 10 Apr. 2019
  23. O. Araque et al., Enhancing deep learning sentiment analysis with ensemble techniques in social applications, Expert Syst. Appl. 77 (2017), 236-246. https://doi.org/10.1016/j.eswa.2017.02.002
  24. J. Hartmann et al., Comparing automated text classification methods, Int. J. Res. Mark. 36 (2019), 20-38. https://doi.org/10.1016/j.ijresmar.2018.09.009
  25. O. Aziz et al., A comparison of accuracy of fall detection algorithms (threshold-based vs. machine learning) using waistmounted tri-axial accelerometer signals from a comprehensive set of falls and non-fall trials, Med. Biol. Eng. Comput. 55 (2017), no. 1, 45-55. https://doi.org/10.1007/s11517-016-1504-y
  26. Z. Hailong, G. Wenyan, and J. Bo, Machine learning and lexicon based methods for sentiment classification: A survey, in Proc. Web Inf. Syst. Applicat. Conf. (Tianjin, China), Sept. 2014, pp. 262-265.
  27. L. Breiman, Random forests, Mach. Learn. 45 (2001), no. 1, 5-32. https://doi.org/10.1023/A:1010933404324
  28. R. E. Schapire and Y. Singer, Improved boosting algorithms using confidence-rated predictions, Mach. Learn. 37 (1999), no. 3, 297-336. https://doi.org/10.1023/A:1007614523901
  29. A. Natekin and A. Knoll, Gradient boosting machines, a tutorial, Frontiers Neurorobotics 7 (2013), 21. https://doi.org/10.3389/fnbot.2013.00021
  30. T. Chen and C. Guestrin, Xgboost: A scalable tree boosting system, in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (San Francisco, CA, USA), Aug. 2016, pp. 785-794.
  31. P. Geurts, D. Ernst, and L. Wehenkel, Extremely randomized trees, Mach. Learn. 63 (2006), no. 1, 3-42. https://doi.org/10.1007/s10994-006-6226-1
  32. R. Feldman and J. Sanger, The text mining handbook: Advanced approaches in analyzing unstructured data, Cambridge University Press, 2007.
  33. B. Sriram et al., Short text classification in twitter to improve information filtering, in Proc. Int. ACM SIGIR Conf. Res. Development Inf. Retrieval (Geneva, Switzerland), July 2010, pp. 841-842.
  34. Scikit learn, Scikit-learn feature extraction with countvectorizer, https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Count/, Online: accessed 5 Apr. 2019
  35. Scikit learn, Scikit-learn feature extraction with tf/idf, https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Tfidf/, Online: accessed 5 Apr. 2019
  36. J. Han, J. Pei, and M. Kamber, Data mining: Concepts and techniques, Elsevier, 2011.
  37. I. Ashraf, S. Hur, and Y. Park, Blocate: A building identification scheme in gps denied environments using smartphone sensors, Sensors 18 (2018), no. 11, 3862. https://doi.org/10.3390/s18113862
  38. S. Loria, textblob documentation, Release 0.15 2 (2018).
  39. P. Geurts and G. Louppe, Learning to rank with extremely randomized trees, JMLR: Workshop Conf. Proc. 14 (2011) 49-61.
  40. X. Z. Fern and C. E. Brodley, Boosting lazy decision trees, In Proc. Int. Conf. Mach. Learn., 2003, pp. 178-185.
  41. L. Breiman, Randomizing outputs to increase prediction accuracy, Mach. Learn. 40 (2000), no. 3, 229-242. https://doi.org/10.1023/A:1007682208299

Cited by

  1. Predicting Pulsars from Imbalanced Dataset with Hybrid Resampling Approach vol.2021, 2021, https://doi.org/10.1155/2021/4916494
  2. Extensive hotel reviews classification using long short term memory vol.12, pp.10, 2021, https://doi.org/10.1007/s12652-020-02654-z