A Two-Phase On-Device Analysis for Gender Prediction of Mobile Users Using Discriminative and Popular Wordsets

모바일 사용자의 성별 예측을 위한 식별 및 인기 단어 집합 기반 2단계 기기 내 분석

Choi, Yerim;Park, Kyuyon;Kim, Solee;Park, Jonghun

  • Received : 2016.01.12
  • Accepted : 2016.02.15
  • Published : 2016.02.28


As respecting one's privacy becomes an important issue in mobile device data analysis, on-device analysis is getting attention, in which the data analysis is conducted inside a mobile device without sending data from the device to outside. One possible application of the on-device analysis is gender prediction using text data in mobile devices, such as text messages, search keyword, website bookmarks, and contact, which are highly private, and the limited computing power of mobile devices can be addressed by utilizing the word comparison method, where words are selected beforehand and delivered to a mobile device of a user to determine the user's gender by matching mobile text data and the selected words. Moreover, it is known that performing prediction after filtering instances using definite evidences increases accuracy and reduces computational complexity. In this regard, we propose a two-phase approach to on-device gender prediction, where both discriminability and popularity of a word are sequentially considered. The proposed method performs predictions using a few highly discriminative words for all instances and popular words for unclassified instances from the previous prediction. From the experiments conducted on real-world dataset, the proposed method outperformed the compared methods.


On-Device Analysis;Gender Prediction;Mobile Text;Two Phase Approach;Discriminative Wordset;Popular Wordset


  1. Baek, S. and Choi, D., "Exploring User Attitude to Information Privacy," The Journal of Society for e-Business Studies, Vol. 20, No. 1, pp. 45-59, 2015.
  2. Chang, C. C. and Lin, C. J., "LIBSVM: A Library for Support Vector Machines," ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 3, pp. 1-27, 2011.
  3. Goswami, S., Sarkar, S., and Rustagi, M., "Stylometric Analysis of Bloggers' Age and Gender," Proceedings of the International AAAI Conference on Weblogs and Social Media, pp. 214-217, 2009.
  4. Han, J., Park, M., and Kim, J., "Improving the Performance of Automatic Text Categorization by Using Phrasal Patterns and Keyword Sets," Proceedings of the Korea Computer Congress, pp. 70-73, 1998.
  5. Kim, S., Choi, Y., Kim, Y., Park, K., and Park, J., "On-Device Gender Prediction Framework Based on the Development of Discriminative Word and Emoticon Sets," KIISE Transactions on Computing Practices, Vol. 21, No. 11, pp. 733-738, 2015.
  6. Kim, Y., Choi, Y., Kim, S., Park, K., and Park, J., "An Ensemble Model for Gender Classification of Mobile Users," Proceedings of the International Conference on Computer Technology and Development, 2015.
  7. Lakoff, R., "Language and Woman's Place," Language in Society, Vol. 2, No. 1, pp. 45-80, 1973.
  8. Lee, D. and Shim, J., "Survey on Vector Similarity Measures: Focusing on Algebraic Characteristics," The Journal of Society for e-Business Studies, Vol. 17, No. 4, pp. 209-219, 2012.
  9. Lee, J., Choi, H., and Choi, S., "Study on How Service Usefulness and Privacy Concern Influence on Service Acceptance," The Journal of Society for e-Business Studies, Vol. 12, No. 4, pp. 37-51, 2007.
  10. Lee, K., Kim, K., Lee, M., Kim, W., and Hong, J., "Post Clustering Method using Tag Hierarchy for Blog Search," The Journal of Society for e-Business Studies, Vol. 16, No. 4, pp. 301-319, 2011.
  11. Otterbacher, J., "Inferring Gender of Movie Reviewers: Exploiting Writing Style, Content and Metadata," Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 369-378, 2010.
  12. Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M., "Classifying Latent User Attributes in Twitter," Proceedings of the International Workshop on Search and Mining User-Generated Contents, pp. 37-44, 2010.
  13. Roh, J., Kim, H., and Jang, J., "Improving Hypertext Classification Systems through WordNet-based Feature Abstraction," The Journal of Society for e-Business Studies, Vol. 18, No. 2, pp. 95-110, 2013.
  14. Shim, K., "MADE: Morphological Analyzer Development Environment," Journal of Internet Computing and Services, Vol. 8, No. 4, pp. 159-171, 2007.
  15. Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.
  16. Yang, Y. and Pedersen, J. O., "A Comparative Study on Feature Selection in Text Categorization," Proceedings of the International Conference on Machine Learning, pp. 412-420, 1997.


Supported by : 한국연구재단