DOI QR코드

DOI QR Code

Language-Independent Word Acquisition Method Using a State-Transition Model

  • Xu, Bin (Graduate School of Shonan Institute of Technology) ;
  • Yamagishi, Naohide (Graduate School of Shonan Institute of Technology) ;
  • Suzuki, Makoto (Department of Information Science, Shonan Institute of Technology) ;
  • Goto, Masayuki (Department of Industrial and Management Systems, School of Creative Science and Engineering, Waseda University)
  • Received : 2016.02.18
  • Accepted : 2016.08.13
  • Published : 2016.09.30

Abstract

The use of new words, numerous spoken languages, and abbreviations on the Internet is extensive. As such, automatically acquiring words for the purpose of analyzing Internet content is very difficult. In a previous study, we proposed a method for Japanese word segmentation using character N-grams. The previously proposed method is based on a simple state-transition model that is established under the assumption that the input document is described based on four states (denoted as A, B, C, and D) specified beforehand: state A represents words (nouns, verbs, etc.); state B represents statement separators (punctuation marks, conjunctions, etc.); state C represents postpositions (namely, words that follow nouns); and state D represents prepositions (namely, words that precede nouns). According to this state-transition model, based on the states applied to each pseudo-word, we search the document from beginning to end for an accessible pattern. In other words, the process of this transition detects some words during the search. In the present paper, we perform experiments based on the proposed word acquisition algorithm using Japanese and Chinese newspaper articles. These articles were obtained from Japan's Kyoto University and the Chinese People's Daily. The proposed method does not depend on the language structure. If text documents are expressed in Unicode the proposed method can, using the same algorithm, obtain words in Japanese and Chinese, which do not contain spaces between words. Hence, we demonstrate that the proposed method is language independent.

Keywords

Word Segmentation;Character N-gram;Language Independent;State Transition

Acknowledgement

Supported by : JSPS

References

  1. Fu, S., Yuan, D., Huang, B., and Zhong, Z. (2002), Word extraction without dictionary based on statistics, Journal of Guangxi Academy of Sciences, 18(4) 252-255.
  2. Kaji, N., Fukushima, K., and Kisurekawa, M. (2009), Acquisition of Katakana verbs and adjectives from large Web text, The IEICE Transactions on Information and Systems, J92-D(3), 293-300.
  3. Lai, S., Xu, L., Chen, Y., Liu, K., and Zhao, J. (2013), Chinese Word Segment Based on Character Representation Learning, Journal of Chinese Information Processing, 27(5).
  4. Mochihashi, D., Yamada, T., and Ueda, N. (2009), Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling, ACL, 100-108.
  5. Okada, S. and Yamamoto, K. (2013), Automatic acquisition of the word-separated units using the occurrence frequency information of the character string, NLP, 422-425.
  6. Xu, G., Su, X., and Chen, S. (2002), Arithmetic and Application of No Dictionary Cutting Word in Chinese Text Mining, Journal of Jilin Institute of Technology, 23(1), 16-18.
  7. Yamagishi, N. and Suzuki, M. (2011), An unsupervised word acquisition method by adaptation to a state transition model, Proc. of the 12th Student Paper Presentation of JIMA, 57-58.