• Title/Summary/Keyword: Word tokenization

Search Result 10, Processing Time 0.03 seconds

Word-Level Embedding to Improve Performance of Representative Spatio-temporal Document Classification

  • Byoungwook Kim;Hong-Jun Jang
    • Journal of Information Processing Systems
    • /
    • v.19 no.6
    • /
    • pp.830-841
    • /
    • 2023
  • Tokenization is the process of segmenting the input text into smaller units of text, and it is a preprocessing task that is mainly performed to improve the efficiency of the machine learning process. Various tokenization methods have been proposed for application in the field of natural language processing, but studies have primarily focused on efficiently segmenting text. Few studies have been conducted on the Korean language to explore what tokenization methods are suitable for document classification task. In this paper, an exploratory study was performed to find the most suitable tokenization method to improve the performance of a representative spatio-temporal document classifier in Korean. For the experiment, a convolutional neural network model was used, and for the final performance comparison, tasks were selected for document classification where performance largely depends on the tokenization method. As a tokenization method for comparative experiments, commonly used Jamo, Character, and Word units were adopted. As a result of the experiment, it was confirmed that the tokenization of word units showed excellent performance in the case of representative spatio-temporal document classification task where the semantic embedding ability of the token itself is important.

A Multi-Bible Application on an Android Platform Using a Word Tokenization and Recognition Algorithm (단어 구분 및 인식 알고리즘을 이용한 안드로이드 플랫폼 기반의 멀티 성경 애플리케이션)

  • Kang, Sung-Mo;Kang, Myeong-Su;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.6 no.4
    • /
    • pp.215-221
    • /
    • 2011
  • Mobile phones, which were used for simply calling and sending text messages, have recently moved to application-oriented digital devices such as smart phones and tablet phones. The rapid increase of smart and tablet phones which can offer advanced ability and run a variety of applications based on Java requires various digital multimedia content activities. These days, there are more than 2.2 billions of Christians around the world. Among them, more than 300 millions of people live in Asian, and all of them have and read the bible. If there is an application for the bible which translates from English to their own languages, it could be very helpful. With this reason, this paper proposes a multi-bible application that supports various languages. To do this, we implemented an algorithm that recognize sentences in the bible as word by word. The algorithm is essentially composed of the following three functions: tokenizing sentences in the bible into word by word (word tokenization), recognizing words by using touch event (word recognition), and translating the selected words to the desired language. Consequently, the proposed multi-bible application supports language translation efficiently by touching words of sentences in the bible.

Korean Head-Tail Tokenization and Part-of-Speech Tagging by using Deep Learning (딥러닝을 이용한 한국어 Head-Tail 토큰화 기법과 품사 태깅)

  • Kim, Jungmin;Kang, Seungshik;Kim, Hyeokman
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.4
    • /
    • pp.199-208
    • /
    • 2022
  • Korean is an agglutinative language, and one or more morphemes are combined to form a single word. Part-of-speech tagging method separates each morpheme from a word and attaches a part-of-speech tag. In this study, we propose a new Korean part-of-speech tagging method based on the Head-Tail tokenization technique that divides a word into a lexical morpheme part and a grammatical morpheme part without decomposing compound words. In this method, the Head-Tail is divided by the syllable boundary without restoring irregular deformation or abbreviated syllables. Korean part-of-speech tagger was implemented using the Head-Tail tokenization and deep learning technique. In order to solve the problem that a large number of complex tags are generated due to the segmented tags and the tagging accuracy is low, we reduced the number of tags to a complex tag composed of large classification tags, and as a result, we improved the tagging accuracy. The performance of the Head-Tail part-of-speech tagger was experimented by using BERT, syllable bigram, and subword bigram embedding, and both syllable bigram and subword bigram embedding showed improvement in performance compared to general BERT. Part-of-speech tagging was performed by integrating the Head-Tail tokenization model and the simplified part-of-speech tagging model, achieving 98.99% word unit accuracy and 99.08% token unit accuracy. As a result of the experiment, it was found that the performance of part-of-speech tagging improved when the maximum token length was limited to twice the number of words.

A Methodology for Urdu Word Segmentation using Ligature and Word Probabilities

  • Khan, Yunus;Nagar, Chetan;Kaushal, Devendra S.
    • International Journal of Ocean System Engineering
    • /
    • v.2 no.1
    • /
    • pp.24-31
    • /
    • 2012
  • This paper introduce a technique for Word segmentation for the handwritten recognition of Urdu script. Word segmentation or word tokenization is a primary technique for understanding the sentences written in Urdu language. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A method is proposed for word segmentation in this paper. It finds the boundaries of words in a sequence of ligatures using probabilistic formulas, by utilizing the knowledge of collocation of ligatures and words in the corpus. The word identification rate using this technique is 97.10% with 66.63% unknown words identification rate.

Comparison of Word Extraction Methods Based on Unsupervised Learning for Analyzing East Asian Traditional Medicine Texts (한의학 고문헌 텍스트 분석을 위한 비지도학습 기반 단어 추출 방법 비교)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.32 no.3
    • /
    • pp.47-57
    • /
    • 2019
  • Objectives : We aim to assist in choosing an appropriate method for word extraction when analyzing East Asian Traditional Medical texts based on unsupervised learning. Methods : In order to assign ranks to substrings, we conducted a test using one method(BE:Branching Entropy) for exterior boundary value, three methods(CS:cohesion score, TS:t-score, SL:simple-ll) for interior boundary value, and six methods(BExSL, BExTS, BExCS, CSxTS, CSxSL, TSxSL) from combining them. Results : When Miss Rate(MR) was used as the criterion, the error was minimal when the TS and SL were used together, while the error was maximum when CS was used alone. When number of segmented texts was applied as weight value, the results were the best in the case of SL, and the worst in the case of BE alone. Conclusions : Unsupervised-Learning-Based Word Extraction is a method that can be used to analyze texts without a prepared set of vocabulary data. When using this method, SL or the combination of SL and TS could be considered primarily.

Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec (Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류)

  • Kim, Dowoo;Koo, Myoung-Wan
    • Journal of KIISE
    • /
    • v.44 no.7
    • /
    • pp.742-747
    • /
    • 2017
  • In this paper, we propose a novel approach to improve the performance of the Convolutional Neural Network(CNN) word embedding model on top of word2vec with the result of performing like doc2vec in conducting a document classification task. The Word Piece Model(WPM) is empirically proven to outperform other tokenization methods such as the phrase unit, a part-of-speech tagger with substantial experimental evidence (classification rate: 79.5%). Further, we conducted an experiment to classify ten categories of news articles written in Korean by feeding words and document vectors generated by an application of WPM to the baseline and the proposed model. From the results of the experiment, we report the model we proposed showed a higher classification rate (89.88%) than its counterpart model (86.89%), achieving a 22.80% improvement. Throughout this research, it is demonstrated that applying doc2vec in the document classification task yields more effective results because doc2vec generates similar document vector representation for documents belonging to the same category.

Korean Part-Of-Speech Tagging by using Head-Tail Tokenization (Head-Tail 토큰화 기법을 이용한 한국어 품사 태깅)

  • Suh, Hyun-Jae;Kim, Jung-Min;Kang, Seung-Shik
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Korean part-of-speech taggers decompose a compound morpheme into unit morphemes and attach part-of-speech tags. So, here is a disadvantage that part-of-speech for morphemes are over-classified in detail and complex word types are generated depending on the purpose of the taggers. When using the part-of-speech tagger for keyword extraction in deep learning based language processing, it is not required to decompose compound particles and verb-endings. In this study, the part-of-speech tagging problem is simplified by using a Head-Tail tokenization technique that divides only two types of tokens, a lexical morpheme part and a grammatical morpheme part that the problem of excessively decomposed morpheme was solved. Part-of-speech tagging was attempted with a statistical technique and a deep learning model on the Head-Tail tokenized corpus, and the accuracy of each model was evaluated. Part-of-speech tagging was implemented by TnT tagger, a statistical-based part-of-speech tagger, and Bi-LSTM tagger, a deep learning-based part-of-speech tagger. TnT tagger and Bi-LSTM tagger were trained on the Head-Tail tokenized corpus to measure the part-of-speech tagging accuracy. As a result, it showed that the Bi-LSTM tagger performs part-of-speech tagging with a high accuracy of 99.52% compared to 97.00% for the TnT tagger.

A Comparative study on the Effectiveness of Segmentation Strategies for Korean Word and Sentence Classification tasks (한국어 단어 및 문장 분류 태스크를 위한 분절 전략의 효과성 연구)

  • Kim, Jin-Sung;Kim, Gyeong-min;Son, Jun-young;Park, Jeongbae;Lim, Heui-seok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.12
    • /
    • pp.39-47
    • /
    • 2021
  • The construction of high-quality input features through effective segmentation is essential for increasing the sentence comprehension of a language model. Improving the quality of them directly affects the performance of the downstream task. This paper comparatively studies the segmentation that effectively reflects the linguistic characteristics of Korean regarding word and sentence classification. The segmentation types are defined in four categories: eojeol, morpheme, syllable and subchar, and pre-training is carried out using the RoBERTa model structure. By dividing tasks into a sentence group and a word group, we analyze the tendency within a group and the difference between the groups. By the model with subchar-level segmentation showing higher performance than other strategies by maximal NSMC: +0.62%, KorNLI: +2.38%, KorSTS: +2.41% in sentence classification, and the model with syllable-level showing higher performance at maximum NER: +0.7%, SRL: +0.61% in word classification, the experimental results confirm the effectiveness of those schemes.

Syntactic and Semantic Disambiguation for Interpretation of Numerals in the Information Retrieval (정보 검색을 위한 숫자의 해석에 관한 구문적.의미적 판별 기법)

  • Moon, Yoo-Jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.8
    • /
    • pp.65-71
    • /
    • 2009
  • Natural language processing is necessary in order to efficiently perform filtering tremendous information produced in information retrieval of world wide web. This paper suggested an algorithm for meaning of numerals in the text. The algorithm for meaning of numerals utilized context-free grammars with the chart parsing technique, interpreted affixes connected with the numerals and was designed to disambiguate their meanings systematically supported by the n-gram based words. And the algorithm was designed to use POS (part-of-speech) taggers, to automatically recognize restriction conditions of trigram words, and to gradually disambiguate the meaning of the numerals. This research performed experiment for the suggested system of the numeral interpretation. The result showed that the frequency-proportional method recognized the numerals with 86.3% accuracy and the condition-proportional method with 82.8% accuracy.

Morphology Representation using STT API in Rasbian OS (Rasbian OS에서 STT API를 활용한 형태소 표현에 대한 연구)

  • Woo, Park-jin;Im, Je-Sun;Lee, Sung-jin;Moon, Sang-ho
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.373-375
    • /
    • 2021
  • In the case of Korean, the possibility of development is lower than that of English if tagging is done through the word tokenization like English. Although the form of tokenizing the corpus by separating it into morpheme units via KoNLPy is represented as a graph database, full separation of voice files and verification of practicality is required when converting the module from graph database to corpus. In this paper, morphology representation using STT API is shown in Raspberry Pi. The voice file converted to Corpus is analyzed to KoNLPy and tagged. The analyzed results are represented by graph databases and can be divided into tokens divided by morpheme, and it is judged that data mining extraction with specific purpose is possible by determining practicality and degree of separation.

  • PDF