• Title/Summary/Keyword: WordCount

Search Result 49, Processing Time 0.021 seconds

A Performance Analysis Based on Hadoop Application's Characteristics in Cloud Computing (클라우드 컴퓨팅에서 Hadoop 애플리케이션 특성에 따른 성능 분석)

  • Keum, Tae-Hoon;Lee, Won-Joo;Jeon, Chang-Ho
    • Journal of the Korea Society of Computer and Information
    • /
    • v.15 no.5
    • /
    • pp.49-56
    • /
    • 2010
  • In this paper, we implement a Hadoop based cluster for cloud computing and evaluate the performance of this cluster based on application characteristics by executing RandomTextWriter, WordCount, and PI applications. A RandomTextWriter creates given amount of random words and stores them in the HDFS(Hadoop Distributed File System). A WordCount reads an input file and determines the frequency of a given word per block unit. PI application induces PI value using the Monte Carlo law. During simulation, we investigate the effect of data block size and the number of replications on the execution time of applications. Through simulation, we have confirmed that the execution time of RandomTextWriter was proportional to the number of replications. However, the execution time of WordCount and PI were not affected by the number of replications. Moreover, the execution time of WordCount was optimum when the block size was 64~256MB. Therefore, these results show that the performance of cloud computing system can be enhanced by using a scheduling scheme that considers application's characteristics.

Comparison between Word Embedding Techniques in Traditional Korean Medicine for Data Analysis: Implementation of a Natural Language Processing Method (한의학 고문헌 데이터 분석을 위한 단어 임베딩 기법 비교: 자연어처리 방법을 적용하여)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.32 no.1
    • /
    • pp.61-74
    • /
    • 2019
  • Objectives : The purpose of this study is to help select an appropriate word embedding method when analyzing East Asian traditional medicine texts as data. Methods : Based on prescription data that imply traditional methods in traditional East Asian medicine, we have examined 4 count-based word embedding and 2 prediction-based word embedding methods. In order to intuitively compare these word embedding methods, we proposed a "prescription generating game" and compared its results with those from the application of the 6 methods. Results : When the adjacent vectors are extracted, the count-based word embedding method derives the main herbs that are frequently used in conjunction with each other. On the other hand, in the prediction-based word embedding method, the synonyms of the herbs were derived. Conclusions : Counting based word embedding methods seems to be more effective than prediction-based word embedding methods in analyzing the use of domesticated herbs. Among count-based word embedding methods, the TF-vector method tends to exaggerate the frequency effect, and hence the TF-IDF vector or co-word vector may be a more reasonable choice. Also, the t-score vector may be recommended in search for unusual information that could not be found in frequency. On the other hand, prediction-based embedding seems to be effective when deriving the bases of similar meanings in context.

A Case Study on the Using of Ryang, a Word of Wooden Structure in Joseon Dynasty (조선시대 목조가구 용어 량의 사용 사례 연구)

  • Lee, Yeon-Ro
    • Journal of architectural history
    • /
    • v.25 no.4
    • /
    • pp.7-18
    • /
    • 2016
  • This thesis mainly deals with how 'count of Ryang' was used in Joseon dynasty. Count of Ryang means how many purlins were used in the building with longitudinal section. As a result, the notion of Ryang in Joseon dynasty does not differ from now one. But the usages of that are slightly different to the present day. In Joseon dynasty, count of Ryang mainly was appeared with another word, count of Kan. Count of Kan has two meanings. One is the length, and the other is the area of building. When they used the count of Ryang combined with Kan, count of Kan had the meaning of length. By doing that, count of Ryang indicates the size of flank, count of Kan indicates the length of front. In the 19th century, count of Ryang looks similar to the past, but count of Kan shows another aspect. It did not indicate the length but the area of building. Through this study, although the usages of Ryang were different to the present, the concepts of Ryang were similar in Joseon dynasty.

A Case Study on the Using of Ryang, a Word of Wooden Structure in the Daehan Empire (대한제국기 목조가구 용어 량(樑)의 사용 사례 연구)

  • Lee, Yeon-Ro
    • Journal of architectural history
    • /
    • v.25 no.5
    • /
    • pp.41-50
    • /
    • 2016
  • This thesis mainly deals with how 'count of Ryang' was used in the Daehan Empire. Count of Ryang means how many purlins were used in the building with longitudinal section. As a result, the notion of Ryang in the Daehan Empire does not differ from now one. But the usages of that are different from the Joseon Dynasty, and from the present. In the Daehan Empire, count of Ryang mainly was appeared with another word, count of Kan. In the Joseon Dynasty, they used the count of Ryang combined with Kan. Count of Kan had the meaning of purlin-directional length. By doing that, count of Ryang indicates the size of flank, count of Kan indicates the length of front. But in the Daehan Empire, count of Kan, especially the beam-directional length was considered at first, and then count of Ryang. Separately they used another count of Kan meaning the area of building. By using the combined words, count of Kan and Ryang in the beam direction, they got focused on the frame of wooden structure than before.

The Review about the Development of Korean Linguistic Inquiry and Word Count (언어적 특성을 이용한 '심리학적 한국어 글분석 프로그램(KLIWC)' 개발 과정에 대한 고찰)

  • Lee Chang H.;Sim Jung-Mi;Yoon Aesun
    • Korean Journal of Cognitive Science
    • /
    • v.16 no.2
    • /
    • pp.93-121
    • /
    • 2005
  • Substantial amounts of research have been accumulated by the attempt to use linguistic styles as the dependent measure in conducting psychological research. This research was condoned to develope a Korean text analysis program(KLIWC) based on the English text analysis program, LIWC(Linguistic Inquiry and Word Count), and the program reflects the Korean linguistic characteristics and culture that is related with language. We made it possible to analyze agglutinative phrase of many morphemes by linguistic tagging, and basic form dictionary and inflection rule were built. In addition, the face-saving weeds and emotional words were included as the analysis variables. The process of development and characteristics of Korean text analysis have been reviewed, and future direction for the improvement of the program has been discussed.

  • PDF

Design of a Sentiment Analysis System to Prevent School Violence and Student's Suicide (학교폭력과 자살사고를 예방하기 위한 감성분석 시스템의 설계)

  • Kim, YoungTaek
    • The Journal of Korean Association of Computer Education
    • /
    • v.17 no.6
    • /
    • pp.115-122
    • /
    • 2014
  • One of the problems with current youth generations is increasing rate of violence and suicide in their school lives, and this study aims at the design of a sentiment analysis system to prevent suicide by uising big data process. The main issues of the design are economical implementation, easy and fast processing for the users, so, the open source Hadoop system with MapReduce algorithm is used on the HDFS(Hadoop Distributed File System) for the experimentation. This study uses word count method to do the sentiment analysis with informal data on some sns communications concerning a kinds of violent words, in terms of text mining to avoid some expensive and complex statistical analysis methods.

  • PDF

A Study on Phon Call Big Data Analytics (전화통화 빅데이터 분석에 관한 연구)

  • Kim, Jeongrae;Jeong, Chanki
    • Journal of Information Technology and Architecture
    • /
    • v.10 no.3
    • /
    • pp.387-397
    • /
    • 2013
  • This paper proposes an approach to big data analytics for phon call data. The analytical models for phon call data is composed of the PVPF (Parallel Variable-length Phrase Finding) algorithm for identifying verbal phrases of natural language and the word count algorithm for measuring the usage frequency of keywords. In the proposed model, we identify words using the PVPF algorithm, and measure the usage frequency of the identified words using word count algorithm in MapReduce. The results can be interpreted from various viewpoints. We design and implement the model based HDFS (Hadoop Distributed File System), verify the proposed approach through a case study of phon call data. So we extract useful results through analysis of keyword correlation and usage frequency.

The Comparison of Linguistic and Psychological Characteristics in the Writing of Korean and Korean-Chinese Adolescents (한국 및 중국 조선족 청소년의 글에 나타난 언어학적, 심리학적 특성 비교)

  • Park, Min-Jung;Park, Hyewon
    • Korean Journal of Child Studies
    • /
    • v.29 no.3
    • /
    • pp.357-373
    • /
    • 2008
  • This study compared the writing of Korean and Korean-Chinese adolescents using K-LIWC (Korean-Linguistic Inquiry Word Count Lee & Yoon, 2005). Three hundred ten (70 : Ulsan, Korea 90 : Yanji, and 150 : Shenyang, China) middle school students wrote a self introductory essay for unknown friends. K-LIWC yielded counts and percentages of word categories using the parts of speech of the Korean language and psychological (emotional, cognitive, sensory/perceptual, social, physical/functional and metaphysical processes) criteria. Results showed that use of pre-noun and present tense correlated with negative mood of the subjects. The writings of Korean-Chinese in Shenyang showed the most negative emotions among the three groups. This was interpreted to be a reflection of better protective factors for Korean-Chinese adolescents in Yanji compared with Shenyang.

  • PDF

Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

  • Mao, Makara;Peng, Sony;Yang, Yixuan;Park, Doo-Soon
    • Journal of Information Processing Systems
    • /
    • v.18 no.4
    • /
    • pp.549-561
    • /
    • 2022
  • In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to right without a space separator; it is complicated and requires more analysis studies. Without clear standard guidelines, a space separator in the Khmer language is used inconsistently and informally to separate words in sentences. Therefore, a segmented method should be discussed with the combination of the future Khmer natural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process in NLP with the capability of extensive data language analysis necessitates applying in this scenario. One of the essential components in Khmer language processing is how to split the word into a series of sentences and count the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So, this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching (BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectional implementation of forward maximal matching (FMM) and backward maximal matching (BMM) to improve word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie, enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracy of BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improves dictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57% compared to FMM and BFF algorithms with 94,807 Khmer words.

A Performance Analysis Based on Spark Application (Spark 애플리케이션 기반의 성능 분석)

  • Jung, Young Gyo;Lee, Byung-Jun;Cho, Young-Joo;Youn, Hee Yong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2016.01a
    • /
    • pp.79-80
    • /
    • 2016
  • 아파치 스파크는 효율적으로 대용량 데이터를 처리하기 위해 분산 메모리 추상화를 사용하는 오픈 소스 분산 데이터 처리 플랫폼이다. 하지만 아파치 스파크 플랫폼의 특정 작업의 성능은 입력 데이터의 유형과 크기, 디자인 및 알고리즘의 구현 및 컴퓨팅 능력에 따라 메모리 사용량 및 I/O 비용이 크게 달라질 수 있다는 문제점이 있다. 이러한 문제점을 해결하기 위하여 본 논문에서는 아파치 스파크 플랫폼에 대한 높은 정밀도 작업 성능을 예측할 수 있도록 CPU core수의 증가에 따른 WordCount 시뮬레이션을 비교 평가 하였다.

  • PDF