• Title/Summary/Keyword: TF-IDF

Search Result 329, Processing Time 0.03 seconds

Hot Topic Prediction Scheme Using Modified TF-IDF in Social Network Environments (소셜 네트워크 환경에서 변형된 TF-IDF를 이용한 핫 토픽 예측 기법)

  • Noh, Yeonwoo;Lim, Jongtae;Bok, Kyoungsoo;Yoo, Jaesoo
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.4
    • /
    • pp.217-225
    • /
    • 2017
  • Recently, the interest in predicting hot topics has grown significantly as it has become more important to find and analyze meaningful information from a large amount of data flowing in social networking services. Existing hot topic detection schemes do not consider a temporal property, so they are not suitable to predict hot topics that are rapidly issued in a changing society. This paper proposes a hot topic prediction scheme that uses a modified TF-IDF in social networking environments. The modified TF-IDF extracts a candidate set of keywords that are momentarily issued. The proposed scheme then calculates the hot topic prediction scores by assigning weights considering user influence and professionality to extract the candidate keywords. The superiority of the proposed scheme is shown by comparing it to an existing detection scheme. In addition, to show whether or not it predicts hot topics correctly, we evaluate its quality with Korean news articles from Naver.

A study on Korean language processing using TF-IDF (TF-IDF를 활용한 한글 자연어 처리 연구)

  • Lee, Jong-Hwa;Lee, MoonBong;Kim, Jong-Weon
    • The Journal of Information Systems
    • /
    • v.28 no.3
    • /
    • pp.105-121
    • /
    • 2019
  • Purpose One of the reasons for the expansion of information systems in the enterprise is the increased efficiency of data analysis. In particular, the rapidly increasing data types which are complex and unstructured such as video, voice, images, and conversations in and out of social networks. The purpose of this study is the customer needs analysis from customer voices, ie, text data, in the web environment.. Design/methodology/approach As previous study results, the word frequency of the sentence is extracted as a word that interprets the sentence has better affects than frequency analysis. In this study, we applied the TF-IDF method, which extracts important keywords in real sentences, not the TF method, which is a word extraction technique that expresses sentences with simple frequency only, in Korean language research. We visualized the two techniques by cluster analysis and describe the difference. Findings TF technique and TF-IDF technique are applied for Korean natural language processing, the research showed the value from frequency analysis technique to semantic analysis and it is expected to change the technique by Korean language processing researcher.

Impact of Word Embedding Methods on Performance of Sentiment Analysis with Machine Learning Techniques

  • Park, Hoyeon;Kim, Kyoung-jae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.8
    • /
    • pp.181-188
    • /
    • 2020
  • In this study, we propose a comparative study to confirm the impact of various word embedding techniques on the performance of sentiment analysis. Sentiment analysis is one of opinion mining techniques to identify and extract subjective information from text using natural language processing and can be used to classify the sentiment of product reviews or comments. Since sentiment can be classified as either positive or negative, it can be considered one of the general classification problems. For sentiment analysis, the text must be converted into a language that can be recognized by a computer. Therefore, text such as a word or document is transformed into a vector in natural language processing called word embedding. Various techniques, such as Bag of Words, TF-IDF, and Word2Vec are used as word embedding techniques. Until now, there have not been many studies on word embedding techniques suitable for emotional analysis. In this study, among various word embedding techniques, Bag of Words, TF-IDF, and Word2Vec are used to compare and analyze the performance of movie review sentiment analysis. The research data set for this study is the IMDB data set, which is widely used in text mining. As a result, it was found that the performance of TF-IDF and Bag of Words was superior to that of Word2Vec and TF-IDF performed better than Bag of Words, but the difference was not very significant.

Comparison of term weighting schemes for document classification (문서 분류를 위한 용어 가중치 기법 비교)

  • Jeong, Ho Young;Shin, Sang Min;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.265-276
    • /
    • 2019
  • The document-term frequency matrix is a general data of objects in text mining. In this study, we introduce a traditional term weighting scheme TF-IDF (term frequency-inverse document frequency) which is applied in the document-term frequency matrix and used for text classifications. In addition, we introduce and compare TF-IDF-ICSDF and TF-IGM schemes which are well known recently. This study also provides a method to extract keyword enhancing the quality of text classifications. Based on the keywords extracted, we applied support vector machine for the text classification. In this study, to compare the performance term weighting schemes, we used some performance metrics such as precision, recall, and F1-score. Therefore, we know that TF-IGM scheme provided high performance metrics and was optimal for text classification.

A Study on Patent Data Analysis and Competitive Advantage Strategy using TF-IDF and Network Analysis (TF-IDF와 네트워크분석을 이용한 특허 데이터 분석과 경쟁우위 전략수립에 관한 연구)

  • Yun, Seok-Yong;Han, Kyeong-Seok
    • Journal of Digital Contents Society
    • /
    • v.19 no.3
    • /
    • pp.529-535
    • /
    • 2018
  • Data is explosively growing, but many companies are still using data analysis only for descriptive analysis or diagnostic analysis, and not appropriately for predictive analysis or enterprise technology strategy analysis. In this study, we analyze the structured & unstructured patent data such as IPC code, inventor, filing date and so on by using big data analysis techniques such as network analysis and TF-IDF. Through this analysis, we propose analysis process to understand the core technology and technology distribution of competitors and prove it through data analysis.

Performance Evaluations of Text Ranking Algorithms

  • Kim, Myung-Hwi;Jang, Beakcheol
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.2
    • /
    • pp.123-131
    • /
    • 2020
  • The text ranking algorithm is a representative method for keyword extraction, and its importance is emphasized highly. In this paper, we compare the performance of recent research and experiments with TF-IDF, SMART, INQUERY and CCA algorithms, which are used in text ranking algorithm.. After explaining each algorithm, we compare the performance of each algorithm based on the data collected from news and Twitter. Experimental results show that all of four algorithms can extract specific words from news data equally. However, in the case of Twitter, CCA has the best performance to extract specific words, and INQUERY shows the worst performance. We also analyze the accuracy of the algorithm through six comparison metrics. The experimental results present that CCA shows the best accuracy in the news data. In case of Twitter, TF-IDF and CCA show similar performance and demonstrate good performance.

RTFIDF·VT: a New TF-IDF Algorithm considered Variety of Tweets (RTFIDF·VT: 트윗의 다양성을 고려한 새로운 TF-IDF 알고리즘)

  • Oh, Pyeonghwa;Kim, Seokjung;Yoon, Jinyoung;Yim, Junyeob;Hwang, Byung-Yeon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.11a
    • /
    • pp.1241-1244
    • /
    • 2013
  • 스마트 폰의 보급으로 웹 접근성이 향상되면서 모바일을 기반으로 성장한 소셜 네트워크 서비스들은 폭발적인 사용자 증가를 이루었다. 그중에서도 트위터는 개방적인 사용자간 네트워크 연결 방식과 강력한 전파능력으로 사용자 개개인이 정보를 생산하고 소비하는 소셜 저널리즘의 형태를 띠며 영향력을 더해가고 있다. 이에 트위터를 이용해 이벤트를 탐지하고자 하는 연구들이 활발히 진행되고 있다. 그러나 이벤트를 탐지할 때 기존의 TF-IDF 알고리즘을 적용할 경우 트위터의 특징을 적절히 반영하지 못하는 문제점이 있다. 본 논문에서는 기존의 TF-IDF 알고리즘에 트위터의 특징을 반영하도록 가중치를 변형하고 여기에 다시 보정계수를 적용하여 새로운 TF-IDF 알고리즘을 제안하였으며 두 번의 이벤트에 적용한 실험을 통해 새로운 알고리즘의 성능향상을 보였다.

Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model (공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법)

  • Cho, Yoon-Ho;Lee, Sang-Keun
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.2
    • /
    • pp.142-151
    • /
    • 2009
  • Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.

Style-Specific Language Model Adaptation using TF*IDF Similarity for Korean Conversational Speech Recognition

  • Park, Young-Hee;Chung, Min-Hwa
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.2E
    • /
    • pp.51-55
    • /
    • 2004
  • In this paper, we propose a style-specific language model adaptation scheme using n-gram based tf*idf similarity for Korean spontaneous speech recognition. Korean spontaneous speech shows especially different style-specific characteristics such as filled pauses, word omission, and contraction, which are related to function words and depend on preceding or following words. To reflect these style-specific characteristics and overcome insufficient data for training language model, we estimate in-domain dependent n-gram model by relevance weighting of out-of-domain text data according to their n-. gram based tf*idf similarity, in which in-domain language model include disfluency model. Recognition results show that n-gram based tf*idf similarity weighting effectively reflects style difference.

Research of Term-Weighting Method in an Usenet Information Retrieval System (유즈넷 정보검색시스템에서 단어 가중치 적용방법에 관한연구)

  • 최재덕;최진석;박민식
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10b
    • /
    • pp.339-341
    • /
    • 1998
  • 다양한 정보교환 수단의 하나인 유즈넷은 방대한 정보량을 가진다. 사용자는 유즈넷에서 필요한 정보를 쉽게 찾지 못하므로 뉴스그룹 전체와 본문에서 정보 검색의 필요성을 인식하고 있다. 이 논문에서는 정보검색시스템을 유즈넷으로 확장시 단어 가중치 적용방법의 개선을 통해 검색효율을 향상시키고자 한다. 정보검색에서 단어의 중요도에 영향을 미치는 tf, idf 이외의 다른 요소인 카테고리빈도(category frequency, cf)를 활용하여 tf*idf방법에 역카테고리빈도(inverted categoary frequency, icf)를 고려한 유사도 계산 방법을 제시하고 이를 검증하였다. 실험 결과에서 상위 30위 내의 평균 적합문서의 수가 tf*{{{{ SQRT {idf$^2$+icf$^2$} }}}}방법이 tf*idf 방법보다 4.6% 향상됨을 알 수 있다.