• 제목/요약/키워드: Term frequency

검색결과 1,578건 처리시간 0.024초

자동 문서분류에서의 정규화 용어빈도 가중치방법 (Normalized Term Frequency Weighting Method in Automatic Text Categorization)

  • 김수진;박혁로
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2003년도 컴퓨터소사이어티 추계학술대회논문집
    • /
    • pp.255-258
    • /
    • 2003
  • This paper defines Normalized Term Frequency Weighting method for automatic text categorization by using Box-Cox, and then it applies automatic text categorization. Box-Cox transformation is statistical transformation method which makes normalized data. This paper applies that and suggests new term frequency weighting method. Because Normalized Term Frequency is different from every term compared by existing term frequency weighting method, it is general method more than fixed weighting method such as log or root. Normalized term frequency weighting method's reasonability has been proved though experiments, used 8000 newspapers divided in 4 groups, which resulted high categorization correctness in all cases.

  • PDF

A Term Importance-based Approach to Identifying Core Citations in Computational Linguistics Articles

  • Kang, In-Su
    • 한국컴퓨터정보학회논문지
    • /
    • 제22권9호
    • /
    • pp.17-24
    • /
    • 2017
  • Core citation recognition is to identify influential ones among the prior articles that a scholarly article cite. Previous approaches have employed citing-text occurrence information, textual similarities between citing and cited article, etc. This study proposes a term-based approach to core citation recognition, which exploits the importance of individual terms appearing in in-text citation to calculate influence-strength for each cited article. Term importance is computed using various frequency information such as term frequency(tf) in in-text citation, tf in the citing article, inverse sentence frequency in the citing article, inverse document frequency in a collection of articles. Experiments using a previous test set consisting of computational linguistics articles show that the term-based approach performs comparably with the previous approaches. The proposed technique could be easily extended by employing other term units such as n-grams and phrases, or by using new term-importance formulae.

다양한 어휘 가중치를 이용한 블로그 포스트의 자동 분류 (Automatic Classification of Blog Posts using Various Term Weighting)

  • 김수아;조희선;이현아
    • Journal of Advanced Marine Engineering and Technology
    • /
    • 제39권1호
    • /
    • pp.58-62
    • /
    • 2015
  • 대부분의 블로그 사이트에서는 미리 정의된 분류 체계에 따른 내용 기반 분류 환경을 제공하고 있으나, 작성된 포스트의 분류를 수동으로 선택해야하는 번거로움 때문에 대부분의 블로거들은 포스트에 대한 분류를 입력하지 않고 있다. 본 논문에서는 블로그 포스트의 자동 분류를 위해 블로그 사이트에서 분류별 문서를 수집하고 수집된 분류별 문서의 어휘빈도와 문서빈도, 분류별 빈도 등의 다양한 어휘 가중치 조합하여 블로그 포스트의 특성에 적합한 가중치 방식을 찾고자 한다. 실험에서는 본 논문에서 제안한 TF-CTF-IECDF를 어휘 가중치로 사용한 분류 모델이 77.02%의 분류 정확률을 보였다.

Term Frequency-Inverse Document Frequency (TF-IDF) Technique Using Principal Component Analysis (PCA) with Naive Bayes Classification

  • J.Uma;K.Prabha
    • International Journal of Computer Science & Network Security
    • /
    • 제24권4호
    • /
    • pp.113-118
    • /
    • 2024
  • Pursuance Sentiment Analysis on Twitter is difficult then performance it's used for great review. The present be for the reason to the tweet is extremely small with mostly contain slang, emoticon, and hash tag with other tweet words. A feature extraction stands every technique concerning structure and aspect point beginning particular tweets. The subdivision in a aspect vector is an integer that has a commitment on ascribing a supposition class to a tweet. The cycle of feature extraction is to eradicate the exact quality to get better the accurateness of the classifications models. In this manuscript we proposed Term Frequency-Inverse Document Frequency (TF-IDF) method is to secure Principal Component Analysis (PCA) with Naïve Bayes Classifiers. As the classifications process, the work proposed can produce different aspects from wildly valued feature commencing a Twitter dataset.

Texas Climatological Model에 의한 短期 大氣汚染濃度 發生頻度의 推定 (Estimation of Occurrence Frequency of Short Term Air Pollution Concentration Using Texas Climatological Model)

  • 이종범
    • 한국대기환경학회지
    • /
    • 제4권2호
    • /
    • pp.67-71
    • /
    • 1988
  • To estimate the probability of short term concentration of air pollution using long term arithmetic average concentration, the procedure was developed and added to Texas Climatological Model version 2. In the procedure, such statistical characteristics that frequency distribution of short term concentration may be approximated by a lognormal distribution, were applied. This procedure is capable of estimating not only highest concentration for a variety of averaging times but also concentrations for arbitrary occurrence frequency. Evaluation of the procedure with the results of short term concentrations calculated by Texas Episodic Model version 8 using the meteorological data and emission data in Seoul shows that the procedure estimates concentrations fairly well for wide range of percentiles.

  • PDF

Analysis of Drought Characteristics in Gyeongbuk Based on the Duration of Standard Precipitation Index

  • Ahn, Seung Seop;Park, Ki bum;Yim, Dong Hee
    • 한국환경과학회지
    • /
    • 제28권10호
    • /
    • pp.863-872
    • /
    • 2019
  • Using the Standard Precipitation Index (SPI), this study analyzed the drought characteristics of ten weather stations in Gyeongbuk, South Korea, that precipitation data over a period of 30 years. For the number of months that had a SPI of -1.0 or less, the drought occurrence index was calculated and a maximum shortage months, resilience and vulnerability in each weather station were analyzed. According to the analysis, in terms of vulnerability, the weather stations with acute short-term drought were Andong, Bonghwa, Moongyeong, and Gumi. The weather stations with acute medium-term drought were Daegu and Uljin. Finally the weather stations with acute long-term drought were Pohang, Youngdeok, and Youngju. In terms of severe drought frequency, the stations with relatively high frequency of mid-term droughts were Andong, Bonghwa, Daegu, Uiseong, Uljin, and Youngju. Gumi station had high frequency of short-term droughts. Pohang station had severe short-term ad long-term droughts. Youngdeok had severe droughts during all the terms. Based on the analysis results, it is inferred that the size of the drought should be evaluated depending on how serious vulnerability, resilience, and drought index are. Through proper evaluation of drought, it is possible to take systematic measures for the duration of the drought.

Time-Frequency Analysis of Electrohysterogram for Classification of Term and Preterm Birth

  • Ryu, Jiwoo;Park, Cheolsoo
    • IEIE Transactions on Smart Processing and Computing
    • /
    • 제4권2호
    • /
    • pp.103-109
    • /
    • 2015
  • In this paper, a novel method for the classification of term and preterm birth is proposed based on time-frequency analysis of electrohysterogram (EHG) using multivariate empirical mode decomposition (MEMD). EHG is a promising study for preterm birth prediction, because it is low-cost and accurate compared to other preterm birth prediction methods, such as tocodynamometry (TOCO). Previous studies on preterm birth prediction applied prefilterings based on Fourier analysis of an EHG, followed by feature extraction and classification, even though Fourier analysis is suboptimal to biomedical signals, such as EHG, because of its nonlinearity and nonstationarity. Therefore, the proposed method applies prefiltering based on MEMD instead of Fourier-based prefilters before extracting the sample entropy feature and classifying the term and preterm birth groups. For the evaluation, the Physionet term-preterm EHG database was used where the proposed method and Fourier prefiltering-based method were adopted for comparative study. The result showed that the area under curve (AUC) of the receiver operating characteristic (ROC) was increased by 0.0351 when MEMD was used instead of the Fourier-based prefilter.

문서 분류를 위한 용어 가중치 기법 비교 (Comparison of term weighting schemes for document classification)

  • 정호영;신상민;최용석
    • 응용통계연구
    • /
    • 제32권2호
    • /
    • pp.265-276
    • /
    • 2019
  • 문서-용어 빈도행렬은 텍스트 마이닝에서 분석하고자 하는 개체 정보를 가지고 있는 일반적인 자료 형태이다. 본 연구에서 문서 분류를 위해 문서-용어 빈도행렬에 적용되는 기존의 용어 가중치인 TF-IDF를 소개한다. 추가하여 최근에 알려진 용어 가중치인 TF-IDF-ICSDF와 TF-IGM의 정의와 장단점을 소개하고 비교한다. 또한 문서 분류 분석의 질을 높이기 위해 핵심어를 추출하는 방법을 제시하고자 한다. 추출된 핵심어를 바탕으로 문서 분류에 있어서 가장 많이 활용된 기계학습 알고리즘 중에서 서포트 벡터 머신을 이용하였다. 본 연구에서 소개한 용어 가중치들의 성능을 비교하기 위하여 정확률, 재현율, F1-점수와 같은 성능 지표들을 이용하였다. 그 결과 TF-IGM 방법이 모두 높은 성능 지표를 보였고, 텍스트를 분류하는데 있어 최적화 된 방법으로 나타났다.

피벗 역문헌빈도 가중치 기법에 대한 연구 (A Study on the Pivoted Inverse Document Frequency Weighting Method)

  • 이재윤
    • 정보관리학회지
    • /
    • 제20권4호통권50호
    • /
    • pp.233-248
    • /
    • 2003
  • 역문헌빈도 가중치 기법은 문헌 집단에서 출현빈도가 낮을수록 색인어의 중요도가 높다는 가정에 근거하고 있다. 그런데 이는 중간빈도어를 중요하게 여기는 여타 이론과는 일치하지 않는 것이다. 이 연구에서는 저빈도어보다 중간빈도어가 더 중요하다는 가정에 근거하여 역문헌빈도 가중치 공식을 수정한 피벗 역문헌번도 가중치 기법을 제안하였다. 제안된 기법을 검증하기 위해서 세 실험집단을 대상으로 검색실험을 수행한 결과, 피벗 역문헌빈도 가중치기법이 역문헌빈도 가중치 기법에 비해서 특히 검색결과 상위에서의 성능을 향상시키는 것으로 나타났다.

GPS/INS/기압고도계의 웨이블릿 센서융합 기법 (Sensor Fusion of GPS/INS/Baroaltimeter Using Wavelet Analysis)

  • 김성필;김응태;성기정
    • 제어로봇시스템학회논문지
    • /
    • 제14권12호
    • /
    • pp.1232-1237
    • /
    • 2008
  • This paper introduces an application of wavelet analysis to the sensor fusion of GPS/INS/baroaltimeter. Using wavelet analysis the baro-inertial altitude is decomposed into the low frequency content and the high frequency content. The high frequency components, 'details', represent the perturbed altitude change from the long time trend. GPS altitude is also broken down by a wavelet decomposition. The low frequency components, 'approximations', of the decomposed signal address the long-term trend of altitude. It is proposed that the final altitude be determined as the sum of both the details of the baro-inertial altitude and the approximations of GPS altitude. Then the final altitude exclude long-term baro-inertial errors and short-term GPS errors. Finally, it is shown from the test results that the proposed method produces continuous and sensitive altitude successfully.