• Title/Summary/Keyword: Over-sampling

Search Result 1,269, Processing Time 0.027 seconds

Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

  • Kang, Dae-Ki;Han, Min-gyu
    • International journal of advanced smart convergence
    • /
    • v.8 no.1
    • /
    • pp.75-81
    • /
    • 2019
  • Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

Regression Estimators with Unequal Selection Probabilities on Two Successive Occasions

  • Kim, Kyu-Seong
    • Journal of the Korean Statistical Society
    • /
    • v.25 no.1
    • /
    • pp.25-37
    • /
    • 1996
  • In this paper, we propose regression estimators based on a partial replacement sampling scheme over two successive occasions and derive the minimum variances of them. PPSWR, RHC, $\pi$PS and PPSWOR schemes are considered to select unequal probability samples on two occasions. Simulation results over four populations are given for comparison of composite estimators and regression estimators.

  • PDF

Atmospheric Bioaerosol, Bacillus sp., at an Altitude of 3,500 m over the Noto Peninsula: Direct Sampling via Aircraft

  • Kobayashi, Fumihisa;Morosawa, Shinji;Maki, Teruya;Kakikawa, Makiko;Yamada, Maromu;Tobo, Yutaka;Hon, Chun-Sang;Matsuki, Atsushi;Iwasaka, Yasunobu
    • Asian Journal of Atmospheric Environment
    • /
    • v.5 no.3
    • /
    • pp.164-171
    • /
    • 2011
  • This work focuses on the analysis of bioaerosols in the atmosphere at higher altitudes over Noto Peninsula, Japan. We carried out direct sampling via aircraft, separated cultures, and identified present isolates. Atmospheric bioaerosols at higher altitudes were collected using a Cessna 404 aircraft for an hour at an altitude of 3,500 m over the Noto Peninsula. The aircraft-based direct sampling system was devised to improve upon the system of balloon-based sampling. In order to examine pre-existing microorganism contamination on the surface of the aircraft body, bioaerosol sampling was carried out just before takeoff using the same method as atmospheric sampling. Identification was carried out by a homology search for 16S or 18S rDNA isolate sequences in DNA databases (GenBank). Isolate sampling just before takeoff revealed Stretpomyces sp., Micrococcus sp., and Cladosporium sp. One additional strain, Bacillus sp., was isolated from the sample after bioaerosol collection at high altitude. As the microorganism contamination on the aircraft body before takeoff differed from that while in the air, the presence of additional, higher atmosphere-based microorganisms was confirmed. It was found that Bacillus sp. was floating at an altitude of 3,500 m over Noto Peninsula.

On the sampling unit (표본점단위(標本點單位)에 대(對)하여)

  • Kim, Kap Duk
    • Journal of Korean Society of Forest Science
    • /
    • v.4 no.1
    • /
    • pp.26-29
    • /
    • 1965
  • 1. The purpose of this study was to find out the best sampling form and sampling unit in forest survey. 2. The value of small sampling unit was over estimated in comparison with that of large sampling unit. 3. The value of circular form was over estimated in comparison with that of the others. 4. The smallest unit for estimation in area sampling were as follows. a) 0.06 ha. in the rectangular plot. b) 0.08 ha. in the square plot. c) 0.10 ha. in the circular plot. 5. Conclusion was as follows. The best sampling unit was 0.06 hectoare in the rectangular plot, which was most economic above all and gave preferable result for in the forest survey.

  • PDF

Comparison of resampling methods for dealing with imbalanced data in binary classification problem (이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교)

  • Park, Geun U;Jung, Inkyung
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.349-374
    • /
    • 2019
  • A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

Classification of Class-Imbalanced Data: Effect of Over-sampling and Under-sampling of Training Data (계급불균형자료의 분류: 훈련표본 구성방법에 따른 효과)

  • 김지현;정종빈
    • The Korean Journal of Applied Statistics
    • /
    • v.17 no.3
    • /
    • pp.445-457
    • /
    • 2004
  • Given class-imbalanced data in two-class classification problem, we often do over-sampling and/or under-sampling of training data to make it balanced. We investigate the validity of such practice. Also we study the effect of such sampling practice on boosting of classification trees. Through experiments on twelve real datasets it is observed that keeping the natural distribution of training data is the best way if you plan to apply boosting methods to class-imbalanced data.

A study on unequal probability sampling over two successive occasions in time series (시계열 계속 표본조사에서 불균등확률 추출법 연구)

  • 박홍래;이계오
    • The Korean Journal of Applied Statistics
    • /
    • v.6 no.1
    • /
    • pp.145-162
    • /
    • 1993
  • We review sampling schemes on successive occasions with partial replacement of units and propose a Rao-Hartley-Cochran(RHC) type's sampling scheme over two successive occasions with probability proportionate to observations on the previous occasion. For comparison of the reviewed and proposed sampling schemes, optimal estimator of population mean on second occasion and its variance are derived. The relative efficiency of the proposed sampling scheme is compared with other equal and unequal probability sampling scheme by theoretical and numerical simulation study. For simulation study, three artificial populations are generated by a time series model. It is observed that RHC type's sampling scheme has small variance and deviation in general.

  • PDF

Evaluation of the Utility of Self Produced MRI Radiofrequency Shielding Material (자체 제작한 자기공명영상 고주파 차폐체의 유용성 평가)

  • Lee, Jin-Hoe;Lee, Bo-Woo
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.11
    • /
    • pp.89-94
    • /
    • 2020
  • This paper proposes a better shielding method to over sampling technique. The new method uses aluminum foil for RF shielding. As a result of the phantom test, when the over-sampling technique was applied, the aliasing artifact was reduced by about 94% compared to before the application, and the case where the aluminum shielding band was applied was also reduced by about 92% compared to before application. In addition, the scan time also increased by more than 3 times in the case of the over-sampling technique, while it was found that there was no change from before the application of the aluminum shielding band Therefore, it was confirmed that the shielding band using aluminum foil can effectively remove aliasing artifacts without increasing the scan time..

A Deep Learning Based Over-Sampling Scheme for Imbalanced Data Classification (불균형 데이터 분류를 위한 딥러닝 기반 오버샘플링 기법)

  • Son, Min Jae;Jung, Seung Won;Hwang, Een Jun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.7
    • /
    • pp.311-316
    • /
    • 2019
  • Classification problem is to predict the class to which an input data belongs. One of the most popular methods to do this is training a machine learning algorithm using the given dataset. In this case, the dataset should have a well-balanced class distribution for the best performance. However, when the dataset has an imbalanced class distribution, its classification performance could be very poor. To overcome this problem, we propose an over-sampling scheme that balances the number of data by using Conditional Generative Adversarial Networks (CGAN). CGAN is a generative model developed from Generative Adversarial Networks (GAN), which can learn data characteristics and generate data that is similar to real data. Therefore, CGAN can generate data of a class which has a small number of data so that the problem induced by imbalanced class distribution can be mitigated, and classification performance can be improved. Experiments using actual collected data show that the over-sampling technique using CGAN is effective and that it is superior to existing over-sampling techniques.

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data (데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구)

  • Leea, Hee-Jae;Lee, Sungim
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.357-371
    • /
    • 2014
  • There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.