• Title/Summary/Keyword: ADASYN

Search Result 12, Processing Time 0.025 seconds

Experimental Analysis of Equilibrization in Binary Classification for Non-Image Imbalanced Data Using Wasserstein GAN

  • Wang, Zhi-Yong;Kang, Dae-Ki
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.11 no.4
    • /
    • pp.37-42
    • /
    • 2019
  • In this paper, we explore the details of three classic data augmentation methods and two generative model based oversampling methods. The three classic data augmentation methods are random sampling (RANDOM), Synthetic Minority Over-sampling Technique (SMOTE), and Adaptive Synthetic Sampling (ADASYN). The two generative model based oversampling methods are Conditional Generative Adversarial Network (CGAN) and Wasserstein Generative Adversarial Network (WGAN). In imbalanced data, the whole instances are divided into majority class and minority class, where majority class occupies most of the instances in the training set and minority class only includes a few instances. Generative models have their own advantages when they are used to generate more plausible samples referring to the distribution of the minority class. We also adopt CGAN to compare the data augmentation performance with other methods. The experimental results show that WGAN-based oversampling technique is more stable than other approaches (RANDOM, SMOTE, ADASYN and CGAN) even with the very limited training datasets. However, when the imbalanced ratio is too small, generative model based approaches cannot achieve satisfying performance than the conventional data augmentation techniques. These results suggest us one of future research directions.

Development of Prediction Models for Fatal Accidents using Proactive Information in Construction Sites (건설현장의 공사사전정보를 활용한 사망재해 예측 모델 개발)

  • Choi, Seung Ju;Kim, Jin Hyun;Jung, Kihyo
    • Journal of the Korean Society of Safety
    • /
    • v.36 no.3
    • /
    • pp.31-39
    • /
    • 2021
  • In Korea, more than half of work-related fatalities have occurred on construction sites. To reduce such occupational accidents, safety inspection by government agencies is essential in construction sites that present a high risk of serious accidents. To address this issue, this study developed risk prediction models of serious accidents in construction sites using five machine learning methods: support vector machine, random forest, XGBoost, LightGBM, and AutoML. To this end, 15 proactive information (e.g., number of stories and period of construction) that are usually available prior to construction were considered and two over-sampling techniques (SMOTE and ADASYN) were used to address the problem of class-imbalanced data. The results showed that all machine learning methods achieved 0.876~0.941 in the F1-score with the adoption of over-sampling techniques. LightGBM with ADASYN yielded the best prediction performance in both the F1-score (0.941) and the area under the ROC curve (0.941). The prediction models revealed four major features: number of stories, period of construction, excavation depth, and height. The prediction models developed in this study can be useful both for government agencies in prioritizing construction sites for safety inspection and for construction companies in establishing pre-construction preventive measures.

A Data Sampling Technique for Secure Dataset Using Weight VAE Oversampling(W-VAE) (가중치 VAE 오버샘플링(W-VAE)을 이용한 보안데이터셋 샘플링 기법 연구)

  • Kang, Hanbada;Lee, Jaewoo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.12
    • /
    • pp.1872-1879
    • /
    • 2022
  • Recently, with the development of artificial intelligence technology, research to use artificial intelligence to detect hacking attacks is being actively conducted. However, the fact that security data is a representative imbalanced data is recognized as a major obstacle in composing the learning data, which is the key to the development of artificial intelligence models. Therefore, in this paper, we propose a W-VAE oversampling technique that applies VAE, a deep learning generation model, to data extraction for oversampling, and sets the number of oversampling for each class through weight calculation using K-NN for sampling. In this paper, a total of five oversampling techniques such as ROS, SMOTE, and ADASYN were applied through NSL-KDD, an open network security dataset. The oversampling method proposed in this paper proved to be the most effective sampling method compared to the existing oversampling method through the F1-Score evaluation index.

Prediction of CDOM absorption coefficient using Oversampling technique and Machine Learning in upstream reach of Baekje weir (백제보 상류하천구간의 Oversampling technique과 Machine Learning을 활용한 CDOM 흡수계수 예측)

  • Kim, Jinuk;Jang, Wonjin;Kim, Jinhwi;Park, Yongeun;Kim, Seongjoon
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2022.05a
    • /
    • pp.46-46
    • /
    • 2022
  • 유기물의 복잡한 혼합물인 CDOM(Colored or Chromophoric Dissolved Organic Matter)은 하천 내 BOD(Biological Oxygen Demand), COD(Chemical Oxygen Demand) 및 유기 오염물질과 상당한 관련이 있다. CDOM은 가시광선 영역에서 빛을 흡수하는 성질을 가지고 있으며, 최근 원격감지 기술로 CDOM을 모니터링하기 위한 연구가 진행되고 있다. 본 연구에서는 백제보 상류 23km 구간에서 3년(2016~2018) 중 13일의 초분광영상을 활용하여 머신러닝 기반 CDOM을 추정 알고리즘을 개발하고자 한다. 초분광영상은 400~970 nm의 범위의 4 nm 간격 127개 대역의 분광해상도와 2 m의 공간해상도를 가진 항공기 탑재 AsiaFENIX 초분광 센서를 통해 수집하였으며 CDOM은 Millipore polycarbonate filter (𝚽47, 0.2 ㎛)에서 여과된 CDOM 샘플 자료를 200~800 nm의 흡수계수 스펙트럼으로 추출하여 사용하였다. CDOM 값은 전체기간 동안 2.0~11.0 m-1의 값 분포를 보였으며 5 m-1이상의 고농도 구간 자료개수가 전체 153개 샘플자료 중 21개로 불균형하다. 따라서 ADASYN(Adaptive Synthesis Sampling Approach)의 oversampling 방법으로 생성된 합성 데이터를 사용하여 원본 데이터의 소수계층 데이터 불균형을 해결하고 모델 예측 성능을 개선하고자 하였다. 생성된 합성 데이터를 입력변수로 하여 ANN(Artificial Neural Netowk)을 활용한 CDOM 예측 알고리즘을 구축하였다. ADASYN 기법을 통한 합성 데이터는 관측된 데이터의 불균형을 해결하여 기계학습 모델의 CDOM 탐지 성능을 향상시킬 수 있으며, 저수지 내 유기 오염물질 관리를 위한 설계를 지원하는데 사용할 수 있을 것으로 판단된다.

  • PDF

Development of Type 2 Prediction Prediction Based on Big Data (빅데이터 기반 2형 당뇨 예측 알고리즘 개발)

  • Hyun Sim;HyunWook Kim
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.18 no.5
    • /
    • pp.999-1008
    • /
    • 2023
  • Early prediction of chronic diseases such as diabetes is an important issue, and improving the accuracy of diabetes prediction is especially important. Various machine learning and deep learning-based methodologies are being introduced for diabetes prediction, but these technologies require large amounts of data for better performance than other methodologies, and the learning cost is high due to complex data models. In this study, we aim to verify the claim that DNN using the pima dataset and k-fold cross-validation reduces the efficiency of diabetes diagnosis models. Machine learning classification methods such as decision trees, SVM, random forests, logistic regression, KNN, and various ensemble techniques were used to determine which algorithm produces the best prediction results. After training and testing all classification models, the proposed system provided the best results on XGBoost classifier with ADASYN method, with accuracy of 81%, F1 coefficient of 0.81, and AUC of 0.84. Additionally, a domain adaptation method was implemented to demonstrate the versatility of the proposed system. An explainable AI approach using the LIME and SHAP frameworks was implemented to understand how the model predicts the final outcome.

Bankruptcy Prediction with Explainable Artificial Intelligence for Early-Stage Business Models

  • Tuguldur Enkhtuya;Dae-Ki Kang
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.15 no.3
    • /
    • pp.58-65
    • /
    • 2023
  • Bankruptcy is a significant risk for start-up companies, but with the help of cutting-edge artificial intelligence technology, we can now predict bankruptcy with detailed explanations. In this paper, we implemented the Category Boosting algorithm following data cleaning and editing using OpenRefine. We further explained our model using the Shapash library, incorporating domain knowledge. By leveraging the 5C's credit domain knowledge, financial analysts in banks or investors can utilize the detailed results provided by our model to enhance their decision-making processes, even without extensive knowledge about AI. This empowers investors to identify potential bankruptcy risks in their business models, enabling them to make necessary improvements or reconsider their ventures before proceeding. As a result, our model serves as a "glass-box" model, allowing end-users to understand which specific financial indicators contribute to the prediction of bankruptcy. This transparency enhances trust and provides valuable insights for decision-makers in mitigating bankruptcy risks.

A Classification Model for Customs Clearance Inspection Results of Imported Aquatic Products Using Machine Learning Techniques (머신러닝 기법을 활용한 수입 수산물 통관검사결과 분류 모델)

  • Ji Seong Eom;Lee Kyung Hee;Wan-Sup Cho
    • The Journal of Bigdata
    • /
    • v.8 no.1
    • /
    • pp.157-165
    • /
    • 2023
  • Seafood is a major source of protein in many countries and its consumption is increasing. In Korea, consumption of seafood is increasing, but self-sufficiency rate is decreasing, and the importance of safety management is increasing as the amount of imported seafood increases. There are hundreds of species of aquatic products imported into Korea from over 110 countries, and there is a limit to relying only on the experience of inspectors for safety management of imported aquatic products. Based on the data, a model that can predict the customs inspection results of imported aquatic products is developed, and a machine learning classification model that determines the non-conformity of aquatic products when an import declaration is submitted is created. As a result of customs inspection of imported marine products, the nonconformity rate is less than 1%, which is very low imbalanced data. Therefore, a sampling method that can complement these characteristics was comparatively studied, and a preprocessing method that can interpret the classification result was applied. Among various machine learning-based classification models, Random Forest and XGBoost showed good performance. The model that predicts both compliance and non-conformance well as a result of the clearance inspection is the basic random forest model to which ADASYN and one-hot encoding are applied, and has an accuracy of 99.88%, precision of 99.87%, recall of 99.89%, and AUC of 99.88%. XGBoost is the most stable model with all indicators exceeding 90% regardless of oversampling and encoding type.

A Hybrid Oversampling Technique for Imbalanced Structured Data based on SMOTE and Adapted CycleGAN (불균형 정형 데이터를 위한 SMOTE와 변형 CycleGAN 기반 하이브리드 오버샘플링 기법)

  • Jung-Dam Noh;Byounggu Choi
    • Information Systems Review
    • /
    • v.24 no.4
    • /
    • pp.97-118
    • /
    • 2022
  • As generative adversarial network (GAN) based oversampling techniques have achieved impressive results in class imbalance of unstructured dataset such as image, many studies have begun to apply it to solving the problem of imbalance in structured dataset. However, these studies have failed to reflect the characteristics of structured data due to changing the data structure into an unstructured data format. In order to overcome the limitation, this study adapted CycleGAN to reflect the characteristics of structured data, and proposed hybridization of synthetic minority oversampling technique (SMOTE) and the adapted CycleGAN. In particular, this study tried to overcome the limitations of existing studies by using a one-dimensional convolutional neural network unlike previous studies that used two-dimensional convolutional neural network. Oversampling based on the method proposed have been experimented using various datasets and compared the performance of the method with existing oversampling methods such as SMOTE and adaptive synthetic sampling (ADASYN). The results indicated the proposed hybrid oversampling method showed superior performance compared to the existing methods when data have more dimensions or higher degree of imbalance. This study implied that the classification performance of oversampling structured data can be improved using the proposed hybrid oversampling method that considers the characteristic of structured data.

Mitigiating Data Imbalance via Ensembled Data Augmentation: An Explainable Credit Scoring Models (데이터 증강 기법의 앙상블을 통한 레이블 불균형 해 소: 설명 가능한 신용평가 모델을 중심으로)

  • Ji-Young Chung;So-Yeon Lee;Ye-Lin Yong;Min-Jun Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.483-486
    • /
    • 2023
  • 최근 금융 분야는 예측 모델의 복잡성으로 인한 블랙박스 문제와 금융 규제에 대한 관심이 높아지고 있다. 이에 따라 금융 업계는 신뢰성과 투명성을 강조하며, 특히 신용평가 분야에서 설명 가능한 모델 연구가 활발히 진행되고 있다. 또한, 해당 분야에서 소수 클래스에 대해 충분히 학습하지 못하고 다수 클래스에 과적합 될 수 있는 데이터 불균형 문제 역시 강조되고 있다. 이는 제 2종 오류(Type 2 Error)를 최소화해야 하는 상황에서 더욱 부각되며, 대출 상환 능력이 낮은 고객을 최대한 식별해야 하는 개인 신용평가 문제에서 매우 중요한 화두로 떠오르고 있다. 본 논문에서는 어텐션 메커니즘을 활용하여 모델의 설명 가능성을 개선하고, 분석 결과를 해석하는 데 도움이 되고자 한다. 더 나아가, SMOTE, GAN, ADASYN 등 총 다섯 가지 데이터 증강 기법을 실험하여, 이를 앙상블 하였을 때 소수 클래스 레이블에 대한 분류 정확도를 크게 개선할 수 있음을 확인하였다.

Simulated Annealing for Overcoming Data Imbalance in Mold Injection Process (사출성형공정에서 데이터의 불균형 해소를 위한 담금질모사)

  • Dongju Lee
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.4
    • /
    • pp.233-239
    • /
    • 2022
  • The injection molding process is a process in which thermoplastic resin is heated and made into a fluid state, injected under pressure into the cavity of a mold, and then cooled in the mold to produce a product identical to the shape of the cavity of the mold. It is a process that enables mass production and complex shapes, and various factors such as resin temperature, mold temperature, injection speed, and pressure affect product quality. In the data collected at the manufacturing site, there is a lot of data related to good products, but there is little data related to defective products, resulting in serious data imbalance. In order to efficiently solve this data imbalance, undersampling, oversampling, and composite sampling are usally applied. In this study, oversampling techniques such as random oversampling (ROS), minority class oversampling (SMOTE), ADASYN(Adaptive Synthetic Sampling), etc., which amplify data of the minority class by the majority class, and complex sampling using both undersampling and oversampling, are applied. For composite sampling, SMOTE+ENN and SMOTE+Tomek were used. Artificial neural network techniques is used to predict product quality. Especially, MLP and RNN are applied as artificial neural network techniques, and optimization of various parameters for MLP and RNN is required. In this study, we proposed an SA technique that optimizes the choice of the sampling method, the ratio of minority classes for sampling method, the batch size and the number of hidden layer units for parameters of MLP and RNN. The existing sampling methods and the proposed SA method were compared using accuracy, precision, recall, and F1 Score to prove the superiority of the proposed method.