• Title/Summary/Keyword: Distance-based Outlier Detection

Search Result 10, Processing Time 0.029 seconds

A Distance-based Outlier Detection Method using Landmarks in High Dimensional Data (고차원 데이터에서 랜드마크를 이용한 거리 기반 이상치 탐지 방법)

  • Park, Cheong Hee
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.9
    • /
    • pp.1242-1250
    • /
    • 2021
  • Detection of outliers deviating normal data distribution in high dimensional data is an important technique in many application areas. In this paper, a distance-based outlier detection method using landmarks in high dimensional data is proposed. Given normal training data, the k-means clustering method is applied for the training data in order to extract the centers of the clusters as landmarks which represent normal data distribution. For a test data sample, the distance to the nearest landmark gives the outlier score. In the experiments using high dimensional data such as images and documents, it was shown that the proposed method based on the landmarks of one-tenth of training data can give the comparable outlier detection performance while reducing the time complexity greatly in the testing stage.

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.1
    • /
    • pp.55-67
    • /
    • 2015
  • An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

An Outlier Cluster Detection Technique for Real-time Network Intrusion Detection Systems (실시간 네트워크 침입탐지 시스템을 위한 아웃라이어 클러스터 검출 기법)

  • Chang, Jae-Young;Park, Jong-Myoung;Kim, Han-Joon
    • Journal of Internet Computing and Services
    • /
    • v.8 no.6
    • /
    • pp.43-53
    • /
    • 2007
  • Intrusion detection system(IDS) has recently evolved while combining signature-based detection approach with anomaly detection approach. Although signature-based IDS tools have been commonly used by utilizing machine learning algorithms, they only detect network intrusions with already known patterns, Ideal IDS tools should always keep the signature database of your detection system up-to-date. The system needs to generate the signatures to detect new possible attacks while monitoring and analyzing incoming network data. In this paper, we propose a new outlier cluster detection algorithm with density (or influence) function, Our method assumes that an outlier is a kind of cluster with similar instances instead of a single object in the context of network intrusion, Through extensive experiments using KDD 1999 Cup Intrusion Detection dataset. we show that the proposed method outperform the conventional outlier detection method using Euclidean distance function, specially when attacks occurs frequently.

  • PDF

An Outlier Detection Algorithm and Data Integration Technique for Prediction of Hypertension (고혈압 예측을 위한 이상치 탐지 알고리즘 및 데이터 통합 기법)

  • Khongorzul Dashdondov;Mi-Hye Kim;Mi-Hwa Song
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.417-419
    • /
    • 2023
  • Hypertension is one of the leading causes of mortality worldwide. In recent years, the incidence of hypertension has increased dramatically, not only among the elderly but also among young people. In this regard, the use of machine-learning methods to diagnose the causes of hypertension has increased in recent years. In this study, we improved the prediction of hypertension detection using Mahalanobis distance-based multivariate outlier removal using the KNHANES database from the Korean national health data and the COVID-19 dataset from Kaggle. This study was divided into two modules. Initially, the data preprocessing step used merged datasets and decision-tree classifier-based feature selection. The next module applies a predictive analysis step to remove multivariate outliers using the Mahalanobis distance from the experimental dataset and makes a prediction of hypertension. In this study, we compared the accuracy of each classification model. The best results showed that the proposed MAH_RF algorithm had an accuracy of 82.66%. The proposed method can be used not only for hypertension but also for the detection of various diseases such as stroke and cardiovascular disease.

An Outlier Detection Using Autoencoder for Ocean Observation Data (해양 이상 자료 탐지를 위한 오토인코더 활용 기법 최적화 연구)

  • Kim, Hyeon-Jae;Kim, Dong-Hoon;Lim, Chaewook;Shin, Yongtak;Lee, Sang-Chul;Choi, Youngjin;Woo, Seung-Buhm
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.33 no.6
    • /
    • pp.265-274
    • /
    • 2021
  • Outlier detection research in ocean data has traditionally been performed using statistical and distance-based machine learning algorithms. Recently, AI-based methods have received a lot of attention and so-called supervised learning methods that require classification information for data are mainly used. This supervised learning method requires a lot of time and costs because classification information (label) must be manually designated for all data required for learning. In this study, an autoencoder based on unsupervised learning was applied as an outlier detection to overcome this problem. For the experiment, two experiments were designed: one is univariate learning, in which only SST data was used among the observation data of Deokjeok Island and the other is multivariate learning, in which SST, air temperature, wind direction, wind speed, air pressure, and humidity were used. Period of data is 25 years from 1996 to 2020, and a pre-processing considering the characteristics of ocean data was applied to the data. An outlier detection of actual SST data was tried with a learned univariate and multivariate autoencoder. We tried to detect outliers in real SST data using trained univariate and multivariate autoencoders. To compare model performance, various outlier detection methods were applied to synthetic data with artificially inserted errors. As a result of quantitatively evaluating the performance of these methods, the multivariate/univariate accuracy was about 96%/91%, respectively, indicating that the multivariate autoencoder had better outlier detection performance. Outlier detection using an unsupervised learning-based autoencoder is expected to be used in various ways in that it can reduce subjective classification errors and cost and time required for data labeling.

A RSS-Based Localization Method Utilizing Robust Statistics for Wireless Sensor Networks under Non-Gaussian Noise (비 가우시안 잡음이 존재하는 무선 센서 네트워크에서 Robust Statistics를 활용하는 수신신호세기기반의 위치 추정 기법)

  • Ahn, Tae-Joon;Koo, In-Soo
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.11 no.3
    • /
    • pp.23-30
    • /
    • 2011
  • In the wireless sensor network(WSN), the detection of precise location of sensor nodes is essential for efficiently utilizing the sensing data acquired from sensor nodes. Among various location methods, the received signal strength (RSS) based localization scheme is mostly preferable in many applications since it can be easily implemented without any additional hardware cost. Since the RSS localization method is mainly effected by radio channel between two nodes, outlier data can be included in the received signal strength measurement specially when some obstacles move around the link between nodes. The outlier data can have bad effect on estimating the distance between two nodes such that it can cause location errors. In this paper, we propose a RSS-based localization method using Robust Statistic and Gaussian filter algorithm for enhancing the accuracy of RSS-based localization. In the proposed algorithm, the outlier data can be eliminated from samples by using the Robust Statistics as well as the Gaussian filter such that the accuracy of localization can be achieved. Through simulation, it is shown that the proposed algorithm can increase the accuracy of localization and is more robust to non gaussian noise channels.

Robust Multithreaded Object Tracker through Occlusions for Spatial Augmented Reality

  • Lee, Ahyun;Jang, Insung
    • ETRI Journal
    • /
    • v.40 no.2
    • /
    • pp.246-256
    • /
    • 2018
  • A spatial augmented reality (SAR) system enables a virtual image to be projected onto the surface of a real-world object and the user to intuitively control the image using a tangible interface. However, occlusions frequently occur, such as a sudden change in the lighting environment or the generation of obstacles. We propose a robust object tracker based on a multithreaded system, which can track an object robustly through occlusions. Our multithreaded tracker is divided into two threads: the detection thread detects distinctive features in a frame-to-frame manner, and the tracking thread tracks features periodically using an optical-flow-based tracking method. Consequently, although the speed of the detection thread is considerably slow, we achieve real-time performance owing to the multithreaded configuration. Moreover, the proposed outlier filtering automatically updates a random sample consensus distance threshold for eliminating outliers according to environmental changes. Experimental results show that our approach tracks an object robustly in real-time in an SAR environment where there are frequent occlusions occurring from augmented projection images.

GNSS/Multiple IMUs Based Navigation Strategy Using the Mahalanobis Distance in Partially GNSS-denied Environments (GNSS 부분 음영 지역에서 마할라노비스 거리를 이용한 GNSS/다중 IMU 센서 기반 측위 알고리즘)

  • Kim, Jiyeon;Song, Moogeun;Kim, Jaehoon;Lee, Dongik
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.4
    • /
    • pp.239-247
    • /
    • 2022
  • The existing studies on the localization in the GNSS (Global Navigation Satellite System) denied environment usually exploit low-cost MEMS IMU (Micro Electro Mechanical Systems Inertial Measurement Unit) sensors to replace the GNSS signals. However, the navigation system still requires GNSS signals for the normal environment. This paper presents an integrated GNSS/INS (Inertial Navigation System) navigation system which combines GNSS and multiple IMU sensors using extended Kalman filter in partially GNSS-denied environments. The position and velocity of the INS and GNSS are used as the inputs to the integrated navigation system. The Mahalanobis distance is used for novelty detection to detect the outlier of GNSS measurements. When the abnormality is detected in GNSS signals, GNSS data is excluded from the fusion process. The performance of the proposed method is evaluated using MATLAB/Simulink. The simulation results show that the proposed algorithm can achieve a higher degree of positioning accuracy in the partially GNSS-denied environment.

Improvement of Network Intrusion Detection Rate by Using LBG Algorithm Based Data Mining (LBG 알고리즘 기반 데이터마이닝을 이용한 네트워크 침입 탐지율 향상)

  • Park, Seong-Chul;Kim, Jun-Tae
    • Journal of Intelligence and Information Systems
    • /
    • v.15 no.4
    • /
    • pp.23-36
    • /
    • 2009
  • Network intrusion detection have been continuously improved by using data mining techniques. There are two kinds of methods in intrusion detection using data mining-supervised learning with class label and unsupervised learning without class label. In this paper we have studied the way of improving network intrusion detection accuracy by using LBG clustering algorithm which is one of unsupervised learning methods. The K-means method, that starts with random initial centroids and performs clustering based on the Euclidean distance, is vulnerable to noisy data and outliers. The nonuniform binary split algorithm uses binary decomposition without assigning initial values, and it is relatively fast. In this paper we applied the EM(Expectation Maximization) based LBG algorithm that incorporates the strength of two algorithms to intrusion detection. The experimental results using the KDD cup dataset showed that the accuracy of detection can be improved by using the LBG algorithm.

  • PDF

A Study of Anomaly Detection for ICT Infrastructure using Conditional Multimodal Autoencoder (ICT 인프라 이상탐지를 위한 조건부 멀티모달 오토인코더에 관한 연구)

  • Shin, Byungjin;Lee, Jonghoon;Han, Sangjin;Park, Choong-Shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.57-73
    • /
    • 2021
  • Maintenance and prevention of failure through anomaly detection of ICT infrastructure is becoming important. System monitoring data is multidimensional time series data. When we deal with multidimensional time series data, we have difficulty in considering both characteristics of multidimensional data and characteristics of time series data. When dealing with multidimensional data, correlation between variables should be considered. Existing methods such as probability and linear base, distance base, etc. are degraded due to limitations called the curse of dimensions. In addition, time series data is preprocessed by applying sliding window technique and time series decomposition for self-correlation analysis. These techniques are the cause of increasing the dimension of data, so it is necessary to supplement them. The anomaly detection field is an old research field, and statistical methods and regression analysis were used in the early days. Currently, there are active studies to apply machine learning and artificial neural network technology to this field. Statistically based methods are difficult to apply when data is non-homogeneous, and do not detect local outliers well. The regression analysis method compares the predictive value and the actual value after learning the regression formula based on the parametric statistics and it detects abnormality. Anomaly detection using regression analysis has the disadvantage that the performance is lowered when the model is not solid and the noise or outliers of the data are included. There is a restriction that learning data with noise or outliers should be used. The autoencoder using artificial neural networks is learned to output as similar as possible to input data. It has many advantages compared to existing probability and linear model, cluster analysis, and map learning. It can be applied to data that does not satisfy probability distribution or linear assumption. In addition, it is possible to learn non-mapping without label data for teaching. However, there is a limitation of local outlier identification of multidimensional data in anomaly detection, and there is a problem that the dimension of data is greatly increased due to the characteristics of time series data. In this study, we propose a CMAE (Conditional Multimodal Autoencoder) that enhances the performance of anomaly detection by considering local outliers and time series characteristics. First, we applied Multimodal Autoencoder (MAE) to improve the limitations of local outlier identification of multidimensional data. Multimodals are commonly used to learn different types of inputs, such as voice and image. The different modal shares the bottleneck effect of Autoencoder and it learns correlation. In addition, CAE (Conditional Autoencoder) was used to learn the characteristics of time series data effectively without increasing the dimension of data. In general, conditional input mainly uses category variables, but in this study, time was used as a condition to learn periodicity. The CMAE model proposed in this paper was verified by comparing with the Unimodal Autoencoder (UAE) and Multi-modal Autoencoder (MAE). The restoration performance of Autoencoder for 41 variables was confirmed in the proposed model and the comparison model. The restoration performance is different by variables, and the restoration is normally well operated because the loss value is small for Memory, Disk, and Network modals in all three Autoencoder models. The process modal did not show a significant difference in all three models, and the CPU modal showed excellent performance in CMAE. ROC curve was prepared for the evaluation of anomaly detection performance in the proposed model and the comparison model, and AUC, accuracy, precision, recall, and F1-score were compared. In all indicators, the performance was shown in the order of CMAE, MAE, and AE. Especially, the reproduction rate was 0.9828 for CMAE, which can be confirmed to detect almost most of the abnormalities. The accuracy of the model was also improved and 87.12%, and the F1-score was 0.8883, which is considered to be suitable for anomaly detection. In practical aspect, the proposed model has an additional advantage in addition to performance improvement. The use of techniques such as time series decomposition and sliding windows has the disadvantage of managing unnecessary procedures; and their dimensional increase can cause a decrease in the computational speed in inference.The proposed model has characteristics that are easy to apply to practical tasks such as inference speed and model management.