• Title/Summary/Keyword: Unlabeled

Search Result 151, Processing Time 0.027 seconds

A Clustering-based Semi-Supervised Learning through Initial Prediction of Unlabeled Data (미분류 데이터의 초기예측을 통한 군집기반의 부분지도 학습방법)

  • Kim, Eung-Ku;Jun, Chi-Hyuck
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.33 no.3
    • /
    • pp.93-105
    • /
    • 2008
  • Semi-supervised learning uses a small amount of labeled data to predict labels of unlabeled data as well as to improve clustering performance, whereas unsupervised learning analyzes only unlabeled data for clustering purpose. We propose a new clustering-based semi-supervised learning method by reflecting the initial predicted labels of unlabeled data on the objective function. The initial prediction should be done in terms of a discrete probability distribution through a classification method using labeled data. As a result, clusters are formed and labels of unlabeled data are predicted according to the Information of labeled data in the same cluster. We evaluate and compare the performance of the proposed method in terms of classification errors through numerical experiments with blinded labeled data.

Improving the Classification Accuracy Using Unlabeled Data: A Naive Bayesian Case (나이브 베이지안 환경에서 미분류 데이터를 이용한 성능향상)

  • Lee Chang-Hwan
    • The KIPS Transactions:PartB
    • /
    • v.13B no.4 s.107
    • /
    • pp.457-462
    • /
    • 2006
  • In many applications, an enormous amount of unlabeled data is available with little cost. Therefore, it is natural to ask whether we can take advantage of these unlabeled data in classification learning. In this paper, we analyzed the role of unlabeled data in the context of naive Bayesian learning. Experimental results show that including unlabeled data as part of training data can significantly improve the performance of classification accuracy. The effect of using unlabeled data is especially important in case labeled data are sparse.

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities (문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구)

  • Kim, Pan-Jun;Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.1 s.63
    • /
    • pp.251-271
    • /
    • 2007
  • This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps In general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.

A Co-training Method based on Classification Using Unlabeled Data (비분류표시 데이타를 이용하는 분류 기반 Co-training 방법)

  • 윤혜성;이상호;박승수;용환승;김주한
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.8
    • /
    • pp.991-998
    • /
    • 2004
  • In many practical teaming problems including bioinformatics area, there is a small amount of labeled data along with a large pool of unlabeled data. Labeled examples are fairly expensive to obtain because they require human efforts. In contrast, unlabeled examples can be inexpensively gathered without an expert. A common method with unlabeled data for data classification and analysis is co-training. This method uses a small set of labeled examples to learn a classifier in two views. Then each classifier is applied to all unlabeled examples, and co-training detects the examples on which each classifier makes the most confident predictions. After some iterations, new classifiers are learned in training data and the number of labeled examples is increased. In this paper, we propose a new co-training strategy using unlabeled data. And we evaluate our method with two classifiers and two experimental data: WebKB and BIND XML data. Our experimentation shows that the proposed co-training technique effectively improves the classification accuracy when the number of labeled examples are very small.

Case-Related News Filtering via Topic-Enhanced Positive-Unlabeled Learning

  • Wang, Guanwen;Yu, Zhengtao;Xian, Yantuan;Zhang, Yu
    • Journal of Information Processing Systems
    • /
    • v.17 no.6
    • /
    • pp.1057-1070
    • /
    • 2021
  • Case-related news filtering is crucial in legal text mining and divides news into case-related and case-unrelated categories. Because case-related news originates from various fields and has different writing styles, it is difficult to establish complete filtering rules or keywords for data collection. In addition, the labeled corpus for case-related news is sparse; therefore, to train a high-performance classification model, it is necessary to annotate the corpus. To address this challenge, we propose topic-enhanced positive-unlabeled learning, which selects positive and negative samples guided by topics. Specifically, a topic model based on a variational autoencoder (VAE) is trained to extract topics from unlabeled samples. By using these topics in the iterative process of positive-unlabeled (PU) learning, the accuracy of identifying case-related news can be improved. From the experimental results, it can be observed that the F1 value of our method on the test set is 1.8% higher than that of the PU learning baseline model. In addition, our method is more robust with low initial samples and high iterations, and compared with advanced PU learning baselines such as nnPU and I-PU, we obtain a 1.1% higher F1 value, which indicates that our method can effectively identify case-related news.

Incremental Superised Learning based on SVM with Unlabeled Documents (레이블이 없는 문서를 이용한 SVM 기반의 점증적 지도학습)

  • 김수영;조성배
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.04b
    • /
    • pp.301-303
    • /
    • 2002
  • 컴퓨터가 널리 보급되고 인터넷이 발전함에 따라 수없이 많은 정보가 디지털 형태로 생산되고 있다. 이러한 정보를 사람이 일일이 가공하고 분류하기에는 한계가 있으므로 자동으로 문서를 분류하고자 하는 연구가 대두되었다. 문서를 자동으로 분류하기 위해 기계학습 방법이 많이 이용되고 있다. 기계학습방법을 이용한 문서분류가 좋은 성능을 내기 위해서는 충분한 양의 학습데이터가 필요하다. 학습데이터를 만들기 위해서는 사람이 일일이 분류해야 하므로, 비용이 많이 든다. 본 논문에서는 적은양의 labeled 데이터로부터 시작하여, 점증적으로 unlabeled 데이터를 학습에 참여시킴으로써, 문서분류의 성능을 높이고자 한다. 실험을 통해 Unlabeled 문서데이터를 사용한 것이 좋은 성능을 보였음을 알 수 있다.

  • PDF

A Constraint-based Semi-supervised Clustering Through Initial Prediction of Unlabeled Data (비분류표시 데이터의 초기예측을 통한 제약기반 부분-지도 군집분석)

  • Kim, Eung-Gu;Jeon, Chi-Hyeok
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 2007.11a
    • /
    • pp.383-387
    • /
    • 2007
  • Traditional clustering is regarded as an unsupervised teaming to analyze unlabeled data. Semi-supervised clustering uses a small amount of labeled data to predict labels of unlabeled data as well as to improve clustering performance. Previous methods use constraints generated from available labeled data in clustering process. We propose a new constraint-based semi-supervised clustering method by reflecting initial predicted labels of unlabeled data. We evaluate and compare the performance of the proposed method in terms of classification errors through numerical experiments with blinded labeled data.

  • PDF

Labeling Big Spatial Data: A Case Study of New York Taxi Limousine Dataset

  • AlBatati, Fawaz;Alarabi, Louai
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.6
    • /
    • pp.207-212
    • /
    • 2021
  • Clustering Unlabeled Spatial-datasets to convert them to Labeled Spatial-datasets is a challenging task specially for geographical information systems. In this research study we investigated the NYC Taxi Limousine Commission dataset and discover that all of the spatial-temporal trajectory are unlabeled Spatial-datasets, which is in this case it is not suitable for any data mining tasks, such as classification and regression. Therefore, it is necessary to convert unlabeled Spatial-datasets into labeled Spatial-datasets. In this research study we are going to use the Clustering Technique to do this task for all the Trajectory datasets. A key difficulty for applying machine learning classification algorithms for many applications is that they require a lot of labeled datasets. Labeling a Big-data in many cases is a costly process. In this paper, we show the effectiveness of utilizing a Clustering Technique for labeling spatial data that leads to a high-accuracy classifier.

Variational Auto-Encoder Based Semi-supervised Learning Scheme for Learner Classification in Intelligent Tutoring System (지능형 교육 시스템의 학습자 분류를 위한 Variational Auto-Encoder 기반 준지도학습 기법)

  • Jung, Seungwon;Son, Minjae;Hwang, Eenjun
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.11
    • /
    • pp.1251-1258
    • /
    • 2019
  • Intelligent tutoring system enables users to effectively learn by utilizing various artificial intelligence techniques. For instance, it can recommend a proper curriculum or learning method to individual users based on their learning history. To do this effectively, user's characteristics need to be analyzed and classified based on various aspects such as interest, learning ability, and personality. Even though data labeled by the characteristics are required for more accurate classification, it is not easy to acquire enough amount of labeled data due to the labeling cost. On the other hand, unlabeled data should not need labeling process to make a large number of unlabeled data be collected and utilized. In this paper, we propose a semi-supervised learning method based on feedback variational auto-encoder(FVAE), which uses both labeled data and unlabeled data. FVAE is a variation of variational auto-encoder(VAE), where a multi-layer perceptron is added for giving feedback. Using unlabeled data, we train FVAE and fetch the encoder of FVAE. And then, we extract features from labeled data by using the encoder and train classifiers with the extracted features. In the experiments, we proved that FVAE-based semi-supervised learning was superior to VAE-based method in terms with accuracy and F1 score.

Study on semi-supervised local constant regression estimation

  • Seok, Kyung-Ha
    • Journal of the Korean Data and Information Science Society
    • /
    • v.23 no.3
    • /
    • pp.579-585
    • /
    • 2012
  • Many different semi-supervised learning algorithms have been proposed for use wit unlabeled data. However, most of them focus on classification problems. In this paper we propose a semi-supervised regression algorithm called the semi-supervised local constant estimator (SSLCE), based on the local constant estimator (LCE), and reveal the asymptotic properties of SSLCE. We also show that the SSLCE has a faster convergence rate than that of the LCE when a well chosen weighting factor is employed. Our experiment with synthetic data shows that the SSLCE can improve performance with unlabeled data, and we recommend its use with the proper size of unlabeled data.