Search | Korea Science

Combining Distributed Word Representation and Document Distance for Short Text Document Clustering

Kongwudhikunakorn, Supavit;Waiyamai, Kitsana
- Journal of Information Processing Systems
- /
- v.16 no.2
- /
- pp.277-300
- /
- 2020
This paper presents a method for clustering short text documents, such as news headlines, social media statuses, or instant messages. Due to the characteristics of these documents, which are usually short and sparse, an appropriate technique is required to discover hidden knowledge. The objective of this paper is to identify the combination of document representation, document distance, and document clustering that yields the best clustering quality. Document representations are expanded by external knowledge sources represented by a Distributed Representation. To cluster documents, a K-means partitioning-based clustering technique is applied, where the similarities of documents are measured by word mover's distance. To validate the effectiveness of the proposed method, experiments were conducted to compare the clustering quality against several leading methods. The proposed method produced clusters of documents that resulted in higher precision, recall, F1-score, and adjusted Rand index for both real-world and standard data sets. Furthermore, manual inspection of the clustering results was conducted to observe the efficacy of the proposed method. The topics of each document cluster are undoubtedly reflected by members in the cluster.
https://doi.org/10.3745/JIPS.04.0164 인용 PDF KSCI

Document Clustering Method using Coherence of Cluster and Non-negative Matrix Factorization (비음수 행렬 분해와 군집의 응집도를 이용한 문서군집)

Kim, Chul-Won;Park, Sun
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.13 no.12
- /
- pp.2603-2608
- /
- 2009
Document clustering is an important method for document analysis and is used in many different information retrieval applications. This paper proposes a new document clustering model using the clustering method based NMF(non-negative matrix factorization) and refinement of documents in cluster by using coherence of cluster. The proposed method can improve the quality of document clustering because the re-assigned documents in cluster by using coherence of cluster based similarity between documents, the semantic feature matrix and the semantic variable matrix, which is used in document clustering, can represent an inherent structure of document set more well. The experimental results demonstrate appling the proposed method to document clustering methods achieves better performance than documents clustering methods.
https://doi.org/10.6109/JKIICE.2009.13.12.2603 인용 PDF KSCI

Sparse Document Data Clustering Using Factor Score and Self Organizing Maps (인자점수와 자기조직화지도를 이용한 희소한 문서데이터의 군집화)

Jun, Sung-Hae
- Journal of the Korean Institute of Intelligent Systems
- /
- v.22 no.2
- /
- pp.205-211
- /
- 2012
The retrieved documents have to be transformed into proper data structure for the clustering algorithms of statistics and machine learning. A popular data structure for document clustering is document-term matrix. This matrix has the occurred frequency value of a term in each document. There is a sparsity problem in this matrix because most frequencies of the matrix are 0 values. This problem affects the clustering performance. The sparseness of document-term matrix decreases the performance of clustering result. So, this research uses the factor score by factor analysis to solve the sparsity problem in document clustering. The document-term matrix is transformed to document-factor score matrix using factor scores in this paper. Also, the document-factor score matrix is used as input data for document clustering. To compare the clustering performances between document-term matrix and document-factor score matrix, this research applies two typed matrices to self organizing map (SOM) clustering.
https://doi.org/10.5391/JKIIS.2012.22.2.205 인용 PDF KSCI

Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet

Kim, Chul-Won;Park, Sun
- Journal of information and communication convergence engineering
- /
- v.11 no.4
- /
- pp.241-246
- /
- 2013
A classic document clustering technique may incorrectly classify documents into different clusters when documents that should belong to the same cluster do not have any shared terms. Recently, to overcome this problem, internal and external knowledge-based approaches have been used for text document clustering. However, the clustering results of these approaches are influenced by the inherent structure and the topical composition of the documents. Further, the organization of knowledge into an ontology is expensive. In this paper, we propose a new enhanced text document clustering method using non-negative matrix factorization (NMF) and WordNet. The semantic terms extracted as cluster labels by NMF can represent the inherent structure of a document cluster well. The proposed method can also improve the quality of document clustering that uses cluster labels and term weights based on term mutual information of WordNet. The experimental results demonstrate that the proposed method achieves better performance than the other text clustering methods.
https://doi.org/10.6109/jicce.2013.11.4.241 인용 PDF KSCI

Development of Similarity-Based Document Clustering System (유사성 계수에 의한 문서 클러스터링 시스템 개발)

Woo Hoon-Shik;Yim Dong-Soon
- Proceedings of the Society of Korea Industrial and System Engineering Conference
- /
- 2002.05a
- /
- pp.119-124
- /
- 2002
Clustering of data is of a great interest in many data mining applications. In the field of document clustering, a document is represented as a data in a high dimensional space. Therefore, the document clustering can be accomplished with a general data clustering techniques. In this paper, we introduce a document clustering system based on similarity among documents. The developed system consists of three functions: 1) gatherings documents utilizing a search agent; 2) determining similarity coefficients between any two documents from term frequencies; 3) clustering documents with similarity coefficients. Especially, the document clustering is accomplished by a hybrid algorithm utilizing genetic and K-Means methods.
PDF

Document Clustering using Non-negative Matrix Factorization and Fuzzy Relationship (비음수 행렬 분해와 퍼지 관계를 이용한 문서군집)

Park, Sun;Kim, Kyung-Jun
- Journal of Advanced Navigation Technology
- /
- v.14 no.2
- /
- pp.239-246
- /
- 2010
This paper proposes a new document clustering method using NMF and fuzzy relationship. The proposed method can improve the quality of document clustering because the clustered documents by using fuzzy relation values between semantic features and terms to distinguish well dissimilar documents in clusters, the selected cluster label terms by using semantic features with NMF, which is used in document clustering, can represent an inherent structure of document set better. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.
PDF KSCI

Refinement of Document Clustering by Using NMF

Shinnou, Hiroyuki;Sasaki, Minoru
- Proceedings of the Korean Society for Language and Information Conference
- /
- 2007.11a
- /
- pp.430-439
- /
- 2007
In this paper, we use non-negative matrix factorization (NMF) to refine the document clustering results. NMF is a dimensional reduction method and effective for document clustering, because a term-document matrix is high-dimensional and sparse. The initial matrix of the NMF algorithm is regarded as a clustering result, therefore we can use NMF as a refinement method. First we perform min-max cut (Mcut), which is a powerful spectral clustering method, and then refine the result via NMF. Finally we should obtain an accurate clustering result. However, NMF often fails to improve the given clustering result. To overcome this problem, we use the Mcut object function to stop the iteration of NMF.
PDF

Document Clustering Using Reference Titles (인용문헌 표제를 이용한 문헌 클러스터링에 관한 연구)

Choi, Sang-Hee
- Journal of the Korean Society for information Management
- /
- v.27 no.2
- /
- pp.241-252
- /
- 2010
Titles have been regarded as having effective clustering features, but they sometimes fail to represent the topic of a document and result in poorly generated document clusters. This study aims to improve the performance of document clustering with titles by suggesting titles in the citation bibliography as a clustering feature. Titles of original literature, titles in the citation bibliography, and an aggregation of both titles were adapted to measure the performance of clustering. Each feature was combined with three hierarchical clustering methods, within group average linkage, complete linkage, and Ward's method in the clustering experiment. The best practice case of this experiment was clustering document with features from both titles by within-groups average method.
https://doi.org/10.3743/KOSIM.2010.27.2.241 인용 PDF

Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model (공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법)

Cho, Yoon-Ho;Lee, Sang-Keun
- The Journal of the Korea Contents Association
- /
- v.9 no.2
- /
- pp.142-151
- /
- 2009
Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.
https://doi.org/10.5392/JKCA.2009.9.2.142 인용 PDF

Fine-Grained Mobile Application Clustering Model Using Retrofitted Document Embedding

Yoon, Yeo-Chan;Lee, Junwoo;Park, So-Young;Lee, Changki
- ETRI Journal
- /
- v.39 no.4
- /
- pp.443-454
- /
- 2017
In this paper, we propose a fine-grained mobile application clustering model using retrofitted document embedding. To automatically determine the clusters and their numbers with no predefined categories, the proposed model initializes the clusters based on title keywords and then merges similar clusters. For improved clustering performance, the proposed model distinguishes between an accurate clustering step with titles and an expansive clustering step with descriptions. During the accurate clustering step, an automatically tagged set is constructed as a result. This set is utilized to learn a high-performance document vector. During the expansive clustering step, more applications are then classified using this document vector. Experimental results showed that the purity of the proposed model increased by 0.19, and the entropy decreased by 1.18, compared with the K-means algorithm. In addition, the mean average precision improved by more than 0.09 in a comparison with a support vector machine classifier.
https://doi.org/10.4218/etrij.17.0116.0936 인용 PDF KSCI

Search Result 223, Processing Time 0.038 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)