Home // SEMAPRO 2010, The Fourth International Conference on Advances in Semantic Processing // View article
Document Clustering Using Semantic Relationship Between Target Documents and Related Documents
Authors:
Minoru Sasaki
Hiroyuki Shinnou
Keywords: document clustering, semi-supervised clustering, semantic feature expansion
Abstract:
Document clustering is one of the most major techniques to group documents automatically. This technique is to divide a given set of documents into a certain number of clusters automatically. In this technique, the first step is 'feature extraction' from documents. As a feature used in the conventional methods, we frequently use a set of words that contains nouns and verbs. Although words are used as features in a generic clustering framework, some previous research proposes the clustering method using the other features based on vector space model such as kernel methods and adaptive sprinkling. However, in previous research of document clustering, the method of appending new feature vectors obtained by using relationship between the existing documents and other documents has not been reported yet. So, we propose a new method for clustering documents using the relationship between the existing documents and other documents to acquire the more useful clusters for users. Our method can expand features of document similarities as semantic relationships by using relevant documents that user is interested in, like semi-supervised clustering. To evaluate the efficiency of this system, we made experiments on clustering newsgroup documents by using our method and by using the dimension reduction method based on the singular value decomposition. As the results of these experiments, we found that (i) it is effective for document clustering to combine the similarity matrix with the original matrix, and (ii) low similarity values cause adverse effect to the clustering performance when we use all the similarity value. Moreover, the proposed method is more effective for the document clustering in comparison with the clustering through the dimensionality reduction.
Pages: 91 to 95
Copyright: Copyright (c) IARIA, 2010
Publication date: October 25, 2010
Published in: conference
ISSN: 2308-4510
ISBN: 978-1-61208-104-5
Location: Florence, Italy
Dates: from October 25, 2010 to October 30, 2010