haizizhen90

New Member
Download Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents

Download miễn phí Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents





Table of Content
Introduction.1
Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis.3
1.1. Introduction.3
1.2. The Early Methods.5
1.2.1. Latent Semantic Analysis.5
1.2.2. Probabilistic Latent Semantic Analysis.8
1.3. Latent Dirichlet Allocation.11
1.3.1. Generative Model in LDA.12
1.3.2. Likelihood.13
1.3.3. Parameter Estimation and Inference via Gibbs Sampling.14
1.3.4. Applications.17
1.4. Summary.17
Chapter 2. Frameworks of Learning with Hidden Topics.19
2.1. Learning with ExternalResources: Related Works.19
2.2. General Learning Frameworks.20
2.2.1. Frameworks for Learning with Hidden Topics.20
2.2.2. Large-Scale Web Collections as Universal Dataset.22
2.3. Advantages of the Frameworks.23
2.4. Summary.23
Chapter 3. Topics Analysis of Large-Scale Web Dataset.24
3.1. Some Characteristics of Vietnamese.24
3.1.1. Sound.24
3.1.2. Syllable Structure.26
3.1.3. Vietnamese Word.26
3.2. Preprocessing and Transformation.27
3.2.1. Sentence Segmentation.27
3.2.2. Sentence Tokenization.28
3.2.3. Word Segmentation.28
3.2.4. Filters.28
3.2.5. Remove Non Topic-Oriented Words.28
3.3. Topic Analysis for VnExpress Dataset.29
3.4. Topic Analysis for Vietnamese Wikipedia Dataset.30
3.5. Discussion.31
3.6. Summary.32
Chapter 4. Deployments of General Frameworks.33
4.1. Classification with Hidden Topics.33
4.1.1. Classification Method.33
4.1.2. Experiments.36
4.2. Clustering with Hidden Topics.40
4.2.1. Clustering Method.40
4.2.2. Experiments.45
4.3. Summary.49
Conclusion.50
Achievements throughout the thesis.50
Future Works.50
References.52
Vietnamese References.52
English References.52
Appendix: Some Clustering Results.



Để tải bản DOC Đầy Đủ xin Trả lời bài viết này, Mods sẽ gửi Link download cho bạn sớm nhất qua hòm tin nhắn.
Ai cần download tài liệu gì mà không tìm thấy ở đây, thì đăng yêu cầu down tại đây nhé:
Nhận download tài liệu miễn phí

Tóm tắt nội dung:

external resources, these methods can be
roughly classified into 2 categories: those make use of unlabeled data, and those exploit
structured or semi-structured data.
The first category is commonly referred under the name of semi-supervised learning. The
key argument is that unlabeled examples are significantly easier to collect than labeled
ones. One example of this is web-page classification. Suppose that we want a program to
electronically visit some web site and download all the web pages of interest to us, such
as all the Computer Science faculty pages, or all the course home pages at some
university. To train such a system to automatically classify web pages, one would
typically rely on hand labeled web pages. Unfortunately, these labeled examples are fairly
expensive to obtain because they require human effort. In contrast, the web has hundreds
of millions of unlabeled web pages that can be inexpensively gathered using a web
crawler. Therefore, we would like the learning algorithms to be able to take as much
advantage of the unlabeled data as possible.
Semi-supervised learning has been received a lot of attentions in the last decade.
Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding whether
the word “plant” means a living organism or a factory in a given context. Rosenberg et. al
(2005) apply it to object detection systems from images, and show the semi-supervised
technique compares favorably with a state-of-the-art detector. In 2000, Nigam and Ghani
[30] perform extensive empirical experiments to compare co-training with generative
mixture models and Expectation Maximization (EM). Jones (2005) used co-training, co-
EM and other related methods for information extraction from text. Besides, there were a
lot of works that applied Transductive Support Vector Machines (TSVMs) to use
unlabeled data for determining optimal decision boundary.
The second category covers a lot of works exploiting resources like Wikipedia to support
learning process. Gabrilovich et. al. (2007) [16] has demonstrated the value of using
Wikipedia as an additional source of features for text classification and determining the
semantic relatedness between texts. Banerjee et al (2007)[3] also extract titles of
Wikipedia articles and use them as features for clustering short texts. Unfortunately, this
approach is not very flexible in the sense that it depends much on the external resource or
the application.
20
This chapter describes frameworks for learning with the support of topic model estimated
from a large universal dataset. This topic model can be considered background knowledge
for the domain of application. It also helps the learning process to capture hidden topics
(of the domain), the relationships between topics and words as well as words and words,
thus partially overcome the limitations of different word choices in text.
2.2. General Learning Frameworks
This section presents general frameworks for learning with the support of hidden topics.
The main motivation is how to gain benefits from huge sources of online data in order to
enhance quality of the Text/Web clustering and classification. Unlike previous studies of
learning with external resources, we approach this issue from the point of view of
text/Web data analysis that is based on recently successful latent topic analysis models
like LSA, pLSA, and LDA. The underlying idea of the frameworks is that for each
learning task, we collect a very large external data collection called “universal dataset”,
and then build a learner on both the learning data and a rich set of hidden topics
discovered from that data collection.
2.2.1. Frameworks for Learning with Hidden Topics
Corresponding to two typical learning problems, i.e. classification and clustering, we
describe two frameworks with some differences in the architectures.
a. Framework for Classification
Figure 2.1. Classification with Hidden Topics
Nowadays, the continuous development of Internet has created a huge amount of
documents which are difficult to manage, organize and navigate. As a result, the task of
automatic classification, which is to categorize textual documents into two or more
predefined classes, has been received a lot of attentions.
21
Several machine-learning methods have been applied to text classification including
decision trees, neural networks, support vector machines, etc. In the typical applications
of machine-learning methods, the training data is passed to a learning phrase. The result
of the learning step is an appropriate classifier capable of categorizing new documents.
However, in the cases such as the training data is not as much as expected or the data to
be classified is too rare [52], learning with only training data can not provide us a
satisfactory classifier. Inspired by this fact, we propose a framework that enables us to
enrich both training and new coming data with hidden topics from available large dataset
so as to enhance the performance of text classification.
Classification with hidden topics is described in Figure 2.1. We first collect a very large
external data collection called “universal dataset”. Next, a topic analysis technique such
as pLSA, LDA, etc. is applied to the dataset. The result of this step is an estimated topic
model which consists of hidden topics and the probability distributions of words over
these topics. Upon this model, we can do topic inference for training dataset and new
data. For each document, the output of topic inference is a probability distribution of
hidden topics – the topics analyzed in the estimation phrase – given the document. The
topic distributions of training dataset are then combined with training dataset itself for
learning classifier. In the similar way, new documents, which need to be classified, are
combined with their topic distributions to create the so called “new data with hidden
topics” before passing to the learned classifier.
b. Framework for Clustering
Figure 2.2. Clustering with Hidden Topics
Text clustering is to automatically generate groups (clusters) of documents based on the
similarity or distance among documents. Unlike Classification, the clusters are not known
22
previously. User can optionally give the requirement about the number of clusters. The
documents will later be organized in clusters, each of which contains “close” documents.
Clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find
successive clusters using previously established clusters, whereas partitional algorithms
determine all clusters at once. Hierarchical algorithms can be agglomerative (“bottom-
up”) or divisive (“top-down”). Agglomerative algorithms begin with each element as a
separate cluster and merge them into larger ones. Divisive algorithm begin with the while
set and divide it into smaller ones.
Distance measure, which determines how similarity of two documents is calculated, is a
key to the success of any text clustering algorithms. Some documents may be close to one
another according to one distance and further away according to another. Common
distance functions are the Euclidean distance, the Manhattan distance (also called taxicab
norm or 1-norm) and the maximum norm, just to name here but a few.
Web clustering, which is a type of text clustering specific for web pages, can be offline or
online. Offline clustering is to cluster the whole storage of available web documents and
does not have the constraint of response time. In online clustering, the algorithms need to
meet the “real-time condition”, i.e. the system need to perform clustering as fast as
possible. For example, the algorithm should take the document snippets instead of the
whole documents as input since the downloading of original documents is time-
consuming. The question here is how to enhance the quality of clustering for such
document snippets in online web clustering. Inspired by the fact those snippets are only
small pieces of text (and thus poor in content) we propose the framework to enrich them
with hidden topics for clustering (Figure 2.2). This framework and topic analysis is
similar to one for classification. The difference here is only due to the essential
differences between classification and clustering.
2.2.2. Large-Scale Web Collections as Universal Dataset
Despite of the obvious differences between two learning frameworks, there is a key
phrase sharing between them – the phrase of analyzing topics for previously collected
dataset. Here are some important considerations for this phrase:
- The degree of coverage of the dataset: the universal dataset should be l...
 

Các chủ đề có liên quan khác

Top