Drugi

New Member
Download Luận văn Real data analysis and discussion

Download miễn phí Luận văn Real data analysis and discussion





FOREWORD .1
CHAPTER 1.3
INTRODUCTION TO GENEEXPRESSIONDATA .3
1.1. GENE EXPRESSION.3
1.2. DNAMICROARRAY EXPERIMENTS.5
1.3. HIGH-THROUGHPUT MICROARRAY TECHNOLOGY.8
1.4. MICROARRAY DATA ANALYSIS.12
1.4.1. Pre-processingstep on raw data .14
1.4.1.1 Processing missing values. 14
1.4.1.2. Data transformation and Discretization . 15
1.4.1.3. Data Reduction. 16
1.4.1.4. Normalization. 17
1.4.2. Data analysis tasks .18
1.4.2.1. Classification on gene expression data . 18
1.4.2.2. Feature selection . 21
1.4.2.3. Performance assessment . 21
1.5. RESEARCH TOPICS ON CDNAMICROARRAY DATA.22
CHAPTER 2.25
GRAPH BASED RANKING ALGORITHMS WITH GENE
NETWORKS.25
2.1. GRAPH BASED RANKING ALGORITHMS.25
2.2. INTRODUCTION TO GENE NETWORK.29
2.2.1. The Boolean Network Model .30
2.2.2. Probabilistic Boolean Networks .31
2.2.3. Bayesian Networks.31
2.2.4. Additive regulation models .33
CHAPTER 3.35
REAL DATA ANALYSIS AND DISCUSSION .35
3.1. THE PROPOSED SCHEME FOR GENE SELECTION IN SAMPLE
CLASSIFYING PROBLEM.35
3.2. DEVELOPING ENVIRONMENT.37
3.3. ANALYSIS RESULTS.38
REFERENCES .



Để tải bản DOC Đầy Đủ xin Trả lời bài viết này, Mods sẽ gửi Link download cho bạn sớm nhất qua hòm tin nhắn.
Ai cần download tài liệu gì mà không tìm thấy ở đây, thì đăng yêu cầu down tại đây nhé:
Nhận download tài liệu miễn phí

Tóm tắt nội dung:

ticular gene’s expression levels over
all samples. And the term observation is the one whose values are expression levels
of one sample across all studied genes. The following are three common data
reduction strategies:
i. Variable selection select a good subset of all variables and only retain them to
further analysis.
ii. Observation selection Similar to variable selection, except that observation
are in role here.
iii. Variable combination find the suitable combination of existing variables into
a kind of "super" or composite variable. The composite variables will be in used
for further analysis while the variables used to create them not.
Variable selection is one of the most important issues in microarray analysis,
because microarray analysis encounters the so-called n-large and p-small problem.
That means the number of studied genes is usually much bigger than the number of
samples. Moreover most of genes (variables) are uninformative. One idea is to
exhaustively consider and evaluate all possible subsets and then chose the best one.
However, it is infeasible in practice since there are 2n-1 possible unique subsets of
the given n genes.
Combining the relevant biological knowledge and heuristics is a simple
consideration to select a subset of suitable variables. Besides consideration all
subsets, one gene can be considered one by one and then be eliminated or not out of
final subset based on whether it sastifies some predefined criteria such as
information gain and entropy-based measure, statistical tests or interdependence
analyses. In most situations, as the result of selection methods, the good set of
17
variables obtained may contain the correlating genes. Moreover there are some
genes filtered out that only expose their meaningfullness in conjunction with other
genes (variables).
Taking into account more than one genes (variables) at once, the multivariate
feature selection methods such as cluster analysis techniques, and multivariate
decision trees compute a correlation matrix or covariance matrix to detect
redundant and correlated variables. In the covariance matrix, the variables with
large values tendency tend to have large covariance scores. The correlation matrix
is calculated in the same fashion but the value of elements are normalized into the
interval of [-1, 1] to eleminate the above effect of large values of variables [35].
The original set of genes (variables) can be reduced by the procedure that
merges the subset of highly correlated genes (variables) into one variable so that
the derived set contains the mutually largely uncorrelated variables but still reserve
the original information content. For example, we can replace a set of gene or array
profiles highly correlated by some average profile that conveys most of the profiles'
information.
Besides, the Principal Component Analysis (PCA) methods summarizing
patterns of correlation, and providing the basis for predictive models is a feature-
merging method commonly used to reduce microarray data [26].
1.4.1.4. Normalization
Ideally, the expression matrix contains the true level of transcript abundance
in the measured gene-sample combination. However, because of naturally biased
measurement condition, the measured values usually deviate from the true
expression level by some amount. So we have measured level = truth level + error,
Where error comes from systematic tendency of the measurement instrument to
detect either too low or too high values [35] and the wrong measurement. The
former is called bias and the latter is called variance. So error is the sum of bias
and variance. The variance is often normally distributed, meaning that wrong
measurements in both directions are equally frequent, and that small deviations are
more frequent than large ones.
18
Normalization is a numerical method designed to deal with measurement
errors and with biological variations as follows. After the raw data is pre-processed
with tranformation procedure, e.g., base-2 logarithm, the resulting matri can be
normallized by multiplying each element on an array with an array-specific factor
such that the mean value is the same for all arrays. Futher requirement, the array-
specific factor must sastify that the mean for each array equals to 0 and the standard
deviation equals 1.
1.4.2. Data analysis tasks
Right after the data pre-processing step is employed, a numerical analysis
method is deployed corresponding to the scientific analysis task. The elementary
tasks can be divided into two categories: prediction and pattern-detection (Figure
1.9). Due to the scope of this thesis, only two topics classification and gene
regulatory network will be discussed in the following sections.
Prediction Pattern-detection
Classification
Regression or
Estimation
Time-series
Prediction
Clustering
Correlation analysis
Assosiation analysis
Deviation detection
Visualization
Figure 1.10: Two classes of data analysis tasks for microarry data.
1.4.2.1. Classification on gene expression data
Classification is a prediction or supervised learning problem in which the
data objects are assigned into one of the k predefined classes {c1, c2, …, ck}. Each
data object is characterized by a set of g measurements which create the feature
vector or vector of predictor variables, X=(x1,…,xg) and is associated with a
dependent variable (class label), Y={1,2,…,k }. We call the classification as binary
if k=2 otherwise as multi-classification. Informly a classifier C can be thought as a
partition of the feature space X into k disjoint and exhaustive subsets, A1, ...,Ak,
containing the subset of data objects whose assigned classes are c1, …, ck
respectively.
19
Classifiers are derived from the training set L= {(x1,y1),…,(xn,yn)} in which
each data object is known to belong to a certain class. The notation C(.; L) is used
to denote a classifier built from a learning set L [24]. For gene expression data, the
data object is biological sample needed to be classified, features correspond to the
expression measures of different genes over all samples studied and classes
correspond to different types of tumors (e.g., nodal positive vs. negative breast
tumors, or tumors with good vs. bad prognosis). The process of classifying tumor
samples concerns with the gene selection mentioned above, i.e., the identification
of marker genes that characterize different tumor classes.
For the classification problem of microarray data, one has to classify the
sample profile into predefined tumor types. Each gene corresponds to a feature
variable whose value domain contains all possible gene expression levels. The
expression levels might be either absolute (e.g., Affymetrix oligonucleotide arrays)
or relative to the expression levels of a well defined common reference sample
(e.g., 2-color cDNA microarrays). The main obstade encountered during the
classification of microarry data is a very large number of genes (variables) w.r.t the
number of tumbor samples or the so-called “large p, small n” problem. Typical
expression data contain from 5,000 to 10,000 genes for less than 100 tumor
samples.
The problem of classifying the biological samples using gene expression data
has becomed the key issue in cancer research. For successfullness in diagnosis and
treatment cancer, we need a reliable and precise classification of tumors. Recently,
many researchers have published their works on statistical aspects of classification
in the context of microarray experiments [14,17]. They mainly focused on existing
methods or variants derived from those. Studies to date suggest that simple
methods such as K Nearest Neighbor [17] or naive Bayes classification [13,3],
perform as well as more complex approaches, such as Support Vector Machines
(SVMs) [14]. This section will discuss the native Bayes and k Nearest Neighbours
methods. Finally we will describe issue of performance assessment.
20
The naïve Bayes classification
Suppose that the likelyhood pk(x)=p(x | Y=k) and class priors πk are known
for all possible class value k. Bayes' Theorem can be used to compute the posterior
probability p(k | x) of class k given feature vector x as
∑ == Kl ll
kk
xp
xpxkp
1
)(
)(
)|( π
π
The native Bayes classification predicts the class )(xCB of an object x by
maximizing the posterior probability
)|(maxarg)( xkpxC kB =
Depending on parametric or non-parametric estimations of p(k|x), there are
two general schemes to estimate the class posterior probabilities p(k|x): density
estimation and direct function estimation. In the density estimation approach, class
conditional densities Pk(x) = p(x | Y=k...
 
Top