2012-02-29 10:40:42| 分类： R&Bioconductor | 标签： |举报 |字号大中小 订阅

**Clustering and Data Mining in R****Introduction**(Slide Show)- R contains many functions and libraries for clustering of large data sets. A very useful overview of clustering utilities in R is available on the Cluster Task Page and for machine learning algorithms on the Machine Learning Task Page and the MLInterfaces package.
**Data Preprocessing**- Generate a sample data set
- Data centering and scaling
- Obtain a distance matrix
**Hierarchical Clustering (HC)**- The basic hierarchical clustering functions in R are hclust, flashClust, agnes and diana. Hclust and agnes perform agglomerative hierarchical clustering, while diana performs divisive hierarchical clustering. flashClust is a highly speed improved (50-100 faster) version of hclust. The pvclust package can be used for assessing the uncertainty in hierarchical cluster analyses. It provides approximately unbiased p-values as well as bootstrap p-values. As an introduction into R's standard hierarchical clustering utilities one should read the help pages on the following functions: hclust, dendrogram, as.dendrogram, cutree and heatmap. An example for sub-clustering (subsetting) heatmaps based on selected tree nodes is given in the last part of this section (see zoom into heatmaps).
- Clustering with hclust
- Tree subsetting (see also Dynamic Tree Cut package)
- Tree coloring and zooming into branches
- Plot heatmaps with the heatmap() function
- Plot heatmaps with the image() or heatmap.2() functions
- Zooming into heatmaps by sub-clustering selected tree nodes
**Bootstrap Analysis in Hierarchical Clustering****QT Clustering****K-Means & PAM**- K-means
- PAM (partitioning around medoids)
- Clara (clustering large applications: PAM method for large data sets)
**Fuzzy Clustering**- Fuzzy clustering with the cluster library
- Fuzzy clustering with the e1071 library
**Self-Organizing Map (SOM)****Principal Component Analysis (PCA)****Multidimensional Scaling (MDS)****Bicluster Analysis**- Plaid model biclustering
- XMotifs biclustering
- Bimax biclustering
- Spectral biclustering
- CC biclustering
**Network Analysis**- WGCNA: an R package for weighted correlation network analysis
- BioNet: routines for the functional analysis of biological networks
**Support Vector Machines (SVM)****Similarity Measures for Clustering Results****Clustering Exercises**- Slide Show (R Code)
- Install required libraries
- The following exercises demonstrate several useful clustering and data mining utilities in R.
- Import a sample data set
- Download from GEO the Arabidopsis IAA treatment series "GSE1110" in TXT format. The direct link to the download is:
- ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE1110/
- Uncompress the downloaded file.
- Import the data set into R
- Filtering
- Hierarchical clustering
- Obtain significant clusters by pvclust bootstrap analysis
- Compare PAM (K-means) with hierarchical clustering
- Compare SOM with hierarchical clustering
- Compare PCA with SOM
- Compare MDS with HC, SOM and K-means
- Fuzzy clustering
- Biclustering
**Administration****Installation of R****Installation of BioConductor Packages****Installation of CRAN Packages (Task View)**

The pvclust package allows to assess the uncertainty in hierarchical cluster analysis by calculating for each cluster p-values via multiscale bootstrap resampling. The method provides two types of p-values. The approximately unbiased p-value (AU) is computed by multiscale bootstrap resampling. It is a less biased p-value than than the second one, bootstrap probability (BP), which is computed by normal bootstrap resampling.

QT (quality threshold) clustering is a partitioning method that forms clusters based on a maximum cluster diameter. It iteratively identifies the largest cluster below the threshold and removes its items from the data set until all items are assigned. The method was developed by Heyer et al. (1999) for the clustering of gene expression data.

K-means, PAM (partitioning around medoids) and clara are related partition clustering algorithms that cluster data points into a predefined number of K clusters. They do this by associating each data point to its nearest centroids and then recomputing the cluster centroids. In the next step the data points are associated with the nearest adjusted centroid. This procedure continues until the cluster assignments are stable. K-means uses the average of all the points in a cluster for centering, while PAM uses the most centrally located point. Commonly used R functions for K-means clustering are: kmeans() of the stats package, kcca() of the flexclust package and trimkmeans() of the trimcluster package. PAM clustering is available in the pam() function from the cluster package. The clara() function of the same package is a PAM wrapper for clustering very large data sets.

In contrast to strict/hard clustering approaches, fuzzy clustering allows multiple cluster memberships of the clustered items. This is commonly achieved by partitioning the membership assignments among clusters by positive weights that sum up for each item to one. Several R libraries contain implementations of fuzzy clustering algorithms. The library e1071 contains the cmeans (fuzzy C-means) and cshell (fuzzy C-shell) clustering functions. And the cluster library provides the fanny function, which is a fuzzy implementation of the above described k-medoids method.

Self-organizing map (SOM), also known as Kohonen network, is a popular artificial neural network algorithm in the unsupervised learning area. The approach iteratively assigns all items in a data matrix to a specified number of representatives and then updates each representative by the mean of its assigned data points. Widely used R packages for SOM clustering and visualization are: class (part of R), SOM and kohonen. The SOM package, that is introduced here, provides similar utilities as the GeneCluster software from the Broad Institute.

Principal components analysis (PCA) is a data reduction technique that allows to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and visual variance analysis. The following commands introduce the basic usage of the prcomp() function. A very related function is princomp(). The BioConductor library pcaMethods provides many additional PCA functions. For viewing PCA plots in 3D, one can use the scatterplot3d library or the made4 library.

Multidimensional scaling (MDS) algorithms start with a matrix of item-item distances and then assign coordinates for each item in a low-dimensional space to represent the distances graphically. Cmdscale() is the base function for MDS in R. Additional MDS functions are sammon() and isoMDS() of the MASS library.

Biclustering (also co-clustering or two-mode clustering) is an unsupervised clustering technique which allows simultaneous clustering of the rows and columns of a matrix. The goal of biclustering is to find subgroups of rows and columns which are as similar as possible to each other and as different as possible to the remaining data points. The biclust package, that is introduced here, contains a collection of bicluster algorithms, data preprocessing and visualization methods (Detailed User Manual). An algorithm that allows the integration of different data types is implemented in the cMonkey R program (BMC Bioinformatics 2006, 7, 280). A comparison of several bicluster algorithms for clustering gene expression data has been published by Prelic et al (2006). Since most biclustering algorithms expect the input data matrix to be properly preprocessed, it is especially important to carefully read the manual pages for the different functions.

Support vector machines (SVMs) are supervised machine learning classification methods. SVM implementations are available in the R packages kernlab and e1071. The e1071 package contains an interface to the C++ libsvm implementation from Chih-Chung Chang and Chih-Jen Lin. In addition to SVMs, the e1071 package includes a comprehensive set of machine learning utilities, such as functions for latent class analysis, bootstrapping, short time Fourier transform, fuzzy clustering, shortest path computation, bagged clustering, naive Bayes classifier, etc. The kernlab package contains additional functions for spectral clustering, kernel PCA, etc.

- An excellent introduction into the usage of SVMs in R is available in David Meyer's SVM article.

To measure the similarities among clustering results, one can compare the numbers of identical and unique item pairs appearing in their clusters. This can be achieved by counting the number of item pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c) set. With this information it is possible to calculate a similarity coefficient, such as the Jaccard Index. The latter is defined as the size of the intersect divided by the size of the union of two sample sets: a/(a+b+c). In case of partitioning results, the Jaccard Index measures how frequently pairs of items are joined together in two clustering data sets and how often pairs are observed only in one set. Related coefficient are the Rand Index and the Adjusted Rand Index. These indices also consider the number of pairs (d) that are not joined together in any of the clusters in both sets. A variety of alternative similarity coefficients can be considered for comparing clustering results. An overview of available methods is given on this cluster validity page. In addition, the Consense library contains a variety of functions for comparing cluster sets, and the mclust02 library contains an implementation of the variation of information criterion described by M. Meila (J Mult Anal 98, 873-895).

- Follow the examples given in the Bicluster Analysis section.

- Download and install R for your operating system from CRAN.

R interfaces: RStudio (additional options)

- The latest instructions for installing BioConductor packages are available on the BioC Installation page. Only the essential steps are given here. To install BioConductor packages, execute from the R console the following commands:

- To install CRAN packages, execute from the R console the following command:

Alternatively, a package can be downloaded and then intalled in compressed form like this:

List all packages installed on a system:

评论这张

<#--最新日志，群博日志-->
<#--推荐日志-->
<#--引用记录-->
<#--博主推荐-->
<#--随机阅读-->
<#--首页推荐-->
<#--历史上的今天-->
<#--被推荐日志-->
<#--上一篇，下一篇-->
<#-- 热度 -->
<#-- 网易新闻广告 -->
<#--右边模块结构-->
<#--评论模块结构-->
<#--引用模块结构-->
<#--博主发起的投票-->

## 评论