注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

网易考拉推荐

R & Bioconductor Manual (五)  

2012-02-29 10:40:42|  分类: R&Bioconductor |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
  1. Clustering and Data Mining in R
    1. Introduction (Slide Show)
    2. R contains many functions and libraries for clustering of large data sets. A very useful overview of clustering utilities in R is available on the Cluster Task Page and for machine learning algorithms on the Machine Learning Task Page and the MLInterfaces package.


    3. Data Preprocessing
    4. Generate a sample data set
        y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data matrix.
      Data centering and scaling
        scale(t(y)); yscaled <- t(scale(t(y))); apply(yscaled, 1, sd) # The function scale() centers and/or scales numeric matrices column-wise. When used with its default settings, the function returns columns that have a mean close to zero and a standard deviation of one. For row-wise scaling the data set needs to be transposed first using the t() function. Another transposing step is necessary to return the data set in its original orientation (same row/column assignment). Centering and scaling are common data transformation steps for many clustering techniques.
      Obtain a distance matrix
        dist(y, method = "euclidean") # Computes and returns a distance matrix for the rows in the data matrix 'y' using the specified distance measure, here 'euclidean'.
        c <- cor(t(y), method="spearman"); d <- as.dist(1-c); d # To obtain correlation-based distances, one has to compute first a correlation matrix with the cor() function. Since this step iterates across matrix columns, a transpose 't()' step is required to obtain row distances. In a second step the distance matrix object of class "dist" can be obtained with the as.dist() function.

    5. Hierarchical Clustering (HC)
    6. The basic hierarchical clustering functions in R are hclust, flashClust, agnes and diana. Hclust and agnes perform agglomerative hierarchical clustering, while diana performs divisive hierarchical clustering. flashClust is a highly speed improved (50-100 faster) version of hclust. The pvclust package can be used for assessing the uncertainty in hierarchical cluster analyses. It provides approximately unbiased p-values as well as bootstrap p-values. As an introduction into R's standard hierarchical clustering utilities one should read the help pages on the following functions: hclust, dendrogram, as.dendrogram, cutree and heatmap. An example for sub-clustering (subsetting) heatmaps based on selected tree nodes is given in the last part of this section (see zoom into heatmaps).


      Clustering with hclust
        y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
        c <- cor(t(y), method="spearman"); d <- as.dist(1-c); hr <- hclust(d, method = "complete", members=NULL) # This is a short example of clustering the rows of the generated sample matrix 'y' with hclust. Seven different clustering methods can be selected with the 'method' argument: ward, single, complete, average, mcquitty, median and centroid. The object returned by hclust is a list of class hclust which describes the tree generated by the clustering process with the following components: merge, height, order, labels, method, call and dist.method.
        par(mfrow = c(2, 2)); plot(hr, hang = 0.1); plot(hr, hang = -1) # The generated tree can be plotted with the plot() function. When the hang argument is set to '-1' then all leaves end on one line and their labels hang down from 0. More details on the plotting behavior is provided in the hclust help document (?hclust).
        plot(as.dendrogram(hr), edgePar=list(col=3, lwd=4), horiz=T) # To plot trees horizontally, the hclust tree has to be transformed into a dendrogram object.
        unclass(hr) # Prints the full content of the hclust object.
        str(as.dendrogram(hr)) # Prints dendrogram structure as text.
        hr$labels[hr$order] # Prints the row labels in the order they appear in the tree.
        par(mfrow=c(2,1)); hrd1 <- as.dendrogram(hr); plot(hrd1); hrd2 <- reorder(hrd1, sample(1:10)); plot(hrd2); labels(hrd1); labels(hrd2) # Example to reorder a dendrogram and print out its labels.
        library(ctc); hc2Newick(hr) # The 'hc2Newick' function of the BioC ctc library can convert a hclust object into the Newick tree file format for export into external programs.
      Tree subsetting (see also Dynamic Tree Cut package)
        mycl <- cutree(hr, h=max(hr$height)/2); mycl[hr$labels[hr$order]] # Cuts tree into discrete cluster groups (bins) at specified height in tree that is specified by 'h' argument. Alternatively, one can cut based on a desired number of clusters that can be specified with the 'k' argument.
        rect.hclust(hr, k=5, border="red") # Cuts dendrogram at specified level and draws rectangles around the resulting clusters.
        subcl <- names(mycl[mycl==3]); subd <- as.dist(as.matrix(d)[subcl,subcl]); subhr <- hclust(subd, method = "complete"); par(mfrow = c(1, 2)); plot(hr); plot(subhr) # This example shows how one can subset a tree by cutting it at a given branch point provided by the cutree() function. The identified row IDs are then used to subset the distancs matrix and re-cluster it with hclust.
        lapply(as.dendrogram(hr), unlist); lapply(as.dendrogram(hr)[[1]], unlist) # Returns the indices of the cluster members of the two main branches. The second step returns the members of the next two descendent clusters of the first main branch.
        example(hclust) # The last step in this example shows how the 'members' argument can be used to reconstruct a hierarchical tree above the cut points using the cluster centers. More details on this can be found in the hclust help document.
        source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/selectNodes.R"); Uniclusters <- nodes2Cluster(myhclust=hr, rootdist=2); clusterQCselect <- clusterQC(clusterlist=Uniclusters, simMatrix=as.matrix(c), method="min", cutoff=0.7, output="df"); clusterQCselect # The function 'nodes2Cluster' cuts the tree at all height levels as defined in 'hr$height' and the second function 'clusterQC' selects clusters based on a given minimum cutoff in an all-against-all comparison within each cluster. The 'method' argument allows to calculate the cutoff value with one of the following functions: "min", "qu1", "median", "mean", "qu3" and "max". The end result can be returned as list (output="list") or as data frame (output="df"). The argument 'rootdist' defines the starting distance from the root of the tree to avoid testing of very large clusters.
      Tree coloring and zooming into branches
        source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function..
        dend_colored <- dendrapply(as.dendrogram(hr), dendroCol, keys=subcl, xPar="edgePar", bgr="red", fgr="blue", lwd=2, pch=20) # In this example the dendrogram for the above 'hr' object is colored with the imported 'dendroCol()' function based on the identifiers provided in its 'keys' argument. If 'xPar' is set to 'nodePar' then the labels are colored instead of the leaves.
        par(mfrow = c(1, 3)); plot(dend_colored, horiz=T); plot(dend_colored, horiz=T, type="tr"); plot(dend_colored, horiz=T, edgePar=list(lwd=2), xlim=c(3,0), ylim=c(1,3)) # Plots the colored tree in different formats. The last command shows how one can zoom into the tree with the 'xlim and ylim' arguments, which is possible since R 2.8.
        z <- as.dendrogram(hr); attr(z[[2]][[2]],"edgePar") <- list(col="blue", lwd=4, pch=NA); plot(z, horiz=T) # This example shows how one can manually color tree elements.
      Plot heatmaps with the heatmap() function
        y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
        source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R") # Imports color function to obtain heat maps in red and green. Use this color function in heatmap color argument like this: 'col=my.colorFct()'.
        hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete") # Clusters rows by Pearson correlation as distance method.
        hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete") # Clusters columns by Spearman correlation as distance method.
        heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row") # The previous two hclust objects are used to obtain a heatmap where the rows are clusterd by Pearson and the columns by Spearman correlation distances. The 'scale' argument allows to scale the input matrix by rows or columns. The default is by rows and "none" turns the scaling off.
        mycl <- cutree(hr, h=max(hr$height)/1.5); mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(mycl)]; heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol) # The arguments 'ColSideColors' and 'RowSideColors' allow to annotate a heatmap with a color bar. In this example the row colors correspond to cluster bins that were obtained by the cutree() function.
        dend_colored <- dendrapply(as.dendrogram(hr), dendroCol, keys=c("g1", "g2"), xPar="edgePar", bgr="black", fgr="red", pch=20); heatmap(y, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol) # In addition, one can color the tree leaves or labels by providing a colored dendrogram object, here 'dend_colored' from above.
        heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", labRow = "", labCol="") # The arguments 'labRow' and 'labCol' allow to provide custom labels or to omit the printing of the labels as shown here.
        heatmap(y, Rowv=as.dendrogram(hr), Colv=NA, col=my.colorFct()) # If 'Rowv' or 'Cowv' is set to 'NA' then the row and/or the column clustering is turned off.
        ysort <- y[hr$labels[hr$order], hc$labels[hc$order]]; heatmap(ysort, Rowv=NA, Colv=NA, col=my.colorFct()) # This example shows how the rows and columns of a heatmap can be sorted by the hclust trees without printing them.
        x11(height=4, width=4); heatmap(matrix(rep(seq(min(as.vector(y)), max(as.vector(y)), length.out=10),2), 2, byrow=T, dimnames=list(1:2, round(seq(min(as.vector(y)), max(as.vector(y)), length.out=10),1))), col=my.colorFct(), Colv=NA, Rowv=NA, labRow="", main="Color Legend") # Prints color legend in a separate window.
      Plot heatmaps with the image() or heatmap.2() functions
        image(scale(t(ysort)), col=my.colorFct()) # A very similar plot as before can be obtained by the image() function which plots the values of a given matrix as a grid of colored rectangles.
        x11(height=12); par(mfrow=c(1,2)); plot(as.dendrogram(hr), horiz=T, yaxt="n", edgePar=list(col="grey", lwd=4)); image(scale(t(ysort)), col=my.colorFct(), xaxt="n", yaxt="n") # To create very large heatmaps that are still readable, one can plot the dendrogram and the heatmap of the image() function next to it.
        library("gplots") # Loads the gplots library that contains the heatmap.2() function.
        heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=redgreen(75), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol, trace="none", key=T, cellnote=round(t(scale(t(y))),1)) # The heatmap.2() function contains many additional plotting features that are not available in the basic heatmap() function. For instance, the 'key' argument allows to add a color key within the same plot. Numeric values can be added to each cell with the 'cellnote' argument. In addition, heatmap.2() scales to very large formats for viewing complex heatmaps.
        library(RColorBrewer); mypalette <- rev(brewer.pal(9,"Blues")[-1]) # Select an alternative color scheme that is more appropriate for representing continous values (e.g. chip intensities).
        heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, col=mypalette, scale="row", trace="none", key=T, cellnote=round(t(scale(t(y))),1)) # Example to illustrate how to turn off the row or column reordering in heatmap.2. The settings "dendrogram="row", Colv=F" turn off the column reordering and no column tree is shown. To turn off the re-ordering of both dimensions, use the following settings: "dendrogram="none", Rowv=F, Colv=F".

      Zooming into heatmaps by sub-clustering selected tree nodes
        y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
        hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete"); hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete") # Generates row and column dendrograms.
        mycl <- cutree(hr, h=max(hr$height)/1.5); mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9); mycolhc <- mycolhc[as.vector(mycl)] # Cuts the tree and creates color vector for clusters.
        library(gplots); myheatcol <- redgreen(75) # Assign your favorite heatmap color scheme. Some useful examples: colorpanel(40, "darkblue", "yellow", "white"); heat.colors(75); cm.colors(75); rainbow(75); redgreen(75); library(RColorBrewer); rev(brewer.pal(9,"Blues")[-1]). Type demo.col(20) to see more color schemes.
        heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=myheatcol, scale="row", density.info="none", trace="none", RowSideColors=mycolhc) # Creates heatmap for entire data set where the obtained clusters are indicated in the color bar.
        x11(height=6, width=2); names(mycolhc) <- names(mycl); barplot(rep(10, max(mycl)), col=unique(mycolhc[hr$labels[hr$order]]), horiz=T, names=unique(mycl[hr$order])) # Prints color key for cluster assignments. The numbers next to the color boxes correspond to the cluster numbers in 'mycl'.
        clid <- c(1,2); ysub <- y[names(mycl[mycl%in%clid]),]; hrsub <- hclust(as.dist(1-cor(t(ysub), method="pearson")), method="complete") # Select sub-cluster number (here: clid=c(1,2)) and generate corresponding dendrogram.
        x11(); heatmap.2(ysub, Rowv=as.dendrogram(hrsub), Colv=as.dendrogram(hc), col=myheatcol, scale="row", density.info="none", trace="none", RowSideColors=mycolhc[mycl%in%clid]) # Create heatmap for chosen sub-cluster.
        data.frame(Labels=rev(hrsub$labels[hrsub$order])) # Print out row labels in same order as shown in the heatmap.

    7. Bootstrap Analysis in Hierarchical Clustering
    8. The pvclust package allows to assess the uncertainty in hierarchical cluster analysis by calculating for each cluster p-values via multiscale bootstrap resampling. The method provides two types of p-values. The approximately unbiased p-value (AU) is computed by multiscale bootstrap resampling. It is a less biased p-value than than the second one, bootstrap probability (BP), which is computed by normal bootstrap resampling.

        library(pvclust) # Loads the required pvclust package.
        y <- matrix(rnorm(500), 50, 10, dimnames=list(paste("g", 1:50, sep=""), paste("t", 1:10, sep=""))) # Creates a sample data set.
        pv <- pvclust(scale(t(y)), method.dist="correlation", method.hclust="complete", nboot=10) # This step performs the hierarchical cluster analysis with multiscale bootstrap with 10 repetitions, complete linkage for cluster joining and a correlation-based dissimilarity matrix. Usually, one should set 'nboot' to at least 1000 bootstrap repetitions.
        plot(pv, hang=-1) # Plots a dendrogram where the red numbers represent the AU p-values and the green ones the BP values.
        pvrect(pv, alpha=0.95) # This command highlights with red rectangles all clusters in the dendrogram that have an AU value above 95%.
        clsig <- unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters) # The function 'pvpick' allows to pick significant clusters.
        source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function. The details on this function are provided in the hclust section of this manual.
        dend_colored <- dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20) # Creates dendrogram object where the significant clusters are labeled in red.
        plot(dend_colored, horiz=T) # Plots a dendrogram where the significant clusters are highlighted in red.

    9. QT Clustering
    10. QT (quality threshold) clustering is a partitioning method that forms clusters based on a maximum cluster diameter. It iteratively identifies the largest cluster below the threshold and removes its items from the data set until all items are assigned. The method was developed by Heyer et al. (1999) for the clustering of gene expression data.

        library(flexclust) # Loads flexclust library.
        x <- matrix(10*runif(1000), ncol=2) # Creates a sample data set.
        cl1 <- qtclust(x, radius=3) # Uses 3 as the maximum distance of the points to the cluster centers.
        cl2 <- qtclust(x, radius=1) # Uses 1 as the maximum distance of the points to the cluster centers.
        par(mfrow=c(2,1)); plot(x, col=predict(cl1), xlab="", ylab=""); plot(x, col=predict(cl2), xlab="", ylab="") # Plots the two cluster results.
        data.frame(name_index=1:nrow(x), cl=attributes(cl1)$cluster) # Provides the row-to-cluster assignment information.

    11. K-Means & PAM
    12. K-means, PAM (partitioning around medoids) and clara are related partition clustering algorithms that cluster data points into a predefined number of K clusters. They do this by associating each data point to its nearest centroids and then recomputing the cluster centroids. In the next step the data points are associated with the nearest adjusted centroid. This procedure continues until the cluster assignments are stable. K-means uses the average of all the points in a cluster for centering, while PAM uses the most centrally located point. Commonly used R functions for K-means clustering are: kmeans() of the stats package, kcca() of the flexclust package and trimkmeans() of the trimcluster package. PAM clustering is available in the pam() function from the cluster package. The clara() function of the same package is a PAM wrapper for clustering very large data sets.

      K-means
        km <- kmeans(t(scale(t(y))), 3); km$cluster # K-means clustering is a partitioning method where the number of clusters is pre-defined. The basic R implementation requires as input the data matrix and uses Euclidean distance. In contrast to pam(), the implementation does not allow to provide a distance matrix. In the given example the row-wise centered data matrix is provided.
        mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(km$cluster)]; heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol) # This example shows how the obtained K-means clusters can be compared with the hierarchical clustering results by highlighting them in the heatmap color bar.
      PAM (partitioning around medoids)
        library(cluster) # Loads the required cluster library.
        y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
        pamy <- pam(y, 4); pamy$clustering # Clusters data into 4 clusters using default Euclidean as distance method.
        mydist <- as.dist(1-cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
        pamy <- pam(mydist, 4); pamy$clustering # Same a above, but uses provided distance matrix.
        pam(mydist, 4, cluster.only=TRUE) # Same as before, but with faster computation.
        plot(pamy) # Generates cluster plot.
      Clara (clustering large applications: PAM method for large data sets)
        clarax <- clara(t(scale(t(y))), 4); clarax$clustering # Clusters above data into 4 clusters using default Euclidean as distance method. If the data set consists of gene expression values, then row-wise scaling of y is recommend in combination with Euclidean distances.

    13. Fuzzy Clustering
    14. In contrast to strict/hard clustering approaches, fuzzy clustering allows multiple cluster memberships of the clustered items. This is commonly achieved by partitioning the membership assignments among clusters by positive weights that sum up for each item to one. Several R libraries contain implementations of fuzzy clustering algorithms. The library e1071 contains the cmeans (fuzzy C-means) and cshell (fuzzy C-shell) clustering functions. And the cluster library provides the fanny function, which is a fuzzy implementation of the above described k-medoids method.

      Fuzzy clustering with the cluster library
        library(cluster) # Loads the cluster library.
        y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
        fannyy <- fanny(y, k=4, metric = "euclidean", memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering # Computes membership coefficients (weights) for 4 clusters and stores them in the 'membership' component of the resulting fanny.object. The corresponding hard clustering result is provided in the 'clustering' slot, which is formed by assigning each point to the cluster with the highest coefficient. If the distance method is set to metric="SqEuclidean", then the function performs fuzzy C-means clustering. The membership exponent can be controled with the argument 'memb.exp' (default is 2). Values close to 1 give 'crispier' clustering, whereas larger values increase the fuzzyness of the results. If a clustering results in complete fuzzyness, then the functions returns a warining.
        mydist <- as.dist(1-cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
        fannyy <- fanny(mydist, k=4, memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering # Example for a using a correlation-based distance matrix.
        fannyyMA <- round(fannyy$membership, 2) > 0.10; apply(fannyyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.1).
      Fuzzy clustering with the e1071 library
        library(e1071) # Loads the e1071 library.
        y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
        cmeansy <- cmeans(y, 4, method="cmeans", m=1.5); round(cmeansy$membership, 2); cmeansy$cluster # Uses the c-means clustering method to compute membership coefficients (weights) for 4 clusters and stores them in the 'membership' component of the resulting object. The result of this example should be identical to the above clustering with the fanny function when its metric argument is set to "SqEuclidean". The cmeans function does not accept a distance matrix as input, but the fanny function does.
        cmeansyMA <- round(cmeansy$membership, 2) > 0.10; apply(cmeansyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.1).
        cshelly <- cshell(y, 4, method="cshell", m=1.5); round(cshelly$membership, 2); cshelly$cluster # Uses the c-shell clustering method to compute membership coefficients (weights) for 4 clusters and stores them in the 'membership' component of the resulting object.

    15. Self-Organizing Map (SOM)
    16. Self-organizing map (SOM), also known as Kohonen network, is a popular artificial neural network algorithm in the unsupervised learning area. The approach iteratively assigns all items in a data matrix to a specified number of representatives and then updates each representative by the mean of its assigned data points. Widely used R packages for SOM clustering and visualization are: class (part of R), SOM and kohonen. The SOM package, that is introduced here, provides similar utilities as the GeneCluster software from the Broad Institute.

        library(som) # Loads the required SOM library.
        y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
        y <- t(scale(t(y))) # Normalizes the data so that each row has a mean close to zero and a standard deviation of one.
        y.som <- som(y, xdim = 5, ydim = 6, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.
        plot(y.som) # Plots SOM clustering result.
        y.som$visual # Provides the assignment of rows items to the SOM clusters. Remember that the coordinate counting starts here at zero!

    17. Principal Component Analysis (PCA)
    18. Principal components analysis (PCA) is a data reduction technique that allows to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and visual variance analysis. The following commands introduce the basic usage of the prcomp() function. A very related function is princomp(). The BioConductor library pcaMethods provides many additional PCA functions. For viewing PCA plots in 3D, one can use the scatterplot3d library or the made4 library.

        z1 <- rnorm(10000, mean=1, sd=1); z2 <- rnorm(10000, mean=3, sd=3); z3 <- rnorm(10000, mean=5, sd=5); z4 <- rnorm(10000, mean=7, sd=7); z5 <- rnorm(10000, mean=9, sd=9); mydata <- matrix(c(z1, z2, z3, z4, z5), 2500, 20, byrow=T, dimnames=list(paste("R", 1:2500, sep=""), paste("C", 1:20, sep=""))) # Generates sample matrix of five discrete clusters that have very different mean and standard deviation values.
        pca <- prcomp(mydata, scale=T) # Performs principal component analysis after scaling the data. It returns a list with class "prcomp" that contains five components: (1) the standard deviations (sdev) of the principal components, (2) the matrix of eigenvectors (rotation), (3) the principal component data (x), (4) the centering (center) and (5) scaling (scale) used.
        summary(pca) # Prints variance summary for all principal components.
        x11(height=6, width=12, pointsize=12); par(mfrow=c(1,2)) # Set plotting parameters.
        mycolors <- c("red", "green", "blue", "magenta", "black") # Define plotting colors.
        plot(pca$x, pch=20, col=mycolors[sort(rep(1:5, 500))]) # Plots scatter plot for the first two principal components that are stored in pca$x[,1:2].
        plot(pca$x, type="n"); text(pca$x, rownames(pca$x), cex=0.8, col=mycolors[sort(rep(1:5, 500))]) # Same as above, but prints labels.
        library(geneplotter); smoothScatter(pca$x) # Same as above, but generates a smooth scatter plot that shows the density of the data points.
        pairs(pca$x[,1:4], pch=20, col=mycolors[sort(rep(1:5, 500))]) # Plots scatter plots for all combinations between the first four principal components.
        biplot(pca) # Plots a scatter plot for the first two principal components plus the corresponding eigen vectors that are stored in pca$rotation.
        library(scatterplot3d) # Loads library scatterplot3d.
        scatterplot3d(pca$x[,1:3], pch=20, color=mycolors[sort(rep(1:5, 500))]) # Same as above, but plots the first three principal components in 3D scatter plot.
        library(rgl); rgl.open(); offset <- 50; par3d(windowRect=c(offset, offset, 640+offset, 640+offset)); rm(offset); rgl.clear(); rgl.viewpoint(theta=45, phi=30, fov=60, zoom=1); spheres3d(pca$x[,1], pca$x[,2], pca$x[,3], radius=0.3, color=mycolors, alpha=1, shininess=20); aspect3d(1, 1, 1); axes3d(col='black'); title3d("", "", "PC1", "PC2", "PC3", col='black'); bg3d("white") # Creates an interactive 3D scatter plot with Open GL. The rgl library needs to be installed for this. To save a snapshot of the graph, one can use the command rgl.snapshot("test.png").

    19. Multidimensional Scaling (MDS)
    20. Multidimensional scaling (MDS) algorithms start with a matrix of item-item distances and then assign coordinates for each item in a low-dimensional space to represent the distances graphically. Cmdscale() is the base function for MDS in R. Additional MDS functions are sammon() and isoMDS() of the MASS library.

        loc <- cmdscale(eurodist) # Performs MDS analysis on the geographic distances between European cities.
        plot(loc[,1], -loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)"); text(loc[,1], -loc[,2], rownames(loc), cex=0.8) # Plots the MDS results in 2D plot. The minus is required in this example to flip the plotting orientation.

    21. Bicluster Analysis
    22. Biclustering (also co-clustering or two-mode clustering) is an unsupervised clustering technique which allows simultaneous clustering of the rows and columns of a matrix. The goal of biclustering is to find subgroups of rows and columns which are as similar as possible to each other and as different as possible to the remaining data points. The biclust package, that is introduced here, contains a collection of bicluster algorithms, data preprocessing and visualization methods (Detailed User Manual). An algorithm that allows the integration of different data types is implemented in the cMonkey R program (BMC Bioinformatics 2006, 7, 280). A comparison of several bicluster algorithms for clustering gene expression data has been published by Prelic et al (2006). Since most biclustering algorithms expect the input data matrix to be properly preprocessed, it is especially important to carefully read the manual pages for the different functions.

      Plaid model biclustering
        library(biclust) # Loads the biclust library.
        data(BicatYeast) # Loads sample data from Barkow et al. (Bioinformatics, 22, 1282-1283).
        res1 <- biclust(BicatYeast, method=BCPlaid()) # Performs plaid model biclustering as described in Turner et al, 2003. This algorithm models data matrices to a sum of layers, the model is fitted to data through minimization of error.
        summary(res1); show(res1) # The summary() functions returns the sizes of all clusters found and the show() function provides an overview of the method applied.
        names(attributes(res1)) # Converts res1 object of class Biclust into a list and returns the names of its components (slots). The results for the row clustering are stored in the "RowxNumber" slot and the results for the column clustering are stored in the "NumberxCol" slot.
        BicatYeast[res1@RowxNumber[,2], res1@NumberxCol[2,]] # Extracts values for bicluster number 2 from original data set.
        myrows <- attributes(res1)$RowxNumber; mycols <- attributes(res1)$NumberxCol; myclusters <- lapply(1:length(myrows[1,]), function(x) list(rownames(BicatYeast[myrows[,x], ]), colnames(BicatYeast[, mycols[x,]]))); names(myclusters) <- paste("CL", 1:length(myclusters), sep="_"); myclusters[2] # Organizes all identified row and column clusters in a list object, here 'myclusters'.
        writeBiclusterResults("results.txt", res1, "My Result", dimnames(BicatYeast)[1][[1]], dimnames(BicatYeast)[2][[1]]) # Writes bicluster results to a file.
        bubbleplot(BicatYeast, res1, showLabels=T) # Draws a bubble plot where each bicluster is represented as a circle (bubble). Up to three biclustering result sets can be compared in one plot (here one). The sets are then labled with different colors. The color brightness representes the bicluster homogeneity (darker, less homogeneous). The bubble sizes represent the size of the biclusters, as (number of rows)x(number of columns). The bubble location is a 2D-projection of row and column profiles.
        parallelCoordinates(x=BicatYeast, bicResult=res1, number=4) # Plots the profiles for the 4th cluster. The cluster number is specified under the number argument.
        drawHeatmap(BicatYeast, res1, number=4) # Draws a heatmap for the BicatYeast data matrix with the rows and colums of the selected 4th bicluster at the top-left of plot.
      XMotifs biclustering
        y <- discretize(BicatYeast, nof=10) # Converts input matrix with continuous data into matrix with discrete values of 10 levels (nof).
        res2 <- biclust(y, method=BCXmotifs(), alpha=0.05, number=50) # Performs XMotifs biclustering based on the framework by Murali and Kasif (2003). It searches for rows with constant values over a set of columns which results in a submatrix where all rows in a descretized matrix have the same value profile. Usually the algorihm needs a discrete matrix to perform well.
        bubbleplot(BicatYeast, res1, res2, showLabels=F) # Compares the two bicluster results in form of a bubble plot.
        jaccardind(res1, res2) # Calculates the similarity for two clustering results using an adaptation of the Jaccard index (values from 0-1). An overview of related methods is available on this cluster validity algorithm page.
      Bimax biclustering
        y <- BicatYeast; y[y >= 1 | y <= -1] <- 1; y[y != 1] <- 0 # Transforms input data matrix into binary values. Here the log2 ratios with at least a two-fold change are set to 1 and the rest is set to 0.
        res3 <- biclust(y, method=BCBimax(), minr=15, minc=10, number=100) # Performs Bimax biclustering based on an algorithm described by Prelic et al (2006). This method searches in a binary matrix for sub-matrices containing only ones.
      Spectral biclustering
        res4 <- biclust(BicatYeast, method=BCSpectral(), numberOfEigenvalues=2) # Performs spectral biclustering as described in Kluger et al, 2003. Spectral biclustering supposes that normalized microarray data matrices have a checkerboard structure that can be discovered by the use of svd decomposition in eigenvectors, applied to genes.
      CC biclustering
        res5 <- biclust(y, method=BCCC(), delta=1.5, alpha=1, number=50) # Performs CC biclustering based on the framework by Cheng and Church (2000). Searches for submatrices with a score lower than a specific treshold in a standardized data matrix.

    23. Network Analysis
      • WGCNA: an R package for weighted correlation network analysis
      • BioNet: routines for the functional analysis of biological networks

    24. Support Vector Machines (SVM)
    25. Support vector machines (SVMs) are supervised machine learning classification methods. SVM implementations are available in the R packages kernlab and e1071. The e1071 package contains an interface to the C++ libsvm implementation from Chih-Chung Chang and Chih-Jen Lin. In addition to SVMs, the e1071 package includes a comprehensive set of machine learning utilities, such as functions for latent class analysis, bootstrapping, short time Fourier transform, fuzzy clustering, shortest path computation, bagged clustering, naive Bayes classifier, etc. The kernlab package contains additional functions for spectral clustering, kernel PCA, etc.

    26. Similarity Measures for Clustering Results
    27. To measure the similarities among clustering results, one can compare the numbers of identical and unique item pairs appearing in their clusters. This can be achieved by counting the number of item pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c) set. With this information it is possible to calculate a similarity coefficient, such as the Jaccard Index. The latter is defined as the size of the intersect divided by the size of the union of two sample sets: a/(a+b+c). In case of partitioning results, the Jaccard Index measures how frequently pairs of items are joined together in two clustering data sets and how often pairs are observed only in one set. Related coefficient are the Rand Index and the Adjusted Rand Index. These indices also consider the number of pairs (d) that are not joined together in any of the clusters in both sets. A variety of alternative similarity coefficients can be considered for comparing clustering results. An overview of available methods is given on this cluster validity page. In addition, the Consense library contains a variety of functions for comparing cluster sets, and the mclust02 library contains an implementation of the variation of information criterion described by M. Meila (J Mult Anal 98, 873-895).

        source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/clusterIndex.R") # Imports the cindex() function for computing the Jaccard Index or the Rand Index.
        library(cluster); y <- matrix(rnorm(5000), 1000, 5, dimnames=list(paste("g", 1:1000, sep=""), paste("t", 1:5, sep=""))); clarax <- clara(y, 49); clV1 <- clarax$clustering; clarax <- clara(y, 50); clV2 <- clarax$clustering # Creates two sample cluster data sets stored in the named vectors clV1 and clV2.
        cindex(clV1=clV1, clV2=clV2, self=FALSE, minSZ=1, method="jaccard") # Computes the Jaccard Index for clV1 and clV2, where values close to 0 indicate low similarities and values close to 1 high similarities among the evaluated cluster sets. The Rand and Adjusted Rand Indices can be returned by assigning to the methods argument the values rand or arand, respectively. By setting the argument self=TRUE, one can allow self comparisons, where pairs with different and identical labels are counted. This way clusters with single items will be considered as well. If it is set to FALSE only pairs with different labels are considered. The implemented cindex() function is relatively speed and memory efficient. Two clustering results with 1000 items each partitioned into 50 clusters will be processed in about 0.05 seconds. To focus the analysis on clusters above a minimum size limit, one can use the minSZ argument. For instance, the setting minSZ=3 will only consider clusters with 3 or more items.
        clVlist <- lapply(3:12, function(x) clara(y[1:30, ], k=x)$clustering); names(clVlist) <- paste("k", "=", 3:12); d <- sapply(names(clVlist), function(x) sapply(names(clVlist), function(y) cindex(clV1=clVlist[[y]], clV2=clVlist[[x]], method="jaccard")[[3]])); hv <- hclust(as.dist(1-d)); plot(as.dendrogram(hv), edgePar=list(col=3, lwd=4), horiz=T, main="Similarities of 10 Clara Clustering Results for k: 3-12") # Example for clustering entire cluster result sets. First, 10 sample cluster results are created with Clara using k-values from 3 to 12. The results are stored as named clustering vectors in a list object. Then a nested sapply loop is used to generate a similarity matrix of Jaccard Indices for the clustering results. After converting the result into a distance matrix, hierarchical clustering is performed with hclust.

    28. Clustering Exercises
    29. Slide Show (R Code)
      Install required libraries
      The following exercises demonstrate several useful clustering and data mining utilities in R.
      Import a sample data set
      1. Download from GEO the Arabidopsis IAA treatment series "GSE1110" in TXT format. The direct link to the download is:
      2. Import the data set into R
        • temp <- readLines("GSE1110_series_matrix.txt"); cat(temp[-grep("^!|^\"$", temp)], file="GSE1110clean.txt", sep="\n"); mydata <- read.delim("GSE1110clean.txt", header=T, sep="\t") # These import commands include a cleanup step to get rid of annotation lines and corrupted return signs.
          rownames(mydata) <- mydata[,1]; mydata <- as.matrix(mydata[,-1]) # Assigns row indices and converts the data into a matrix object.
      Filtering
          mydata <- mydata[apply(mydata>100, 1, sum)/length(mydata[1,])>0.5 & apply(log2(mydata), 1, IQR)>1.5,] # Retrieves all rows with high intensities (50% > 100) and high variability (IQR>1.5).
      Hierarchical clustering
          source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R") # Import an alternative color scheme for the heatmap function.
          mydatascale <- t(scale(t(mydata))) # Centers and scales data.
          hr <- hclust(as.dist(1-cor(t(mydatascale), method="pearson")), method="complete") # Cluster rows by Pearson correlation.
          hc <- hclust(as.dist(1-cor(mydatascale, method="spearman")), method="complete") # Clusters columns by Spearman correlation.
          heatmap(mydata, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row") # Plot the data table as heatmap and the cluster results as dendrograms.
          mycl <- cutree(hr, h=max(hr$height)/1.5); mycolhc <- sample(rainbow(256)); mycolhc <- mycolhc[as.vector(mycl)]; heatmap(mydata, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc) # Cut the tree at specific height and color the corresponding clusters in the heatmap color bar.
      Obtain significant clusters by pvclust bootstrap analysis
          library(pvclust) # Loads the required pvclust package.
          pv <- pvclust(scale(t(mydata)), method.dist="correlation", method.hclust="complete", nboot=10) # Perform the hierarchical cluster analysis. Due to time resrictions, we are using here only 10 bootstrap repetitions. Usually, one should use at least 1000 repetitions.
          plot(pv, hang=-1); pvrect(pv, alpha=0.95) # Plots result as a dendrogram where the significant clusters are highlighted with red rectangles.
          clsig <- unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters) # Retrieve members of significant clusters.
          source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
          dend_colored <- dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20) # Create dendrogram object where the significant clusters are labeled in red.
          heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc) # Plot the heatmap from above, but with the significant clusters in red and the cluster bins from the tree cutting step in the color bar.
          x11(height=12); heatmap.2(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", trace="none", RowSideColors=mycolhc) # Plot heatmap with heatmap.2() function which scales better for many entries.
          mydatasort <- mydata[pv$hclust$labels[pv$hclust$order], hc$labels[hc$order]] # Sort rows in data table by 'dend_colored' and its colums by 'hc'.
          x11(height=16, width=12); par(mfrow=c(1,2)); plot(dend_colored, horiz=T, yaxt="n"); image(scale(t(mydatasort)), col=my.colorFct(), xaxt="n",yaxt="n") # Plot heatmap with bootstrap tree in larger format using instead of heatmap the image function.
          pdf("pvclust.pdf", height=21, width=6); plot(dend_colored, horiz=T, yaxt="n"); dev.off(); pdf("heatmap.pdf", height=20, width=6); image(scale(t(mydatasort)), col=my.colorFct(), xaxt="n",yaxt="n"); dev.off() # Save graphical results to two PDF files: 'pvclust.pdf' and'heatmap.pdf'.
      Compare PAM (K-means) with hierarchical clustering
          library(cluster) # Loads required library.
          mydist <- t(scale(t(mydata))) # Center and scale data.
          mydist <- as.dist(1-cor(t(mydist), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
          pamy <- pam(mydist, max(mycl)) # Clusters distance matrix into as many clusters as obtained by tree cutting step (6).
          mycolkm <- sample(rainbow(256)); mycolkm <- mycolkm[as.vector(pamy$clustering)]; heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolkm) # Compare PAM clustering results with hierarchical clustering by labeling it in heatmap color bar.
          pdf("pam.pdf", height=20, width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolkm); dev.off() # Save graphical results to PDF file: 'pvclust.pdf'.
      Compare SOM with hierarchical clustering
          library(som) # Loads required library.
          y <- t(scale(t(mydata))) # Center and scale data.
          y.som <- som(y, xdim = 2, ydim = 3, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.
          plot(y.som) # Plots results.
          pdf("som.pdf"); plot(y.som); dev.off() # Save plot to PDF: 'som.pdf'.
          somclid <- as.numeric(paste(y.som$visual[,1], y.som$visual[,2], sep=""))+1 # Returns SOM cluster assignment in order of input data.
          mycolsom <- sample(rainbow(256)); mycolsom <- mycolsom[somclid]; heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolsom) # Compare SAM clustering results with hierarchical clustering by labeling it in heatmap color bar.
          pdf("somhc.pdf", height=20, width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolsom); dev.off() # Save graphical results to PDF file: 'somhc.pdf'.
      Compare PCA with SOM
          pca <- prcomp(mydata, scale=T) # Performs principal component analysis after scaling the data.
          summary(pca) # Prints variance summary for all principal components.
          library(scatterplot3d) # Loads 3D library.
          scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom) # Plots PCA result in 3D. The SOM clusters are highlighted in their color.
          pdf("pcasom.pdf"); scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom); dev.off() # Saves PCA plot in PDF format: 'pcasom.pdf'.
      Compare MDS with HC, SOM and K-means
          loc <- cmdscale(mydist, k = 3) # Performs MDS analysis and returns results for three dimensions.
          x11(height=8, width=8, pointsize=12); par(mfrow=c(2,2)) # Sets plotting parameters.
          plot(loc[,1:2], pch=20, col=mycolsom, main="MDS vs SOM 2D") # Plots MDS-SOM comparison in 2D. The SOM clusters are highlighted in their color.
          scatterplot3d(loc, pch=20, color=mycolsom, main="MDS vs SOM 3D") # Plots MDS-SOM comparison in 3D.
          scatterplot3d(loc, pch=20, color=mycolhc, main="MDS vs HC 3D") # Plots MDS-HC comparison.
          scatterplot3d(loc, pch=20, color=mycolkm, main="MDS vs KM 3D") # Plots MDS-KM comparison.
      Fuzzy clustering
          library(cluster) # Loads cluster library.
          fannyy <- fanny(mydist, k= max(mycl), memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering # Performs fuzzy clustering with as many coefficients as clusters were obtained by tree cutting step in HC. The hard clustering results are provided in the 'clustering' slot.
          fannyyMA <- round(fannyy$membership, 2) > 0.3; apply(fannyyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.3).
      Biclustering

  2. Administration
    1. Installation of R
      1. Download and install R for your operating system from CRAN.
        R interfaces: RStudio (additional options)  
    2. Installation of BioConductor Packages
      1. The latest instructions for installing BioConductor packages are available on the BioC Installation page. Only the essential steps are given here. To install BioConductor packages, execute from the R console the following commands:
          source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script.
          biocLite() # Installs the biocLite.R default set of BioConductor packages.
          biocLite(c("pkg1", "pkg2")) # Command to install additional packages from BioC.
          update.packages(repos=biocinstallRepos(), ask=FALSE) # Bioconductor packages, especially those in the development branch, are updated fairly regularly. To update all of your installed packages, run this command after sourcing "biocLite.R".
    3. Installation of CRAN Packages (Task View)
      1. To install CRAN packages, execute from the R console the following command:
          install.packages(c("pkg1", "pkg2")) # The selected package name(s) need to be provided within the quotation marks.

        Alternatively, a package can be downloaded and then intalled in compressed form like this:
          install.packages("pkg.zip", repos=NULL)
List all packages installed on a system:
row.names(installed.packages())
On UNIX-based systems one can also execute the following shell command (not from R): $ R CMD INSTALL pkg_0.1.tar.gz # This installs packages by default into a system-wide directory. To install packages into a user account directory (not system-wide), one can specify a custom directory with the argument '-l my_lib_dir'. A locally installed library can be called in R like this: 'library(my_lib, lib.loc= "/home/user/my_lib_dir/")'. To manage libraries by default in a custom directory, one can generate a '.Reviron' file containing the line: R_LIBS=/your_home_dir/Rlib
  评论这张
 
阅读(3937)| 评论(1)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016