- HT Sequence Analysis with R and Bioconductor
- This section of the manual is available on the HT-Seq site.
- Programming in R
- This section of the manual is available on the Programming in R site.
- BioConductor is an open source and open development software project for the analysis of genome data (e.g. sequence, microarray, annotation and many other data types). This section of the manual provides a brief introduction into the usage and utilities of a subset of packages from the BioConductor project. The included packages are a 'personal selection' of the author of this manual that does not reflect the full utility specturm of the BioConductor project. The introduced packages were chosen, because the author uses them often for his own teaching and research. To obtain a broad overview of available BioConductor packages, it is strongly recommended to consult its official project site. Due to the rapid development of many packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutley critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to its author.
- Finding Help
- The instructions for installing BioConductor packages are available in the administrative section of this manual. Documentation for BioConductor packages can be found in the vignette of each package. A listing of the available packages is available on the BiocViews page. Annotation libraries can be found on the meta data page. The BioC Mail GMANE Archive can be searched find to answers to questions that have been discussed on the mailing list. Another valuable information resource is the BioConductor Book. The basic R help functions provide additional information about packages and their functions:
library(affy) # Loads a particular package (here affy package).
library(help=affy) # Lists all functions/objects of a package (here affy package).
library() # Lists all libraries/packages that are available on a system.
openVignette() # Provides documentation on packages.
sessionInfo() # Prints version information about R and all loaded packages. The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists.
?my_topic # Opens documentation on a function.
help.search("topic") # Searches help system for documentation.
search() # Shows loaded packages and attached data frame in current search path.
system.file(package="affy") # Shows location of an installed package.
library("tools"); vigSrc <- list.files(pattern="Rnw$", system.file("doc",package="GOstats"), full.names=TRUE); vigSrc; for (v in vigSrc) Stangle(v) # Extracts R code from a vignette and saves it to file(s) in your current working directory.
- Affy Packages
- BioConductor provides extensive resources for the analysis of Affymetrix data. The most important ones are introduced here.
- The Affy package provides basic methods for analyzing affymetrix oligonucleotide arrays. The following steps outline the usage of the Affy package and associated packages.
- Obtaining scaled expression values with 5 different methods (MAS5, RMA, GCRMA, Plier & dChip). A comparison of three of these methods including references is available in this slide show.
Move your CEL files into your working directory and make sure that the corresponding CDF library for your chips is available on your system.
library(affy) # Loads the affy package.
mydata <- ReadAffy() # Reads all *.CEL (*.cel) files in your current working directory and stores them into the AffyBatch object 'mydata'.
mydata <- ReadAffy(widget=TRUE) # Opens file browser to select specific CEL files.
eset <- rma(mydata) # Creates normalized and background corrected expression values using the RMA method. The generated data are stored as ExpressionSet class in the 'eset' object. For large data sets use the more memory efficient justRMA() function.
eset <- mas5(mydata) # Uses expresso (MAS 5.0 method) module instead of RMA, which is much slower than RMA!
eset_gcrma <- gcrma(mydata) # Use this command to aquire gcrma data. The 'library(gcrma)' needs to be loaded first.
eset_plier <- justPlier(mydata) # Use this command to aquire plier data. The 'library(plier)' needs to be loaded first.
eset_PMA <- mas5calls(mydata) # Generates MAS 5.0 P/M/A calls. The command creates ExpressionSet with P/M/A calls in the 'exprs' slot and the wilcoxon p-values in the 'se.exprs' slot. To access them see below.
eset <- expresso(mydata, normalize.method="invariantset", bg.correct=FALSE, pmcorrect.method="pmonly", summary.method="liwong") # Generates expression calls similar to dChip (MBEI) method from Li and Wong.
library(affycoretools); affystart(plot=T, express="mas5") # Handy function to normalize all cel files in current working directory, perform qc plots and export normalized data to file. Works for mas5, rma and gcrma.
- Exporting data from 'ExpressionSet' objects
write.exprs(eset, file="mydata.txt") # Writes expression values to text file in working directory.Obtaining single probe-level data from 'affybatch' objects (see also documentation '?ReadAffy')
x <- data.frame(exprs(eset), exprs(eset_PMA), assayDataElement(eset_PMA, "se.exprs")); x <- x[,sort(names(x))]; write.table(x, file="mydata_PMA.xls", quote=F, col.names = NA, sep="\t") # Writes expression values, PMA values and wilcoxon p-values to text file in working directory. Remember: the old command for accessing the wilcoxon p-values in BioC versions <2.0 was 'se.exprs(eset_PMA))'.
mypm <- pm(mydata) # retrieves PM intensity values for single probesWorking with 'ExpressionSet' objects (see Biobase Manual)
mymm <- mm(mydata) # retrieves MM intensity values for single probes
myaffyids <- probeNames(mydata) # retrieves Affy IDs
result <- data.frame(myaffyids, mypm, mymm) # combines the above information in data frame
eset; pData(eset) # Provides summary information of ExpressionSet object 'eset' and lists the analyzed file names.Retrieving annotation data for Affy IDs (see Annotation Package Manual)
exprs(eset)[1:2,1:4]; exprs(eset)[c("244901_at","244902_at"),1:4] # Retrieves specific rows and fields of ExpressionSet object. To learn more about this format class, consult ExpressionSet manual with command '?ExpressionSet'.
test <- as.data.frame(exprs(eset)); eset2 <-new("ExpressionSet", exprs = as.matrix(test), annotation="ath1121501"); eset2 # Example for creating an ExpressionSet object from a data frame. To create the object from an external file, use the read.delim() function first and then convert it accordingly.
data.frame(eset) # Prints content of 'eset' as data frame to STDOUT.
exprs(eset_PMA)[1:2,1:2]; assayDataElement(eset_PMA, "se.exprs")[1:2,1:2] # Prints from ExpressionSet of 'mas5calls(mydata)' the PMA values from its 'exprs' slot and the p-values from its 'se.exprs' slot.
library(ath1121501.db) # Opens library with annotation data.Accessing probe sequence data
library(help=ath1121501.db) # Shows availability and syntax for annotation data.
ath1121501() # Provides a summary about the available annotation data sets of an anntation library.
library(ath1121501cdf); ls(ath1121501cdf) # Retrieves all Affy IDs for a chip in vector format.
x <- c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at") # Generates sample data set of Affy ID numbers.
contents(ath1121501ACCNUM)[1:40] # Retrieves locus ID numbers for Affy IDs.
myAnnot <- data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "), DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", ")); myAnnot[x, ] # Organizes an entire set of annotation features in a data frame, here: ACCNUM, SYMBOL and GENENAME. The annotations from probe sets with several mappings are collapsed into single strings.
mget(x, ath1121501GO, ifnotfound=NA) # Retrieves GO information for Affy IDs.
library(ath1121501probe) # Opens library with probe sequence data.
print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
library(Biostrings); pm <- DNAStringSet(ath1121501probe$sequence); names(pm) <- ath1121501probe$Probe.Set.Name # Stores probe sequences as DNAStringSet object. See HT-Seq manual for more details.
Visualization and quality controlsAdditional information on this topic can be found in the affyPLM QC Manual, the arrayQualityMetrics package and on the WEHI Affy Lab site.
library(affyQCReport); QCReport(mydata, file="ExampleQC.pdf") # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
deg <- AffyRNAdeg(mydata); summaryAffyRNAdeg(deg); plotAffyRNAdeg(deg) # Performs RNA degradation analysis. It averages on each chip the probes relative to the 5'/3' position on the target genes. A summary list and a plot are returned.
image(mydata[ ,1]) # Reconstructs image with log intensities of first chip.
hist(mydata[ ,1:2]) # Plots histogram of PM intensities for 1st and 2nd array.
hist(log2(pm(mydata[,1])), breaks=100, col="blue") # Plos bar histogram of the PM ('pm') or MM ('mm') log intensities of 1st array.
boxplot(mydata,col="red") # Generates a box plot of un-normalized log intensity values.
boxplot(data.frame(exprs(eset)),col="blue", main="Normalized Data") # Generates a box plot of normalized log intensity values.
mva.pairs(pm(mydata)[,c(1,4)]) # Creates MA-plot for un-normalized data. A MA-plot is a plot of log-intensity ratios (M-values) versus log-intensity averages (A-values) between selected chips (here '[1,4]').
mva.pairs(exprs(eset)[,c(1,4)]) # Creates MA-plot for normalized data.
SimpleaffyThe simpleaffy package automates many repetitive high-level analysis steps.
- Creating expression values from CEL files
Move CEL files into working directory.Quality control steps
Create white-space delimited table file "covdesc.txt": first column no title then CEL file names; second column title "treatment", then treatment names.
library(simpleaffy) # Loads "affy" and "simpleaffy" packages.
raw.data <- read.affy("covdesc.txt") # Reads data in working directory that are specified in "covdesc.txt" file; for help on this function type "?read.affy".
x.rma <- call.exprs(raw.data, "rma") # Creates expression values using RMA method. The function 'call.exprs' returns always log2 expression values; for help on this function type "?call.exprs". One can also use here the "gcrma" method after loading it with the command 'library(gcrma)'.
x.mas <- call.exprs(raw.data, "mas5") # Computes expression values using MAS 5.0 method.
write.exprs(x.rma, file="mydata.txt") # Writes expression values to file "mydata.txt".
The "qc" function provides several qc stats for chips. To use it, one follows the following steps:Filtering by expression values
x <- read.affy("covdesc.txt") # Reads data in working directory.
x.mas5 <- call.exprs(x,"mas5") # Calculates expression values with MAS 5.0 method which is required for the next step!
qc <- qc(x,x.mas5) # Calls "qc" function which generates object containing scale factors, GAPDH-Actin_3'/5' ratios, target intensity, % present, average, min, max, mean background int and more; for more details type "?qc".
x.mas5@description@preprocessing # Creates QC output like this.
getGapdh3(cleancdfname(cdfName(x))) # To find out which GAPDH/Actin probe sets are used on a given chip; for others use getGapdhM, getGapdh5, getBioB, getBioC...
plot(qc) # Creates summary plot of QC data. See description on page 5 of vignette "simpleaffy".
get.array.subset(x.rma, "treatment",c("CMP1","Ctrl")) # When R loaded *.CEL files it also reads experiment layout from covdesc.txt file (Ctrl and CMP1). The "get.array.subset" function makes it easy to select subsets of arrays.Plotting results
results <- pairwise.comparison(x.rma, "treatment", c("1", "2"), raw.data) # computes mean intensities of replicates from two treatments (means), calculates log2-fold changes between means (fc), performs t-test (tt) and writes out PMA calls (calls). Type "?pairwise.comparison" for more help on this function.
write.table(data.frame(means(results), fc(results), tt(results), calls(results)), file="my_comp.txt", sep="\t") # Command to export all 'pairwise.comparison' data into one table.
sort(abs(fc(results)),decreasing=TRUE)[1:100] # Prints out 100 highest fold-changes (fc), for means of intensities (means), for ttest (tt), for PMA (calls).
significant <- pairwise.filter(results, min.exp=log2(10), min.exp.no=2, fc=log2(8), min.present.no=4, tt= 0.001, present.by.group=FALSE) # Function 'pairwise.filter' takes the output from 'pairwise.comparison' and filters for significant changes. Filter Arguments:
Type "?pairwise.filter" for more info on the different arguments of this function.
- min.exp: minimum expression cut off
- min.exp.no: occurence of 'min.exp' in at least this number of chips
- min.present.no: present calls on at least this number of chips
- present.by.group: If true, then present count is restricted to replicate groups and not all chips of an experiment!
- fc: A gene must show a log2 fold change greater than this to be called significant. A 2.5-fold change can be specified like this: 'fc=log2(2.5)'
- tt: A gene must be changing with a p-score less than this to be called significant
write.table(data.frame(means(significant), fc(significant), tt(significant), calls(significant)), file="my_comp.txt", col.names = NA, sep="\t") # Exports all 'pairwise.filter' data into one table.
plot(significant, type="scatter") # Plots significant changes from above as scatter plot. Alternative plotting types: type="ma" or type="volcano".Exercise simpleaffy (Command Summary)Analysis of Differentially Expressed Genes
plot(results,significant, type="scatter") # Plots means of the two replicate groups as scatter plot and high-lights all genes that meet criteria of 'significant' filter. Meaning of colors: red - all present, orange - all present in one group or the other, yellow - all that remain.
plot(results,type="scatter") # Plots means of the two replicate groups as scatter plot.
png(file="figure1.png"); plot(significant,type="scatter"); dev.off() # Writes scatter plot to image file.
- Various packages are available for analyzing pre-processed data from dual-color and Affymetrix arrays.
- LIMMA, limmaGUI & affylmGUI
- Limma is a software package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. The package includes pre-processing capabilities for two-color spotted arrays. The differential expression methods apply to all array platforms and treat Affymetrix, single channel and two channel experiments in a unified way. The methods are described in Smyth 2004 and in the limma manual. An illustrated introduction for the GUI packages can be found at WEHI.
- Tcl/Tk Requirements
library(limma) # Loads command-line limma package.Data objects in limma
limmaUsersGuide() # Opens pdf manual for limma.
The cDNA and affy sample data sets of the manual can be downloaded from the limmaGUI and affylmGUI pages.
library(affylmGUI); affylmGUI() # Opens affylmGUI for affy data. For a quick start, follow the instructions for the Estrogen data set.
library(limmaGUI); limmaGUI() # Opens limmaGUI for cDNA array data. For a quick start, follow the instructions for the Swirl Zebrafish data set.
There are four main data objects created and used by limma:
- RGList: for cDNA data created by function 'read.maimages()'
- MAList: for cDNA data created by functions MA.RG() or 'normalizeWithinArrays()'
- MArrayLM: created by function 'lmFit()'
- TestResults: created by function 'decideTests()'
More details on this can be found in paragraph 13 of the limma PDF manual ('limmaUsersGuide()').
Preprocessing of cDNA array data
- Reading cDNA array data
To make the following commands work, save and extract the SWIRL cDNA microarray sample data into your R working directory. For a quick demonstration of the analysis of this data set, one can copy&paste or source the following command-line summary into the R terminal: my_swirl_commands.txt.
Requirements Quality plots
Reading intensity data
- Have all intensity data files in one directory. Supported formats are: ArrayVision, ImaGene, GenePix, QuantArray, SMD (QuantArray) or SPOT. If an intensity data file format is not supported then one can specify the corresponding column names during the data import into R (see below). Type '?read.maimages' for more information about supported formats.
- Targets file: Format of targets file. This file defines which RNA sample was hybridized to each channel of each array.
- Only for SPOT: separate gene list file
- Optional: spot type file for identifying special probes such as controls.
targets <- readTargets("Targets.txt") # Reads targets information from file 'Targets.txt' and assigns it to targets frame.Spot quality weights
RG <- read.maimages(targets$FileName, source="spot", sep="\t", path="./") # Writes intensity data into 'RGList' list object. The argument 'targets$FileName' specifies the intensity files; 'source="spot"' specifies the image analysis program (e.g. spot); 'path="./"' provides the path to the directory where the intensity files are located and 'sep="\t"' specifies the field separator.
RG <- read.maimages(targets$FileName, annotation="My_spot_labels", columns=list(Rf="Rmean",Gf="Gmean",Rb="morphR",Gb="morphG")) # If an intensity data file format is not supported then one can specify the corresponding column names as shown here. The provided example is equivalent to the previous command with the exception of the additional argument: annotation="My_spot_labels". This argument allows the import of a spot ID or annotation column. This can be useful for analyzing arrays with unavailable printing layout or incomplete intensity data files. In those cases, all print-tip specific functions of the following steps will not work (e.g. use 'loess' normalizaten instead of 'printtiploess').
RG_combined <- cbind(RG1, RG2) # One can combine different 'RGList' objects (e.g. RG1, RG2) into one large object. The command 'RG[,1]' shows only the first array, while 'RG[1:100,]' gives the data of the first 100 genes.
RG <- read.maimages(targets$FileName, source="spot", wt.fun=wtarea(100)) # Assigns spot quality weights from 0 to 1 to each spot and writes information as 'weight' component into 'RGList' object. Provided example with 'wt.fun=wtarea(100)' gives full weight of 1 to spots with 100 pixels. Spots with zero pixels or twice the ideal size are given weight of 0. All spots with weight of 0 are considered as flagged and limma will ignore them in the subsequent analysis. The appropriate way of computing quality weights depends on the image analysis software. The command '?QualityWeight' provides more information on available weight functions.Gene lists and print layout
RG$genes <- readGAL() # The output of some image analysis programs contains intensity, gene name and print layout information all in one file. In the cases of the image analysis programs SPOT, GenePix, ArrayVision and others, this information is located in a separate file from where it needs to be written into the 'RGList' using the 'readGal' and 'getLayout' functions. The function 'readGal' reads this information by default from a *.gal file in the working directory. This file contains the columns: 'Block', 'Column', 'Row', 'ID' and 'Name' (optional). If such a file is not provided then one can create it in R using the printing layout information of a given array. Example of an array with 4 by 8 sub-grids (pins) each with 20 by 21 spots: x <- 1:32; c1 <- c(); for(i in x) c1 <- c(c1, rep(i, times=420)); x <- 1:21; c2 <- c(); for(i in x) c2 <- c(c2, rep(i, times=20)); c3 <- rep(1:21, times=20); test.gal <- data.frame(Block=c1, Row=rep(c2, times=32), Column=rep(c3, times=32), ID=rep("my_ID", times=13440), Name=rep("my_Description", times=13440)); test.gal[1:20,];Spot type file
RG$printer <- getLayout(RG$genes) # Extracts print layout (number of pins or subgrids) after the gene list is available and adds this information as $printer component to RGList.
Control spot information can be added to the RGList object with the spot type file (SpotTypes.txt). Regular expressions can be used here to associate this information with the RG$genes component of the RGList. The incorporation of the spot type information has the advantage that controls can be highlighted in plots or separated in certain analysis steps. The format of this file is specified in the limma pdf manual.
spottypes <- readSpotTypes("SpotTypes.txt") # Reads in the file SpotTypes.txt.
RG$genes$Status <- controlStatus(spottypes, RG) # Adds a 'Status' column to the RG$genes component in RGList.
Recommended quality exploration steps are the imageplot() function of the raw log-ratios and an MA-plot (plotMA) of the raw data for each array. Normalization
spottypes <- readSpotTypes("SpotTypes.txt") # Same as above.
RG$genes$Status <- controlStatus(spottypes, RG) # Same as above.
plotMA(RG, array=1) # Creates MA-plot which is a plot of the log-intensity ratios (M-values) versus the log-intensity averages of both channels (A-values). To display plots for all arrays next to each other, one can use the layout function 'par(mfrow=c(2,2))' and plot the different arrays in the same plotting device by changing the array numbers in the argument 'array=1'.
imageplot(log2(RG$R[,1]), RG$printer, low="white", high="red") # Creates an image of color shades that represent the values of a statistic in the microarray printing format. This function can be used to explore spatial effects across the microarray.
plotPrintTipLoess(RG) # Co-plots unnormalized RG data in form of MA-plots with the loess curves for each print-tip.
Limma integrates several within- and between-array normalization methods. The commands '?normalizeWithinArrays' and '?normalizeBetweenArrays' provide information about the different methods. The method 'printtiploess' is the default. Use 'loess' instead if your arrays have no print-tip groups (e.g. Agilent) or have less than 150 spots per print-tip. Background correction
MA <- normalizeWithinArrays(RG, method="printtiploess") # Background corrects and normalizes the expression log-ratios of the RGList and assigns the results to a MAList object. The default background correction method is bc.method="subtract". Use bc.method="none" to ignore the background in the analysis. If the quality weights should be ignored in this step, add to the command the 'weights=NULL' argument. For analyzing arrays with unavailable printing layout or incomplete intensity data files, one should use here a print-tip unspecific normalization method such as 'loess'.
plotPrintTipLoess(MA) # Coplots normalized MA data in form of MA-plots with loess curves for each print-tip. Compare results with unnormalized RG data from above.
boxplot(as.data.frame(MA$M)) # Creates box-and-whisker plot of MA values. This is a graphical summary of the MA distribution between the arrays. The two hinges in the middle represent the first and third quartile, the lines (whiskers) represent the largest and smallest observation in a distance of 1.5 times the box height and everything beyond that is printed as extreme values in form of dots. Inconsistent spreads of hinges and whiskers between arrays can indicate normalization issues. Between-array normalization may improve this situaltion (see limma MAnual).
MA <- normalizeBetweenArrays(MA) # If box-and-whisker plot indicates different spreads of M-values, as in the above example, then one can use this secondary between-array normalization step.
The above 'normalizeWithinArrays' function performs by default a background correction by subtracting the background values from the expression intensities. Linear models and differential expression of single and dual color data
RGb <- backgroundCorrect(RG, method="subtract"); MA <- normalizeWithinArrays(RGb) # These two commands perform the same operation as the above 'normalizeWithinArrays' command.
RG <- backgroundCorrect(RG, method="normexp", offset=50) # This more advanced background correction method 'normexp' adjusts the expression values adaptively for background values and results in strictly positive expression values. This method is particularly effective for obtaining more robust ratios for low expressed genes.
Limma provides advanced statistical methods for linear modelling of microarray data and for identifying differentially expressed genes. The approach fits a linear model to the data and uses an empirical Bayes method for assessing differential expression. It is described in Smyth 2004
. One or two experiment definition matrices need to be specified during the analysis: a design matrix
defining the RNA samples and a contrast matrix
(optional for simple experiments) defining the comparisons to be performed. How to set up these matrices is described in the limma manual and on this page from Natalie Thorne
. Another useful manual is Advanced Linear Modeling and Time-Series Analysis
- cDNA common reference design (one sample test)
The swirl experiment is a classical one sample test where a mutant is compared to a WT sample with four replicates that include 2 dye swaps. The following commands will work with the above objects obtained under steps '4.1-4.4'.
design <- c(-1,1,-1,1) # Creates appropriate design matrix.
fit <- lmFit(MA, design) # Fits a linear model for each gene based on the given series of arrays. The returned list object 'fit' contains the average M-value ($coefficients) for each gene and their standard deviations ($sigma). The ordinary t-statistics can be computed with this command: ordinary.t <- fit$coef / fit$stdev.unscaled / fit$sigma. However, the empirical Bayes moderated t-statistics should be the preferred method for limma users (see below).
plotMA(fit); abline(0,0,col="blue") # Plots MA-plot with baseline.
fit <- eBayes(fit) # Computes empirical Bayes statistics for differential expression. This moderated t-statistics uses standard deviations that have been shrunk towards a pooled standard deviation value.
qqt(fit$t,df=fit$df.prior+fit$df.residual,pch=16,cex=0.2); abline(0,1) # Plots the quantiles of a data sample against the theoretical quantiles of a Student's t distribution.
options(digits=3) # Sets the number of digits to print in numeric output to 3 digits.
topTable(fit, adjust="fdr", sort.by="B", number=10) # Generates list of top 10 ('number=10') differentially expressed genes sorted by B-values ('sort.by=B'). The summary table contains the following information: logFC is the log2-fold change, AveExpr is the average expression value accross all arrays and channels, the moderated t-statistic (t) is the logFC to its standard error, the P.Value is the associated p-value, the adj.P.Value is the p-value adjusted for multiple testing and the B-value (B) is the empirical Bayes log-odds of differential expression (the-higher-the-better). Usually one wants to base gene selection on the adj.P.Value rather than the t- or B-values. More details on this can be found in the limma PDF manual (type 'limmaUsersGuide()') or on this FAQ page. If the swirl data set was normalized with the 'printtiploess' method and the between-array method, then the top downregulated gene is BMP2 which is the mutated gene of this sample.
x <- topTable(fit, adjust="fdr", sort.by="P", number=100); x[x$adj.P.Val < 0.01,] # Filters out candidates that have P-values < 0.01.
plotMA(fit); ord <- order(fit$lods, decreasing=TRUE); top30 <- ord[1:30]; text(fit$Amean[top30], fit$coef[top30], labels=fit$genes[top30, "Name"], cex=0.8, col="blue") # Plots MA-plot and highlights 30 top changes.
Affymetrix data analysis
To enable the following analysis steps, users need to save and extract this archive of 6 cel files into their R working directory. These sample files are from the Cold Stress Time Course of the AtGenExpress site (ftp batch download). This affy_targets.txt file is required to select specific cel files during the analysis. For a quick demonstration of the analysis of this data set, one can copy&paste the following commands into the R terminal. RankProdThe RankProd package contains functions for the identification of differentially expressed genes using the rank product non-parametric method from Breitling et al., 2004. It generates a list of up- or down-regulated genes based on the estimated percentage of false positive predictions (pfp), which is also known as false discovery rate (FDR). The attraction of this method is its ability to analyze data sets from different origins (e.g. laboratories) or variable environments.Required data objects
library(limma) # Loads limma library.For visualizing the analyzed affy data, use the same visualization and quality control steps as described in the affy package section.
targets <- readTargets("affy_targets.txt") # Reads targets information from file 'affy_targets.txt' and assigns it to targets frame.
library(affy); data <- ReadAffy(filenames=targets$FileName) # Reads CEL files (specified in 'targets') into AffyBatch object.
eset <- rma(data) # Normalizes data with 'rma' function and assigns them to ExpressionSet object (see above).
# exprs(eset) <- log2(exprs(eset)) # If eset contains absolute intensity values like MAS5 results, then they should be transformed to log2 (or loge) values for limma. RMA/GCRMA generate log2 values and MAS5 produces absolute values.
pData(eset) # Lists the analyzed file names.
write.exprs(eset, file="affy_all.txt") # Exports all affy expression values to tab delimited text file. The MAS 5.0 P/M/A calls can be retrieved with the simpleaffy package or with the affy package like this: 'eset <- mas5calls(data); write.exprs(eset, file="my_PMA.txt")'.
design <- model.matrix(~ -1+factor(c(1,1,2,2,3,3))) # Creates appropriate design matrix. Alternatively, such a design matrix can be created in any spreadsheet program and then imported into R.
colnames(design) <- c("group1", "group2", "group3") # Assigns column names.
fit <- lmFit(eset, design) # Fits a linear model for each gene based on the given series of arrays.
contrast.matrix <- makeContrasts(group2-group1, group3-group2, group3-group1, levels=design) # Creates appropriate contrast matrix to perform all pairwise comparisons. Alternatively, such a contrast matrix can be created in any spreadsheet program and then imported into R. For complex experiments one can also use this function to compute a contrast matrix with all possible pairwise comparisons.
fit2 <- contrasts.fit(fit, contrast.matrix) # Computes estimated coefficients and standard errors for a given set of contrasts.
fit2 <- eBayes(fit2) # Computes moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage of the standard errors towards a common value.
topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=10) # Generates list of top 10 ('number=10') differentially expressed genes sorted by B-values ('sort.by=B') for each of the three comparison groups ('coef=1') in this sample set. The summary table contains the following information: logFC is the log2-fold change, the AveExpr is the average expression value accross all arrays and channels, the moderated t-statistic (t) is the logFC to its standard error, the P.Value is the associated p-value, the adj.P.Value is the p-value adjusted for multiple testing and the B-value (B) is the log-odds that a gene is differentially expressed (the-higher-the-better). Usually one wants to base gene selection on the adjusted P-value rather than the t- or B-values. More details on this can be found in the limma PDF manual (type 'limmaUsersGuide()') or on this FAQ page.
write.table(topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=50000), file="limma_complete.xls", row.names=F, sep="\t") # Exports complete limma statistics table for first comparison group ('coef=1') to tab delimited text file.
results <- decideTests(fit2, p.value=0.05); vennDiagram(results) # Creates venn diagram of all changed genes with p-value equal or less than 0.05.
x <- topTable(fit2, coef=1, adjust="fdr", sort.by="P", number=50000); y <- x[x$adj.P.Val < 0.05,]; y; print("Number of genes in this list:"); length(y$ID) # Filters out candidates that have P-values < 0.05 in each group ('coef=1') and provides the number of candidates for each list. These numbers should be identical with the sum of the values in each circle of the above venn diagram.
x <- topTable(fit2, coef=1, adjust="fdr", sort.by="P", number=50000); y <- x[x$adj.P.Val < 0.01 & (x$logFC > 1 | x$logFC < -1) & x$AveExpr > 10,]; y; print("Number of genes in this list:"); length(y$ID) # Same as above but with complex filter: P-value < 0.01 AND at least 2-fold change AND expression value A > 10.
results <- decideTests(fit2, p.value=0.000005); heatDiagram(results, fit2$coef, primary=1) # This function plots heat diagram gene expression profiles for genes which are significantly differentially expressed in the primary condition (this is not a cluster analysis heat map). Genes are sorted by differential expression under the primary condition. The argument 'primary=1' selects the first contrast column in the 'results' matrix as primary condition. The plotted genes can be extracted like this 'results[results[,1]==1,]'. More information on this function can be found in the limma manual.
- 'data' object of type matrix or data frame containing expression values in log2 scale.
- 'cl' vector of length ncol(data) with class lables of samples/treatments.
- 'origin' vector of length ncol(data) with origin labels of samples. The origin vector is only required for analyzing data from multiple origins!
- Differential expression analysis with Affymetrix data from single origin
library(RankProd) # Loads the required library.Differential expression analysis with Affymetrix data from multiple origins
data(arab) # Loads sample data matrix or data frame 'arab' that contains log2 intensities of 500 genes (rows) for 10 samples (columns).
my_expr_matrix <- exprs(my_eset) # When the analysis is started from Affy Cel files, one can create the required expression matrix or data frame for input into RankProd from an ExpressionSet object with the 'exprs' function.
arab.sub <- arab[, which(arab.origin == 1)] # Generates sub-data set for single origin analysis.
cl <- arab.cl[which(arab.origin == 1)] # Generates 'cl' vector with classes (contrast groups) labeled with '0' and '1'.
RP.out <- RP(arab.sub, cl, num.perm = 100, logged = TRUE, na.rm = FALSE, plot = FALSE, gene.names = arab.gnames, rand = 123) # Performs rank product analysis for single-origin data. If the input data are not log transformed use 'logged=FALSE'.
topGene(RP.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames) # The function 'topGene' outputs identified genes with user-specified filters: 'cutoff': pfp threshold; 'num.gene': number of top genes identified; 'gene.names': vector with Affy IDs in order of input data set. For log2 scaled data use the arguments 'logged=TRUE' and 'logbase=2'. The generated 'Table 1' lists the up-regulated genes and the 'Table 2' the down-regulated genes. The four columns in each table contain the following information: (1) 'gene index' = position in original data set, (2) 'RP/Rsum' = rank product, 'FC' = fold change of averaged expression levels, 'pfp' = estimated pfp or FDR.
plotRP(RP.out, cutoff = 0.05) # Plots pfp (FDR) versus number of identified up- and down-regulated genes. Genes within the specified cutoff range are plotted in red.
cl <- arab.cl # Assigns 'cl' vector with classes (contrast groups) labeled with '0' and '1'. Object 'arab.cl' is part of sample data set 'arab'.Differential expression analysis with cDNA arrays
origin <- arab.origin # Assigns 'origin' vector with origin labels of samples. Object 'arab.origin' is part of sample data set 'arab'.
RP.adv.out <- RPadvance(arab, cl, origin, num.perm = 100, logged = TRUE, gene.names = arab.gnames, rand = 123) # The function 'RPadvance' performs the rank product analysis for multiple-origin data that requires the above 'origin' vector.
topGene(RP.adv.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames) # Generates the 'topGene' list (see above).
SAMSAM is part of the 'siggenes' package.
library(siggenes) # Loads siggenes package.
?sam # Opens help page for SAM.
sam(data,cl....) # Basic syntax for running SAM.
Digital Gene Expression (DGE)
Dual Color (cDNA) Array Packages
- BioConductor provides various packages for the analysis of dual-color cDNA arrays.
- Exploratory analysis for two-color spotted microarray data.
- The following gives a very brief summary on how to use this package:
library(marray) # Loads marray library.
... # Continue...
- Several BioConductor packages are available for displaying genomic information graphically. The following commands illustrate some of the chromosome plotting utilities from the geneplotter package.
library(annotate); library(geneplotter); library(hgu95av2); newChrom <- buildChromLocation("hgu95av2"); newChrom; cPlot(newChrom) # This displays all genes on the chromosomes of an organisms. Genes encoded by the antisense strands are represented by lines below the chromosomes.
data(sample.ExpressionSet); myeset <- sample.ExpressionSet; cColor(featureNames(sample.ExpressionSet), "red", newChrom) # This highlights in the above plot a set of genes of interest in red color (e.g. expressed genes of an experiment).
cPlot(newChrom,c("1","2"), fg="yellow", scale="relative"); cColor(featureNames(myeset), "red", newChrom) # Plots just a specific set of chromosomes.
KEGG Pathway Analysis
- Several packages are available for Gene Ontology (GO) analysis. The following examples of the different packages use the GO annotations from Arabidopsis.
- Basic GO usage with GOstats
library(GOstats); library(GO.db); library(ath1121501.db); library(annotate) # Loads the required libraries.
goann <- as.list(GOTERM) # Retrieves full set of GO annotations.
zz <- eapply(GOTERM, function(x) x@Ontology); table(unlist(zz)) # Calculates the number of annotations for each ontology category.
?GOTERM # To find out, how to access the different GO components.
GOTERM$"GO:0003700"; GOMFPARENTS$"GO:0003700"; GOMFCHILDREN$"GO:0003700" # Shows how to print out the GO annotations for one entry and how to retrieve its direct parents and children.
GOMFANCESTOR$"GO:0003700"; GOMFOFFSPRING$"GO:0003700" # Prints out complete lineages of parents and children for a GO ID.
goterms <- unlist(eapply(GOTERM, function(x) x@Term)); goterms[grep("molecular_function", goterms)] # Retrieves all GO terms and prints out only those matching a search string given in the grep function. The same can be done for the definition field with 'x@Definition'. A set of GO IDs can be provided as well: goterms[GOMFANCESTOR$"GO:0005507"]
go_df <- data.frame(GOID=unlist(eapply(GOTERM, function(x) x@GOID)), Term=unlist(eapply(GOTERM, function(x) x@Term)), Ont=unlist(eapply(GOTERM, function(x) x@Ontology))) # Generates data frame of the commonly used GO components: GOID, GO Term and Ontology Type.
affyGO <- eapply(ath1121501GO, getOntology, "MF"); table(sapply(affyGO, length)) # Retrieves MF GO terms for all probe IDs of a chosen Affy chip and calculates how many probes have multiple GO terms associated. Use "BP" and "CC" arguments to retrieve BP/CC GO terms.
affyGOdf <- data.frame(unlist(affyGO)); affyGOdf <- data.frame(AffyID=row.names(affyGOdf), GOID=affyGOdf[,1]); affyGOdf <- merge(affyGOdf, go_df, by.x="GOID", by.y="GOID", all.x=T) # Converts above MF list object into a data frame. The AffyID occurence counts are appended to AffyIDs. The last step merges the two data frames: 'affyGOdf' and 'go_df'.
unique(lookUp("GO:0004713", "ath1121501", "GO2ALLPROBES")) # Retrieves all Affy IDs that are associated with a GO node.
z <- affyGO[c("254759_at", "260744_at")]; as.list(GOTERM)[z[]] # Retrieves GO IDs for set of Affy IDs and then the corresponding GO term for first Affy ID.
a <- data.frame(unlist(z)); a <- data.frame(ID=row.names(a), a); b <- data.frame(goterms[as.vector(unlist(z))]); b <- data.frame(ID=row.names(b), b); merge(b,a, by.x = "ID", by.y="unlist.z.") # Merges Affy ID, GO ID and GO annotation information.
affyEv <- eapply(ath1121501GO, getEvidence); table(unlist(affyEv, use.names = FALSE)) # Provides evidence code information for each gene and summarizes the result.
test1 <- eapply(ath1121501GO, dropECode, c("IEA", "NR")); table(unlist(sapply(test1, getEvidence), use.names = FALSE)) # This example shows how one can remove certain evidence codes (e.g. IEA, IEP) from the analysis.
library(ath1121501.db); affySample <- c("266592_at", "266703_at", "266199_at", "246949_at", "267370_at", "267115_s_at", "266489_at", "259845_at", "266295_at", "262632_at"); geneSample <- as.vector(unlist(mget(affySample, ath1121501ACCNUM, ifnotfound=NA))); library(ath1121501cdf); affyUniverse <- ls(ath1121501cdf); geneUniverse <- as.vector(unlist(mget(affyUniverse, ath1121501ACCNUM, ifnotfound=NA))); params <- new("GOHyperGParams", geneIds = geneSample, universeGeneIds = geneUniverse, annotation="ath1121501", ontology = "MF", pvalueCutoff = 0.5, conditional = FALSE, testDirection = "over"); hgOver <- hyperGTest(params); summary(hgOver); htmlReport(hgOver, file = "MyhyperGresult.html") # Example of how to test a sample set of probe set keys for overrepresentation of GO terms using a hypergeometric distribution test with the function hyperGTest(). For more information, read the GOstatsHyperG manual.
- GOHyperGAll function: To test a sample population of genes for over-representation of GO terms, the function 'GOHyperGAll' computes for all GO nodes a hypergeometric distribution test and returns the corresponding raw and Bonferroni corrected p-values (notes about implementation). A subsequent filter function performs a GO Slim analysis using default or custom GO Slim categories. The method has been published in Plant Physiol (2008) 147, 41-57). GOHyperGAll provides similar utilities as the hyperGTest function in the GOstats package from BioConductor. The main difference is that GOHyperGAll simplifies the usage of custom chip-to-gene and gene-to-GO mappings.
To demo the utility of the GOHyperGAll function, run it as shown below. The initial sample data generation step takes some time (~10 min), since it needs to generate the required data objects for all three ontologies. This needs to be done only once for every custom gene-to-GO annotation.
(1.1) Import all required functions with the following source() command
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt")(2.1) Using gene-to-GO mappings from geneontology.org
readGOorg(myfile = "gene_association.tair", colno = c(5,11,9), org = "Arabidopsis"); gene2GOlist(rootUK=T) # Download the required annotation table from geneontology.org and unzip it. Then point the 'readGOorg()' function to this file name. The two functions generate 4 data frames with the assigned gene-to-GO mappings and 3 lists containing the gene-to-GO-OFFSPRING associations. When the processes are completed, 6 files will be saved in your working directory! They can be reloaded in future R sessions with the 'load' command below. If the argument 'rootUK' is set to TRUE, then the root nodes are treated as terminal nodes to account for the new assignment of unknown genes to the root nodes.(2.2) Using gene-to-GO mappings from BioC - Note: users should execute either step (2.1) or (2.2), but not both!
sampleDFgene2GO(lib="ath1121501.db"); gene2GOlist(rootUK=T) # Similar as above, but the gene-to-GO mappings are obtained from BioC. The generated 4 sample data frame and 3 list objects can be reloaded in future R sessions with the 'load' command below.(2.3) Obtain AffyID-to-GeneID mappings when working with AffyIDs
AffyID2GeneID(map = "ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements-2010-12-20.txt") # When working with AffyIDs, this function creates a AffyID-to-GeneID mapping data frame using by default the TAIR mappings for the Arabidopsis ATH1 chip. To use the function for the mappings of other chips, one needs to create the corresponding decoding data frame 'affy2locusDF'.(3.1) Reloading required data objects from local files
loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list") # This step makes future sessions much faster, since it allows to skip the previous data generation steps (2.1-2.3). A sample data set is available here: ArabSampleGOHyperGAll.zip (Jan 2010).(3.2) Obtain a sample set of GeneIDs
test_sample <- unique(as.vector(GO_MF_DF[1:40,2])) # When working with GeneIDs.(4.1) Perform phyper test, goSlim subsetting and plotting of results
test_sample <- AffyID2GeneID(affyIDs=affy_sample, probe2gene=1) # When working with AffyIDs, one can use the function 'AffyID2GeneID' to obtain for a set of AffyIDs their corresponding GeneIDs from the data frame 'affy2locusDF' (see above). For probe sets that match several loci, only the first locus ID will be used if the argument 'probe2gene' is set to 1. To demo the function, one can use the following sample AffyIDs: affy_sample <- c("266592_at", "266703_at", "266199_at", "246949_at", "267370_at", "267115_s_at", "266489_at", "259845_at", "266295_at", "262632_at").
GOHyperGAll_result <- GOHyperGAll(gocat="MF", sample=test_sample, Nannot=2); GOHyperGAll_result[1:10,-8] # The function 'GOHyperGAll()' performs the phyper test against all nodes in the GO network. It returns raw and adjusted p-values. The Bonferroni correction is used as p-values adjustment method according to Boyle et al, 2004 (online). The argument 'Nannot' defines the minimum number of direct annotations per GO node from the sample set to determine the number of tested hypotheses for the p-value adjustment. The argument 'gocat' can be assigned the values "MF", "BP" and "CC". Omitting the '-8' delimiter will provide the sample keys matching at every GO node.(4.2) Reduce GO Term Redundancy in 'GOHyperGAll_results'
subset <- GOHyperGAll_Subset(GOHyperGAll_result, sample=test_sample, type="goSlim"); subset[,-8] # The function 'GOHyperGAll_Subset()' subsets the GOHyperGAll results by assigned GO nodes or custom goSlim categories. The argument 'type' can be assigned the values "goSlim" or "assigned". The optional argument 'myslimv' can be used to provide a custom goSlim vector. Omitting the '-8' delimiter will show the sample keys matching at every GO node.
pie(subset[subset$SampleMatch>0 ,3], labels=as.vector(subset[subset$SampleMatch>0 ,1]), main=unique(as.vector(subset[subset$SampleMatch>0, 7]))) # Plots pie chart of subsetted results.
simplifyDF <- GOHyperGAll_Simplify(GOHyperGAll_result, gocat="MF", cutoff=0.001, correct=T) # The result data frame 'GOHyperGAll_result' often contains several connected GO terms with significant scores which can complicate the interpretation of large sample sets. To reduce the redundancy, the function 'GOHyperGAll_Simplify' subsets the data frame 'GOHyperGAll_result' by a user specified p-value cutoff and removes from it all GO nodes with overlapping children sets (OFFSPRING), while the best scoring nodes remain in the data frame.(4.3) Batch Analysis of Many Gene Clusters
data.frame(GOHyperGAll_result[GOHyperGAll_result[,1] %in% simplifyDF[,1], -8], GO_OL_Match=simplifyDF[,2]) # This command returns the redundancy reduced data set. The column 'GO_OL_Match' provides the number of accessions that match the connected nodes.
BatchResult <- GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10, cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575")) # The function 'GOCluster_Report' performs the three GO analyses in batch mode: 'GOHyperGAll', 'GOHyperGAll_Subset' or 'GOHyperGAll_Simplify'. It processes many groups of genes (e.g. gene expression clusters) and returns the results conveniently organized in a single data frame. The gene sets need to be provided in a data frame of the format specified at the end of the GOHyperGAll script. CLSZ: minimum cluster size to consider; method: "all", "slim" or "simplify"; gocat: "MF", "BP" or "CC"; cutoff: adjusted p-value cutoff; recordSpecGO: argument to include one specific GOID in each of the 3 ontologies, e.g: recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575").
- Gene Set Enrichment Analysis (GSEA) The following function converts the objects, generated by the above GOHyperGAll script, into the gene set format (*.gmt) for the GSEA tool from the Broad Institute. This way the custom GO files of the GOHyperGAll method can be used as gene sets for the Java or R releases of the GSEA program.
(A) Import the required file format conversion function with the following source() command
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOhyper2GSEA() function.
GOhyper2GSEA(myfile=c("MF_node_affy_list", "BP_node_affy_list", "CC_node_affy_list"), type="all") # Converts XX_node_affy_list or GO_XX_DF files into GO_XX_ALL.gmt and GO_XX_TERM.gmt files, respecitively. Use the argument "all" for XX_node_affy_list files and "terminal" for GO_XX_DF files. The former contain all GO assignments, whereas the latter contain only the direct GO assignments.
(B) Import the generated gene set files (*.gmt) along with the gene expression intensity data (*.txt) or pre-ranked gene lists (*.rnk) into the Java or R GSEA program, and follow the instructions on the GSEA User Guide. A detailed description of the data formats is provided on the Data Formats page.
- Getting started with goTools
my_list <- list(L1=c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at"), L2=c("256497_at", "250942_at", "265611_at", "247488_at", "255148_at")) # Creates a working list with two components each consisting of Arabidopsis Affy IDs.
library("goTools", verbose = FALSE) # Loads the required library.
ontoCompare(my_list, probeType = "ath1121501", goType="MF") # Provides compositional differences of the molecular function GOs between the Affy ID vectors contained in my_list.
ontoCompare(my_list, probeType = "ath1121501", goType="MF", plot=TRUE) # Plots the above MF GO compositions in bar diagram.
par(mar = c(5, 8, 5, 8)); res2 <- ontoCompare(my_list["L1"], probeType = "ath1121501", method="TIDS", goType="MF", plot=FALSE); ontoPlot(res2, cex = 0.7) # Plots pie chart for "L1" gene list component in my_list.
- goCluster demo
library("goCluster", verbose = FALSE) # Loads the goCluster package.
testdata <- eset[c(1:400),c(1:4)] # Selects 400 genes from previously generated ExpressionSet object (e.g. simpleaffy) with four chips.
test <- new("goCluster") # Creates goCluster object.
testConfigured <- config(test) # The config-method starts the configuration process and stores the results in object 'testConfigured'. The required configuration information needs to be provided in the interactive mode (use 'ath1121501').
setup(testConfigured)[["algo"]]; print(testConfigured) # Retrieves the generated configuration information.
testExecuted <- execute(testConfigured) # Analyzes the configured object.
print(testExecuted) # Prints the above analysis object to give a result summary.
display(testExecuted, selection = 3) # Displays result in form of heat map.
testExecuted2 <- testExecuted; testExecuted2@visu <- new("clusterVisualHeatmap"); testExecuted2@visu <- execute(testExecuted2@visu, testExecuted2); display(testExecuted2, selection = "GO:0009535") # Displays result for chosen GO identifier.
Motif Identification in Promoter Regions
- KEGG Pathway Script: Script to map Arabidopsis locus IDs to KEGG pathways. It can be easily adjusted to any organism.
- COSMO from Oliver Bembom et al.: allows to discover motifs in unaligned DNA sequences
- rGADEM: de novo motif discovery
ChemmineR: Searching and Clustering Drug-like Compounds in R
Protein Structure Analysis
- ChemmineR is an R package for mining drug-like compound and screening data sets.
MS Data Analysis
- Bio3D in R: utilities for the analysis of protein structure and sequence data.
Genome-Wide Association Studies (GWAS)
- SimHap: A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of population-based data.
- GenABEL: R library for Genome-wide association analysis.
- crlmm: Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays (Carvalho et al., 2009).
- GGtools: Software and data for genetical genomics.
- GeneticsBase: Classes and functions for handling genetic data.
- PLINK: Free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
The following exercises demonstrate several very useful BioConductor tools for single-color and dual-color array analysis.
- Slide Show: Technology Overview (R Code)
- Install R, BioC and required libraries
- Download a sample set of Affymetrix cel files
- Right-click this link and save its content to your computer: Workshop.zip. These sample CEL files are from the GEO data set: GSE5621
- After unpacking this file archive one should see six *.cel files.
- Generate RMA expression data, MAS5 P/M/A calls and export results to Excel
library(affy); mydata <- ReadAffy() # Reads cel files in current working directory into affybatch object 'mydata'.
library(ath1121501probe); print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
eset_rma <- rma(mydata) # Generates RMA expression values and stores them as ExpressionSet.
exprs(eset_rma)[1:4,] # Prints first 4 rows in data frame structure.
# exprs(eset_rma) <- 2^(exprs(eset_rma)) # If needed unlog RMA expression values.
mydf <- 2^exprs(eset_rma); myList <- tapply(colnames(mydf), c(1,1,2,2,3,3), list); names(myList) <- sapply(myList, paste, collapse="_"); mymean <- sapply(myList, function(x) rowMeans(mydf[,x])) # Generic approach for calculating mean values for any sample combination.
eset_pma <- mas5calls(mydata) # Generates MAS 5.0 P/M/A calls.
my_frame <- data.frame(exprs(eset_rma), exprs(eset_pma), assayDataElement(eset_pma, "se.exprs")) # Combine RMA intensities, P/M/A calls plus their wilcoxon p-values in one data frame.
my_frame <- my_frame[, sort(names(my_frame))] # Sorts columns by cel file name.
write.table(my_frame, file="my_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.
- Add annotation information
library("ath1121501.db") # Loads required annotation package.
Annot <- data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "), DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", ")) # Constructs a data frame containing the gene IDs, gene symbols and descriptions for all probe sets on the chip.
all <- merge(Annot, my_frame, by.x=0, by.y=0, all=T) # Merges everything with above expression data.
write.table(all, file="my_annot_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.
- Visualization and quality control
d <- cor(2^exprs(eset_rma), method="pearson"); plot(hclust(dist(1-d))) # Generates a correlation matrix for all-against-all chip comparisons.
library(affyQCReport); QCReport(mydata, file="ExampleQC.pdf") # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
eset <- eset_rma # To work with following commands.
- Continue with commands in section "Visualization and quality controls".
- Simple Affy (command summary)
- Save this covdesc.txt to your R working directory.
- Generate expression data with RMA and MAS5.
- Filter each of the three data sets with the following parameters: 2-fold changes, present in all 4 chips and p-score less than 0.001.
- Write the results into separate files.
- Create scatter plots for the filtered data sets and save them to external image files.
- Identify the overlap of the significant changes between the RMA and MAS5 data.
- Perform simpleaffy QC checks: scaling factor, percent present calls, etc.
- Identify differtially expressed genes (DEGs) with the limma package
- Follow the instructions in the Limma Section for Affy Data.
- cDNA microarray users can save and extract the SWIRL cDNA microarray sample data. For a quick demonstration of the analysis of this data set, one can copy&paste or source the following command-line summary into the R terminal: my_swirl_commands.txt.
- Test gene sets for overrepresented GO terms using the GOHyperGAll script
- Download and unzip the annotation objects for Arabidopsis (March 2010) into your working directory.
- Then continue with the following commands:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOHyperGAll functions.
loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list") # Loads the downloaded annotation objects for Arabidopsis.
GOHyperGAll_result <- GOHyperGAll(gocat="MF", sample=unique(as.vector(GO_MF_DF[1:40,2])), Nannot=2); GOHyperGAll_result[1:10,-8] # Performs the enrichment test for provided set of gene identifiers.
CL_DF <- data.frame(geneID=GO_MF_DF[1:400,2], ClusterID=sort(rep(1:4,100)), ClusterSize=rep(100,400)) # Create sample data set for batch processing.
BatchResult <- GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10, cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575")); BatchResult[1:4,-10] # Performs all three GO analyses for many sample set at once. When the method argument is set to "slim" then the goSlim method is used.
write.table(BatchResult, file="GO_Batch.xls", sep="\t", col.names = NA) # Exports batch result to Excel file.
- Clustering of differentially expressed genes
my_fct <- function(x) hclust(x, method="complete") # Creates function to perform hierarchical clustering (complete linkage).
heatmap(as.matrix(2^exprs(eset)[1:40,]), col = cm.colors(256), hclustfun=my_fct) # Plots heatmap with dendrograms.
- Replace in last step 'exprs(eset)[1:40,]' by matrix of differentially expressed genes from limma analysis.
- Convert the above commands into an R script and execute it with the source() function
source("myscript.R") # Executes all of the above commands and generates the corresponding output files in the current working directory.