R & Bioconductor Manual (一)
20120229 10:31:28 分类：
R&Bioconductor
 标签：
举报
字号大中小 订阅
http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#biocon_dgeIndex
 R Basics
 Introduction
 Finding Help
 Basics on Functions and Packages
 System commands under Linux
 Reading and Writing External Data
 R Objects
 Some Great R Functions
 Graphical Procedures
 Missing Values
 Writing Your Own Functions
 R Web Applications
 R Exercises
 HT Sequence Analysis with R and Bioconductor
 Programming in R
 BioConductor
 Introduction
 Finding Help
 Affy Packages
 Analysis of Differentially Expressed Genes
 Dual Color Array Packages
 Chromosome maps
 Gene Ontologies
 KEGG Pathway Analysis
 Motif Identification in Promoter Regions
 Phylogenetic Analysis
 Mining DrugLike Compounds
 Protein Structure Analysis
 MS Data Analysis
 GenomeWide Association Studies (GWAS)
 BioC Exercises
 Clustering and Data Mining in R
 Introduction
 Data Preprocessing
 Hierarchical Clustering (HC)
 Bootstrap Analysis in Hierarchical Clustering
 QT Clustering
 KMeans & PAM
 Fuzzy Clustering
 SelfOrganizing Map (SOM)
 Principal Component Analysis (PCA)
 MultiDimensional Scaling (MDS)
 Bicluster Analysis
 Network Analysis
 Support Vector Machines (SVM)
 Similarity Measures for Clustering Results
 Clustering Exercises
 Administration
 R Basics
 Introduction
General Overview R (http://cran.at.rproject.org) is a comprehensive statistical environment and programming language for professional data analysis and graphical display. The associated BioConductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, sequence and genome analysis.
 Scope of this Manual
This R tutorial provides a condensed introduction into the usage of the R environment and its utilities for general data analysis and clustering. It also introduces a subset of packages from the BioConductor project. The included packages are a 'personal selection' of the author of this manual that does not reflect the full utility specturm of the R/BioConductor projects. Many packages were chosen, because the author uses them often for his own teaching and research. To obtain a broad overview of available R packages, it is strongly recommended to consult the official BioConductor and R project sites. Due to the rapid development of most packages, it is also important to be aware that this manual will often not be fully uptodate. Because of this and many other reasons, it is absolutley critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to its author.
 Format of this Manual
A not always very easy to read, but practical copy & paste format has been chosen throughout this manual. In this format all commands are represented in bold font followed by a short description of their usage in nonbold font. To save space, often several commands are concatenated on one line and separated with a semicolon ';'. All explanations start with the standard comment sign '#' to prevent them from being interpreted by R as commands. This way several commands can be pasted with their comment text into the R console to demo the different functions and analysis steps. Commands starting with a '$' sign need to be executed from a Unix or Linux shell. Windows users can simply ignore them. Information in red color is considered essential knowledge, while commands in green color are important for someone interested in a quick start with R and BioConductor. The remaining commands in black color provide often more indepth information about a certain topic.
 Installation of the R Software and R Packages
 Basic R Usage
$ R # Starts the R console under Unix/Linux. The R GUI versions under Windows and Mac OS X can be started by doubleclicking their icons.
object < function(arguments) or object = function(arguments) # This general R command syntax uses the assignment operator '<' (or '=') to assign data generated by command to its right to object on its left. A more recently introduced assignment operator is '='. Both of them work the same way and in both directions. For consistency reasons one should use only one of them.
assign("x", function(arguments)) # Has the same effect as above, but uses the assignment function instead of the assignment operator.
source("my_script") # Command to execute an R script, here 'my_script'. For example, generate a text file 'my_script' with the command 'print(1:100)', then execute it with the source function.
x < edit(data.frame()) # Starts empty GUI spreadsheet editor for manual data entry.
x < edit(x) # Opens existing data frame (table) 'x' in GUI spreadsheet editor.
x < scan(w="c") # Lets you enter values from the keyboard or by copy&paste and assigns them to vector 'x'.
q() # Quits R console.
 R Startup Behavior
The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)
 Finding Help
 Various online manuals are available on the R project site. Very useful manuals for beginners are: R Stats Tutorial, An Introduction to R, QuickR, the manual simpleR  Using R for Introductory Statistics, R for Beginners, Kelly Black's R Tutorial, Kim Seefeld's Rintroduction for Biostatistics, Peter Dalgaard's book Introductory Statistics with R and Applied Statistics for Bioinformatics using R by Wim Krijnen. To find basic functions and syntax structures quickly, one can consult The R Reference Card. Paul Murrell's book is a complete reference on R graphics. More manuals and documentation can be found on this R search site or R Seek. References on R programming are listed in the 'Programming in R' chapter of this manual. Documentation within R can be found with the following commands.

?function # or 'help(function)' opens documentation on a function
apropos(function) # Finds all functions containing a given term.
example(heatmap) # executes examples for function 'heatmap'
help.search("topic") # searches help system for documentation
RSiteSearch('regression', restrict='functions', matchesPerPage=100) # Search for key words or phrases in the Rhelp mailing list archives, help pages, vignettes or task views, using the search engine at http://search.rproject.org and view them in a web browser.
library() # shows available libraries on a system.
library(help=mypackage) or help(package=mypackage) # lists all functions/objects of a library.
help.start() # Starts local HTML interface. The link 'Packages' provides a list of all installed packages. After initiating 'start.help()' in a session the '?function' commands will open as HTML pages!
sessionInfo() # Prints version information about R and all loaded packages. The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists.
$ R h # or 'R help'; provides help on R environment, more detailed information on page 90 of 'An Introduction to R'
 Basics on Functions and Packages
 R contains most arithmetic functions like mean, median, sum, prod, sqrt, length, log, etc. An extensive list of R functions can be found on the function and variable index page. Many R functions and datasets are stored in separate packages, which are only available after loading them into an R session. Information about installing new packages can be found in the administrative section of this manual.

library(my_package) # Loads a particular package.
library(help=mypackage) # Lists all functions/objects of a library.
search() # Lists which packages are currently loaded.
 Information and management of objects

ls() or objects() # Lists R objects created during session, they are stored in file '.RData' when exiting R and the workspace is saved.
rm(my_object1, my_object2, ...) # Removes objects.
rm(list = ls()) # Removes all objects without warning!
str(object) # Displays object types and structure of an R object.
ls.str(pattern="") # Lists object type info on all objects in a session.
print(ls.str(), max.level=0) # If a session contains very long list objects then one can simplify the output with this command.
lsf.str(pattern="") # Lists object type info on all functions in a session.
class(object) # Prints the object type.
mode(object) # Prints the storage mode of an object.
summary(object) # Generic summary info for all kinds of objects.
attributes(object) # Returns an object's attribute list.
gc() # Causes the garbage collection to take place. This is sometimes useful to clean up memory allocations after deleting large objects.
length(object) # Provides length of object.
.Last.value # Prints the value of the last evaluated expression.
 Reading and changing directories

dir() # Reads content of current working directory.
getwd() # Returns current working directory.
setwd("/home/user") # Changes current working directory to the specified directory.
 System commands under Linux
 R IN/OUTPUT & BATCH Mode
 One can redirect R input and output with '', '>' and '<' from the Shell command line.
$ R slave < my_infile > my_outfile # The argument 'slave' makes R run as 'quietly' as possible. This option is intended to support programs which use R to compute results for them. For example, if my_infile contains 'x < c(1:100); x;' the result for this R expression will be written to 'my_outfile' (or STDOUT).
$ R CMD BATCH [options] my_script.R [outfile] # Sytax for running R programs in BATCH mode from the commandline. The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of 'infile' and '.Rout' is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: 'options(echo=FALSE)'. If the command is run like this 'R CMD BATCH nosave my_script.R', then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: '$ R CMD BATCH help' or '> ?BATCH'.
$ echo 'sqrt(3)'  R slave # calculates sqrt of 3 in R and prints it to STDOUT.
 Executing Shell & Perl commands from R with 'system("")' function.
 Remember, single escapes (e.g. '\n') need to be double escaped in R (e.g. '\\n')
system("ls al") # Prints content of current working directory.
system("perl ne 'print if (/my_pattern1/ ? ($c=1) : ($c > 0)); print if (/my_pattern2/ ? ($d = 1) : ($d > 0));' my_infile.txt > my_outfile.txt") # Runs Perl oneliner that extracts two patterns from external text file and writes result into new file.
 Reading and Writing Data from/to Files
 Import

read.delim("clipboard", header=T) # Command to copy&paste tables from Excel or other programs into R. If the 'header' argument is set to FALSE, then the first line of the data set will not be used as column titles.
read.delim(pipe("pbpaste")) # Command to copy&paste on Mac OS X systems.
file.show("my_file") # prints content of file to screen, allows scrolling.
scan("my_file") # reads vector/array into vector from file or keyboard.
my_frame < read.table(file="my_table") # reads in table and assigns it to data frame.
my_frame < read.table(file="my_table", header=TRUE, sep="\t") # Same as above, but with info on column headers and field separators. If you want to import the data in character mode, then include this argument: colClasses = "character".
my_frame < read.delim("ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements2009729.txt", na.strings = "", fill=TRUE, header=T, sep="\t") # The function read.delim() is often more flexible for importing tables with empty fields and long character strings (e.g. gene descriptions).
cat(month.name, file="zzz.txt", sep="\n"); x < readLines("zzz.txt"); x < x[c(grep("^J", as.character(x), perl = TRUE))]; t(as.data.frame(strsplit(x,"u"))) # A somewhat more advanced example for retrieving specific lines from an external file with a regular expression. In this example an external file is created with the 'cat' function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) are then retieved with the 'grep' function, and the resulting lines are split into subfields with 'strsplit'.
 Export

write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F) # Command to copy&paste from R into Excel or other programs. It writes the data of an R data frame object into the clipbroard from where it can be pasted into other applications.
zz < pipe('pbcopy', 'w'); write.table(iris, zz, sep="\t", col.names=NA, quote=F); close(zz) # Command to copy&paste from R into Excel or other programs on Mac OS X systems.
write.table(my_frame, file="my_file", sep="\t", col.names = NA) # Writes data frame to a tabdelimited text file. The argument 'col.names = NA' makes sure that the titles align with columns when row/index names are exported (default).
save(x, file="my_file.txt"); load(file="file.txt") # Commands to save R object to an external file and to read it in again from this file.
files < list.files(pattern=".txt$"); for(i in files) { x < read.table(i, header=TRUE, row.names=1, comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x); write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) } # Batch import and export of many files. First, the *.txt file names in the current directory are assigned to list ($ sign is used to anchor string '*.txt' to end of names). Second, the files are imported onebyone using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.
HTML(my_frame, file = "my_table.html") # Writes data frame to HTML table. Subsequent exports to the same file will arrange several tables in one HTML document. In order to access this function one needs to load the library 'R2HTML' first. This library is usually not installed by default.
write(x, file="my_file") # Writes matrix data to a file.
sink("My_R_Output") # redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore.
sink() # restores normal R output behavior.
 Interfacing with Google Docs

 R Objects
 Data and Object Types
 Data Types
 Numeric data: 1, 2, 3
x < c(1, 2, 3); x; is.numeric(x); as.character(x) # Creates a numeric vector, checks for the data type and converts it into a character vector.
 Character data: "a", "b" , "c"
x < c("1", "2", "3"); x; is.character(x); as.numeric(x) # Creates a character vector, checks for the data type and converts it into a numeric vector.
 Complex data: 1, b, 3
 Logical data: TRUE, FALSE, TRUE
1:10 < 5 # Returns TRUE where x is < 5.
 Object Types in R
 vectors: ordered collection of numeric, character, complex and logical values.
 factors: special type vectors with grouping information of its components
 data frames: two dimensional structures with different data types
 matrices: two dimensional structures with data of same type
 arrays: multidimensional arrays of vectors
 lists: general form of vectors with different types of elements
 functions: piece of code
 Naming Rules
 Object, row and column names should not start with a number.
 Avoid spaces in object, row and column names.
 Avoid special characters like '#'.
 General Subsetting Rules
 Subsetting syntax:
my_object[row] # Subsetting of one dimensional objects, like vectors and factors.
my_object[row, col] # Subsetting of two dimensional objects, like matrices and data frames.
my_object[row, col, dim] # Subsetting of three dimensional objects, like arrays.
 There are three possibilities to subset data objects.
 Subsetting by positive or negative index/position numbers
my_object < 1:26; names(my_object) < LETTERS # Creates a vector sample with named elements.
my_object[1:4] # Returns the elements 14.
my_object[c(1:4)] # Excludes elements 14.
 Subsetting by same length logical vectors
my_logical < my_object > 10 # Generates a logical vector as example.
my_object[my_logical] # Returns the elements where my_logical contains TRUE values.
 Subsetting by field names
my_object[c("B", "K", "M")] # Returns the elements with element titles: B, K and M.
 Calling a single column or list component by its name with the '$' sign.
iris$Species # Returns the 'Species' column in the sample data frame 'iris'.
iris[,c("Species")] # Has the same effect as the previous step.
 Basic Operators and Calculations
 Comparison operators
 equal: ==
 not equal: !=
 greater/less than: > <
 greater/less than or equal: >= <=
Example:
 Logical operators
 AND: &
x < 1:10; y < 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5 # Returns TRUE where both comparisons return TRUE.
 OR: 
x == y  x != y # Returns TRUE where at least one comparison returns TRUE.
 NOT: !
!x > y # The '!' sign returns the negation (opposite) of a logical vector.
 Calculations
 Four basic arithmetic functions: addition, subtraction, multiplication and division
1 + 1; 1  1; 1 * 1; 1 / 1 # Returns the results of these calculations.
 Calculations on vectors
x < 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for the vector x its sum, mean, standard deviation and square root. A list of the basic R functions can be found on the function and variable index page.
x < 1:10; y < 1:10; x + y # Calculates the sum for each element in the vectors x and y.
 Iterative calculations
apply(iris[,1:3], 1, mean) # Calculates the mean values for the columns 13 in the sample data frame 'iris'. With the argument setting '1', rowwise iterations are performed and with '2' columnwise iterations.
tapply(iris[,4], iris$Species, mean) # Calculates the mean values for the 4th column based on the grouping information in the 'Species' column in the 'iris' data frame.
sapply(x, sqrt) # Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'.
 Assigning values to object components
zzz < iris[,1:3]; zzz < 0 # Reassignment syntax to create/replace an entire object.
zzz < iris[,1:3]; zzz[,] < 0 # Populates all fields in an object with zeros.
zzz < iris[,1:3]; zzz[zzz < 4] < 0 # Populates only specified fields with zeros.
 Vectors
 General information
 Vectors are ordered collection of 'atomic' (same data type) components or modes of the following four types: numeric, character, complex and logical. Missing values are indicated by 'NA'. R inserts them automatically in blank fields.

x < c(2.3, 3.3, 4.4) # Example for creating a numeric vector with ordered collection of numbers using 'c' (combine values in vector) function.
z < scan(file="my_file") # Reads data from file and assigns them to the vector 'z'.
vector[rows] # Syntax to access vector sections.
z < 1:10; z; as.character(z) # The function 'as.character' changes the data mode from numeric to character.
as.numeric(character) # The function 'as.numeric' changes the data mode from character to numeric.
d < as.integer(x); d # Transforms numeric data into integers.
x < 1:100; sample(x, 5) # Selects a random sample of size of 5 from a vector.
x < as.integer(runif(100, min=1, max=5)); sort(x); rev(sort(x)); order(x); x[order(x)] # Generates random integers from 1 to 4. The sort() function sorts the items by size. The rev() function reverses the order. The order() function returns the corresponding indices for a sorted object. The order() function is usually the one that needs to be used for sorting complex objects, such as data frames or lists.
x < rep(1:10, times=2); x; unique(x) # The unique() function removes the duplicated entries in a vector.
sample(1:10, 5, replace=TRUE) # Returns a set of randomly selected elements from a vector (here 5 numbers from 1 to 10) using either with or without replacement.
 Sequences
 R has several facilities to create sequences of numbers:

1:30 # Generates a sequence of integers from 1 to 30.
letters; LETTERS; month.name; month.abb # Generates lower case letters, capital letters, month names and abbreviated month names, respectively.
2*1:10 # Creates a sequence of even numbers.
seq(1, 30, by=0.5) # Same as before, but with 0.5 increments.
seq(length=100, from=20, by=0.5) # Creates number sequence with specified start and length.
rep(LETTERS[1:8], times=5) # Replicates given sequence or vector x times.
 Character Vectors
paste(LETTERS[1:8], 1:12, sep="") # The command 'paste' merges vectors after converting to characters.
x < paste(rep("A", times=12), 1:12, sep=""); y < paste(rep("B", times=12), 1:12, sep=""); append(x,y) # possibility to build plate location vector in R (better example under 'arrays').
 Subsetting Vectors
x < 1:100; x[2:23] # Values in square brackets select vector range.
x < 1:100; x[(2:23)] # Prints everything except the values in square brackets.
x[5] < 99 # Replaces value at position 5 with '99'.
x < 1:10; y < c(x[1:5],99,x[6:10]); y # Inserts new value at defined position of vector.
letters=="c" # Returns logical vector of "FALSE" and "TRUE" strings.
which(rep(letters,2)=="c") # Returns index numbers where "c" occurs in the 'letters' vector. For retrieving indices of several strings provided by query vector, use the following 'match' function.
match(c("c","g"), letters) # Returns index numbers for "c" and "g" in the 'letters' vector. If the query vector (here 'c("c","g")') contains entries that are duplicated in the target vector, then this syntax returns only the first occurence(s) for each duplicate. To retrieve the indices for all duplicates, use the following '%in%' function.
x < rep(1:10, 2); y < c(2,4,6); x %in% y # The function '%in%' returns a logical vector. This syntax allows the subsetting of vectors and data frames with a query vector ('y') containing entries that are duplicated in the target vector ('x'). The resulting logical vector can be used for the actual subsetting step of vectors and data frames.
 Finding Identical and NonIdentical Entries between Vectors
intersect(month.name[1:4], month.name[3:7]) # Returns identical entries of two vectors.
month.name[month.name %in% month.name[3:7]] # Returns the same result as in the previous step. The vector comparison with %in% returns first a logical vector of identical items that is then used to subset the first vector.
setdiff(x=month.name[1:4], y=month.name[3:7]); setdiff(month.name[3:7], month.name[1:4]) # Returns the unique entries occuring only in the first vector. Note: if the argument names are not used, as in the second example, then the order of the arguments is important.
union(month.name[1:4], month.name[3:7]) # Joins two vectors without duplicating identical entries.
x < c(month.name[1:4], month.name[3:7]); x[duplicated(x)] # Returns duplicated entries.
 Other Vector Types
 see page 10 of R Introduction
 Factors
 Factors are vector objects that contain grouping (classification) information of its components.
animalf < factor(animal < c("dog", "cat", "mouse", "dog", "dog", "cat")) # Creates factor 'animalf' from vector 'animal'.
animalf # Prints out factor 'animalf', this lists first all components and then the different levels (unique entries); alternatively one can print only levels with 'levels(animalf)'.
animalfr < table(animalf); animalfr # Creates frequency table for levels.
 Function 'tapply' applies calculation on all members (replicates) of a level.

weight < c(102, 50, 5, 101, 103, 52) # Creates new vector with weight values for 'animalf' (both need to have same length).
mean < tapply(weight, animalf, mean) # Applies function (length, mean, median, sum, sterr, etc) to all level values; 'length' provides the number of entries (replicates) in each level.
 Function 'cut' divides a numeric vector into size intervals.

y < 1:200; interval < cut(y, right=F, breaks=c(1, 2, 6, 11, 21, 51, 101, length(y)+1), labels=c("1","25","610", "1120", "2150", "51100", ">=101")); table(interval) # Prints the counts for the specified size intervals (beaks) in the numeric vector: 1:200.
plot(interval, ylim=c(0,110), xlab="Intervals", ylab="Count", col="green"); text(labels=as.character(table(interval)), x=seq(0.7, 8, by=1.2), y=as.vector(table(interval))+2) # Plots the size interval counts as bar diagram.
 Matrices and Arrays
 Matrices are two dimensional data objects consisting of rows and columns. Arrays are similar, but they can have one, two or more dimensions. In contrast to data frames (see below), one can store only a single data type in the same object (e.g. numeric or character).
x < matrix(1:30, 3, 10, byrow = T) # Lays out vector (1:30) in 3 by 10 matrix. The argument 'byrow' defines whether the matrix is filled by row or columns.
dim(x) < c(3,5,2) # transforms above matrix into multidimensional array.
x < array(1:25, dim=c(5,5)) # creates 5 by 5 array ('x') and fills it with values 125.
y < c(x) # writes array into vector.
x[c(1:5),3] # writes values from 3rd column into vector structure.
mean(x[c(1:5),3]) # calculates mean of 3rd column.
 Subsetting matrices and arrays

array[rows, columns] # Syntax to access columns and rows of matrices and twodimensional arrays.
as.matrix(iris) # Many functions in R require matrices as input. If something doesn't work then try to convert the object into a matrix with the as.matrix() function.
i < array(c(1:5,5:1),dim=c(3,2)) # Creates 5 by 2 index array ('i') and fills it with the values 15, 51.
x[i] # Extracts the corresponding elements from 'x' that are indexed by 'i'.
x[i] < 0 # replaces those elements by zeros.
array1 < array(scan(file="my_array_file", sep="\t"), c(4,3)) # Reads data from 'my_array_file' and writes it into 4 by 3 array.
 Subsetting arrays with more than two dimensions

array[rows, columns, dimensions] # Syntax to access columns, rows and dimensions in arrays with more than two dimensions.
x < array(1:250, dim=c(10,5,5)); x[2:5,3,] # Example to generate 10x5x5 array and how to retrieve slices from all subarrays..
 Merging arrays: example for building location tables for microtiter plates:

Z < array(1:12, dim=c(12,8)); X < array(c("A","B","C","D","E","F","G","H"), dim=c(8,12)); Y < paste(t(X), Z, sep=""); Y # Creates array with 12 AxHx columns.
M < array(Y, c(96,1)) # Writes 12 AxHx columns into one.
 Script for mapping 24/48/96 to 384 well plates

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/384_96_48_24conversion.txt") # Prints plate mappings for 24/48/96 well plates to 384 well plates.
 Script for mapping 384 to 1536 well plates

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/1536_384conversion.txt") # Prints plate mappings for 384 well plates to 1536 well plates.
 Appending arrays and matrices

cbind(matrix1, matrix2) # Appends columns of matrices with same number of rows.
rbind(matrix1, matrix2) # Appends rows of matrices with same number of columns.
c(array1, array2) # Clears dimension attributes from arrays and concatenates them in vector format. The function 'as.vector(my_array1)' achieves similar result.
 Calculations between arrays

Z < array(1:12, dim=c(12,8)); X < array(12:1, dim=c(12,8)) # Creates arrays Z and X with same dimension.
calarray < Z/X # Divides Z/X elements and assigns result to array 'calarray'.
t(my_array) # Transposes 'my_array'; a more flexible transpose function is 'aperm(my_array, perm)'.
 Information about arrays

dim(X); nrow(X); ncol(X) # returns number of rows and columns of array 'X'.
 Data Frames
 Data frames are two dimensional data objects that are composed of rows and columns. They are very similar to matrices. The main difference is that data frames can store different data types, whereas matrices allow only one data type (e.g. numeric or character).

 Constructing data frames
my_frame < data.frame(y1=rnorm(12),y2=rnorm(12),y3=rnorm(12),y4=rnorm(12)) # Creates data frame with vectors 112 and 121.
rownames(my_frame) < month.name[1:12] # Assigns row (index) names. These indices need to be unique.
names(my_frame) < c("y4", "y3", "y2", "y1") # Assigns new column titles.
names(my_frame)[c(1,2)] < c("y3", "y4") # Changes titles of specific columns.
my_frame < data.frame(IND=row.names(my_frame), my_frame) # Generates new column with title "IND" containing the row names.
my_frame[,2:5]; my_frame[,1] # Different possibilities to remove column(s) from a data frame.
 Accessing and slicing data frame sections
my_frame[rows, columns] # Generic syntax to access columns and rows in data frames.
dim(my_frame) # Gives dimensions of data frame.
length(my_frame); length(my_frame$y1) # Provides number of columns or rows of data frame, respectively.
colnames(my_frame); rownames(my_frame) # Gives column and row names of data frame.
row.names(my_frame) # Prints row names or indexing column of data frame.
my_frame[order(my_frame$y2, decreasing=TRUE), ] # Sorts the rows of a data frame by the specified columns, here 'y2'; for increasing order use 'decreasing=FALSE'. In addition to the 'order()' function, there are: 'sort(x)' for vectors, 'rev(x)' for vector in decreasing order and 'sort.list(x)' for vector sequences.
my_frame[order(my_frame[,4], my_frame[,3]),] # Subsequent subsorts can be performed by changing sign of the argument.
my_frame$y1 # Notation to print entire column of a data frame as vector or factor.
my_frame$y1[2:4] # Notation to access column element(s) of a data frame.
v <my_frame[,4]; v[3] # Notation for returning the value of an individual cell. In this example the corresponding column is first assigned to a vector and then the desired field is accessed by its index number.
my_frame[1:5,] # Notation to view only the first five rows of all columns.
my_frame[,1:2] # Notation to view all rows of the first two columns.
my_frame[,c(1,3)] # Notation to view all rows of the specified columns.
my_frame[1:5,1:2] # Notation to view only the first five rows of the columns 12.
my_frame["August",] # Notation to retrieve row values by their index name (here "August").
x < data.frame(row.names=LETTERS[1:10], letter=letters[1:10],Month=month.name[1:10]); x; match(c("c","g"), x[,1]) # Returns matching index numbers of data frame or vector using 'match' function. This syntax returns for duplicates only the index of their first occurence. To return all, use the following syntax.
x[x[,2] %in% month.name[3:7],] # Subsets a data frame with a query vector using the '%in%' function. This returns all occurences of duplicates.
as.vector(as.matrix(my_frame[1,])) # Returns one row entry of a data frame as vector.
as.data.frame(my_list) # Returns a list object as data frame if the list components can be converted into a two dimensional object.
 Calculations on data frames
summary(my_frame) # Prints a handy summary for a data frame.
mean(my_frame) # Calculates the mean for all columns.
data.frame(my_frame, mean=apply(my_frame[,2:5], 1, mean), ratio=(my_frame[,2]/my_frame[,3])) # The apply function performs rowwise or columnwise calculations on data frames or matrices (here mean and ratios). The results are returned as vectors. In this example, they are appended to the original data frame with the data.frame function. The argument '1' in the apply function specifies rowwise calculations. If '2' is selected, then the calculations are performed columnwise.
aggregate(my_frame, by=list(c("G1","G1","G1","G1","G2","G2","G2","G2","G3","G3","G3","G4")), FUN=mean) # The aggregate function can be used to compute the mean (or any other stats) for data groups specified under the argument 'by'.
cor(my_frame[,2:4]); cor(t(my_frame[,2:4])) # Syntax to calculate a correlation matrix based on allagainstall rows and allagainstall columns.
x < matrix(rnorm(48), 12, 4, dimnames=list(month.name, paste("t", 1:4, sep=""))); corV < cor(x["August",], t(x), method="pearson"); y < cbind(x, correl=corV[1,]); y[order(y[,5]), ] # Commands to perform a correlation search and ranking across all rows (matrix object required here). First, an example matrix 'x' is created. Second, the correlation values for the "August" row against all other rows are calculated. Finally, the resulting vector with the correlation values is merged with the original matrix 'x' and sorted by decreasing correlation values.
merge(frame1, frame2, by.x = "frame1col_name", by.y = "frame2col_name", all = TRUE) # Merges two data frames (tables) by common field entries. To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. To merge on row names (indices), refer to it with "row.names" or '0', e.g.: 'by.x = "row.names", by.y = "row.names"'.
my_frame1 < data.frame(title1=month.name[1:8], title2=1:8); my_frame2 < data.frame(title1=month.name[4:12], title2=4:12); merge(my_frame1, my_frame2, by.x = "title1", by.y = "title1", all = TRUE) # Example for merging two data frames by common field.
 Fast computations on large data frames and matrices
myDF < as.data.frame(matrix(rnorm(100000), 10000, 10)) # Creates an example data frame.
myCol < c(1,1,1,2,2,2,3,3,4,4); myDFmean < t(aggregate(t(myDF), by=list(myCol), FUN=mean, na.rm=T)[,1]); colnames(myDFmean) < tapply(names(myDF), myCol, paste, collapse="_"); myDFmean[1:4,] # The aggregate function can be used to perform calculations (here mean) across rows for any combination of column selections (here myCol) after transposing the data frame. However, this will be very slow for data frames with millions of rows.
myList < tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list); names(myList) < sapply(myList, paste, collapse="_"); myDFmean < sapply(myList, function(x) mean(as.data.frame(t(myDF[,x])))); myDFmean[1:4,] # Faster alternative for performing the aggregate computation of the previous step. However, this step will be still very slow for very large data sets, due to the sapply loop over the row elements.
myList < tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list); myDFmean < sapply(myList, function(x) rowSums(myDF[,x])/length(x)); colnames(myDFmean) < sapply(myList, paste, collapse="_"); myDFmean[1:4,] # By using the rowSums or rowMeans functions one can perform the above aggregate computations 1001000 times faster by avoiding a loop over the rows altogether.
myDFsd < sqrt((rowSums((myDFrowMeans(myDF))^2)) / (length(myDF)1)); myDFsd[1:4] # Similarly, one can compute the standard deviation for large data frames by avoiding loops. This approach is about 100 times faster than the loopbased alternatives: sd(t(myDF)) or apply(myDF, 1, sd).
 Regular expressions

gsub('(i.*a)', '\\1_xxx', iris$Species, perl = TRUE) # Example for using regular expressions to substitute a pattern by another one using a back reference. Remember: single escapes '\' need to be double escaped '\\' in R.
 Conditional selections

my_frame[!duplicated(my_frame[,2]),] # Removes rows with duplicated values in selected column.
my_frame[my_frame$y2 > my_frame$y3,] # Prints all rows of data frame where values of col1 > col2. Comparison operators are: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators are & (and),  (or) and ! (not).
x < 0.5:10; x[x<1.0] < 1/x[x<1.0] # Replaces all values in vector or data frame that are below 1 with their reciprocal value.
x <data.frame(month=month.abb[1:12], AB=LETTERS[1:2], no1=1:48, no2=1:24); x[x$month == "Apr" & (x$no1 == x$no2  x$no1 > x$no2),] # Prints all records of frame 'x' that contain 'Apr' AND have equal values in columns 'no1' and 'no2' OR have greater values in column 'no1'.
x[x[,1] %in% c("Jun", "Aug"),] # Retrieves rows with column matches specified in a query vector.
x[c(grep("\\d{2}", as.character(x$no1), perl = TRUE)),] # Possibility to print out all rows of a data frame where a regular expression matches (here all double digit values in col 'no1').
x[c(grep("\\d{2}", as.character(for(i in 1:4){x[,i]}), perl = TRUE)),] # Same as above, but searches all columns (14) using a for loop (see below).
z < data.frame(chip1=letters[1:25], chip2=letters[25:1], chip3=letters[1:25]); z; y < apply(z, 1, function(x) sum(x == "m") > 2); z[y,] # Identifies in a data frame ('z') all those rows that contain a certain number of identical fields (here 'm' > 2).
z < data.frame(chip1=1:25, chip2=25:1, chip3=1:25); c < data.frame(z, count=apply(z[,1:3], 1, FUN < function(x) sum(x >= 5))); c # Counts in each row of a data frame the number of fields that are above or below a specified value and appends this information to the data frame. By default rows with "NA" values will be ignored. To work around this limitation, one can replace the NA fields with a value that doesn't affect the result, e.g.: x[is.na(x)] < 1.
x < data.frame(matrix(rep(c("P","A","M"),20),10,5)); x; index < x == "P"; cbind(x, Pcount=rowSums(index)); x[rowSums(index)>=2,] # Example of how one can count occurances of strings across rows. In this example the occurances of "P" in a data frame of "PMA" values are counted by converting to a data frame of logical values and then counting the 'TRUE' occurences with the 'rowSums' function, which does in this example the same as: 'cbind(x, count=apply(x=="P",1,sum))'.
 Lists
 Lists are ordered collections of objects that can be of different modes (e.g. numeric vector, array, etc.). A name can be assigned to each list component.

my_list < list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # Example of how to create a list.
attributes(my_list); names(my_list) # Prints attributes and names of all list components.
my_list[2]; my_list[[2]] # Returns component 2 of list; The [[..]] operator returns the object type stored in this component, while '[..]' returns a single component list.
my_list$wife # Named components in lists can also be called with their name. Command returns same component as 'my_list[[2]]'.
my_list[[4]][2] # Returns from the fourth list component the second entry.
length(my_list[[4]]) # Prints the length of the fourth list component.
my_list$wife < 1:12 # Replaces the content of the existing component with new content.
my_list$wife < NULL # Deletes list component 'wife'.
my_list < c(my_list, list(my_title2=month.name[1:12])) # Appends new list component to existing list.
my_list < c(my_name1=my_list1, my_name2=my_list2, my_name3=my_list3) # Concatenates lists.
my_list < c(my_title1=my_list[[1]], list(my_title2=month.name[1:12])) # Concatenates existing and new components into a new list.
unlist(my_list); data.frame(unlist(my_list)); matrix(unlist(my_list)); data.frame(my_list) # Useful commands to convert lists into other objects, such as vectors, data frames or matrices.
my_frame < data.frame(y1=rnorm(12),y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)); my_list < apply(my_frame, 1, list); my_list < lapply(my_list, unlist); my_list # Stores the rows of a data frame as separate vectors in a list container.
mylist < list(a=letters[1:10], b=letters[10:1], c=letters[1:3]); lapply(names(mylist), function(x) c(x, mylist[[x]])) # Syntax to access list component names with the looping function 'lapply'. In this example the list component names are prepended to the corresponding vectors.
评论这张
转发至微博
转发至微博
关闭
玩LOFTER，免费冲印20张照片，人人有奖！
我要抢>
<#最新日志，群博日志>
<#推荐日志>
<#引用记录>
<#博主推荐>
<#随机阅读>
<#首页推荐>
<#历史上的今天>
<#被推荐日志>
<#上一篇，下一篇>
<# 热度 >
<# 网易新闻广告 >
<#右边模块结构>
<#评论模块结构>
<#引用模块结构>
<#博主发起的投票>
评论