注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

网易考拉推荐

R & Bioconductor Manual (一)  

2012-02-29 10:31:28|  分类: R&Bioconductor |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#biocon_dge

Index

  1. R Basics
    1. Introduction
    2. Finding Help
    3. Basics on Functions and Packages
    4. System commands under Linux
    5. Reading and Writing External Data
    6. R Objects
    7. Some Great R Functions
    8. Graphical Procedures
    9. Missing Values
    10. Writing Your Own Functions
    11. R Web Applications
    12. R Exercises
  2. HT Sequence Analysis with R and Bioconductor
  3. Programming in R
  4. BioConductor
    1. Introduction
    2. Finding Help
    3. Affy Packages
    4. Analysis of Differentially Expressed Genes
    5. Dual Color Array Packages
    6. Chromosome maps
    7. Gene Ontologies
    8. KEGG Pathway Analysis
    9. Motif Identification in Promoter Regions
    10. Phylogenetic Analysis
    11. Mining Drug-Like Compounds
    12. Protein Structure Analysis
    13. MS Data Analysis
    14. Genome-Wide Association Studies (GWAS)
    15. BioC Exercises
  5. Clustering and Data Mining in R
    1. Introduction
    2. Data Preprocessing
    3. Hierarchical Clustering (HC)
    4. Bootstrap Analysis in Hierarchical Clustering
    5. QT Clustering
    6. K-Means & PAM
    7. Fuzzy Clustering
    8. Self-Organizing Map (SOM)
    9. Principal Component Analysis (PCA)
    10. Multi-Dimensional Scaling (MDS)
    11. Bicluster Analysis
    12. Network Analysis
    13. Support Vector Machines (SVM)
    14. Similarity Measures for Clustering Results
    15. Clustering Exercises
  6. Administration

  1. R Basics
    1. Introduction
    2. General Overview
        R (http://cran.at.r-project.org) is a comprehensive statistical environment and programming language for professional data analysis and graphical display. The associated BioConductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, sequence and genome analysis.
      Scope of this Manual
        This R tutorial provides a condensed introduction into the usage of the R environment and its utilities for general data analysis and clustering. It also introduces a subset of packages from the BioConductor project. The included packages are a 'personal selection' of the author of this manual that does not reflect the full utility specturm of the R/BioConductor projects. Many packages were chosen, because the author uses them often for his own teaching and research. To obtain a broad overview of available R packages, it is strongly recommended to consult the official BioConductor and R project sites. Due to the rapid development of most packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutley critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to its author.
      Format of this Manual
        A not always very easy to read, but practical copy & paste format has been chosen throughout this manual. In this format all commands are represented in bold font followed by a short description of their usage in non-bold font. To save space, often several commands are concatenated on one line and separated with a semicolon ';'. All explanations start with the standard comment sign '#' to prevent them from being interpreted by R as commands. This way several commands can be pasted with their comment text into the R console to demo the different functions and analysis steps. Commands starting with a '$' sign need to be executed from a Unix or Linux shell. Windows users can simply ignore them. Information in red color is considered essential knowledge, while commands in green color are important for someone interested in a quick start with R and BioConductor. The remaining commands in black color provide often more in-depth information about a certain topic.
      Installation of the R Software and R Packages
      Basic R Usage
        $ R # Starts the R console under Unix/Linux. The R GUI versions under Windows and Mac OS X can be started by double-clicking their icons.
        object <- function(arguments) or object = function(arguments) # This general R command syntax uses the assignment operator '<-' (or '=') to assign data generated by command to its right to object on its left. A more recently introduced assignment operator is '='. Both of them work the same way and in both directions. For consistency reasons one should use only one of them.
        assign("x", function(arguments)) # Has the same effect as above, but uses the assignment function instead of the assignment operator.
        source("my_script") # Command to execute an R script, here 'my_script'. For example, generate a text file 'my_script' with the command 'print(1:100)', then execute it with the source function.
        x <- edit(data.frame()) # Starts empty GUI spreadsheet editor for manual data entry.
        x <- edit(x) # Opens existing data frame (table) 'x' in GUI spreadsheet editor.
        x <- scan(w="c") # Lets you enter values from the keyboard or by copy&paste and assigns them to vector 'x'.
        q() # Quits R console.
      R Startup Behavior
        The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)

    3. Finding Help
    4. Various online manuals are available on the R project site. Very useful manuals for beginners are: R Stats Tutorial, An Introduction to R, Quick-R, the manual simpleR - Using R for Introductory Statistics, R for Beginners, Kelly Black's R Tutorial, Kim Seefeld's R-introduction for Biostatistics, Peter Dalgaard's book Introductory Statistics with R and Applied Statistics for Bioinformatics using R by Wim Krijnen. To find basic functions and syntax structures quickly, one can consult The R Reference Card. Paul Murrell's book is a complete reference on R graphics. More manuals and documentation can be found on this R search site or R Seek. References on R programming are listed in the 'Programming in R' chapter of this manual. Documentation within R can be found with the following commands.
        ?function # or 'help(function)' opens documentation on a function
        apropos(function) # Finds all functions containing a given term.
        example(heatmap) # executes examples for function 'heatmap'
        help.search("topic") # searches help system for documentation
        RSiteSearch('regression', restrict='functions', matchesPerPage=100) # Search for key words or phrases in the R-help mailing list archives, help pages, vignettes or task views, using the search engine at http://search.r-project.org and view them in a web browser.
        library() # shows available libraries on a system.
        library(help=mypackage) or help(package=mypackage) # lists all functions/objects of a library.
        help.start() # Starts local HTML interface. The link 'Packages' provides a list of all installed packages. After initiating 'start.help()' in a session the '?function' commands will open as HTML pages!
        sessionInfo() # Prints version information about R and all loaded packages. The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists.
        $ R -h # or 'R --help'; provides help on R environment, more detailed information on page 90 of 'An Introduction to R'

    5. Basics on Functions and Packages
    6. R contains most arithmetic functions like mean, median, sum, prod, sqrt, length, log, etc. An extensive list of R functions can be found on the function and variable index page. Many R functions and datasets are stored in separate packages, which are only available after loading them into an R session. Information about installing new packages can be found in the administrative section of this manual.
        library(my_package) # Loads a particular package.
        library(help=mypackage) # Lists all functions/objects of a library.
        search() # Lists which packages are currently loaded.
      Information and management of objects
        ls() or objects() # Lists R objects created during session, they are stored in file '.RData' when exiting R and the workspace is saved.
        rm(my_object1, my_object2, ...) # Removes objects.
        rm(list = ls()) # Removes all objects without warning!
        str(object) # Displays object types and structure of an R object.
        ls.str(pattern="") # Lists object type info on all objects in a session.
        print(ls.str(), max.level=0) # If a session contains very long list objects then one can simplify the output with this command.
        lsf.str(pattern="") # Lists object type info on all functions in a session.
        class(object) # Prints the object type.
        mode(object) # Prints the storage mode of an object.
        summary(object) # Generic summary info for all kinds of objects.
        attributes(object) # Returns an object's attribute list.
        gc() # Causes the garbage collection to take place. This is sometimes useful to clean up memory allocations after deleting large objects.
        length(object) # Provides length of object.
        .Last.value # Prints the value of the last evaluated expression.
      Reading and changing directories
        dir() # Reads content of current working directory.
        getwd() # Returns current working directory.
        setwd("/home/user") # Changes current working directory to the specified directory.

    7. System commands under Linux
    8. R IN/OUTPUT & BATCH Mode

      One can redirect R input and output with '|', '>' and '<' from the Shell command line.
        $ R --slave < my_infile > my_outfile # The argument '--slave' makes R run as 'quietly' as possible. This option is intended to support programs which use R to compute results for them. For example, if my_infile contains 'x <- c(1:100); x;' the result for this R expression will be written to 'my_outfile' (or STDOUT).
        $ R CMD BATCH [options] my_script.R [outfile] # Sytax for running R programs in BATCH mode from the command-line. The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of 'infile' and '.Rout' is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: 'options(echo=FALSE)'. If the command is run like this 'R CMD BATCH --no-save my_script.R', then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: '$ R CMD BATCH --help' or '> ?BATCH'.
        $ echo 'sqrt(3)' | R --slave # calculates sqrt of 3 in R and prints it to STDOUT.
      Executing Shell & Perl commands from R with 'system("")' function.

      Remember, single escapes (e.g. '\n') need to be double escaped in R (e.g. '\\n')
        system("ls -al") # Prints content of current working directory.
        system("perl -ne 'print if (/my_pattern1/ ? ($c=1) : (--$c > 0)); print if (/my_pattern2/ ? ($d = 1) : (--$d > 0));' my_infile.txt > my_outfile.txt") # Runs Perl one-liner that extracts two patterns from external text file and writes result into new file.

    9. Reading and Writing Data from/to Files
    10. Import
        read.delim("clipboard", header=T) # Command to copy&paste tables from Excel or other programs into R. If the 'header' argument is set to FALSE, then the first line of the data set will not be used as column titles.
        read.delim(pipe("pbpaste")) # Command to copy&paste on Mac OS X systems.
        file.show("my_file") # prints content of file to screen, allows scrolling.
        scan("my_file") # reads vector/array into vector from file or keyboard.
        my_frame <- read.table(file="my_table") # reads in table and assigns it to data frame.
        my_frame <- read.table(file="my_table", header=TRUE, sep="\t") # Same as above, but with info on column headers and field separators. If you want to import the data in character mode, then include this argument: colClasses = "character".
        my_frame <- read.delim("ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements-2009-7-29.txt", na.strings = "", fill=TRUE, header=T, sep="\t") # The function read.delim() is often more flexible for importing tables with empty fields and long character strings (e.g. gene descriptions).
        cat(month.name, file="zzz.txt", sep="\n"); x <- readLines("zzz.txt"); x <- x[c(grep("^J", as.character(x), perl = TRUE))]; t(as.data.frame(strsplit(x,"u"))) # A somewhat more advanced example for retrieving specific lines from an external file with a regular expression. In this example an external file is created with the 'cat' function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) are then retieved with the 'grep' function, and the resulting lines are split into sub-fields with 'strsplit'.
      Export
        write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F) # Command to copy&paste from R into Excel or other programs. It writes the data of an R data frame object into the clipbroard from where it can be pasted into other applications.
        zz <- pipe('pbcopy', 'w'); write.table(iris, zz, sep="\t", col.names=NA, quote=F); close(zz) # Command to copy&paste from R into Excel or other programs on Mac OS X systems.
        write.table(my_frame, file="my_file", sep="\t", col.names = NA) # Writes data frame to a tab-delimited text file. The argument 'col.names = NA' makes sure that the titles align with columns when row/index names are exported (default).
        save(x, file="my_file.txt"); load(file="file.txt") # Commands to save R object to an external file and to read it in again from this file.
        files <- list.files(pattern=".txt$"); for(i in files) { x <- read.table(i, header=TRUE, row.names=1, comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x); write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) } # Batch import and export of many files. First, the *.txt file names in the current directory are assigned to list ($ sign is used to anchor string '*.txt' to end of names). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.
        HTML(my_frame, file = "my_table.html") # Writes data frame to HTML table. Subsequent exports to the same file will arrange several tables in one HTML document. In order to access this function one needs to load the library 'R2HTML' first. This library is usually not installed by default.
        write(x, file="my_file") # Writes matrix data to a file.
        sink("My_R_Output") # redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore.
        sink() # restores normal R output behavior.
      Interfacing with Google Docs

    11. R Objects
    12. Data and Object Types
      Data Types
      • Numeric data: 1, 2, 3
        • x <- c(1, 2, 3); x; is.numeric(x); as.character(x) # Creates a numeric vector, checks for the data type and converts it into a character vector.
      • Character data: "a", "b" , "c"
        • x <- c("1", "2", "3"); x; is.character(x); as.numeric(x) # Creates a character vector, checks for the data type and converts it into a numeric vector.
      • Complex data: 1, b, 3
      • Logical data: TRUE, FALSE, TRUE
        • 1:10 < 5 # Returns TRUE where x is < 5.
      Object Types in R
      • vectors: ordered collection of numeric, character, complex and logical values.
      • factors: special type vectors with grouping information of its components
      • data frames: two dimensional structures with different data types
      • matrices: two dimensional structures with data of same type
      • arrays: multidimensional arrays of vectors
      • lists: general form of vectors with different types of elements
      • functions: piece of code
      Naming Rules
      • Object, row and column names should not start with a number.
      • Avoid spaces in object, row and column names.
      • Avoid special characters like '#'.

      General Subsetting Rules
      Subsetting syntax:
          my_object[row] # Subsetting of one dimensional objects, like vectors and factors.
          my_object[row, col] # Subsetting of two dimensional objects, like matrices and data frames.
          my_object[row, col, dim] # Subsetting of three dimensional objects, like arrays.
      There are three possibilities to subset data objects.
      1. Subsetting by positive or negative index/position numbers
        • my_object <- 1:26; names(my_object) <- LETTERS # Creates a vector sample with named elements.
          my_object[1:4] # Returns the elements 1-4.
          my_object[-c(1:4)] # Excludes elements 1-4.
      2. Subsetting by same length logical vectors
        • my_logical <- my_object > 10 # Generates a logical vector as example.
          my_object[my_logical] # Returns the elements where my_logical contains TRUE values.
      3. Subsetting by field names
        • my_object[c("B", "K", "M")] # Returns the elements with element titles: B, K and M.
      Calling a single column or list component by its name with the '$' sign.
          iris$Species # Returns the 'Species' column in the sample data frame 'iris'.
          iris[,c("Species")] # Has the same effect as the previous step.

      Basic Operators and Calculations
      Comparison operators
      • equal: ==
      • not equal: !=
      • greater/less than: > <
      • greater/less than or equal: >= <=
      • Example:
          1 == 1 # Returns TRUE.
      Logical operators
      • AND: &
        • x <- 1:10; y <- 10:1 # Creates the sample vectors 'x' and 'y'.
          x > y & x > 5 # Returns TRUE where both comparisons return TRUE.
      • OR: |
        • x == y | x != y # Returns TRUE where at least one comparison returns TRUE.
      • NOT: !
        • !x > y # The '!' sign returns the negation (opposite) of a logical vector.
      Calculations
      • Four basic arithmetic functions: addition, subtraction, multiplication and division
        • 1 + 1; 1 - 1; 1 * 1; 1 / 1 # Returns the results of these calculations.
      • Calculations on vectors
        • x <- 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for the vector x its sum, mean, standard deviation and square root. A list of the basic R functions can be found on the function and variable index page.
          x <- 1:10; y <- 1:10; x + y # Calculates the sum for each element in the vectors x and y.
      • Iterative calculations
        • apply(iris[,1:3], 1, mean) # Calculates the mean values for the columns 1-3 in the sample data frame 'iris'. With the argument setting '1', row-wise iterations are performed and with '2' column-wise iterations.
          tapply(iris[,4], iris$Species, mean) # Calculates the mean values for the 4th column based on the grouping information in the 'Species' column in the 'iris' data frame.
          sapply(x, sqrt) # Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'.
      Assigning values to object components
          zzz <- iris[,1:3]; zzz <- 0 # Reassignment syntax to create/replace an entire object.
          zzz <- iris[,1:3]; zzz[,] <- 0 # Populates all fields in an object with zeros.
          zzz <- iris[,1:3]; zzz[zzz < 4] <- 0 # Populates only specified fields with zeros.

      Vectors
      General information
      Vectors are ordered collection of 'atomic' (same data type) components or modes of the following four types: numeric, character, complex and logical. Missing values are indicated by 'NA'. R inserts them automatically in blank fields.
        x <- c(2.3, 3.3, 4.4) # Example for creating a numeric vector with ordered collection of numbers using 'c' (combine values in vector) function.
        z <- scan(file="my_file") # Reads data from file and assigns them to the vector 'z'.
        vector[rows] # Syntax to access vector sections.
        z <- 1:10; z; as.character(z) # The function 'as.character' changes the data mode from numeric to character.
        as.numeric(character) # The function 'as.numeric' changes the data mode from character to numeric.
        d <- as.integer(x); d # Transforms numeric data into integers.
        x <- 1:100; sample(x, 5) # Selects a random sample of size of 5 from a vector.
        x <- as.integer(runif(100, min=1, max=5)); sort(x); rev(sort(x)); order(x); x[order(x)] # Generates random integers from 1 to 4. The sort() function sorts the items by size. The rev() function reverses the order. The order() function returns the corresponding indices for a sorted object. The order() function is usually the one that needs to be used for sorting complex objects, such as data frames or lists.
        x <- rep(1:10, times=2); x; unique(x) # The unique() function removes the duplicated entries in a vector.
        sample(1:10, 5, replace=TRUE) # Returns a set of randomly selected elements from a vector (here 5 numbers from 1 to 10) using either with or without replacement.
      Sequences
      R has several facilities to create sequences of numbers:
        1:30 # Generates a sequence of integers from 1 to 30.
        letters; LETTERS; month.name; month.abb # Generates lower case letters, capital letters, month names and abbreviated month names, respectively.
        2*1:10 # Creates a sequence of even numbers.
        seq(1, 30, by=0.5) # Same as before, but with 0.5 increments.
        seq(length=100, from=20, by=0.5) # Creates number sequence with specified start and length.
        rep(LETTERS[1:8], times=5) # Replicates given sequence or vector x times.
      Character Vectors
          paste(LETTERS[1:8], 1:12, sep="") # The command 'paste' merges vectors after converting to characters.
          x <- paste(rep("A", times=12), 1:12, sep=""); y <- paste(rep("B", times=12), 1:12, sep=""); append(x,y) # possibility to build plate location vector in R (better example under 'arrays').
      Subsetting Vectors
          x <- 1:100; x[2:23] # Values in square brackets select vector range.
          x <- 1:100; x[-(2:23)] # Prints everything except the values in square brackets.
          x[5] <- 99 # Replaces value at position 5 with '99'.
          x <- 1:10; y <- c(x[1:5],99,x[6:10]); y # Inserts new value at defined position of vector.
          letters=="c" # Returns logical vector of "FALSE" and "TRUE" strings.
          which(rep(letters,2)=="c") # Returns index numbers where "c" occurs in the 'letters' vector. For retrieving indices of several strings provided by query vector, use the following 'match' function.
          match(c("c","g"), letters) # Returns index numbers for "c" and "g" in the 'letters' vector. If the query vector (here 'c("c","g")') contains entries that are duplicated in the target vector, then this syntax returns only the first occurence(s) for each duplicate. To retrieve the indices for all duplicates, use the following '%in%' function.
          x <- rep(1:10, 2); y <- c(2,4,6); x %in% y # The function '%in%' returns a logical vector. This syntax allows the subsetting of vectors and data frames with a query vector ('y') containing entries that are duplicated in the target vector ('x'). The resulting logical vector can be used for the actual subsetting step of vectors and data frames.
      Finding Identical and Non-Identical Entries between Vectors
          intersect(month.name[1:4], month.name[3:7]) # Returns identical entries of two vectors.
          month.name[month.name %in% month.name[3:7]] # Returns the same result as in the previous step. The vector comparison with %in% returns first a logical vector of identical items that is then used to subset the first vector.
          setdiff(x=month.name[1:4], y=month.name[3:7]); setdiff(month.name[3:7], month.name[1:4]) # Returns the unique entries occuring only in the first vector. Note: if the argument names are not used, as in the second example, then the order of the arguments is important.
          union(month.name[1:4], month.name[3:7]) # Joins two vectors without duplicating identical entries.
          x <- c(month.name[1:4], month.name[3:7]); x[duplicated(x)] # Returns duplicated entries.
      Other Vector Types
      see page 10 of R Introduction

      Factors
      Factors are vector objects that contain grouping (classification) information of its components.
          animalf <- factor(animal <- c("dog", "cat", "mouse", "dog", "dog", "cat")) # Creates factor 'animalf' from vector 'animal'.
          animalf # Prints out factor 'animalf', this lists first all components and then the different levels (unique entries); alternatively one can print only levels with 'levels(animalf)'.
          animalfr <- table(animalf); animalfr # Creates frequency table for levels.
      Function 'tapply' applies calculation on all members (replicates) of a level.
          weight <- c(102, 50, 5, 101, 103, 52) # Creates new vector with weight values for 'animalf' (both need to have same length).
          mean <- tapply(weight, animalf, mean) # Applies function (length, mean, median, sum, sterr, etc) to all level values; 'length' provides the number of entries (replicates) in each level.
      Function 'cut' divides a numeric vector into size intervals.
          y <- 1:200; interval <- cut(y, right=F, breaks=c(1, 2, 6, 11, 21, 51, 101, length(y)+1), labels=c("1","2-5","6-10", "11-20", "21-50", "51-100", ">=101")); table(interval) # Prints the counts for the specified size intervals (beaks) in the numeric vector: 1:200.
          plot(interval, ylim=c(0,110), xlab="Intervals", ylab="Count", col="green"); text(labels=as.character(table(interval)), x=seq(0.7, 8, by=1.2), y=as.vector(table(interval))+2) # Plots the size interval counts as bar diagram.

      Matrices and Arrays
      Matrices are two dimensional data objects consisting of rows and columns. Arrays are similar, but they can have one, two or more dimensions. In contrast to data frames (see below), one can store only a single data type in the same object (e.g. numeric or character).
          x <- matrix(1:30, 3, 10, byrow = T) # Lays out vector (1:30) in 3 by 10 matrix. The argument 'byrow' defines whether the matrix is filled by row or columns.
          dim(x) <- c(3,5,2) # transforms above matrix into multidimensional array.
          x <- array(1:25, dim=c(5,5)) # creates 5 by 5 array ('x') and fills it with values 1-25.
          y <- c(x) # writes array into vector.
          x[c(1:5),3] # writes values from 3rd column into vector structure.
          mean(x[c(1:5),3]) # calculates mean of 3rd column.
      Subsetting matrices and arrays
          array[rows, columns] # Syntax to access columns and rows of matrices and two-dimensional arrays.
          as.matrix(iris) # Many functions in R require matrices as input. If something doesn't work then try to convert the object into a matrix with the as.matrix() function.
          i <- array(c(1:5,5:1),dim=c(3,2)) # Creates 5 by 2 index array ('i') and fills it with the values 1-5, 5-1.
          x[i] # Extracts the corresponding elements from 'x' that are indexed by 'i'.
          x[i] <- 0 # replaces those elements by zeros.
          array1 <- array(scan(file="my_array_file", sep="\t"), c(4,3)) # Reads data from 'my_array_file' and writes it into 4 by 3 array.
      Subsetting arrays with more than two dimensions
          array[rows, columns, dimensions] # Syntax to access columns, rows and dimensions in arrays with more than two dimensions.
          x <- array(1:250, dim=c(10,5,5)); x[2:5,3,] # Example to generate 10x5x5 array and how to retrieve slices from all sub-arrays..
      Merging arrays: example for building location tables for microtiter plates:
          Z <- array(1:12, dim=c(12,8)); X <- array(c("A","B","C","D","E","F","G","H"), dim=c(8,12)); Y <- paste(t(X), Z, sep=""); Y # Creates array with 12 Ax-Hx columns.
          M <- array(Y, c(96,1)) # Writes 12 Ax-Hx columns into one.
      Script for mapping 24/48/96 to 384 well plates
          source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/384_96_48_24conversion.txt") # Prints plate mappings for 24/48/96 well plates to 384 well plates.
      Script for mapping 384 to 1536 well plates
          source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/1536_384conversion.txt") # Prints plate mappings for 384 well plates to 1536 well plates.
      Appending arrays and matrices
          cbind(matrix1, matrix2) # Appends columns of matrices with same number of rows.
          rbind(matrix1, matrix2) # Appends rows of matrices with same number of columns.
          c(array1, array2) # Clears dimension attributes from arrays and concatenates them in vector format. The function 'as.vector(my_array1)' achieves similar result.
      Calculations between arrays
          Z <- array(1:12, dim=c(12,8)); X <- array(12:1, dim=c(12,8)) # Creates arrays Z and X with same dimension.
          calarray <- Z/X # Divides Z/X elements and assigns result to array 'calarray'.
          t(my_array) # Transposes 'my_array'; a more flexible transpose function is 'aperm(my_array, perm)'.
      Information about arrays
          dim(X); nrow(X); ncol(X) # returns number of rows and columns of array 'X'.

      Data Frames
      Data frames are two dimensional data objects that are composed of rows and columns. They are very similar to matrices. The main difference is that data frames can store different data types, whereas matrices allow only one data type (e.g. numeric or character).

      Constructing data frames
          my_frame <- data.frame(y1=rnorm(12),y2=rnorm(12),y3=rnorm(12),y4=rnorm(12)) # Creates data frame with vectors 1-12 and 12-1.
          rownames(my_frame) <- month.name[1:12] # Assigns row (index) names. These indices need to be unique.
          names(my_frame) <- c("y4", "y3", "y2", "y1") # Assigns new column titles.
          names(my_frame)[c(1,2)] <- c("y3", "y4") # Changes titles of specific columns.
          my_frame <- data.frame(IND=row.names(my_frame), my_frame) # Generates new column with title "IND" containing the row names.
          my_frame[,2:5]; my_frame[,-1] # Different possibilities to remove column(s) from a data frame.
      Accessing and slicing data frame sections
          my_frame[rows, columns] # Generic syntax to access columns and rows in data frames.
          dim(my_frame) # Gives dimensions of data frame.
          length(my_frame); length(my_frame$y1) # Provides number of columns or rows of data frame, respectively.
          colnames(my_frame); rownames(my_frame) # Gives column and row names of data frame.
          row.names(my_frame) # Prints row names or indexing column of data frame.
          my_frame[order(my_frame$y2, decreasing=TRUE), ] # Sorts the rows of a data frame by the specified columns, here 'y2'; for increasing order use 'decreasing=FALSE'. In addition to the 'order()' function, there are: 'sort(x)' for vectors, 'rev(x)' for vector in decreasing order and 'sort.list(x)' for vector sequences.
          my_frame[order(my_frame[,4], -my_frame[,3]),] # Subsequent sub-sorts can be performed by changing sign of the argument.
          my_frame$y1 # Notation to print entire column of a data frame as vector or factor.
          my_frame$y1[2:4] # Notation to access column element(s) of a data frame.
          v <-my_frame[,4]; v[3] # Notation for returning the value of an individual cell. In this example the corresponding column is first assigned to a vector and then the desired field is accessed by its index number.
          my_frame[1:5,] # Notation to view only the first five rows of all columns.
          my_frame[,1:2] # Notation to view all rows of the first two columns.
          my_frame[,c(1,3)] # Notation to view all rows of the specified columns.
          my_frame[1:5,1:2] # Notation to view only the first five rows of the columns 1-2.
          my_frame["August",] # Notation to retrieve row values by their index name (here "August").
          x <- data.frame(row.names=LETTERS[1:10], letter=letters[1:10],Month=month.name[1:10]); x; match(c("c","g"), x[,1]) # Returns matching index numbers of data frame or vector using 'match' function. This syntax returns for duplicates only the index of their first occurence. To return all, use the following syntax.
          x[x[,2] %in% month.name[3:7],] # Subsets a data frame with a query vector using the '%in%' function. This returns all occurences of duplicates.
          as.vector(as.matrix(my_frame[1,])) # Returns one row entry of a data frame as vector.
          as.data.frame(my_list) # Returns a list object as data frame if the list components can be converted into a two dimensional object.

      Calculations on data frames
          summary(my_frame) # Prints a handy summary for a data frame.
          mean(my_frame) # Calculates the mean for all columns.
          data.frame(my_frame, mean=apply(my_frame[,2:5], 1, mean), ratio=(my_frame[,2]/my_frame[,3])) # The apply function performs row-wise or column-wise calculations on data frames or matrices (here mean and ratios). The results are returned as vectors. In this example, they are appended to the original data frame with the data.frame function. The argument '1' in the apply function specifies row-wise calculations. If '2' is selected, then the calculations are performed column-wise.
          aggregate(my_frame, by=list(c("G1","G1","G1","G1","G2","G2","G2","G2","G3","G3","G3","G4")), FUN=mean) # The aggregate function can be used to compute the mean (or any other stats) for data groups specified under the argument 'by'.
          cor(my_frame[,2:4]); cor(t(my_frame[,2:4])) # Syntax to calculate a correlation matrix based on all-against-all rows and all-against-all columns.
          x <- matrix(rnorm(48), 12, 4, dimnames=list(month.name, paste("t", 1:4, sep=""))); corV <- cor(x["August",], t(x), method="pearson"); y <- cbind(x, correl=corV[1,]); y[order(-y[,5]), ] # Commands to perform a correlation search and ranking across all rows (matrix object required here). First, an example matrix 'x' is created. Second, the correlation values for the "August" row against all other rows are calculated. Finally, the resulting vector with the correlation values is merged with the original matrix 'x' and sorted by decreasing correlation values.
          merge(frame1, frame2, by.x = "frame1col_name", by.y = "frame2col_name", all = TRUE) # Merges two data frames (tables) by common field entries. To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. To merge on row names (indices), refer to it with "row.names" or '0', e.g.: 'by.x = "row.names", by.y = "row.names"'.
          my_frame1 <- data.frame(title1=month.name[1:8], title2=1:8); my_frame2 <- data.frame(title1=month.name[4:12], title2=4:12); merge(my_frame1, my_frame2, by.x = "title1", by.y = "title1", all = TRUE) # Example for merging two data frames by common field.
      Fast computations on large data frames and matrices
          myDF <- as.data.frame(matrix(rnorm(100000), 10000, 10)) # Creates an example data frame.
          myCol <- c(1,1,1,2,2,2,3,3,4,4); myDFmean <- t(aggregate(t(myDF), by=list(myCol), FUN=mean, na.rm=T)[,-1]); colnames(myDFmean) <- tapply(names(myDF), myCol, paste, collapse="_"); myDFmean[1:4,] # The aggregate function can be used to perform calculations (here mean) across rows for any combination of column selections (here myCol) after transposing the data frame. However, this will be very slow for data frames with millions of rows.
          myList <- tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list); names(myList) <- sapply(myList, paste, collapse="_"); myDFmean <- sapply(myList, function(x) mean(as.data.frame(t(myDF[,x])))); myDFmean[1:4,] # Faster alternative for performing the aggregate computation of the previous step. However, this step will be still very slow for very large data sets, due to the sapply loop over the row elements.
          myList <- tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list); myDFmean <- sapply(myList, function(x) rowSums(myDF[,x])/length(x)); colnames(myDFmean) <- sapply(myList, paste, collapse="_"); myDFmean[1:4,] # By using the rowSums or rowMeans functions one can perform the above aggregate computations 100-1000 times faster by avoiding a loop over the rows altogether.
          myDFsd <- sqrt((rowSums((myDF-rowMeans(myDF))^2)) / (length(myDF)-1)); myDFsd[1:4] # Similarly, one can compute the standard deviation for large data frames by avoiding loops. This approach is about 100 times faster than the loop-based alternatives: sd(t(myDF)) or apply(myDF, 1, sd).
      Regular expressions
          gsub('(i.*a)', '\\1_xxx', iris$Species, perl = TRUE) # Example for using regular expressions to substitute a pattern by another one using a back reference. Remember: single escapes '\' need to be double escaped '\\' in R.
      Conditional selections
          my_frame[!duplicated(my_frame[,2]),] # Removes rows with duplicated values in selected column.
          my_frame[my_frame$y2 > my_frame$y3,] # Prints all rows of data frame where values of col1 > col2. Comparison operators are: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators are & (and), | (or) and ! (not).
          x <- 0.5:10; x[x<1.0] <- -1/x[x<1.0] # Replaces all values in vector or data frame that are below 1 with their reciprocal value.
          x <-data.frame(month=month.abb[1:12], AB=LETTERS[1:2], no1=1:48, no2=1:24); x[x$month == "Apr" & (x$no1 == x$no2 | x$no1 > x$no2),] # Prints all records of frame 'x' that contain 'Apr' AND have equal values in columns 'no1' and 'no2' OR have greater values in column 'no1'.
          x[x[,1] %in% c("Jun", "Aug"),] # Retrieves rows with column matches specified in a query vector.
          x[c(grep("\\d{2}", as.character(x$no1), perl = TRUE)),] # Possibility to print out all rows of a data frame where a regular expression matches (here all double digit values in col 'no1').
          x[c(grep("\\d{2}", as.character(for(i in 1:4){x[,i]}), perl = TRUE)),] # Same as above, but searches all columns (1-4) using a for loop (see below).
          z <- data.frame(chip1=letters[1:25], chip2=letters[25:1], chip3=letters[1:25]); z; y <- apply(z, 1, function(x) sum(x == "m") > 2); z[y,] # Identifies in a data frame ('z') all those rows that contain a certain number of identical fields (here 'm' > 2).
          z <- data.frame(chip1=1:25, chip2=25:1, chip3=1:25); c <- data.frame(z, count=apply(z[,1:3], 1, FUN <- function(x) sum(x >= 5))); c # Counts in each row of a data frame the number of fields that are above or below a specified value and appends this information to the data frame. By default rows with "NA" values will be ignored. To work around this limitation, one can replace the NA fields with a value that doesn't affect the result, e.g.: x[is.na(x)] <- 1.
          x <- data.frame(matrix(rep(c("P","A","M"),20),10,5)); x; index <- x == "P"; cbind(x, Pcount=rowSums(index)); x[rowSums(index)>=2,] # Example of how one can count occurances of strings across rows. In this example the occurances of "P" in a data frame of "PMA" values are counted by converting to a data frame of logical values and then counting the 'TRUE' occurences with the 'rowSums' function, which does in this example the same as: 'cbind(x, count=apply(x=="P",1,sum))'.
      Lists
      Lists are ordered collections of objects that can be of different modes (e.g. numeric vector, array, etc.). A name can be assigned to each list component.
          my_list <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # Example of how to create a list.
          attributes(my_list); names(my_list) # Prints attributes and names of all list components.
          my_list[2]; my_list[[2]] # Returns component 2 of list; The [[..]] operator returns the object type stored in this component, while '[..]' returns a single component list. 
          my_list$wife # Named components in lists can also be called with their name. Command returns same component as 'my_list[[2]]'.
          my_list[[4]][2] # Returns from the fourth list component the second entry.
          length(my_list[[4]]) # Prints the length of the fourth list component.
          my_list$wife <- 1:12 # Replaces the content of the existing component with new content.
          my_list$wife <- NULL # Deletes list component 'wife'.
          my_list <- c(my_list, list(my_title2=month.name[1:12])) # Appends new list component to existing list.
          my_list <- c(my_name1=my_list1, my_name2=my_list2, my_name3=my_list3) # Concatenates lists.
          my_list <- c(my_title1=my_list[[1]], list(my_title2=month.name[1:12])) # Concatenates existing and new components into a new list.
          unlist(my_list); data.frame(unlist(my_list)); matrix(unlist(my_list)); data.frame(my_list) # Useful commands to convert lists into other objects, such as vectors, data frames or matrices.
          my_frame <- data.frame(y1=rnorm(12),y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)); my_list <- apply(my_frame, 1, list); my_list <- lapply(my_list, unlist); my_list # Stores the rows of a data frame as separate vectors in a list container.
          mylist <- list(a=letters[1:10], b=letters[10:1], c=letters[1:3]); lapply(names(mylist), function(x) c(x, mylist[[x]])) # Syntax to access list component names with the looping function 'lapply'. In this example the list component names are prepended to the corresponding vectors.
  评论这张
 
阅读(2605)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016