注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

网易考拉推荐

R & Bioconductor Manual(三)  

2012-02-29 10:38:57|  分类: R&Bioconductor |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
    1. Missing Values
    2. Missing values are represented in R data objects by the missing value place holder 'NA'. The default behavior for many R functions on data objects with missing values is 'na.fail' which returns the value 'NA'. Several 'na.action' options are available to change this behavior.
          x <- 1:10; x <- x[1:12]; z <- data.frame(x,y=12:1) # creates sample data.frame with 'NA' values
          is.na(x) # Finds out which elements of x contain "NA" fields.
          na.omit(z) # Removes rows with 'NA' fields in numeric data objects.
          x <- letters[1:10]; print(x); x <- x[1:12]; print(x); x[!is.na(x)] # A similar result can be achieved on character data objects with the function 'is.na()'.
          apply(z, 1, mean, na.rm=T) # uses available values to perform calculation while ignoring the 'NA' values; the default 'na.rm=F' returns 'NA' for rows with missing values
          z[is.na(z)] <- 0 # replaces 'NA' values with zeros. To replace the special <NA> signs in character columns of data frames, convert the data frame with as.matrix() to a matrix, perform the substitution on this level and then convert it back to a data frame with as.data.frame().
    3. Writing Your Own Functions
    4. The R language allows users to create their own function objects. The basic syntax for defining new functions is:
          name <- function(arg_1, arg_2, ...) expression # Basic syntax for functions.
      A much more detailed introduction into writing functions in R is available in the Programming in R section of this manual.

    5. R Web Applications
    6. Various web-based R services and R web development kits are available:
    7. R Exercises
    8. Slide Show (R Code)
      The following exercises introduce a variety of useful data analysis utilities available in R.
      Import from spreadsheet programs (e.g. Excel)
      1. Download the following molecular weight and subcelluar targeting tables from the TAIR site, import the files into Excel and save them as tab delimited text files.
      2. Import the tables into R
        • my_mw <- read.delim(file="MolecularWeight_tair7.xls", header=T, sep="\t") # Imports molecular weight table.
          my_target <- read.delim(file="TargetP_analysis_tair7.xls", header=T, sep="\t") # Imports subcelluar targeting table.
      Online import
          my_mw <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/MolecularWeight_tair7.xls", header=T, sep="\t") # Online import of molecular weight table from TAIR.
          my_target <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/TargetP_analysis_tair7.xls", header=T, sep="\t") # Online import of subcelluar targeting table from TAIR.
      Check tables and rename gene ID columns
          my_mw[1:4,]; my_target[1:4,]; colnames(my_target)[1] <- "ID"; colnames(my_mw)[1] <- "ID" # Prints out the first four rows of the two data frames and renames their gene ID columns.
      Merge the two tables based on common ID field
          my_mw_target <- merge(my_mw, my_target, by.x="ID", by.y="ID", all.x=T)
      Shorten one table before the merge and then remove the non-matching rows (NAs) in the merged file
          my_mw_target2 <- merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all.x=T) # To remove non-matching rows, use the argument setting 'all=F'.
          na.omit(my_mw_target2) # Removes rows containing "NAs" (non-matching rows).

          Problem 1:
          How can the merge function in the previous step be executed so that only the common rows among the two data frames are returned? Prove that both methods - the two step version with 'na.omit()' and your method - return identical results.

          Problem 2:
          Replace all NAs in the data frame 'my_mw_target2' with zeros.
      Apply several combinatorial queries (filters) on the data set. For example:
          my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C",] # Retrieves all records with MW greater than 100000 and targeting to chloroplasts.
          dim(my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C", ]) # Determines the number of matches for the above query.

          Problem 3:
          How many protein entries in the 'my_mw_target' data frame have a MW of greater then 4000 and less then 5000. Subset the data frame accordingly and sort it by MW to check that your result is correct.
      Use a regular expression in a substitute function to generate a separate ID column that lacks the gene model extensions
          my_mw_target3 <- data.frame(loci=gsub("\\..*", "", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target)
          my_mw_target3 <- data.frame(loci=gsub("(^.*)\\.\\d{1,}", "\\1", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target) # Same as above but with backreference.

          Problem 4:
          Retrieve those rows in 'my_mw_target3' where the second column contains the following identifiers: c("AT5G52930.1", "AT4G18950.1", "AT1G15385.1", "AT4G36500.1", "AT1G67530.1"). Use the '%in%'  function for this query. As an alternative approach, assign the second column to the row index of the data frame and then perform the same query again using the row index. Explain the difference of the two methods.
      Calculate the number of gene models per loci with the table() function
          my_mw_target4 <- cbind(my_mw_target3, Freq=table(my_mw_target3[,1])[my_mw_target3[,1]]) # The table() function counts the number of duplicates for each loci and cbind() appends the information to the data frame.
          my_mw_target4[order(-my_mw_target4$Freq),][1:30,] # Sorts by count information.
          length(unique(my_mw_target4[,1])) # Removes duplicates and prints the number of unique loci.
          sum(!duplicated(my_mw_target4[,1])) # Gives the same result as previous command; remember: the function 'duplicated' returns a logical vector.
          length(my_mw_target4[,2])-length(unique(my_mw_target4[,1])) # Calculates the number of duplicated entries.
          sum(duplicated(my_mw_target4[,1])) # Generates the same result as the previous command.
      Perform a variety calculations across columns
          data.frame(my_mw_target4, avg_AA_WT=(my_mw_target4[,3]/my_mw_target4[,4]))[1:40,] # Calculates the average AA weight per protein.
          data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean))[1:40,] # Calculates the mean across several fields of each row.
          data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean), stdev=apply(my_mw_target4[,6:9], 1, sd, na.rm=T))[1:40,] # Calculates the mean and standard deviation across several fields of each row.
          mean(my_mw_target4[,-c(1,2,5)], na.rm=T) # Calculates the mean for all numeric columns.
          data.frame(my_mw_target4[1:40,], ttest=apply(my_mw_target4[1:40,6:9], 1, function(x) {t.test(x[1:2], x[3:4])$p.value} )) # Calculates a two-sample t-test for two slices in each row of the data frame and records the corresponding p-values. To calculate a paired t-test, add the argument 'paired=T'.
      Export data frame to Excel
          write.table(my_mw_target4, file="my_file.xls", quote=F, sep="\t", col.names = NA) # Writes the data frame 'my_mw_target4' into a tab-delimited text file that can be opened with Excel.
      Plotting samples
          boxplot(my_mw_target4[,3], my_mw_target4[,4], names=c("MW","CC"), col=c("blue","red"), main="Box Plot") # Generates box plots for MW and AA counts.
          plot(density(my_mw_target4[,3]), col="blue", main="Density Plot") # Plots density distribution plots for MW.
          barplot(my_mw_target4[1:5,3], beside = F, col = c("red", "green", "blue", "magenta", "black"), names.arg=as.vector(my_mw_target4[1:5,1]), main="MW Bar Plot", ylim=c(0,230000)) # Bar plot of MW for first five loci.
          source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports the venn diagram function.
          setlist <- list(A=sample(my_mw_target4[,1], 1000), B=sample(my_mw_target4[,1], 1100), C=sample(my_mw_target4[,1], 1200), D=sample(my_mw_target4[,1], 1300), E=sample(my_mw_target4[,1], 1400)) # Generates 5 sets of random IDs.
          OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets"); counts <- sapply(OLlist$Venn_List, length); vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a 5-way venn diagram for the random ID sets.
          OLexport <- as.matrix(unlist(sapply(OLlist[[4]], paste, collapse=" "))); write.table(OLexport, file="OLexport.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
          pdf("venn_plot.pdf"); vennPlot(counts=counts); dev.off() # Saves the plot image to a PDF file.

          Problem 5:
          Generate two key lists each with 4 random sample sets. Compute their overlap counts and plot the results for both lists in one venn diagram.

          Problem 6:
          Write all commands from the previous exercises into an R script (exerciseRbasics.R) and execute it with the source() function like this: source("exerciseRbasics.R") # Executes all of the above commands and generates the corresponding output files in the current working directory.

 

  评论这张
 
阅读(1789)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016