注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

网易考拉推荐

用R和BioConductor进行基因芯片数据分析(二):缺失值填充  

2009-11-26 22:53:01|  分类: 生物信息学 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
http://i.azpala.com/2008/05/02/microarray-data-analysis-using-r-and-bioconductor-step2-missing-value-imputation/

以下分析用到的数据可以在这里下载,这个数据来自关于基因对蝴蝶迁移性的研究,样本是20个蝴蝶个体,其中10个是当地固有个体(old),另外10个是新迁入的个体(new),old 和new个体两两随机配对,分别用不同颜色染料(波长分别为555和647nm)标记后,在同一张基因芯片上杂交;此外,每个基因在每张芯片上都重复点样 3次,因此此数据是有3个replicates及10张芯片的双通道芯片。数据是样点的信号强度值,没有经过标准化处理的。

拿到数据你会看到许多”NA”,这是因为我把缺失的空白值替换成NA了, 以便用R进行缺失值填充。

说到缺失值填充,通常有3种方法:

A. 用此基因的平均表达值填充;如果有多张重复芯片,可以取不同芯片上的平均值;对于时间序列芯片,可以通过差值法填充。—此方法很简单,也比较常用,但是效果不及下面2种方法

B. 基于SVD(即主成分分析)方法的填充:简单地讲,此方法是通过描述基因表达谱的几个基本模式来对缺失值进行填充。

C. 基于KNN(最近邻)方法的填充: 此方法是寻找和有缺失值的基因的表达谱相似的其他基因,通过这些基因的表达值(依照表达谱相似性加权)来填充缺失值。KNN法是这3种方法里效果最好的,因此对本数据的缺失值用KNN法填充。

对以上3种方法的比较,这篇paper提供了清晰的说明: Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001), Missing value estimation methods for DNA microarrays, Bioinformatics 17(6):520-525. 推荐大家看看

铺垫了这许多,下面开工分析数据啦

首先要安装最新版本的R 2.7.0,上面一篇里有下载地址。我之前用的2.5.1版本,安装下面的package有错误,所以强烈推荐最新版本

然后下载安装叫做impute的package,下载地址: http://bioconductor.fhcrc.org/packages/1.9/bioc/html/impute.html

impute是专门用KNN法进行缺失值填充的R package,它的安装如前文所述:

如果是Linux下,就在shell输入: sudo R CMD INSTALL impute_1.6.0.tar.gz

设置好当前工作目录(Windows是在R的菜单栏->工作目录…设置,Linux下用setwd()函数)

然后在R控制台输入以下代码:


library(impute)
#导入impute package
raw<-read.table(’raw_data_3_replicates.txt’,header=TRUE)
rawexpr<-raw[,-1]
#移除第一列ID列
if(exists(”.Random.seed”)) rm(.Random.seed)
#必须,如果没有这句话会出错,原因不知-,-请高手指教
imputed<-impute.knn(as.matrix(rawexpr) ,k = 10, rowmax = 0.5, colmax = 0.8, maxp = 1500, rng.seed=362436069)
#impute.knn() 使用一个矩阵作为第一个参数,其他参数这里使用的是默认值
write.table(imputed$data,file=’imputed_data.txt’)
#write.table() 把数据保存在当前工作目录下的文件中,文件名用file=’ ‘指定,这一步不是必须的
imputeddata<-imputed$data
#imputed$data是在R中储存imputed后的数据的矩阵

现在在R里输入imputed,即填充好的数据矩阵,是不是NA值全都没了?

OK,这一步就这样搞定啦

关于impute package的详细Documentation在这里

不放心,于是在下面再贴一遍,如果你不求甚解就不用看啦

impute.knn {impute} R Documentation

A function to impute missing expression data

Description

A function to impute missing expression data, using nearest neighbor averaging.

Usage

impute.knn(data ,k = 10, rowmax = 0.5, colmax = 0.8, maxp = 1500, rng.seed=362436069)

Arguments

de>datade> An expression matrix with genes in the rows, samples in the columns

Details

de>impute.knnde> uses k-nearest neighbors in the space of genes to impute missing expression values.

For each gene with missing values, we find the k nearest neighbors using a Euclidean metric, confined to the columns for which that gene is NOT missing. Each candidate neighbor might be missing some of the coordinates used to calculate the distance. In this case we average the distance from the non-missing coordinates. Having found the k nearest neighbors for a gene, we impute the missing elements by averaging those (non-missing) elements of its neighbors. This can fail if ALL the neighbors are missing in a particular element. In this case we use the overall column mean for that block of genes.

Since nearest neighbor imputation costs O(p*log(p)) operations per gene, where p is the number of rows, the computational time can be excessive for large p and a large number of missing rows. Our strategy is to break blocks with more than de>maxpde> genes into two smaller blocks using two-mean clustering. This is done recursively till all blocks have less than de>maxpde> genes. For each block, k-nearest neighbor imputation is done separately. We have set the default value of de>maxpde> to 1500. Depending on the speed of the machine, and number of samples, this number might be increased. Making it too small is counter-productive, because the number of two-mean clustering algorithms will increase.

For reproducibility, this function reseeds the random number generator using the seed provided or the default seed (362436069).

Value

de>datade>

{the state of the random number generator, if available, prior to the call to de>set.seedde>. Otherwise, it is de>NULLde>. If necessary, this can be used in the calling code to undo the side-effect of changing the random number generator sequence.}

Author(s)

Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu

References

Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P. and Botstein, D., Imputing Missing Data for Gene Expression Arrays, Stanford University Statistics Department Technical report (1999), http://www-stat.stanford.edu/~hastie/Papers/missing.pdf

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525

See Also

set.seed, save

Examples

data(khanmiss)khan.expr <- khanmiss[-1, -(1:2)]

##

## First example

##

if(exists(".Random.seed")) rm(.Random.seed)

khan.imputed <- impute.knn(as.matrix(khan.expr))

##

## khan.imputed$data should now contain the imputed data matrix

## khan.imputed$rng.seed should contain the random number seed used

## in imputation. In the above invocation, it is the default seed.

##

khan.imputed$rng.seed # should be 362436069

khan.imputed$rng.state # should be NULL

##

## Second example

##

set.seed(12345)

saved.state <- .Random.seed

khan.imputed <- impute.knn(as.matrix(khan.expr))

# Assuming all goes well with no guarantees in case of error...

.Random.seed <- khan.imputed$rng.state

sum(saved.state - khan.imputed$rng.state) # should be zero!

save(khan.imputed, file="khanimputation.Rda")
  评论这张
 
阅读(1426)| 评论(1)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016