注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

网易考拉推荐

NCBI SRA文件中下载,分离出从对短序paired-end reads  

2014-04-01 23:34:03|  分类: 生信分析软件 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

1. Downloading SRA data using command line utilities

.

Overview

When to use a command line utility rather than the SRA website.

For multiple simultaneous downloads of SRA data, or for high-volume downloads, we recommend using command line utilities such as wget, FTP, or Aspera’s ‘ascp’ utility. As with web-based downloads, the best speed is achieved with Aspera’s FASP implementation. ascp is bundled with the Aspera Connect plugin.

Downloading SRA data using the SRA Toolkit.

The SRA Toolkit has the capacity to download data files directly (when properly configured) simply by calling a Toolkit command and specifying the accession of interest. For example:

$ fastq-dump -X 5 -Z SRR390728

This example will retrieve data for SRR390728 (a small dataset: 193 MB), print the first five spots (-X 5) to standard out (-Z). The above operation does not require prior download of SRR390728; fastq-dump (and all other Toolkit ‘dump’ utilities) will identify the accession and contact NCBI to download it.

Alternatively, the Toolkit utility ‘prefetch’ can be used in conjunction with HTTP transfer (default) or ascp. This can be a simpler way for many users to utilize ascp to download SRA data.

Accessing the ‘ascp’ utility.

As of version 3.3x of Aspera Connect, the default install location for ascp is:

Microsoft Windows: ‘C:\Program Files\Aspera\Aspera Connect\bin\ascp.exe’

Mac OS X: ‘/Applications/Aspera Connect.app/Contents/Resources/ascp’ (Administrator-installed Aspera Connect) or ‘/Users/[username]/Applications/Aspera\ Connect.app/Contents/Resources/ascp’ (Non-administrator install)

Linux: ‘/opt/aspera/bin/ascp’ or ‘/home/[username]/.aspera/connect/bin/ascp’

What key file should be used to download SRA data by ascp?

Downloading SRA data does not require an NCBI-generated private key file. Aspera Connect / ascp is packaged with Asperasoft-generated key files:

asperaweb_id_dsa.openssh (openssh formatted key for newer ascp installations)

asperaweb_id_dsa.putty (putty formatted key for older ascp installations)

You may need to browse the directories produced during the Aspera Connect / ascp installation to find the above key files. They are stored in slightly different locations depending on your operating system.

Determining the location of SRA data files for automated or scripted downloads.

Beginning with a list of desired SRA data sets (e.g., a list of SRA Run accessions, “SRRs”), the exact download location for that data file can be determined as follows:

wget/FTP root: ftp://ftp-trace.ncbi.nih.gov

ascp root: vog.hin.mln.ibcn.ptf@ptfnona:

Remainder of path:

/sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6 characters of accession>/<accession>/<accession>.sra

Where

{SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file

Examples:

Downloading SRR304976 by wget or FTP:

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra

Downloading the same file by ascp:

[path_to_ascp_binary]/ascp -i [path_to_Aspera_key]/asperaweb_id_dsa.openssh -k 1 –T -l200m ars.679403RRS/679403RRS/403RRS/RRS/ars/nuRyB/sdaer/tnatsni-ars/ars/:vog.hin.mln.ibcn.ptf@ptfnona [local_target_directory]

It is recommended that you always use the “-l” (target transfer rate) option when configuring ascp. A more detailed discussion of ascp options and configuration can be found in our Aspera Transfer Guide. Aspera’s documentation on ascp also provides usage and examples.

Performance comparison of FTP and ascp downloads

It might be beneficial to consider using ascp if you plan to download more than 1 Gigabyte of data, or if your location is distant from NCBI (located on the eastern coast of North America) since:

  • The amount of data for a given SRA project can exceed 10 gigabytes, and traditional FTP may be too slow to download your data efficiently.
  • FTP performance degrades proportionally with the number of hops or switches the data must take to get to you. Aspera performance does not degrade with distance.
  • Aspera is typically 10 times faster than FTP and reduces the chance of drops or time-outs in the middle of a transfer. Best-case transfer rates for ascp are ~ 600 Mbps, while typical rates are closer to 100-200 Mbps.

If you are located in Europe or Asia and wish to download via FTP, you would have less trouble getting a successful transfer of the data from our INSDC partners — either the European Nucleotide Archive (EMBL-EBI) or the DDBJ Sequence Read Archive (DNA Database of Japan).



2 . 如何从SRA文件中分离出从对短序paired-end reads
http://pgfe.umassmed.edu/ou/archives/3003

很多时候我们从NCBI的SRA文档中分离paired-end sequencing数据。但是当我们使用SRA toolkit的fastq-dump工具时,往往只能得到一个文件,而不是两个文件。如何才能将这个文件分离成两个或者更多的文件呢?答案是不一定。首先我们可以试试使用fastq-dump的–split-3参数。对于–split-3参数,是这样介绍的:
Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored

也就是说如果SRA文件中只有一个文件,那么这个参数就会被忽略。如果原文件中有两个文件,那么它就会把成对的文件按*_1.fastq, *_2.fastq这样分开。如果还有出现了第三个文件,就意味着这个文件本身是未成配对的部分。可能是当初提交的时候因为事先过滤过了一下,所以有一部分数据被删除了。

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
其他格式的转换:
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
说明文档:
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std

Tool: fastq-dump

Usage:
fastq-dump [options] <path/file> [<path/file> ...]
fastq-dump [options] <accession>
Frequently Used Options:
General:
-h | --help Displays ALL options, general usage, and version information.
-V | --version Display the version of the program.
Data formatting:


--split-files Dump each read into separate file. Files will receive suffix corresponding to read number.


--split-spot Split spots into individual reads.


--fasta <[line width]> FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
-I | --readids Append read id after spot id as 'accession.spot.readid' on defline.
-F | --origfmt Defline contains only original sequence name.
-C | --dumpcs <[cskey]> Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
-B | --dumpbase Formats sequence using base space (default for other than SOLiD).
-Q | --offset <integer> Offset to use for ASCII quality scores. Default is 33 ("!").
Filtering:
-N | --minSpotId <rowid> Minimum spot id to be dumped. Use with "X" to dump a range.
-X | --maxSpotId <rowid> Maximum spot id to be dumped. Use with "N" to dump a range.
-M | --minReadLen <len> Filter by sequence length >= <len>


--skip-technical Dump only biological reads.


--aligned Dump only aligned sequences. Aligned datasets only; see sra-stat.


--unaligned Dump only unaligned sequences. Will dump all for unaligned datasets.
Workflow and piping:
-O | --outdir <path> Output directory, default is current working directory ('.').
-Z | --stdout Output to stdout, all split data become joined into single stream.


--gzip Compress output using gzip.


--bzip2 Compress output using bzip2.
Use examples:
de>fastq-dump -X 5 -Z SRR390728de>
Prints the first five spots (-X 5) to standard out (-Z). This is a useful starting point for verifying other formatting options before dumping a whole file.
de>fastq-dump -I --split-files SRR390728de>
Produces two fastq files (--split-files) containing ".1" and ".2" read suffices (-I) for paired-end data.
#paired-end data
fastq-dump  --split-3  ERR364396 (有时,这样是错误的)
那可以这样试试:
fastq-dump --split-3 /home/yourpath/to/ERR364399.sra

de>fastq-dump --split-files --fasta 60 SRR390728de>
Produces two (--split-files) fasta files (--fasta) with 60 bases per line ("60" included after --fasta).
de>fastq-dump --split-files --aligned -Q 64 SRR390728de>
Produces two fastq files (--split-files) that contain only aligned reads (--aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.
Possible errors and their solution:
de>fastq-dump.2.x err: item not found while constructing within virtual database module - the path '<path/SRR*.sra>' cannot be opened as database or table de>
This error indicates that the .sra file cannot be found. Confirm that the path to the file is correct.
de>fastq-dump.2.x err: name not found while resolving tree within virtual file system module - failed SRR*.sra de>
The data are likely reference compressed and the toolkit is unable to acquire the reference sequence(s) needed to extract the .sra file. Please confirm that you have tested and validated the configuration of the toolkit. If you have elected to prevent the toolkit from contacting NCBI, you will need to manually acquire the reference(s) here
de>failed with curl-error 'CURLE_COULDNT_RESOLVE_HOST' de>
The toolkit is attempting to contact or download data from NCBI, but is unable to connect. Please confirm that your computer or server has Internet connectivity.

 fastq-dump -h

Usage:
  fastq-dump [options] [ -A ] <accession>
  fastq-dump [options] <path[ path...]>

INPUT
  -A|--accession <accession>       Replaces accession derived from <path> in
                                   filename(s) and deflines (only for single
                                   table dump)
  --table <table-name>             [NEW] Table name within cSRA object,
                                   default is "SEQUENCE"

PROCESSING

Read Splitting                     Sequence data may be used in raw form or
                                     split into individual reads
  --split-spot                     Split spots into individual reads

Full Spot Filters                  Applied to the full spot independently
                                     of --split-spot
  -N|--minSpotId <rowid>           Minimum spot id
  -X|--maxSpotId <rowid>           Maximum spot id
  --spot-groups <[list]>           Filter by SPOT_GROUP (member): name[,...]
  -W|--clip                        Apply left and right clips

Common Filters                     Applied to spots when --split-spot is not
                                     set, otherwise - to individual reads
  -M|--minReadLen <len>            Filter by sequence length >= <len>
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value
                                   optionally filter by value:
                                   pass|reject|criteria|redacted
  -E|--qual-filter                 Filter used in early 1000 Genomes data: no
                                   sequences starting or ending with >= 10N

Filters based on alignments        Filters are active when alignment
                                     data are present
  --aligned                        Dump only aligned sequences
  --unaligned                      Dump only unaligned sequences
  --aligned-region <name[:from-to]>  Filter by position on genome. Name can
                                   either be accession.version (ex:
                                   NC_000001.10) or file specific name (ex:
                                   "chr1" or "1"). "from" and "to" are 1-based
                                   coordinates
  --matepair-distance <from-to|unknown>  Filter by distance beiween matepairs.
                                   Use "unknown" to find matepairs split
                                   between the references. Use from-to to limit
                                   matepair distance on the same reference

Filters for individual reads       Applied only with --split-spot set
  --skip-technical                 Dump only biological reads

OUTPUT
  -O|--outdir <path>               Output directory, default is working
                                   directory '.' )
  -Z|--stdout                      Output to stdout, all split data become
                                   joined into single stream

Multiple File Options              Setting these options will produce more
                                     than 1 file, each of which will be suffixed
                                     according to splitting criteria.
  --split-files                    Dump each read into separate file.Files
                                   will receive suffix corresponding to read
                                   number
  --split-3                        Legacy 3-file splitting for mate-pairs:
                                   First biological reads satisfying dumping
                                   conditions are placed in files *_1.fastq and
                                   *_2.fastq If only one biological read is
                                   present it is placed in *.fastq Biological
                                   reads and above are ignored.
  -G|--spot-group                  Split into files by SPOT_GROUP (member name)
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value
                                   optionally filter by value:
                                   pass|reject|criteria|redacted
  -T|--group-in-dirs               Split into subdirectories instead of files
  -K|--keep-empty-files            Do not delete empty files

FORMATTING

Sequence
  -C|--dumpcs <[cskey]>            Formats sequence using color space (default
                                   for SOLiD),"cskey" may be specified for
                                   translation
  -B|--dumpbase                    Formats sequence using base space (default
                                   for other than SOLiD).

Quality
  -Q|--offset <integer>            Offset to use for quality conversion,
                                   default is 33
  --fasta                          FASTA only, no qualities

Defline
  -F|--origfmt                     Defline contains only original sequence name
  -I|--readids                     Append read id after spot id as
                                   'accession.spot.readid' on defline
  --helicos                        Helicos style defline
  --defline-seq <fmt>              Defline format specification for sequence.
  --defline-qual <fmt>             Defline format specification for quailty.
                                   <fmt> is string of characters and/or
                                   variables. The variables can be one of: $ac
                                   - accession, $si spot id, $sn spot
                                   name, $sg spot group (barcode), $sl spot
                                   length in bases, $ri read number, $rn
                                   read name, $rl read length in bases. '[]'
                                   could be used for an optional output: if
                                   all vars in [] yield empty values whole
                                   group is not printed. Empty value is empty
                                   string or for numeric variables. Ex:
                                   @$sn[_$rn]/$ri '_$rn' is omitted if name
                                   is empty
 
OTHER:
  -h|--help                        Output brief explanation of program usage
  -V|--version                     Display the version of the program
  -L|--log-level <level>           Logging level as number or enum string One
                                   of (fatal|sys|int|err|warn|info) or (0-5)
                                   Current/default is warn
  -v|--verbose                     Increase the verbosity level of the program
                                   Use multiple times for more verbosity


另外,还有一种方法 http://seqanswers.com/forums/showthread.php?t=12550
fastq-dump -SL
  评论这张
 
阅读(4937)| 评论(1)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016