往NCBI上上傳數據:
1.注冊並登陸NCBI,點擊主頁的submit選項
2.創建bioproject,以及biosample號
(Bioproject: Bioproject是一個項目的編號,在文章發表的時候,讀者需要通過這個號來找到你文章中的數據存放到哪里去了,因此,在投稿之前,我們需要上傳數據並拿到這個ID號。這個id號下面可以存放很多個不同的reads、序列等等,例如在我的項目里面,既有基因組的reads,基因組,也有重測序的reads,我可以把它全放到一個Bioproject下面,也可以申請兩個bioproject號,一個用來放基因組的,一個用來放重測序的等等。
Biosample: Biosample與Bioproject類似,它是你所用的樣本的一個ID。我們要先創建一個biosample,里面放有樣本的信息。)
先創建Bioproject,再創建Biosample,按照頁面的流程依次填寫信息即可。
2.1 上傳重測序的reads
2.1.1 創建Biosample
Attributes file 示例 (注,必填的信息一定要填,其他的隨意)
*sample_name sample_title bioproject_accession *organism isolate cultivar ecotype age dev_stage *geo_loc_name *tissue biomaterial_provider cell_line cell_type collected_by collection_date culture_collection disease disease_stage genotype growth_protocol height_or_length isolation_source lat_lon phenotype population sample_type sex specimen_voucher temp treatment description
Pil01 Populus ilicifolia pil-01 not collected Kenya leaf
Pil02 Populus ilicifolia pil-02 not collected Kenya leaf
Pil03 Populus ilicifolia pil-03 not collected Kenya leaf
Pil04 Populus ilicifolia pil-04 not collected Kenya leaf
Pil05 Populus ilicifolia pil-05 not collected Kenya leaf
Pil06 Populus ilicifolia pil-06 not collected Kenya leaf
Pil07 Populus ilicifolia pil-07 not collected Kenya leaf
Pil08 Populus ilicifolia pil-08 not collected Kenya leaf
Pil09 Populus ilicifolia pil-09 not collected Kenya leaf
Pil10 Populus ilicifolia pil-10 not collected Kenya leaf
Pil11 Populus ilicifolia pil-11 not collected Kenya leaf
Pil12 Populus ilicifolia pil-12 not collected Kenya leaf
Pil13 Populus ilicifolia pil-13 not collected Kenya leaf
Pil14 Populus ilicifolia pil-14 not collected Kenya leaf
Pil15 Populus ilicifolia pil-15 not collected Kenya leaf
Pil16 Populus ilicifolia pil-16 not collected Kenya leaf
Pil17 Populus ilicifolia pil-17 not collected Kenya leaf
Pil18 Populus ilicifolia pil-18 not collected Kenya leaf
Pil19 Populus ilicifolia pil-19 not collected Kenya leaf
審核通過后,每個個體都會有一個Biosample ID號
2.1.2 上傳reads
選擇(Sequence Read Archive)SRA上傳reads
SRA meta file 示例:
bioproject_accession biosample_accession library_ID title library_strategy library_source library_selection library_layout platform instrument_model design_description filetype filename filename2 filename3 filename4 assembly
PRJNA471950 SAMN09228062 Pil01 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil01.1.fq.gz Pil01.2.fq.gz
PRJNA471950 SAMN09228063 Pil02 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil02.1.fq.gz Pil02.2.fq.gz
PRJNA471950 SAMN09228064 Pil03 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil03.1.fq.gz Pil03.2.fq.gz
PRJNA471950 SAMN09228065 Pil04 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil04.1.fq.gz Pil04.2.fq.gz
PRJNA471950 SAMN09228066 Pil05 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil05.1.fq.gz Pil05.2.fq.gz
PRJNA471950 SAMN09228067 Pil06 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil06.1.fq.gz Pil06.2.fq.gz
PRJNA471950 SAMN09228068 Pil07 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil07.1.fq.gz Pil07.2.fq.gz
PRJNA471950 SAMN09228069 Pil08 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil08.1.fq.gz Pil08.2.fq.gz
PRJNA471950 SAMN09228070 Pil09 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil09.1.fq.gz Pil09.2.fq.gz
PRJNA471950 SAMN09228071 Pil10 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil10.1.fq.gz Pil10.2.fq.gz
PRJNA471950 SAMN09228072 Pil11 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil11.1.fq.gz Pil11.2.fq.gz
PRJNA471950 SAMN09228073 Pil12 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil12.1.fq.gz Pil12.2.fq.gz
PRJNA471950 SAMN09228074 Pil13 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil13.1.fq.gz Pil13.2.fq.gz
PRJNA471950 SAMN09228075 Pil14 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil14.1.fq.gz Pil14.2.fq.gz
PRJNA471950 SAMN09228076 Pil15 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil15.1.fq.gz Pil15.2.fq.gz
PRJNA471950 SAMN09228077 Pil16 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil16.1.fq.gz Pil16.2.fq.gz
PRJNA471950 SAMN09228078 Pil17 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil17.1.fq.gz Pil17.2.fq.gz
PRJNA471950 SAMN09228079 Pil18 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil18.1.fq.gz Pil18.2.fq.gz
PRJNA471950 SAMN09228080 Pil19 WGS of P .ilicifolia WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil19.1.fq.gz Pil19.2.fq.gz
審核通過之后,就可以上傳reads了!這里有兩種方式:
- 用ftp從服務器直接上傳:
如:
細葉楊的reads在:/share/work/01.all.reads/00.細葉楊populus_ilifolia這個目錄下,操作如下:
cd /share/work/01.all.reads/00.細葉楊populus_ilifolia
lftp subftp:w4pYB9VQ@ftp-private.ncbi.nlm.nih.gov
cd uploads/zeyuan.chen@qq.com_xxxx
mkdir PilWGS
cd PilWGS
put pil02.1.fq.gz pil02.2.fq.gz ...
上傳完后,在網頁上提交:
- 用Aspera從電腦上上傳:
下載並安裝Asperea,要記住軟件安裝的目錄
得到key文件並保存
在電腦打開cmd,進入aspera的bin目錄
cd C:\Users\82002\AppData\Local\Programs\Aspera\Aspera Connect\bin
ascp -i G:\aspera.openssh -QT -l300m -k1 -d G:\PILWGS\ subasp@upload.ncbi.nlm.nih.gov:/uploads/zeyuan.chen@qq.com_xxxx/
然后選擇第二個,進行提交:
搞定了!最后就等着審核吧~
2.2 上傳基因組序列
同樣,創建Biosample號,然后選擇Genome
其他的步驟省略,提交后,上傳的基因組序列會進行審核,NCBI會發郵件告訴你通過還是有問題,需要繼續修改。我就遇到了序列中含有線粒體序列以及接頭的問題。。
放上修剪的腳本
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $ip = "Pil.Genome.NCBI.fa";
my $op = "Pil.Genome.fa";
my $file = "02.Contamination.txt";
open (F,"<$file") or die ("$!\n");
open (O,">$op") or die ("$!\n");
my %exclude;
my %trim;
while (my $eve = <F>){
chomp ($eve);
#Exclude:
#Trim:
next if ($eve !~ /^scaffold/);
my @a = split /\s+/,$eve;
my $length = scalar (@a);
if ($eve !~ /\.\./){
my $chr = $a[0];
$exclude{$chr}++;
}elsif ($eve =~/\.\./){
my $chr = $a[0];
$exclude{$chr}++;
}elsif ($eve =~/\.\./){
my $chr = $a[0];
if ($a[2] !~ /,/){
$a[2] =~ /(\d+)\.\.(\d+)/;
my $start = $1;
my $end = $2;
$trim{$chr}{0}{start} = $start;
$trim{$chr}{0}{end} = $end;
}elsif ($a[2] =~ /,/){
my @b = split/,/,$a[2];
my $i=1;
foreach my $element (@b){
$element =~ /(\d+)\.\.(\d+)/;
my $start = $1;
my $end = $2;
$trim{$chr}{$i}{start} = $start;
$trim{$chr}{$i}{end} = $end;
$i++;
}
}
}
}
my $in=Bio::SeqIO->new(-file=>"$ip",-format=>'Fasta');
while(my $s =$in->next_seq()){
my $id=$s->id;
my $seq=$s->seq;
next if (exists $exclude{$id});
if (exists $trim{$id}){
my @a = split//,$seq;
foreach my $n (keys %{$trim{$id}}){
my $start = $trim{$id}{$n}{start};
my $end = $trim{$id}{$n}{end};
my $num = ($end-$start+1);
my $N = "N" x $num;
my @mv = split//,$N;
my @remove = splice (@a,$start-1,$end-$start+1,@mv);
}
my $newseq = join "", @a;
my $newseq_length = length ($newseq);
next if ($newseq_length < 200);
print O ">$id\n$newseq\n";
}
elsif ((! exists $exclude{$id}) && (! exists $trim{$id})){
print O ">$id\n$seq\n";
}
}
close F;
close O;
2.3 基因組reads的上傳和重測序reads的上傳類似,這里只放上填寫的表格
Attributes file
accession sample_name sample_title bioproject_accession organism isolate cultivar ecotype age dev_stage geo_loc_name tissue biomaterial_provider cell_line cell_type collected_by collection_date culture_collection disease disease_stage genotype growth_protocol height_or_length isolation_source lat_lon phenotype population sample_type sex specimen_voucher temp treatment description
SAMN09388354 Pil500 Populus ilicifolia Pil-500 not collected kenya leaf
SAMN09388355 Pil800 Populus ilicifolia Pil-800 not collected kenya leaf
SAMN09388356 Pil2k Populus ilicifolia Pil-2k not collected kenya leaf
SAMN09388357 Pil5k Populus ilicifolia Pil-5k not collected kenya leaf
SAMN09388358 Pil10k Populus ilicifolia Pil-10k not collected kenya leaf
SRA meta file
bioproject_accession sample_name library_ID title library_strategy library_source library_selection library_layout platform instrument_model design_description filetype assembly filename filename2 filename3 filename4 filename5 filename6 filename7 filename8
PRJNA471950 Pil500 Pil-500 WGS of P .ilicifolia for assemble WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil.genome.500bp.1.fq.gz Pil.genome.500bp.2.fq.gz
PRJNA471950 Pil800 Pil-800 WGS of P .ilicifolia for assemble WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil.genome.800bp.1.fq.gz Pil.genome.800bp.2.fq.gz
PRJNA471950 Pil2k Pil-2k WGS of P .ilicifolia for assemble WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil.genome.2kbp.1.fq.gz Pil.genome.2kbp.2.fq.gz
PRJNA471950 Pil5k Pil5k WGS of P .ilicifolia for assemble WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil.genome.5kbp.1.fq.gz Pil.genome.5kbp.2.fq.gz
PRJNA471950 Pil10k Pil-10k WGS of P .ilicifolia for assemble WGS GENOMIC RANDOM paired ILLUMINA HiSeq X Ten leaves used to extract genomic DNA and WGS Illumina protocol fastq Pil.genome.10kbp.1.fq.gz Pil.genome.10kbp.2.fq.gz
---END--