【數據使用】3k水稻數據庫現成SNP的使用


---恢復內容開始---

我們經常說幻想着使用已有數據發表高分文章,的確,這樣的童話故事每天都在發生,但如何走出第一步我們很多小伙伴不清楚,那么我們就從水稻SNP數據庫的使用來講起。

 

http://snp-seek.irri.org/

 

這是3k的水稻變異庫,上面保存着現成的SNP,由於數據過大,網站的維護方使用了Plink的格式來給我們在線儲存SNP的信息,可以理解畢竟3025個水稻的全基因組SNP,怎么算都不是個小數。

Plink格式是如下三個文件:

base_filtered_v0.7.bed.gz
base_filtered_v0.7.bim.gz
base_filtered_v0.7.fam.gz

 

用Plink軟件的“--recode”就可以把這三個軟件轉化為Vcf格式:

--recode [output format] <01 | 12> <tab | tabx | spacex | bgz | gen-gz>
         <include-alt> <omit-nonmale-y>
  Create a new text fileset with all filters applied.  The following output
  formats are supported:
  * '23': 23andMe 4-column format.  This can only be used on a single
    sample's data (--keep may be handy), and does not support multicharacter
    allele codes.
  * 'A': Sample-major additive (0/1/2) coding, suitable for loading from R.
    If you need uncounted alleles to be named in the header line, add the
    'include-alt' modifier.
  * 'AD': Sample-major additive (0/1/2) + dominant (het=1/hom=0) coding.
    Also supports 'include-alt'.
  * 'A-transpose': Variant-major 0/1/2.
  * 'beagle': Unphased per-autosome .dat and .map files, readable by early
    BEAGLE versions.
  * 'beagle-nomap': Single .beagle.dat file.
  * 'bimbam': Regular BIMBAM format.
  * 'bimbam-1chr': BIMBAM format, with a two-column .pos.txt file.  Does not
    support multiple chromosomes.
  * 'fastphase': Per-chromosome fastPHASE files, with
    .chr-[chr #].recode.phase.inp filename extensions.
  * 'fastphase-1chr': Single .recode.phase.inp file.  Does not support
    multiple chromosomes.
  * 'HV': Per-chromosome Haploview files, with .chr-[chr #][.ped + .info]
    filename extensions.
  * 'HV-1chr': Single Haploview .ped + .info file pair.  Does not support
    multiple chromosomes.
  * 'lgen': PLINK 1 long-format (.lgen + .fam + .map), loadable with --lfile.
  * 'lgen-ref': .lgen + .fam + .map + .ref, loadable with --lfile +
     --reference.
  * 'list': Single genotype-based list, up to 4 lines per variant.  To omit
    nonmale genotypes on the Y chromosome, add the 'omit-nonmale-y' modifier.
  * 'rlist': .rlist + .fam + .map fileset, where the .rlist file is a
    genotype-based list which omits the most common genotype for each
    variant.  Also supports 'omit-nonmale-y'.
  * 'oxford': Oxford-format .gen + .sample.  With the 'gen-gz' modifier, the
    .gen file is gzipped.
  * 'ped': PLINK 1 sample-major (.ped + .map), loadable with --file.
  * 'compound-genotypes': Same as 'ped', except that the space between each
    pair of same-variant allele codes is removed.
  * 'structure': Structure-format.
  * 'transpose': PLINK 1 variant-major (.tped + .tfam), loadable with
    --tfile.
  * 'vcf', 'vcf-fid', 'vcf-iid': VCFv4.2.  'vcf-fid' and 'vcf-iid' cause
    family IDs or within-family IDs respectively
 
         

 

to be used for the sample
    IDs in the last header row, while 'vcf' merges both IDs and puts an
    underscore between them.  If the 'bgz' modifier is added, the VCF file is
    block-gzipped.
    The A2 allele is saved as the reference and normally flagged as not based
    on a real reference genome (INFO:PR).  When it is important for reference
    alleles to be correct, you'll also want to include --a2-allele and
    --real-ref-alleles in your command.
  In addition,
  * The '12' modifier causes A1 (usually minor) alleles to be coded as '1'
    and A2 alleles to be coded as '2', while '01' maps A1 -> 0 and A2 -> 1.
  * The 'tab' modifier makes the output mostly tab-delimited instead of
    mostly space-delimited.  'tabx' and 'spacex' force all tabs and all
    spaces, respectively.

 

 

plink --bfile <prefix> --recode vcf-iid --out ./<out-prefix>

 

通過這種方式就可以把bed的信息轉化為可用的vcf。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM