---恢復內容開始---
我們經常說幻想着使用已有數據發表高分文章,的確,這樣的童話故事每天都在發生,但如何走出第一步我們很多小伙伴不清楚,那么我們就從水稻SNP數據庫的使用來講起。
http://snp-seek.irri.org/
這是3k的水稻變異庫,上面保存着現成的SNP,由於數據過大,網站的維護方使用了Plink的格式來給我們在線儲存SNP的信息,可以理解畢竟3025個水稻的全基因組SNP,怎么算都不是個小數。
Plink格式是如下三個文件:
base_filtered_v0.7.bed.gz
base_filtered_v0.7.bim.gz
base_filtered_v0.7.fam.gz
用Plink軟件的“--recode”就可以把這三個軟件轉化為Vcf格式:
--recode [output format] <01 | 12> <tab | tabx | spacex | bgz | gen-gz> <include-alt> <omit-nonmale-y> Create a new text fileset with all filters applied. The following output formats are supported: * '23': 23andMe 4-column format. This can only be used on a single sample's data (--keep may be handy), and does not support multicharacter allele codes. * 'A': Sample-major additive (0/1/2) coding, suitable for loading from R. If you need uncounted alleles to be named in the header line, add the 'include-alt' modifier. * 'AD': Sample-major additive (0/1/2) + dominant (het=1/hom=0) coding. Also supports 'include-alt'. * 'A-transpose': Variant-major 0/1/2. * 'beagle': Unphased per-autosome .dat and .map files, readable by early BEAGLE versions. * 'beagle-nomap': Single .beagle.dat file. * 'bimbam': Regular BIMBAM format. * 'bimbam-1chr': BIMBAM format, with a two-column .pos.txt file. Does not support multiple chromosomes. * 'fastphase': Per-chromosome fastPHASE files, with .chr-[chr #].recode.phase.inp filename extensions. * 'fastphase-1chr': Single .recode.phase.inp file. Does not support multiple chromosomes. * 'HV': Per-chromosome Haploview files, with .chr-[chr #][.ped + .info] filename extensions. * 'HV-1chr': Single Haploview .ped + .info file pair. Does not support multiple chromosomes. * 'lgen': PLINK 1 long-format (.lgen + .fam + .map), loadable with --lfile. * 'lgen-ref': .lgen + .fam + .map + .ref, loadable with --lfile + --reference. * 'list': Single genotype-based list, up to 4 lines per variant. To omit nonmale genotypes on the Y chromosome, add the 'omit-nonmale-y' modifier. * 'rlist': .rlist + .fam + .map fileset, where the .rlist file is a genotype-based list which omits the most common genotype for each variant. Also supports 'omit-nonmale-y'. * 'oxford': Oxford-format .gen + .sample. With the 'gen-gz' modifier, the .gen file is gzipped. * 'ped': PLINK 1 sample-major (.ped + .map), loadable with --file. * 'compound-genotypes': Same as 'ped', except that the space between each pair of same-variant allele codes is removed. * 'structure': Structure-format. * 'transpose': PLINK 1 variant-major (.tped + .tfam), loadable with --tfile. * 'vcf', 'vcf-fid', 'vcf-iid': VCFv4.2. 'vcf-fid' and 'vcf-iid' cause family IDs or within-family IDs respectively
to be used for the sample IDs in the last header row, while 'vcf' merges both IDs and puts an underscore between them. If the 'bgz' modifier is added, the VCF file is block-gzipped. The A2 allele is saved as the reference and normally flagged as not based on a real reference genome (INFO:PR). When it is important for reference alleles to be correct, you'll also want to include --a2-allele and --real-ref-alleles in your command. In addition, * The '12' modifier causes A1 (usually minor) alleles to be coded as '1' and A2 alleles to be coded as '2', while '01' maps A1 -> 0 and A2 -> 1. * The 'tab' modifier makes the output mostly tab-delimited instead of mostly space-delimited. 'tabx' and 'spacex' force all tabs and all spaces, respectively.
plink --bfile <prefix> --recode vcf-iid --out ./<out-prefix>
通過這種方式就可以把bed的信息轉化為可用的vcf。