GWAS | 原理和流程 | 全基因組關聯分析 | Linkage disequilibrium (LD)連鎖不平衡 | 曼哈頓圖 Manhattan_plot | QQ plot


GWAS入門必看教程:Statistical analysis of genome-wide association (GWAS) data

 

名詞解釋和基本問題:

關聯分析:就是AS的中文,全稱是GWAS。應用基因組中數以百萬計的單核苷酸多態;SNP為分子遺傳標記,進行全基因組水平上的對照分析或相關性分析,通過比較發現影響復雜性狀的基因變異的一種新策略。在全基因組范圍內選擇遺傳變異進行基因分析,比較異常和對照組之間每個遺傳變異及其頻率的差異,統計分析每個變異與目標性狀之間的關聯性大小,選出最相關的遺傳變異進行驗證,並根據驗證結果最終確認其與目標性狀之間的相關性。

連鎖不平衡:LD,P(AB)= P(A)*P(B)。不連鎖就獨立,如果不存在連鎖不平衡——相互獨立,隨機組合,實際觀察到的群體中單倍體基因型 A和B 同時出現的概率。P (AB) = D + P (A) * P (B) 。D是表示兩位點間LD程度值。

曼哈頓圖:在生物和統計學上,做頻率統計、突變分布、GWAS關聯分析的時候,我們經常會看到一些非常漂亮的manhattan plot,能夠對候選位點的分布和數值一目了然。位點坐標和pvalue。map文件至少包含三列——染色體號,SNP名字,SNP物理位置。assoc文件包含SNP名字和pvalue。haploview即可畫出。

SNP的本質屬性是什么?廣義上講是變異:most common type of genetic variation,平級的還有indel、CNV、SV。Each SNP represents a difference in a single DNA building block, called a nucleotide. 狹義上講是標記:biological markers,因為SNP是單鹼基的,所以SNP又是一個位點,標記了染色體上的一個位置。大部分人的基因組,99%都是一模一樣的,還有些SNP的位點,就是一些可變的位點,在人群中有差異。這些差異/標記可以用於疾病的分析,根據統計學原理,找出與疾病最相關的位點,從而確定某個疾病的risk allele。

SNP array是如何工作的?SNP array測得不是單個鹼基,而是allele。所以GWAS的結果是三種:(1 - AA; 2 - AB; 3 - BB),也可能是0、1、2.

linkage disequilibrium (LD)和 pairwise correlation的區別?

如何鑒定Somatic vs Germline Mutations?In multicellular organisms, mutations can be classed as either somatic or germ-line。必須做通常需要trios或healthy tissue的測序才能確定。最顯然的是cancer里大部分都是somatic的variations。

SNP、variant和mutation有什么區別?SNP是中性的,mutation顯然和疾病相關;其次就是頻率,頻率很高的是SNP,mutation則很低。variant和variation是同義詞,因此和SNP是等價的。

為什么還需要haplotype?HapMap計划的動機是什么?The HapMap is valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to roughly 500,000 tag SNPs.

common variant和rare variant是根據什么來區別的?paper 怎么理解這里的common和rare?variant就是SNP,”常見的變異“,SNP就是位點,一個位點怎么能說常見和不常見呢?這里是有點反直覺的。這里的common說的是minor allele,就是the second most common allele。比如一個SNP:rs78601809,它的位置可知,在不同人群中的allele frequency可知,總體的MAF是0.39 (T)。一個SNP的MAF<1%,那就是rare variant。直覺理解就是這個位點的鹼基在人群中很少發生變化。rare variants (MAF < 0.05) appeared more frequently in coding regions than common variants (MAF > 0.05) in this population

Genetic variants that are outside the reach of the most statistically powered association studies [13] are thought to contribute to the missing heritability of many human traits, including common variants (here denoted by minor allele frequency [MAF] >5%) of very weak effect, low-frequency (MAF 1–5%) and rare variants (MAF <1%) of small to modest effect, or a combination of both, with several possible scenarios all deemed plausible in simulation studies [14]. 

為什么genetic這么執着於MAF?

因為從進化角度,risk allele更有可能是minor allele,自然選擇。不絕對,但可以說是富集。看文章:Are minor alleles more likely to be risk alleles?

 

common variants together account for a small proportion of heritability estimated from family studies,common variants通常都在非編碼區,占總variants的很小一部分,同時effect size也比較低。

SNP的small effect和large effect是什么意思?effect size

極其容易搞混的術語:SNP、mutation、variant、allele、genotype。Allele frequency、Genotype frequency,alternative allele frequency、MAF。一定要能快速區分這些術語的差異,否則你做的就是假的統計遺傳學。

gene-based rare-variant burden tests是用來干什么的?Increased Burden of Rare Variants Among S-HSCR。

epistatic effects是什么?

為什么說L-HSCR是autosomal dominant?很難說是完全的線性,顯隱性的關系是非常復雜的,存在不完全和劑量效應。

DNA序列角度如何看待等位基因,顯隱性的關系?關於Allele(等位基因)的理解,allele在基因上的組合,傳統的等位基因是非常抽象的概念。Dominant vs. Recessive 我們是兩倍體,對每個基因來說,我們都有兩個等位基因,雜合的話,這兩個基因序列就不同,表達出來的蛋白也就不同,而且兩個等位基因有復雜的顯隱性關系。所以說我們傳統的基因表達分析其實是很粗糙的,最好要做到isoform層次的表達,畢竟基因離蛋白還是有一段距離。現在之所以還沒做到isoform水平,大部分原因是我們對蛋白的研究還不夠。

一個新的課題,全球范圍內,人種是如何逐步分化到今天,哪些核心的遺傳因素決定了人種的表型差異;其次,不同的人種在某些疾病上為什么會出現顯著的頻率差異,為什么亞洲人的HSCR發病率會更高?遺傳因素在其中發揮了什么作用?

遺傳效應:Additive genetic effects occur when two or more genes source a single contribution to the final phenotype, or when alleles of a single gene (in heterozygotes) combine so that their combined effects equal the sum of their individual effects.[1][2] Non-additive genetic effects involve dominance (of alleles at a single locus) or epistasis (of alleles at different loci). 就是risk allele的數量和患病率之間成正比。

人類基因組里有多少個variant/SNP? 1000 genome里的數據是84.4 million,這是保守數據,因為只包括了2504個人,相當於每個population只測了100個人,雖然具有一定的代表,性,但實際肯定更多,那就保守估計一下300 million吧,那就真是百分之一了,也就是100個鹼基里就有一個variant。算到個體,就是3 million左右,也就是萬分之一。

 

先從直覺上理解一下GWAS的原理:

核心就是SNP與表型的關聯,對於每一個genome位點,如果某個SNP總是與某疾病同時出現 SNP與phenotype這兩個維度協同變化,那我們就可以推測這個SNP極有可能與此phenotype(疾病)相關。

規范點講就是看某個SNP在case和control兩個population間是否有allel frequency的顯著差異。

而現實情況是,我們樣本數有限,而且有時候control和case樣本不平衡,樣本還分男女、人群,而且我們需要對3億個鹼基位點都做統計檢驗。

我們應該設計哪些指標來評價一個snp與表型的關聯呢?

思考:如果一個位點有多個SNP,而只有其中的一個SNP與疾病相關怎么辦?錯誤認知,一個基因組位點只能有一個SNP,可以有很多種allele。

牢記:曼哈頓圖中的點代表的不是樣品,而是SNP。

思考:曼哈頓圖中,顯著的SNP並不是鶴立雞群的冒出來,而是似乎被捧出來的,就像高樓大廈一樣,從底下逐步冒出來的。這一座大廈其實就是連鎖在一起的SNP,具有很高的LD score。

思考:雖然曼哈頓圖里每個點是SNP,但是通常都會把最顯著的SNP指向某個基因,因為大家最關注的還是SNP的致病根源,但這樣找出來的只有編碼區的SNP。

注意:最突出的SNP極有可能不是causal SNP,它只是near the causal SNP。問題就來了,怎么找causal SNP呢?fine mapping

 

基本背景

什么是SNP?進化過程中隨機產生的單點突變,並能穩定的在群體中遺傳。

什么是allele frequency in population?每一個genome位點都有兩個或多個allele,不同allel之間有明顯的頻率上的差異,簡單點理解就是A和a兩個性質的頻率,但這里是鹼基位點,而不是性狀基因。

 

GWAS分析的前提

sample size足夠,學過統計的都知道sample size會影響power,沒有足夠的power是得不出正確結論的,GWAS通常需要大量的樣本,幾千是標配,幾百就太少,現在有的都達到了幾萬幾十萬級別;

一個大誤區就是GWAS會測全基因組WGS,其實不是的,那太貴了,大部分是做DNA chip DNA芯片(專業的叫SNP array),只包含了常見的10^6個SNP。稍微有錢的就會上WES,就會得到所有編碼區的SNP;最有錢的就是WGS了,全部檢測,編碼非編碼,常見罕見,1000genome就是靠這個才NB的。

 

大致原理已經講了,其實還有統計原理,暫時略過,先看實操。

怎么用PLINK來做GWAS?油管視頻:GWAS in Plink 里面有paper、示例數據、代碼下載,可以跑跑熟悉一下。

 

參考:

Analysis of Microarray Data

Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances

Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays

Discriminating somatic and germline mutations in tumor DNA samples without matching normals

The impact of rare and low-frequency genetic variants in common disease


發表了paper的,GWAS pipeline:A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis

github地址

一下着重講解一下這個流程的操作細節:

主要是四方面的分析:

  1. All essential GWAS QC steps along with scripts for data visualization.
  2. Dealing with population stratification, using 1000 genomes as a reference.
  3. Association analyses of GWAS data.
  4. Polygenic risk score (PRS) analyses.

先看下PLINK的文本文件格式:

ped:行是個體,列是表型和SNP的基因型數據;

map:snp的特征數據;

二進制有三個格式:

主要就是把ped拆成了fam和bed,map變成了bim。

通常要做covariate分析,所以還有個covariate文件。

QC:

Step Command Function
1: Missingness of SNPs and individuals ‐‐geno Excludes SNPs that are missing in a large proportion of the subjects. In this step, SNPs with low genotype calls are removed.
‐‐mind Excludes individuals who have high rates of genotype missingness. In this step, individual with low genotype calls are removed.
2: Sex discrepancy ‐‐check‐sex Checks for discrepancies between sex of the individuals recorded in the dataset and their sex based on X chromosome heterozygosity/homozygosity rates.
3: Minor allele frequency (MAF) ‐‐maf Includes only SNPs above the set MAF threshold.
4: Hardy–Weinberg equilibrium (HWE) ‐‐hwe Excludes markers which deviate from Hardy–Weinberg equilibrium.
5: Heterozygosity For an example script see https://github.com/MareesAT/GWA_tutorial/ Excludes individuals with high or low heterozygosity rates
6: Relatedness ‐‐genome Calculates identity by descent (IBD) of all sample pairs.
‐‐min Sets threshold and creates a list of individuals with relatedness above the chosen threshold. Meaning that subjects who are related at, for example, pi‐hat >0.2 (i.e., second degree relatives) can be detected.
7: Population stratification ‐‐genome Calculates identity by descent (IBD) of all sample pairs.
‐‐cluster ‐‐mds‐plot k Produces a k‐dimensional representation of any substructure in the data, based on IBS.

 

fine mapping

一個常識就是GWAS是2007年才出現得,所以2017年才出了篇有名的綜述ten years of GWAS,fine mapping是GWAS后才出現得。

實驗室很早就開始研究fine mapping了:2009 - Fine mapping of the 9q31 Hirschsprung’s disease locus

看一下introduction,什么是fine mapping?

目的很簡單:GWAS找到的大多不是causal variants,fine mapping就是就fill這個gap。

GWAS得到大體的SNP后,必須做兩方面的深入分析:

第一步就是對SNP給一個概率上的causality,這就是fine-mapping;第二步就是根據功能注釋來確定該SNP確實能導致某個基因。

The first is to assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping. The second step is to try to connect these variants to likely genes whose perturbation leads to altered disease risk by functional annotation. 

基本原理:

Strategies for fine-mapping complex traits - 

Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs

Although eQTLs are increasingly used to provide mechanistic interpretations for human disease associations, the cell type specificity of eQTLs presents a problem. Because the cell type from which a given physiological phenotype arises may not be known, and because eQTL data exist for a limited number of cell types, it is critical to quantify and understand the mechanisms generating cell type specific eQTLs. For example, if a GWAS identifies a set of SNPs associated with risk of type II diabetes, the researcher must choose a target cell type to develop a mechanistic model of the molecular phenotype that causes the gross physiological change. One can imagine that the relevant cell type might be adipose tissue, liver, pancreas, or another hormone-regulating tissue. Furthermore, if the GWAS SNP produces a molecular phenotype (i.e., is an eQTL) in lymphoblastoid cell lines (LCLs), it is not necessarily the case that the SNP will generate a similar molecular phenotype in the cell type of interest. Furthermore, there are many examples of cell types with particular relevance to common diseases, for example dopaminergic neurons and Parkinson's disease, that lack comprehensive eQTL data or catalogs of CREs. The utility of eQTLs for complex trait interpretation will therefore be improved by a more thorough annotation of their cell type specificity.

eQTL最大的問題還是celltype的特異性不夠,關鍵還是要celltype的定義足夠精准!

 

 


現在GWAS已經屬於比較古老的技術了,主要是碰到嚴重的瓶頸了,單純的snp與表現的關聯已經不夠,需要具體的生物學解釋,這些snp是如何具體導致疾病的發生的。

而且,大多數病找到的都不是個別顯著的snp,大多數都找到了很多的snp,而且snp都落在非編碼區了,這就導致對這些snp的解讀非常的困難。

經典解讀看這篇新英格蘭雜志上的文章:FTO Obesity Variant Circuitry and Adipocyte Browning in Humans

 

GWAS的核心結果就兩個,曼哈頓圖和QQ-plot,看懂就夠了。

單純會跑GWAS pipeline已經沒什么價值了,現在重在下游的分析,有幾個熱點:

  • Polygenic risk score (PRS) analyses
  • meta-analysis

The International HapMap Project (http://hapmap. ncbi.nlm.nih.gov/; Gibbs et al., 2003) described the patterns of com- mon SNPs within the human DNA sequence whereas the 1000 Genomes (1KG) project (http://www.1000genomes.org/; Altshuler et al., 2012) provided a map of both common and rare SNPs.

common和rare就是根據allele frequency來界定的,但是似乎沒有明確界限。

HapMap用的是array,所有測得都是一些人為挑的點,所以就是common snps;而1000 genomes是WGS,所以包含了所有的點,所以有common和rare一起。

GWAS和核心就是LD,目前大部分的GWAS都是測得array,因為便宜。

GWAS會漏掉很多點,所以才會有fine-mapping,根據haplotype來做一些imputation。

 

Linkage disequilibrium (LD)連鎖不平衡:不同基因座位的各等位基因在人群中以一定的頻率出現。在某一群體中,不同座位某兩個等位基因出現在同一條染色體上的頻率高於預期的隨機頻率的現象。(就是孟德爾的分離不是隨機的,在染色體上越靠近的allele越傾向於綁在一起,屬於物質性的限制。)

例如兩個相鄰的基因A B, 他們各自的等位基因為a b. 假設A B相互獨立遺傳,則后代群體中觀察得到的單倍體基因型 AB 中出現的P(AB)的概率為 P(A) * P(B). 實際觀察得到群體中單倍體基因型 AB 同時出現的概率為P(AB)。 計算這種不平衡的方法為: D = P(AB)- P(A) * P(B).

事實上,可以檢測遍布基因組中的大量遺傳標記位點snp,或者候選基因附近的遺傳標記來尋找到因為與致病位點距離足夠近而表現出與疾病相關的位點,這就是等位基因關聯分析或連鎖不平衡定位基因的基本思想。

 

待看的paper:Strategies for fine-mapping complex traits

assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping.

 

還有一些非常重要的概念:

effect size:效應量

power:功效,power analyses

Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes

在語境里理解:One explanation of the missing heritability is that complex diseases are caused by a large number of causal variants with small effect sizes. 

 

PRS combines the effect sizes of multiple SNPs into a single aggregated score that can be used to predict disease risk

 

 

 haplotype phasing單倍體分型

Positions with 00 and 11 are called homozygous positions. Positions with 10 or 01 are called heterozygous positions. We note that the reference genome is neither the paternal nor the maternal genome but the genome of an un-related human (or more precisely the mixture of genomes of a few individuals). An individual’s haplotype is the set of variations in that individual’s chromosomes. We note that as any two human haplotypes are 99.9% similar, the mapping problem can be solved quite easily.

Haplotype phasing is the problem of inferring information about an individual’s haplotype. To solve this problem, there are many methods.

Lecture 10: Haplotype Phasing - Community Recovery

 


 

參考:PLINK | File format reference

vcftools

 

plink的主要功能:數據處理,質量控制的基本統計,群體分層分析,單位點的基本關聯分析,家系數據的傳遞不平衡檢驗,多點連鎖分析,單倍體關聯分析,拷貝數變異分析,Meta分析等等。

 

首先必須了解plink的三種格式:bed、fam和bim。(注意:這里的bed和我們genome里的區域文件bed完全不同)

plink需要的格式一般可以從vcf文件轉化而來 (順便了解一下ped和map兩種格式):

PED: Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file. 譜系信息和基因型信息。每一行是一個人。

MAP: Variant information file accompanying a .ped text pedigree + genotype table. 變異信息。每一行是一個變異 | snp。

# PED
     1 1 0 0 1  0    G G    2 2    C C
     1 2 0 0 1  0    A A    0 0    A C
     1 3 1 2 1  2    0 0    1 2    A C
     2 1 0 0 1  0    A A    2 2    0 0
     2 2 0 0 1  2    A A    2 2    0 0
     2 3 1 2 1  2    A A    2 2    A A
# MAP 
     1 snp1 0 1
     1 snp2 0 2
     1 snp3 0 3
# vcf轉ped和map
plink --vcf file.vcf --recode --out file
# ped和map轉bed、bim和fam
plink --file test --make-bed --out test

  

三種格式的官方介紹

bed文件(真實的bed文件是二進制的,比較難讀)

bed:Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. Loaded with --bfile; generated in many situations, most notably when the --make-bed command is used. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. 基因型信息。所以轉換后就是一個matrix,每一行是一個個體,每一列就是一個變異。其中0、1、2分別對應了aa、Aa或aA和AA。不考慮鹼基型,因為我們不關注ATGC的變化。

fam:Sample information file accompanying a .bed binary genotype table. 樣本信息。每一行就是一個樣本。

bim:Extended variant information file accompanying a .bed binary genotype table. 每一行是一個變異,及其注釋信息。

 

             rs4970383 rs3748592 rs9442373 rs1571150 rs6687029
2431:NA19916         2         0         0         0         1
2424:NA19835         1         0         1         2         0
2469:NA20282         1         0         1         0         1
2368:NA19703         0         0         0         2         0
2425:NA19901         1         0         1         2         2
OR
# xxd -b test.bed
00000000: 01101100 00011011 00000001 11011100 00001111 11100111 l.....
00000006: 00001111 01101011 00000001 .k.
  • First two bytes 01101100 00011011 for PLINK v1.00 BED file
  • Third byte is 00000001 (SNP-major) or 00000000 (individual-major)
  • Genotype data, either in SNP-major or individual-major order
  • New "row" always starts a new byte
  • Each byte encodes up to 4 genotypes
  • 10 indicates missing genotype, otherwise 0 and 1 point to allele 1 or allele 2 in the BIM file, respectively
  • Bits in each byte read in reverse order

 

fam文件

1 2431 NA19916  0  0  1
2 2424 NA19835  0  0  2
3 2469 NA20282  0  0  2
4 2368 NA19703  0  0  1
5 2425 NA19901  0  0  2
OR
1 1 0 0 1 0
1 2 0 0 1 0
1 3 1 2 1 2
2 1 0 0 1 0
2 2 0 0 1 2
2 3 1 2 1 2

  

bim文件

1  1 rs4970383  0  828418  A
2  1 rs3748592  0  870101  A
3  1 rs9442373  0 1052501  C
4  1 rs1571150  0 1464167  A
5  1 rs6687029  0 1508931  C
OR
1       snp1    0       1       G       A
1       snp2    0       2       1       2
1       snp3    0       3       A       C

 

 

跑跑PLINK工具

plink --bfile  --pheno  --pheno-name t16 --linear hide-covar --covar  --covar-name
 AGE,SEX,PC1,PC2,PC3,PC4 --ci 0.95 --out
--bfile  將snp文件變成二進制格式
--pheno 這里導入我們剛剛處理的性狀文件 
--pheno-name t16 要處理的性狀名字是t16
--linear hide-covar 使用線性模型,hide-covar指的是不要對我沒加入的協變量進行分析
--covar  --covar-name AGE,SEX,PC1,PC2,PC3,PC4 把我們選取的協變量加入線性回歸模型中,我們選的協變量有:AGE,SEX,PC1,PC2,PC3,PC4
--ci 0.95 設置置信區間

 

SNP過濾問題

使用vcftools過濾:
1. MAF<0.05
vcftools --vcf test.vcf --maf 0.05 --out XX
2.完整度大於90%
vcftools --vcf test.vcf  --max-missing 0.9 --OUT XX
3.平均深度大於5
vcftools --vcf test.vc --min-meanDP 5 --out xx

注:
使用--gvcf更為快捷
使用plink過濾
1.vcf轉化plink格式
vcftools --vcf test.vcf --plink --out  xxx
2.plink --noweb --file plink --geno 0.05 --maf 0.05 --hwe 0.0001 --make-bed

  

跟一個官網的教學,無需寫代碼,教學材料:Resources available for download 非常通俗,容易入門。

ped文件:譜系信息和基因型;

Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file.

The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.

前6行就和fam文件一樣,家庭id,家庭內id,性別,表型。

后面兩個一組,比如第7和第8就是map中第一個snp的等位基因(人有兩條染色體,每條DNA都是雙鏈的,不考慮雙鏈,因為有互補配對)。

fam文件:樣本信息;

  1. Family ID ('FID')
  2. Within-family ID ('IID'; cannot be '0')
  3. Within-family ID of father ('0' if father isn't in dataset)
  4. Within-family ID of mother ('0' if mother isn't in dataset)
  5. Sex code ('1' = male, '2' = female, '0' = unknown)
  6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

map文件:突變信息;

  1. Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
  2. Variant identifier
  3. Position in morgans or centimorgans (optional; also safe to use dummy value of '0')
  4. Base-pair coordinate

bim文件:額外的突變信息;

  1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
  2. Variant identifier
  3. Position in morgans or centimorgans (safe to use dummy value of '0')
  4. Base-pair coordinate (normally 1-based, but 0 ok; limited to 231-2)
  5. Allele 1 (corresponding to clear bits in .bed; usually minor)
  6. Allele 2 (corresponding to set bits in .bed; usually major)

MAF, Minor allele frequency: SNPs with a minor allele frequency of 0.05 or greater were targeted by the HapMap project. 最小等位基因頻率

QC

The SNPs are currently coded according to NCBI build 36 coordinates on the forward strand. 

Data quality control in genetic case-control association studies

plink可以對snp進行QC過濾,根據一些指標,比如MAF。。。

plink的結果必須要有了解,

1. 將文本的ped和map文件轉化為二進制的bed、bim和fam文件;

2. 關聯分析的結果,其實就是給每個人賦值一個表型,然后就做關聯分析,得到每一個snp與表型的相關性,用p-value來表示,最終可以畫曼哈頓圖;

 

參考:

利用PLINK進行GWAS分析(一)  

GWAS的基本原理 講得比較通俗

QQ plot圖——評價你的統計模型是否合理  講得比較清楚

基於全基因組snp數據如何進行主成分分析(PCA)- GCTA


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM