GWAS 全基因組關聯分析 | summary statistic 概括統計 | meta-analysis 綜合分析


有很多概念需要明確區分:

人有23對染色體,其中22對常染色體autosome,另外一對為性染色體sex chromosome,XX為女,XY為男。

染色體區帶命名:在標示一特定的帶時需要包括4項:①染色體號;②臂的符號;③區號;④在該區內的帶號。

1p22表示為1號染色體短臂2區2帶。

 

等位基因其實是一個集合,在同一個locus出現得基因型互為等位基因。Aa不能叫等位基因,正確的邏輯是:A和a是一組等位基因。由等位基因可以定義純合和雜合。

二倍體與多倍體細胞的某些染色體上,在同一基因座上有相同的等位基因,這類細胞稱為純合子/同型合子(homozygous)。若是相同基因座上含有不同的等位基因,則稱作雜合子/異型合子(heterozygous)。

 

summary statistic顧名思義,就和R里面的summary函數一樣,是對GWAS數據的一個概括總結,包含了結果中最核心的信息。

ebi也提供了很多GWAS研究summary statistic的結果下載,https://www.ebi.ac.uk/gwas/summary-statistics

 

GWAS的基本原理

如何跑GWAS?

轉到姊妹篇:GWAS | 全基因組關聯分析 | Linkage disequilibrium (LD)連鎖不平衡 | 曼哈頓圖 Manhattan_plot | QQ_plot | haplotype phasing

 

Power

Effect size 

Major allele,

Minor allele,

Minor allele frequency (MAF),

Missingness per genotype,

Missingness per individuals, 

 

metrics that we look at include

linkage disequilibrium (LD),

variance inflation factor (VIF),

runs of homozygosity (ROH), 

 

These provide a broad 'summary' of the data and allow us to appropriately set thresholds for quality control. It would be wrong, for example, to run a statistical test on a genotype with high missingness because the resulting P value would be misleading and could lead to erroneous conclusions from the data.

PLINK is usually the 'go to' program for analysing GWAS data, but there are other alternatives. It is also possible to read PLINK data into R and do your own analyses, but for now there are not many programs to do that.

Further information can be found here: http://zzz.bwh.harvard.edu/plink/summary.shtml

 

 

A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis

Clumping: This is a procedure in which only the most significant SNP (i.e., lowest p value) in each LD block is identified and selected for further analyses. This reduces the correlation between the remaining SNPs, while retaining SNPs with the strongest statistical evidence.

Co‐heritability: This is a measure of the genetic relationship between disorders. The SNP‐based co‐heritability is the proportion of covariance between disorder pairs (e.g., schizophrenia and bipolar disorder) that is explained by SNPs.

Gene: This is a sequence of nucleotides in the DNA that codes for a molecule (e.g., a protein)

Heterozygosity: This is the carrying of two different alleles of a specific SNP. The heterozygosity rate of an individual is the proportion of heterozygous genotypes. High levels of heterozygosity within an individual might be an indication of low sample quality whereas low levels of heterozygosity may be due to inbreeding.

Individual‐level missingness: This is the number of SNPs that is missing for a specific individual. High levels of missingness can be an indication of poor DNA quality or technical problems.

Linkage disequilibrium (LD): This is a measure of non‐random association between alleles at different loci at the same chromosome in a given population. SNPs are in LD when the frequency of association of their alleles is higher than expected under random assortment. LD concerns patterns of correlations between SNPs.

Minor allele frequency (MAF): This is the frequency of the least often occurring allele at a specific location. Most studies are underpowered to detect associations with SNPs with a low MAF and therefore exclude these SNPs.

Population stratification: This is the presence of multiple subpopulations (e.g., individuals with different ethnic background) in a study. Because allele frequencies can differ between subpopulations, population stratification can lead to false positive associations and/or mask true associations. An excellent example of this is the chopstick gene, where a SNP, due to population stratification, accounted for nearly half of the variance in the capacity to eat with chopsticks (Hamer & Sirota, 2000).

Pruning: This is a method to select a subset of markers that are in approximate linkage equilibrium. In PLINK, this method uses the strength of LD between SNPs within a specific window (region) of the chromosome and selects only SNPs that are approximately uncorrelated, based on a user‐specified threshold of LD. In contrast to clumping, pruning does not take the p value of a SNP into account.

Relatedness: This indicates how strongly a pair of individuals is genetically related. A conventional GWAS assumes that all subjects are unrelated (i.e., no pair of individuals is more closely related than second‐degree relatives). Without appropriate correction, the inclusion of relatives could lead to biased estimations of standard errors of SNP effect sizes. Note that specific tools for analysing family data have been developed.

Sex discrepancy: This is the difference between the assigned sex and the sex determined based on the genotype. A discrepancy likely points to sample mix‐ups in the lab. Note, this test can only be conducted when SNPs on the sex chromosomes (X and Y) have been assessed.

Single nucleotide polymorphism (SNP): This is a variation in a single nucleotide (i.e., A, C, G, or T) that occurs at a specific position in the genome. A SNP usually exists as two different forms (e.g., A vs. T). These different forms are called alleles. A SNP with two alleles has three different genotypes (e.g., AA, AT, and TT).

SNP‐heritability: This is the fraction of phenotypic variance of a trait explained by all SNPs in the analysis.

SNP‐level missingness: This is the number of individuals in the sample for whom information on a specific SNP is missing. SNPs with a high level of missingness can potentially lead to bias.

Summary statistics: These are the results obtained after conducting a GWAS, including information on chromosome number, position of the SNP, SNP(rs)‐identifier, MAF, effect size (odds ratio/beta), standard error, and p value. Summary statistics of GWAS are often freely accessible or shared between researchers.

The Hardy–Weinberg (dis)equilibrium (HWE) law: This concerns the relation between the allele and genotype frequencies. It assumes an indefinitely large population, with no selection, mutation, or migration. The law states that the genotype and the allele frequencies are constant over generations. Violation of the HWE law indicates that genotype frequencies are significantly different from expectations (e.g., if the frequency of allele A = 0.20 and the frequency of allele T = 0.80; the expected frequency of genotype AT is 2*0.2*0.8 = 0.32) and the observed frequency should not be significantly different. In GWAS, it is generally assumed that deviations from HWE are the result of genotyping errors. The HWE thresholds in cases are often less stringent than those in controls, as the violation of the HWE law in cases can be indicative of true genetic association with disease risk.


Meta-analysis

Generally, if a sample includes multiple ethnic groups (e.g., Africans, Asians, and Europeans), it is recommended to perform tests of association in each of the ethnic groups separately and to use appropriate methods, such as meta‐analysis (Willer, Li, & Abecasis, 2010), to combine the results.

Fast and efficient meta‐analysis of genomewide association scans

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM