過濾
SOAPnuke:華大自主開發的一款針對fastq文件的過濾軟件。
HTSeq-count:一款用於reads計數的輕便軟件,作者介紹說可以用於多種mapping軟件的輸出結果,而我則用於tophat2的輸出文件做計數。不過貌似所有能轉換為sam格式文件的輸出都可以用htseq-count計數。
RSeQC: An RNA-seq Quality Control Package
比對
BWA:應用最為廣泛的比對軟件,可以比二代,也可以比三代
Soap:華大開發的比對軟件,全稱SOAPaligner/soap2
bowtie2:常用於RNA-seq的比對
BLASR:專門用於比對三代reads
pynast:多重序列比對軟件,主要用於處理16S序列
FastTree:超快的建樹軟件,同時處理1M級的序列,主要用於16S的建樹
數據處理
SAMtools:專門用於處理SAM、BAM格式,SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
Picard:a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
VCFtools:a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project.
bcftools: utilities for variant calling and manipulating VCFs and BCFs.
bedtools:a powerful toolset for genome arithmetic, allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.
MAKER:an easy-to-use genome annotation pipeline designed for small research groups with little bioinformatics experience.
重測序
Reseqtools:A Toolkit for analyzing next-generation DNA Re-Sequencing data. 華大內部自己整理的工具。
組裝
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
SOAPdenovo
Platanus
DBG2OLC
CANU
Falcon
HGAP
變異檢測
GATK:the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. 最常用的call snp&indel 工具
BreakDancer:genome-wide detection of structural variants from next generation paired-end sequencing reads. 結構變異sv檢測工具
CREST:(Clipping Reveals Structure), a new algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data.
CNVnator:a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads. 人重檢測CNV
PennCNV:a free software tool for Copy Number Variation (CNV) detection from SNP genotyping arrays.
MIDAS:(Metagenomic Intra-species Diversity Analysis System), An integrated pipeline for estimating strain-level genomic variation from metagenomic data(可以對宏基因組 call variation)
GWAS
PLINK:whole genome association analysis toolset
SSR分析
MISA - MIcroSAtellite identification tool
SSRHunter - Simple Sequence Repeat Search tool
統計方法
DMM:(Dirichlet multinomial mixtures), probabilistic modelling of microbial metagenomics data.(宏基因組的概率建模)輸入: frequency_matrix.csv,每行就是一個taxa,每一列都是其在每一個樣本中的頻率。輸出:群體分析結果。The mixture components cluster communities into distinct ‘metacommunities’, and, hence, determine envirotypes or enterotypes, groups of communities with a similar composition. 該方法就是群體的PCA分析,將類似的群體歸於一類。
RNA
cd-hit:a very widely used program for clustering and comparing protein or nucleotide sequences. 去冗余
CPAT:using logistic regression model based on 4 pure sequence-based, linguistic features. 預測RNA的編碼情況
GMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences. RNA比對專用
持續添加~