有時候需要個性化處理原始序列,自己寫python腳本太慢,且速度太慢,可以用seqkit這個工具,開發得不錯。
2021年12月02日
另一個需求:需要從Cell Ranger處理后的bam文件里提取出未比對的reads,然后再去比對到YFP序列上,確定每個細胞是否被YFP標記了。
參考1:Cell Ranger的bam文件的解讀 Barcoded BAM
The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.
參考2:如何提取出未比對上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?
samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt
總結:
提取未比對序列的命令
samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam
samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam
其中CB是校正后的細胞barcode,沒有的話也可以用CR代替。
之后的環節
- bam轉fastq,然后比對到YFP基因組
- 提取出比對reads對應的cell barcode
bam轉fastq bedtools bamtofastq
bedtools bamtofastq -i unmapped.bam -fq unmapped.fq
通過blast搜索所有的YFP相關序列,然后用STAR構建索引。
source activate /home/lizhixin/softwares/anaconda3/envs/splicing mkdir GFP_index STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta
比對
STAR --runThreadN 10 \ --genomeDir YFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix unmap/PHOX2B \ --readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped.fq
提取出比對上的序列
提取對應的cell barcode
比如提取10x genomics的barcode,fastq里的前16個鹼基【搞錯了,沒這么簡單】。
seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq
所有需要的信息都在這個bam文件里面,可以進行二次分析
possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information
統計fasta的每條序列的長度
目的:通過統計人基因組里轉錄本的長度分布,來評估不同的基因治療的載體的局限。
# ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/ cat GRCh38.transcripts.fa | seqkit fx2tab -l | cut -f1,4 > transcript.len.txt
參考:
Single-Library Analysis with cellranger count
