有时候需要个性化处理原始序列,自己写python脚本太慢,且速度太慢,可以用seqkit这个工具,开发得不错。
2021年12月02日
另一个需求:需要从Cell Ranger处理后的bam文件里提取出未比对的reads,然后再去比对到YFP序列上,确定每个细胞是否被YFP标记了。
参考1:Cell Ranger的bam文件的解读 Barcoded BAM
The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.
参考2:如何提取出未比对上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?
samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt
总结:
提取未比对序列的命令
samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam
samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam
其中CB是校正后的细胞barcode,没有的话也可以用CR代替。
之后的环节
- bam转fastq,然后比对到YFP基因组
- 提取出比对reads对应的cell barcode
bam转fastq bedtools bamtofastq
bedtools bamtofastq -i unmapped.bam -fq unmapped.fq
通过blast搜索所有的YFP相关序列,然后用STAR构建索引。
source activate /home/lizhixin/softwares/anaconda3/envs/splicing mkdir GFP_index STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta
比对
STAR --runThreadN 10 \ --genomeDir YFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix unmap/PHOX2B \ --readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped.fq
提取出比对上的序列
提取对应的cell barcode
比如提取10x genomics的barcode,fastq里的前16个碱基【搞错了,没这么简单】。
seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq
所有需要的信息都在这个bam文件里面,可以进行二次分析
possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information
统计fasta的每条序列的长度
目的:通过统计人基因组里转录本的长度分布,来评估不同的基因治疗的载体的局限。
# ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/ cat GRCh38.transcripts.fa | seqkit fx2tab -l | cut -f1,4 > transcript.len.txt
参考:
Single-Library Analysis with cellranger count