seqkit | 序列处理利器 | fastq | fasta | barcode提取 | Cell Ranger

seqkit | 序列处理利器 | fastq | fasta | barcode提取 | Cell Ranger | 荧光标记比对

本文转载自查看原文 2021-10-18 18:14 126 工具

有时候需要个性化处理原始序列，自己写python脚本太慢，且速度太慢，可以用seqkit这个工具，开发得不错。

2021年12月02日

另一个需求：需要从Cell Ranger处理后的bam文件里提取出未比对的reads，然后再去比对到YFP序列上，确定每个细胞是否被YFP标记了。

参考1：Cell Ranger的bam文件的解读 Barcoded BAM

The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.

参考2：如何提取出未比对上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?

samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt

总结：

提取未比对序列的命令

samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam
samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam

samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam

其中CB是校正后的细胞barcode，没有的话也可以用CR代替。

之后的环节

bam转fastq，然后比对到YFP基因组
提取出比对reads对应的cell barcode

bam转fastq bedtools bamtofastq

bedtools bamtofastq -i unmapped.bam -fq unmapped.fq

通过blast搜索所有的YFP相关序列，然后用STAR构建索引。

source activate /home/lizhixin/softwares/anaconda3/envs/splicing

mkdir GFP_index
STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta

比对

STAR --runThreadN 10 \
--genomeDir YFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix unmap/PHOX2B \
--readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped.fq

提取出比对上的序列

提取对应的cell barcode

比如提取10x genomics的barcode，fastq里的前16个碱基【搞错了，没这么简单】。

seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq

所有需要的信息都在这个bam文件里面，可以进行二次分析

possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information　　

统计fasta的每条序列的长度

目的：通过统计人基因组里转录本的长度分布，来评估不同的基因治疗的载体的局限。

# ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/
cat GRCh38.transcripts.fa | seqkit  fx2tab -l | cut -f1,4 > transcript.len.txt

参考：

fasta/fq文件处理万能工具——Seqkit学习记录

Single-Library Analysis with cellranger count

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 FASTA/Q序列处理神器---seqkit seqkit一个FASTA/Q序列处理神器 fasta序列操作神器——seqkit biopython文件处理：fastq文件转换为fasta文件 fasta/fastq格式解读 python文本处理---fasta文件提取指定ID的序列 samtools faidx 命令处理fasta序列从fasta中提取或者过滤掉多个序列 FASTQ 文件格式转换为 FASTA 格式 fasta与fastq格式文件解读