seqkit | 序列處理利器 | fastq | fasta | barcode提取 | Cell Ranger

seqkit | 序列處理利器 | fastq | fasta | barcode提取 | Cell Ranger | 熒光標記比對

本文轉載自查看原文 2021-10-18 18:14 126 工具

有時候需要個性化處理原始序列，自己寫python腳本太慢，且速度太慢，可以用seqkit這個工具，開發得不錯。

2021年12月02日

另一個需求：需要從Cell Ranger處理后的bam文件里提取出未比對的reads，然后再去比對到YFP序列上，確定每個細胞是否被YFP標記了。

參考1：Cell Ranger的bam文件的解讀 Barcoded BAM

The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.

參考2：如何提取出未比對上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?

samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt

總結：

提取未比對序列的命令

samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam
samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam

samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam

其中CB是校正后的細胞barcode，沒有的話也可以用CR代替。

之后的環節

bam轉fastq，然后比對到YFP基因組
提取出比對reads對應的cell barcode

bam轉fastq bedtools bamtofastq

bedtools bamtofastq -i unmapped.bam -fq unmapped.fq

通過blast搜索所有的YFP相關序列，然后用STAR構建索引。

source activate /home/lizhixin/softwares/anaconda3/envs/splicing

mkdir GFP_index
STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta

比對

STAR --runThreadN 10 \
--genomeDir YFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix unmap/PHOX2B \
--readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped.fq

提取出比對上的序列

提取對應的cell barcode

比如提取10x genomics的barcode，fastq里的前16個鹼基【搞錯了，沒這么簡單】。

seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq

所有需要的信息都在這個bam文件里面，可以進行二次分析

possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information　　

統計fasta的每條序列的長度

目的：通過統計人基因組里轉錄本的長度分布，來評估不同的基因治療的載體的局限。

# ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/
cat GRCh38.transcripts.fa | seqkit  fx2tab -l | cut -f1,4 > transcript.len.txt

參考：

fasta/fq文件處理萬能工具——Seqkit學習記錄

Single-Library Analysis with cellranger count

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 FASTA/Q序列處理神器---seqkit seqkit一個FASTA/Q序列處理神器 fasta序列操作神器——seqkit biopython文件處理：fastq文件轉換為fasta文件 fasta/fastq格式解讀 python文本處理---fasta文件提取指定ID的序列 samtools faidx 命令處理fasta序列從fasta中提取或者過濾掉多個序列 FASTQ 文件格式轉換為 FASTA 格式 fasta與fastq格式文件解讀