索引

1.統計fasta、fa和fastq文件的長度，統計fastq的reads個數，單個reads長度，reads總長度；統計fasta文件中contig的個數，列出名稱，單條的長度，以及總長度。

2.1局部組裝：創建目錄，將比對好的reads按100k為單位，用samtools切，並用awk工具提起reads，分別存放在對應文件夾內

2.2局部組裝：用得到的reads_name，去原始的下機reads里提取reads序列

2.3局部組裝：使用canu/soapdenovo按窗口組裝

基本套路：

1.使用傳參數模塊

use Getopt::Long;
my ($first, $second, $third, $fourth);

GetOptions(
    "first=s" => \$first,
    "second=s" => \$second,
    "third=s" => \$third,
    "fourth=s" => \$fourth,
) or die $!;

print "$first-$second-$third-$fourth\n";

模塊名是Getopt::Long，不能寫錯；使用時，得用GetOptions函數，函數中間是個哈希結構，將--first傳給$first變量，以此類推。

這個模塊有點復雜，具體參照：在Perl中使用Getopt::Long模塊來接收用戶命令行參數

2.讀表格式文件，切分，取出目的字段，建立數據結構

open IN, "test.txt" or die $!;
while(<IN>){
    chomp;
    @list = split / /;
    next if $list[0] =~ /\bchar1\b|char4|char6/;
    $name = $list[2];
    $value = $list[1];
    $hash{$name} = $value;
    foreach (keys %hash){
        print "$_ => $hash{$_}\n";
    }
}
close(IN);

3.將原始的下機reads讀到哈希結構里，reads名字作為鍵，其鹼基序列作為哈希的值

#Store the original fasta file into a hash named %hash_reads.
my %hash_reads;
my $name;
open IN,"<",$original_fasta or die $!;
while(<IN>){
    chomp;
    if(/^>(.*)$/){
        $name=$1;
        $hash_reads{$name}="";
    }else{
        $hash_reads{$name} .= $_;
    }
}
close IN;

1.統計fasta、fa和fastq文件的長度，統計fastq的reads個數，單個reads長度，reads總長度（主要是統計總長度，其他在Linux下很簡單就實現了）；統計fasta文件中contig的個數，列出名稱，單條的長度，以及總長度。

思路整理：這是個典型的逐行讀文件，取字段，計算長度問題。

fastq簡單：四行一輪，解決辦法很多，可以逐行讀取，逐行匹配；也可以一次讀兩行；輸出也少，單reads長度，reads數量，reads長度總和，也沒有其他有價值的信息。

fasta略微復雜：沒有規律，因為序列被切成短的，只能逐行讀，逐行匹配；逐行讀有個問題，怎么檢測結束？（這是逐行讀的一個最大缺陷，你無法操縱最后一次！！！）因此，只能把最后一次放在讀循環的外面，每次輸出的點就在匹配title那個地方。

代碼如下：

#!/usr/bin/perl
#Author:zxlee
#Function: compute the length of fastq or fasta, fa.
#usage: `perl script_name fastq/fasta/fa_file_name`, it will show the total length, also a detail file.

use strict;
use warnings;
my $infile = shift;  #give 1st default para to it, you can go on shift to get the 2st para
open IN, "<$infile" or die $!;
open OUT, ">./result_len.txt" or die $!;

our $total_len = 0;
our $seq_num = 0;
our $len;

if($infile =~ /fastq$/){
    while(<IN>){
        next if not /^@\S+/;
        my $seq = <IN>;  #you cannot use $_ here!!!
        chomp($seq);
        $seq_num += 1;
        $total_len += length($seq);
        print OUT "\nreads_len = $total_len\n" if $seq_num == 1;
    }
    print OUT "Total num of reads is $seq_num\n";
}
elsif($infile =~ /(fasta|fa)$/){ # easy way, no "OR"
    my $chr_len = 0;
    while(<IN>){
        chomp;
        my $line = $_;
        if ($line =~ /^>(\S+)/){
            print OUT "$chr_len\n" if $chr_len != 0;
            print OUT "$1\t";
            $chr_len = 0;
        }else{
            $len = length($line) if $total_len == 0;
            $chr_len += length($line);
            $total_len += length($line);
        }
    }
    print OUT "$chr_len\n";
    print OUT "one line has $len\n";
}
print "The total length is $total_len\n";
close(IN);
close(OUT);

總結：若直接讀到數組里，會大大降低操作的復雜度，思維也極容易理解，但是非常耗費計算資源，看數據量決定用哪一種方法。

此腳本主要值得學習的地方：

1.shift的用法，用於逐個取得命令行的參數，讓腳本得到封裝，不必每次使用時都在里面改文件路徑。

2.my、our、local的用法，一般只用my，my管本層及以內的所有層次，my寫在最外面就等價於our。

3.$_什么時候會被改變，不是異想天開，while(<IN>)等價於while($_=<IN>)，但是單獨<IN>就沒有這個效果，這個就是幾種局限的用法，不要腦洞大開，隨意篡改語法。

3.或的模式匹配，/(fasta|fa)$/)，$就不能寫在里面，不知道為什么，記住就好了。

4.算法流程的問題，寫程序之前一定要明晰大致的算法，寫的時候要是沒有算法框架的支持，那你就痛苦死了。

2.1局部組裝，創建目錄，將比對好的reads按100k為單位，用samtools切，並用awk工具提起reads，分別存放在對應文件夾內

#!/usr/bin/perl
#Author: zxlee
#Function: ref, origin reads, bwa_sam_filterd_sored.bam, I want split reads per 100k, for local assembly.
#Usage: {perl script_name --size=100000 --bam=your_sorted.bam --jobs=200 --outpath=/split_reads_name --sample=DHX_Y255}

use strict;
use warnings;
use Getopt::long;
my ($size, $bam, $outpath, $sample_name, $jobs);

# get the perl script para
GetOptions(
    "size=s" => \$size,
    "bam=s" => \$bam,
    "jobs=s" => \$jobs,
    "sample=s" => \$sample_name,
    "outpath=s" => \$outpath,
) or die $1;

# my hash, chr => chr_size
my %hash = (
    'Chr1' => 43270923, 
    'Chr2' => 35937250,
    'Chr3' => 36413819,
    'Chr4' => 35502694,
    'Chr5' => 29958434,
    'Chr6' => 31248787,
    'Chr7' => 29697621,
    'Chr8' => 28443022,
    'Chr9' => 23012720,
    'Chr10' => 23207287,
    'Chr11' => 29021106,
    'Chr12' => 27531856,
);

# iterate each chr, mkdir, create pbs, submit pbs
foreach my $chr (keys %hash){
    my $split_bam_dir = $outpath."/${chr}_split_bam/"; # dir storage the split bam, abs dir
    my $split_pbs_dir = $outpath."/${chr}_split_pbs/";
    my $split_log_dir = $outpath."/${chr}_split_log/";
    my $split_read_names = $outpath."/${chr}_split_read_names/";
    system("mkdir $split_bam_dir") if not -e $split_bam_dir; #create dir
    system("mkdir $split_pbs_dir") if not -e $split_pbs_dir;
    system("mkdir $split_log_dir") if not -e $split_log_dir;
    system("mkdir $split_read_names" if not -e $split_read_names;

    # create split psb
    for(my $start=1; $start<$hash{$chr}; $start+=$size){
        my $end = $start + $size;
        my $startHK = int($start/$size);  # for name convinient
        my $endHK = int($end/$size);

        my $bam_name = "$split_bam_dir/${sample_name}_${chr}_${startHK}hk_${endHK}hk.bam";
        my $read_name = "$split_read_names_dir/${sample_name}_${chr}_${startHK}hk_${endHK}hk_reads_name.txt";
        my $pbs_name = "$split_pbs_dir/${sample_name}_${chr}_${startHK}hk_${endHK}hk.pbs";
        
        open PBS, ">$pbs_name";
        print PBS <<SET;
#PBS -N split_${startHK}_${endHK}
#PBS -j oe
#PBS -l nodes=1:ppn=1
#PBS -q low
#PBS -o $split_log_dir
hostname
date
cd \$PBS_O_WORKDIR
export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/public/software/htslib-1.3/lib
/public/software/samtools-1.3/bin/samtools view -b $bam $chr:$start-$end -o $bam_name
/public/software/samtools-1.3/bin/samtools view $bam_name | awk '{arr[\$1]} END{for(key in arr) {print key} }' > $read_name
date
SET
        close PBS;
        while(1){
            my @run_jobs = `qstat -u zxli | grep "split`;
            if(@run_jobs < $jobs){
                lase;
            }else{
                sleep(1);
            }
        }
        system("qsub $pbs_name");
        last; # for test
    }
    last; # for test
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【生物信息】RPKM, FPKM和TPM perl 腳本測試 perl 調用shell腳本如何查看perl的版本信息 perl 腳本快速入門知識生物信息bowtie比對軟件的結果格式說明生物信息常用數據庫集錦生物信息Python-從入門到精通？ perl腳本基礎總結（轉）生物信息軟件安裝（不定時更新）