索引
1.統計fasta、fa和fastq文件的長度,統計fastq的reads個數,單個reads長度,reads總長度;統計fasta文件中contig的個數,列出名稱,單條的長度,以及總長度。
2.1局部組裝:創建目錄,將比對好的reads按100k為單位,用samtools切,並用awk工具提起reads,分別存放在對應文件夾內
2.2局部組裝:用得到的reads_name,去原始的下機reads里提取reads序列
2.3局部組裝:使用canu/soapdenovo按窗口組裝
基本套路:
1.使用傳參數模塊
use Getopt::Long; my ($first, $second, $third, $fourth); GetOptions( "first=s" => \$first, "second=s" => \$second, "third=s" => \$third, "fourth=s" => \$fourth, ) or die $!; print "$first-$second-$third-$fourth\n";
模塊名是Getopt::Long,不能寫錯;使用時,得用GetOptions函數,函數中間是個哈希結構,將--first傳給$first變量,以此類推。
這個模塊有點復雜,具體參照:在Perl中使用Getopt::Long模塊來接收用戶命令行參數
2.讀表格式文件,切分,取出目的字段,建立數據結構
open IN, "test.txt" or die $!; while(<IN>){ chomp; @list = split / /; next if $list[0] =~ /\bchar1\b|char4|char6/; $name = $list[2]; $value = $list[1]; $hash{$name} = $value; foreach (keys %hash){ print "$_ => $hash{$_}\n"; } } close(IN);
3.將原始的下機reads讀到哈希結構里,reads名字作為鍵,其鹼基序列作為哈希的值
#Store the original fasta file into a hash named %hash_reads. my %hash_reads; my $name; open IN,"<",$original_fasta or die $!; while(<IN>){ chomp; if(/^>(.*)$/){ $name=$1; $hash_reads{$name}=""; }else{ $hash_reads{$name} .= $_; } } close IN;
1.統計fasta、fa和fastq文件的長度,統計fastq的reads個數,單個reads長度,reads總長度(主要是統計總長度,其他在Linux下很簡單就實現了);統計fasta文件中contig的個數,列出名稱,單條的長度,以及總長度。
思路整理:這是個典型的逐行讀文件,取字段,計算長度問題。
fastq簡單:四行一輪,解決辦法很多,可以逐行讀取,逐行匹配;也可以一次讀兩行;輸出也少,單reads長度,reads數量,reads長度總和,也沒有其他有價值的信息。
fasta略微復雜:沒有規律,因為序列被切成短的,只能逐行讀,逐行匹配;逐行讀有個問題,怎么檢測結束?(這是逐行讀的一個最大缺陷,你無法操縱最后一次!!!)因此,只能把最后一次放在讀循環的外面,每次輸出的點就在匹配title那個地方。
代碼如下:
#!/usr/bin/perl #Author:zxlee #Function: compute the length of fastq or fasta, fa. #usage: `perl script_name fastq/fasta/fa_file_name`, it will show the total length, also a detail file. use strict; use warnings; my $infile = shift; #give 1st default para to it, you can go on shift to get the 2st para open IN, "<$infile" or die $!; open OUT, ">./result_len.txt" or die $!; our $total_len = 0; our $seq_num = 0; our $len; if($infile =~ /fastq$/){ while(<IN>){ next if not /^@\S+/; my $seq = <IN>; #you cannot use $_ here!!! chomp($seq); $seq_num += 1; $total_len += length($seq); print OUT "\nreads_len = $total_len\n" if $seq_num == 1; } print OUT "Total num of reads is $seq_num\n"; } elsif($infile =~ /(fasta|fa)$/){ # easy way, no "OR" my $chr_len = 0; while(<IN>){ chomp; my $line = $_; if ($line =~ /^>(\S+)/){ print OUT "$chr_len\n" if $chr_len != 0; print OUT "$1\t"; $chr_len = 0; }else{ $len = length($line) if $total_len == 0; $chr_len += length($line); $total_len += length($line); } } print OUT "$chr_len\n"; print OUT "one line has $len\n"; } print "The total length is $total_len\n"; close(IN); close(OUT);
總結:若直接讀到數組里,會大大降低操作的復雜度,思維也極容易理解,但是非常耗費計算資源,看數據量決定用哪一種方法。
此腳本主要值得學習的地方:
1.shift的用法,用於逐個取得命令行的參數,讓腳本得到封裝,不必每次使用時都在里面改文件路徑。
2.my、our、local的用法,一般只用my,my管本層及以內的所有層次,my寫在最外面就等價於our。
3.$_什么時候會被改變,不是異想天開,while(<IN>)等價於while($_=<IN>),但是單獨<IN>就沒有這個效果,這個就是幾種局限的用法,不要腦洞大開,隨意篡改語法。
3.或的模式匹配,/(fasta|fa)$/),$就不能寫在里面,不知道為什么,記住就好了。
4.算法流程的問題,寫程序之前一定要明晰大致的算法,寫的時候要是沒有算法框架的支持,那你就痛苦死了。
2.1局部組裝,創建目錄,將比對好的reads按100k為單位,用samtools切,並用awk工具提起reads,分別存放在對應文件夾內
#!/usr/bin/perl #Author: zxlee #Function: ref, origin reads, bwa_sam_filterd_sored.bam, I want split reads per 100k, for local assembly. #Usage: {perl script_name --size=100000 --bam=your_sorted.bam --jobs=200 --outpath=/split_reads_name --sample=DHX_Y255} use strict; use warnings; use Getopt::long; my ($size, $bam, $outpath, $sample_name, $jobs); # get the perl script para GetOptions( "size=s" => \$size, "bam=s" => \$bam, "jobs=s" => \$jobs, "sample=s" => \$sample_name, "outpath=s" => \$outpath, ) or die $1; # my hash, chr => chr_size my %hash = ( 'Chr1' => 43270923, 'Chr2' => 35937250, 'Chr3' => 36413819, 'Chr4' => 35502694, 'Chr5' => 29958434, 'Chr6' => 31248787, 'Chr7' => 29697621, 'Chr8' => 28443022, 'Chr9' => 23012720, 'Chr10' => 23207287, 'Chr11' => 29021106, 'Chr12' => 27531856, ); # iterate each chr, mkdir, create pbs, submit pbs foreach my $chr (keys %hash){ my $split_bam_dir = $outpath."/${chr}_split_bam/"; # dir storage the split bam, abs dir my $split_pbs_dir = $outpath."/${chr}_split_pbs/"; my $split_log_dir = $outpath."/${chr}_split_log/"; my $split_read_names = $outpath."/${chr}_split_read_names/"; system("mkdir $split_bam_dir") if not -e $split_bam_dir; #create dir system("mkdir $split_pbs_dir") if not -e $split_pbs_dir; system("mkdir $split_log_dir") if not -e $split_log_dir; system("mkdir $split_read_names" if not -e $split_read_names; # create split psb for(my $start=1; $start<$hash{$chr}; $start+=$size){ my $end = $start + $size; my $startHK = int($start/$size); # for name convinient my $endHK = int($end/$size); my $bam_name = "$split_bam_dir/${sample_name}_${chr}_${startHK}hk_${endHK}hk.bam"; my $read_name = "$split_read_names_dir/${sample_name}_${chr}_${startHK}hk_${endHK}hk_reads_name.txt"; my $pbs_name = "$split_pbs_dir/${sample_name}_${chr}_${startHK}hk_${endHK}hk.pbs"; open PBS, ">$pbs_name"; print PBS <<SET; #PBS -N split_${startHK}_${endHK} #PBS -j oe #PBS -l nodes=1:ppn=1 #PBS -q low #PBS -o $split_log_dir hostname date cd \$PBS_O_WORKDIR export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/public/software/htslib-1.3/lib /public/software/samtools-1.3/bin/samtools view -b $bam $chr:$start-$end -o $bam_name /public/software/samtools-1.3/bin/samtools view $bam_name | awk '{arr[\$1]} END{for(key in arr) {print key} }' > $read_name date SET close PBS; while(1){ my @run_jobs = `qstat -u zxli | grep "split`; if(@run_jobs < $jobs){ lase; }else{ sleep(1); } } system("qsub $pbs_name"); last; # for test } last; # for test }