計算基因組外顯子長度

本文轉載自查看原文 2017-01-09 23:51 1375

 
下載基因組外顯子信心
網站 ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/
wget ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt
運行下列代碼 得到外顯子大約36M
 
         import 
         re 
        
 
         import 
         os 
        
 
         from 
         collections  
         import 
         OrderedDict 
        
 
         from 
         operator  
         import 
         itemgetter 
        

            
        
 
         os.chdir( 
         '/Users/yangqin/Desktop/biotree/python/' 
         )  
        

            
        
 
         exonLength  
         = 
         0 
         # 外顯子長度尚未計算 
        
 
         overlapExons  
         = 
         OrderedDict()  
        
 
         with  
         open 
         ( 
         'CCDS.current.txt' 
         ,  
         'rt' 
         ) as f:  
         # 把CCCDS.current.txt打開定義為f 
        
 
              
         for 
         line  
         in 
         f:  
         # 逐行讀取f 
        
 
                  
         if 
         line.startswith( 
         '#' 
         ): # 如果該行以井號開頭，該行為表頭 
        
 
                      
         continue 
         # 跳過該行 
        
 
                  
         line  
         = 
         line.rstrip()   
         # 把右邊的空格去掉，_.rstrip()把右邊的空格去掉 
        
 
                  
         lst  
         = 
         line.split( 
         '\t' 
         )  
         # 用split分開該行的各列數據，以\t分隔 
        
 
                  
         if 
         lst[ 
         - 
         2 
         ]  
         = 
         = 
         '-' 
         :  
         # 如果倒數第二列是‘-’，沒有外顯子 
        
 
                      
         continue 
         # 跳過該行 
        
 
                  
         lst[ 
         - 
         2 
         ]  
         = 
         re.sub( 
         '\[|\]' 
         ,  
         ' ' 
         , lst[ 
         - 
         2 
         ])  
         # 正則表達式，去掉倒數第二列兩端的[]，換成空格 
        
 
                  
         exons  
         = 
         lst[ 
         - 
         2 
         ].split( 
         ', ' 
         )  
         # 把該列（每個基因）多個外顯子隔開 
        
 
                  
         for 
         exon  
         in 
         exons:   
         # 對於每一個外顯子 
        
 
                      
         start  
         = 
         int 
         (exon.split( 
         '-' 
         )[ 
         0 
         ])  
         # 拿到該外顯子的起始坐標 
        
 
                      
         end  
         = 
         int 
         (exon.split( 
         '-' 
         )[ 
         1 
         ])  
         # 拿到該外顯子的終止坐標 
        
 
                      
         coordinate  
         = 
         lst[ 
         0 
         ]  
         + 
         ':' 
         + 
         exon  
         # 定義每一個外顯子，以防重復計算 
        
 
                      
         if 
         coordinate  
         not 
         in 
         overlapExons.keys():  
         # 如果該外顯子尚未計算 
        
 
                          
         overlapExons[coordinate]  
         = 
         1 
         # 加入該外顯子 
        
 
                          
         exonLength  
         + 
         = 
         end  
         - 
         start  
         # 計算長度 
        
 
                       
        
 
         print 
         (exonLength) 
        

            
        

          36061455 
        

            
        

            
        

            
        

            
        

            
        

            
        

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基因組的外顯子探究基因組測序、外顯子測序和靶向測序有什么樣的區別，如何選擇？獲取基因的所有轉錄本（不同亞型）的外顯子區域獲取基因的所有轉錄本（不同亞型）的外顯子區域參考基因組基因組注釋全基因組關聯分析（GWAS）的計算原理 NextDenovo 組裝基因組基因組轉座元件如何到NCBI提交基因組