[R] read.table/read.delim讀入數據行數變少?


以為對read.table/read.delim很熟了,誰知又掉坑里了。

我有個3萬多行的數據集,包括樣品表達量和注釋信息。大概長這樣:
image.png

本來3萬多行,可是讀進來的時候變成了1萬多行,而且read.delim和read.table減少的行數還不一樣。我用Excel打開,再另存為txt格式讀入后,數據行數變回正常的3萬多。

MP <- read.delim("combine_test.txt",sep = '\t',header = T)
MP1 <- read.table("combine_test.txt",sep = '\t',header = T)
MP2<- read.delim("new_combine_test.txt",sep = '\t',header = T)

image.png

所以我在想是不是Rstudio的問題。於是我在Linux中測試了下,發現更詭異。

MP <- read.table("combine_test2.txt",header = T,sep='\t')
dim(MP)
MP2 <- read.delim("combine_test2.txt",header = T,sep='\t')
dim(MP2)
write.table(MP,"out.txt",col.names=T,row.names=F,sep='\t',quote=F)
write.table(MP2,"out.txt",col.names=T,row.names=F,sep='\t',quote=F)

dim顯示的都是1萬多行,原樣輸出的數據卻有3萬多行!

我意識到是數據格式的問題了。用readr來試試:

MP2 <- as.data.frame(read_delim("combine_test.txt",delim = '\t'))

變回正常了。難道base R還不如tidyverse嗎???我在網上查了查,終於找到原因了,那就是一個quote參數的事情。

MP3 <- read.table("combine_test.txt",sep = '\t',quote = "",header = T)
MP4 <- read.delim("combine_test.txt",sep = '\t',quote = "",header = T)

image.png
關於quote參數,那個答案是這么解釋的:

Explanation: Your data has a single quote on 59th line (( pyridoxamine 5'-phosphate oxidase (predicted)). Then there is another single quote, which complements the single quote on line 59, is on line 137 (5'-hydroxyl-kinase activity...). Everything within quote will be read as a single field of data, and quotes can include the newline character also. That's why you lose the lines in between. quote = "" disables quoting altogether.

簡單理解就是我的數據里面包含了單引號'',兩個單引號之間會當成一個字段來處理,我需要提前用quote=""將字段引起來。我檢查了下,在我的KEGG的描述中確實含有引號。

如果字段字符串中本身含有雙引號""或者其他符號時,也可能出錯。為檢查這種錯誤,可以用count.fields來統計每行的字段數,如果出現NA,則說明讀入的數據有誤。

num.fields = count.fields("combine_test.txt", sep="\t")

image.png

num.fields = count.fields("combine_test.txt", sep="\t",quote = "")

image.png

貌似read.csv不會出現這種問題,因為它提前引起來了。可見read.table確實有意想不到的錯誤發生。多了解下freadreadr系列吧。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM