python linecache讀取過程

本文轉載自查看原文 2017-04-17 22:08 6894 python/ 文件讀取

最近使用Python編寫日志處理腳本時，對Python的幾種讀取文件的方式進行了實驗。其中，linecache的行為引起了我的注意。
Python按行讀取文件的經典方式有以下幾種：

with open('blabla.log', 'r') as f:
    for line in f.readlines():
        ## do something

with open('blabla.log', 'r') as f:
    for line in f:
      ## do something

with open('blabla.log', 'r') as f:
    while 1:
        line = f.readline()
        if not line:
          break
        ## do something

以上幾種方式都不支持對於文件按行隨機訪問。在這樣的背景下，能夠支持訪直接訪問某一行內容的linecache模塊是一種很好的補充。
我們可以使用linecache模塊的getline方法訪問某一具體行的內容，官方文檔中給出了如下用法：

>>> import linecache
>>> linecache.getline('/etc/passwd', 4)

在使用過程中我注意到，基於linecache的getline方法的日志分析會在跑滿CPU資源之前首先占用大量內存空間，也就是在CPU使用率仍然很低的情況下，內存空間就會被迅速地消耗。
這一現象引起了我的興趣。我猜測linecache在隨機讀取文件時，是首先依序將文件讀入內存，之后尋找所要定位的行是否在內存當中。若不在，則進行相應的替換行為，直至尋找到所對應的行，再將其返回。
對linecache代碼的閱讀證實了這一想法。
在linecache.py中，我們可以看到getline的定義為：

def getline(filename, lineno, module_globals=None):
    lines = getlines(filename, module_globals)
    if 1 <= lineno <= len(lines):
        return lines[lineno-1]
    else:
        return ''

不難看出，getline方法通過getlines得到了文件行的List，以此來實現對於文件行的隨機讀取。繼續查看getlines的定義。

def getlines(filename, module_globals=None):
    """Get the lines for a file from the cache.
    Update the cache if it doesn't contain an entry for this file already."""

    if filename in cache:
        return cache[filename][2]
    else:
        return updatecache(filename, module_globals)

由此可見，getlines方法會首先確認文件是否在緩存當中，如果在則返回該文件的行的List，否則執行updatecache方法，對緩存內容進行更新。因此，在程序啟動階段，linecache不得不首先占用內存對文件進行緩存，才能進行后續的讀取操作。
而在updatecache方法中，我們可以看到一個有趣的事實是：

def updatecache(filename, module_globals=None):
    """Update a cache entry and return its list of lines.
    If something's wrong, print a message, discard the cache entry,
    and return an empty list."""

    ## ... 省略...

    try:
        fp = open(fullname, 'rU')
        lines = fp.readlines()
        fp.close()
    except IOError, msg:
##      print '*** Cannot open', fullname, ':', msg
        return []
    if lines and not lines[-1].endswith('\n'):
        lines[-1] += '\n'
    size, mtime = stat.st_size, stat.st_mtime
    cache[filename] = size, mtime, lines, fullname
    return lines

也就是說，linecache依然借助了文件對象的readlines方法。這也給了我們一個提示，當文件很大不適用readlines方法直接獲取行的List進行讀取解析時，linecache似乎也並不會成為一個很好的選擇。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python開發_linecache SocketChannel 讀取ByteBuf 的過程 CPU從磁盤讀取數據過程 python 讀取文件路徑 Python讀取CSV文件 python 對Excel表格的讀取 Python如何讀取Yaml的數據 Python 讀取文件路徑 python 讀取xml文件 python讀取excel文件