python處理二進制文件(字節byte和比特bit)


一、如果按字節處理,可以用struct

https://docs.python.org/2/library/struct.html

 

By default, C types are represented in the machine’s native format and byte order, and properly aligned by skipping pad bytes if necessary (according to the rules used by the C compiler).

Alternatively, the first character of the format string can be used to indicate the byte order, size and alignment of the packed data, according to the following table:

Character

Byte order

Size

Alignment

@

native

native

native

=

native

standard

none

<

little-endian

standard

none

>

big-endian

standard

none

!

network (= big-endian)

standard

none

If the first character is not one of these, '@' is assumed.

 

Format characters have the following meaning; the conversion between C and Python values should be obvious given their types. The ‘Standard size’ column refers to the size of the packed value in bytes when using standard size; that is, when the format string starts with one of '<''>''!' or '='. When using native size, the size of the packed value is platform-dependent.

Format

C Type

Python type

Standard size

Notes

x

pad byte

no value

   

c

char

string of length 1

1

 

b

signed char

integer

1

(3)

B

unsigned char

integer

1

(3)

?

_Bool

bool

1

(1)

h

short

integer

2

(3)

H

unsigned short

integer

2

(3)

i

int

integer

4

(3)

I

unsigned int

integer

4

(3)

l

long

integer

4

(3)

L

unsigned long

integer

4

(3)

q

long long

integer

8

(2), (3)

Q

unsigned long long

integer

8

(2), (3)

f

float

float

4

(4)

d

double

float

8

(4)

s

char[]

string

   

p

char[]

string

   

P

void *

integer

 

(5), (3)

Notes:

  1. The '?' conversion code corresponds to the _Bool type defined by C99. If this type is not available, it is simulated using a char. In standard mode, it is always represented by one byte.

    New in version 2.6.

  2. The 'q' and 'Q' conversion codes are available in native mode only if the platform C compiler supports C long long, or, on Windows, __int64. They are always available in standard modes.

    New in version 2.2.

  3. When attempting to pack a non-integer using any of the integer conversion codes, if the non-integer has a __index__() method then that method is called to convert the argument to an integer before packing. If no __index__() method exists, or the call to __index__() raises TypeError, then the __int__() method is tried. However, the use of __int__() is deprecated, and will raise DeprecationWarning.

    Changed in version 2.7: Use of the __index__() method for non-integers is new in 2.7.

    Changed in version 2.7: Prior to version 2.7, not all integer conversion codes would use the __int__() method to convert, and DeprecationWarning was raised only for float arguments.

  4. For the 'f' and 'd' conversion codes, the packed representation uses the IEEE 754 binary32 (for 'f') or binary64 (for 'd') format, regardless of the floating-point format used by the platform.

  5. The 'P' format character is only available for the native byte ordering (selected as the default or with the '@' byte order character). The byte order character '=' chooses to use little- or big-endian ordering based on the host system. The struct module does not interpret this as native ordering, so the 'P' format is not available.

A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.

示例:

比如有一個結構體

struct Header

{

    unsigned short id;

    char[4] tag;

    unsigned int version;

    unsigned int count;

}

通過socket.recv接收到了一個上面的結構體數據,存在字符串s中,現在需要把它解析出來,可以使用unpack()函數.

import struct

id, tag, version, count = struct.unpack("!H4s2I", s)

上面的格式字符串中,!表示我們要使用網絡字節順序解析,因為我們的數據是從網絡中接收到的,在網絡上傳送的時候它是網絡字節順序的.后面的H表示 一個unsigned short的id,4s表示4字節長的字符串,2I表示有兩個unsigned int類型的數據.


就通過一個unpack,現在id, tag, version, count里已經保存好我們的信息了.

同樣,也可以很方便的把本地數據再pack成struct格式.

ss = struct.pack("!H4s2I", id, tag, version, count);

pack函數就把id, tag, version, count按照指定的格式轉換成了結構體Header,ss現在是一個字符串(實際上是類似於c結構體的字節流),可以通過 socket.send(ss)把這個字符串發送出去.


示例二:

import struct

a=12.34

#將a變為二進制

bytes=struct.pack('i',a)

此時bytes就是一個string字符串,字符串按字節同a的二進制存儲內容相同。


再進行反操作

現有二進制數據bytes,(其實就是字符串),將它反過來轉換成python的數據類型:

a,=struct.unpack('i',bytes)

注意,unpack返回的是tuple

所以如果只有一個變量的話:

bytes=struct.pack('i',a)

那么,解碼的時候需要這樣

a,=struct.unpack('i',bytes) 或者 (a,)=struct.unpack('i',bytes)

如果直接用a=struct.unpack('i',bytes),那么 a=(12.34,) ,是一個tuple而不是原來的浮點數了。


如果是由多個數據構成的,可以這樣:

a='hello'

b='world!'

c=2

d=45.123

bytes=struct.pack('5s6sif',a,b,c,d)

此時的bytes就是二進制形式的數據了,可以直接寫入文件比如 binfile.write(bytes)

然后,當我們需要時可以再讀出來,bytes=binfile.read()

再通過struct.unpack()解碼成python變量

a,b,c,d=struct.unpack('5s6sif',bytes)

'5s6sif'這個叫做fmt,就是格式化字符串,由數字加字符構成,5s表示占5個字符的字符串,2i,表示2個整數等等,下面是可用的字符及類型,ctype表示可以與python中的類型一一對應。

 

示例3:

file = open(file_name, "rb")
short_data = struct.unpack('<h',file.read(2))[0]
float_data = struct.unpack('<f', file.read(4))[0]


2. 有些協議定義字段長度是按照bit為單位的,3bit寬度,7bit寬度等,這樣的就不適合用struct了,

  我們可以用bitstring,處理起來較為簡單

 

https://pypi.org/project/bitstring/

 

代碼示例:

   

import bitstring

file = open(file_name, "rb")

file_b = bitstring.BitStream(bytes=file.read()

print file_b.read(3).int
print file_b.read(3).int
print file_b.read(7).bytes

也可以定義結構體

fmt = 'sequence_header_code,
       uint:12=horizontal_size_value,
       uint:12=vertical_size_value,
       uint:4=aspect_ratio_information,
       ...
       '
d = {'sequence_header_code': '0x000001b3',
     'horizontal_size_value': 352,
     'vertical_size_value': 288,
     'aspect_ratio_information': 1,
     ...
    }
s = bitstring.pack(fmt, **d)

 

 

 
       


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM