5分鍾入門MP4文件格式

本文轉載自查看原文 2020-12-08 08:17 7675 經驗總結/ mp4/ 網絡/ 直播/ 視頻

寫在前面

本文主要內容包括，什么是MP4、MP4文件的基本結構、Box的基本結構、常見且重要的box介紹、普通MP4與fMP4的區別、如何通過代碼解析MP4文件等。

寫作背景：最近經常回答團隊小伙伴關於直播 & 短視頻的問題，比如 “flv.js的實現原理”、“為什么設計同學給的mp4文件瀏覽器里播放不了、但本地可以正常播放”、“MP4兼容性很好，可不可以用來做直播” 等。

在解答的過程中，發現經常涉及 MP4 協議的介紹。之前這塊有簡單了解過並做了筆記，這里稍微整理一下，順便作為團隊參考文檔，如有錯漏，敬請指出。

什么是MP4

首先，介紹下封裝格式。多媒體封裝格式（也叫容器格式），是指按照一定的規則，將視頻數據、音頻數據等，放到一個文件中。常見的 MKV、AVI 以及本文介紹的 MP4 等，都是封裝格式。

MP4是最常見的封裝格式之一，因為其跨平台的特性而得到廣泛應用。MP4文件的后綴為.mp4，基本上主流的播放器、瀏覽器都支持MP4格式。

MP4文件的格式主要由 MPEG-4 Part 12、MPEG-4 Part 14 兩部分進行定義。其中，MPEG-4 Part 12 定義了ISO基礎媒體文件格式，用來存儲基於時間的媒體內容。MPEG-4 Part 14 實際定義了MP4文件格式，在MPEG-4 Part 12的基礎上進行擴展。

對從事直播、音視頻相關工作的同學，很有必要了解MP4格式，下面簡單介紹下。

MP4文件格式概覽

MP4文件由多個box組成，每個box存儲不同的信息，且box之間是樹狀結構，如下圖所示。

box類型有很多，下面是3個比較重要的頂層box：

ftyp：File Type Box，描述文件遵從的MP4規范與版本；
moov：Movie Box，媒體的metadata信息，有且僅有一個。
mdat：Media Data Box，存放實際的媒體數據，一般有多個；

雖然box類型有很多，但基本結構都是一樣的。下一節會先介紹box的結構，然后再對常見的box進行進一步講解。

下表是常見的box，稍微看下有個大致的印象就好，然后直接跳到下一節。

MP4 Box簡介

1個box由兩部分組成：box header、box body。

box header：box的元數據，比如box type、box size。
box body：box的數據部分，實際存儲的內容跟box類型有關，比如mdat中body部分存儲的媒體數據。

box header中，只有type、size是必選字段。當size==0時，存在largesize字段。在部分box中，還存在version、flags字段，這樣的box叫做Full Box。當box body中嵌套其他box時，這樣的box叫做container box。

Box Header

字段定義如下：

type：box類型，包括 “預定義類型”、“自定義擴展類型”，占4個字節；
- 預定義類型：比如ftyp、moov、mdat等預定義好的類型；
- 自定義擴展類型：如果type==uuid，則表示是自定義擴展類型。size（或largesize）隨后的16字節，為自定義類型的值（extended_type）
size：包含box header在內的整個box的大小，單位是字節。當size為0或1時，需要特殊處理：
- size等於0：box的大小由后續的largesize確定（一般只有裝載媒體數據的mdat box會用到largesize）；
- size等於1：當前box為文件的最后一個box，通常包含在mdat box中；
largesize：box的大小，占8個字節；
extended_type：自定義擴展類型，占16個字節；

Box的偽代碼如下：

aligned(8) class Box (unsigned int(32) boxtype, optional unsigned int(8)[16] extended_type) {
    unsigned int(32) size;
    unsigned int(32) type = boxtype;
    if (size==1) {
        unsigned int(64) largesize;
    } else if (size==0) {
        // box extends to end of file
    }
    if (boxtype==‘uuid’) {
        unsigned int(8)[16] usertype = extended_type;
    } 
}

Box Body

box數據體，不同box包含的內容不同，需要參考具體box的定義。有的 box body 很簡單，比如 ftyp。有的 box 比較復雜，可能嵌套了其他box，比如moov。

Box vs FullBox

在Box的基礎上，擴展出了FullBox類型。相比Box，FullBox 多了 version、flags 字段。

version：當前box的版本，為擴展做准備，占1個字節；
flags：標志位，占24位，含義由具體的box自己定義；

FullBox 偽代碼如下：

aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f) extends Box(boxtype) {
	unsigned int(8) version = v;
	bit(24) flags = f;
}

FullBox主要在moov中的box用到，比如 moov.mvhd，后面會介紹到。

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) {
	// 字段略... 
}

ftyp（File Type Box）

ftyp用來指出當前文件遵循的規范，在介紹ftyp的細節前，先科普下isom。

什么是isom

isom（ISO Base Media file）是在 MPEG-4 Part 12 中定義的一種基礎文件格式，MP4、3gp、QT 等常見的封裝格式，都是基於這種基礎文件格式衍生的。

MP4 文件可能遵循的規范有mp41、mp42，而mp41、mp42又是基於isom衍生出來的。

3gp(3GPP)：一種容器格式，主要用於3G手機上；
QT：QuickTime的縮寫，.qt 文件代表蘋果QuickTime媒體文件；

ftyp定義

ftyp 定義如下：

aligned(8) class FileTypeBox extends Box(‘ftyp’) {  
  unsigned int(32) major_brand;  
  unsigned int(32) minor_version;  
  unsigned int(32) compatible_brands[]; // to end of the box  
}

下面是是 brand 的描述，其實就是具體封裝格式對應的代碼，用4個字節的編碼來表示，比如 mp41。

A brand is a four-letter code representing a format or subformat. Each file has a major brand (or primary brand), and also a compatibility list of brands.

ftyp 的幾個字段的含義：

major_brand：比如常見的 isom、mp41、mp42、avc1、qt等。它表示“最好”基於哪種格式來解析當前的文件。舉例，major_brand 是 A，compatible_brands 是 A1，當解碼器同時支持 A、A1 規范時，最好使用A規范來解碼當前媒體文件，如果不支持A規范，但支持A1規范，那么，可以使用A1規范來解碼；
minor_version：提供 major_brand 的說明信息，比如版本號，不得用來判斷媒體文件是否符合某個標准/規范；
compatible_brands：文件兼容的brand列表。比如 mp41 的兼容 brand 為 isom。通過兼容列表里的 brand 規范，可以將文件部分（或全部）解碼出來；

在實際使用中，不能把 isom 做為 major_brand，而是需要使用具體的brand（比如mp41），因此，對於 isom，沒有定義具體的文件擴展名、mime type。

下面是常見的幾種brand，以及對應的文件擴展名、mime type，更多brand可以參考這里。

下面是實際例子的截圖，不贅述。

關於AVC/AVC1

在討論 MP4 規范時，提到AVC，有的時候指的是“AVC文件格式”，有的時候指的是"AVC壓縮標准（H.264）"，這里簡單做下區分。

AVC文件格式：基於 ISO基礎文件格式衍生的，使用的是AVC壓縮標准，可以認為是MP4的擴展格式，對應的brand 通常是 avc1，在MPEG-4 PART 15 中定義。
AVC壓縮標准（H.264）：在MPEG-4 Part 10中定義。
ISO基礎文件格式(Base Media File Format) 在 MPEG-4 Part 12 中定義。

moov（Movie Box）

Movie Box，存儲 mp4 的 metadata，一般位於mp4文件的開頭。

aligned(8) class MovieBox extends Box(‘moov’){ }

moov中，最重要的兩個box是 mvhd 和 trak：

mvhd：Movie Header Box，mp4文件的整體信息，比如創建時間、文件時長等；
trak：Track Box，一個mp4可以包含一個或多個軌道（比如視頻軌道、音頻軌道），軌道相關的信息就在trak里。trak是container box，至少包含兩個box，tkhd、mdia；

mvhd針對整個影片，tkhd針對單個track，mdhd針對媒體，vmhd針對視頻，smhd針對音頻，可以認為是從寬泛 > 具體，前者一般是從后者推導出來的。

mvhd（Movie Header Box）

MP4文件的整體信息，跟具體的視頻流、音頻流無關，比如創建時間、文件時長等。

定義如下：

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { if (version==1) {
      unsigned int(64)  creation_time;
      unsigned int(64)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(64)  duration;
   } else { // version==0
      unsigned int(32)  creation_time;
      unsigned int(32)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(32)  duration;
}
template int(32) rate = 0x00010000; // typically 1.0
template int(16) volume = 0x0100; // typically, full volume const bit(16) reserved = 0;
const unsigned int(32)[2] reserved = 0;
template int(32)[9] matrix =
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
      // Unity matrix
   bit(32)[6]  pre_defined = 0;
   unsigned int(32)  next_track_ID;
}

字段含義如下：

creation_time：文件創建時間；
modification_time：文件修改時間；
timescale：一秒包含的時間單位（整數）。舉個例子，如果timescale等於1000，那么，一秒包含1000個時間單位（后面track等的時間，都要用這個來換算，比如track的duration為10,000，那么，track的實際時長為10,000/1000=10s）；
duration：影片時長（整數），根據文件中的track的信息推導出來，等於時間最長的track的duration；
rate：推薦的播放速率，32位整數，高16位、低16位分別代表整數部分、小數部分（[16.16]），舉例 0x0001 0000 代表1.0，正常播放速度；
volume：播放音量，16位整數，高8位、低8位分別代表整數部分、小數部分（[8.8]），舉例 0x01 00 表示 1.0，即最大音量；
matrix：視頻的轉換矩陣，一般可以忽略不計；
next_track_ID：32位整數，非0，一般可以忽略不計。當要添加一個新的track到這個影片時，可以使用的track id，必須比當前已經使用的track id要大。也就是說，添加新的track時，需要遍歷所有track，確認可用的track id；

tkhd（Track Box）

單個 track 的 metadata，包含如下字段：

version：tkhd box的版本；
flags：按位或操作獲得，默認值是7（0x000001 | 0x000002 | 0x000004），表示這個track是啟用的、用於播放的且用於預覽的。
- Track_enabled：值為0x000001，表示這個track是啟用的，當值為0x000000，表示這個track沒有啟用；
- Track_in_movie：值為0x000002，表示當前track在播放時會用到；
- Track_in_preview：值為0x000004，表示當前track用於預覽模式；
creation_time：當前track的創建時間；
modification_time：當前track的最近修改時間；
track_ID：當前track的唯一標識，不能為0，不能重復；
duration：當前track的完整時長（需要除以timescale得到具體秒數）；
layer：視頻軌道的疊加順序，數字越小越靠近觀看者，比如1比2靠上，0比1靠上；
alternate_group：當前track的分組ID，alternate_group值相同的track在同一個分組里面。同個分組里的track，同一時間只能有一個track處於播放狀態。當alternate_group為0時，表示當前track沒有跟其他track處於同個分組。一個分組里面，也可以只有一個track；
volume：audio track的音量，介於0.0~1.0之間；
matrix：視頻的變換矩陣；
width、height：視頻的寬高；

定義如下：

aligned(8) class TrackHeaderBox 
  extends FullBox(‘tkhd’, version, flags){ 
	if (version==1) {
	      unsigned int(64)  creation_time;
	      unsigned int(64)  modification_time;
	      unsigned int(32)  track_ID;
	      const unsigned int(32)  reserved = 0;
	      unsigned int(64)  duration;
	   } else { // version==0
	      unsigned int(32)  creation_time;
	      unsigned int(32)  modification_time;
	      unsigned int(32)  track_ID;
	      const unsigned int(32)  reserved = 0;
	      unsigned int(32)  duration;
	}
	const unsigned int(32)[2] reserved = 0;
	template int(16) layer = 0;
	template int(16) alternate_group = 0;
	template int(16) volume = {if track_is_audio 0x0100 else 0}; const unsigned int(16) reserved = 0;
	template int(32)[9] matrix= { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // unity matrix
	unsigned int(32) width;
	unsigned int(32) height;
}

例子如下：

hdlr（Handler Reference Box）

聲明當前track的類型，以及對應的處理器（handler）。

handler_type的取值包括：

vide（0x76 69 64 65），video track；
soun（0x73 6f 75 6e），audio track；
hint（0x68 69 6e 74），hint track；

name為utf8字符串，對handler進行描述，比如 L-SMASH Video Handler（參考這里）。

aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) { 
	unsigned int(32) pre_defined = 0;
	unsigned int(32) handler_type;
	const unsigned int(32)[3] reserved = 0;
   	string   name;
}

stbl（Sample Table Box）

MP4文件的媒體數據部分在mdat box里，而stbl則包含了這些媒體數據的索引以及時間信息，了解stbl對解碼、渲染MP4文件很關鍵。

在MP4文件中，媒體數據被分成多個chunk，每個chunk可包含多個sample，而sample則由幀組成（通常1個sample對應1個幀），關系如下：

Alt text

stbl中比較關鍵的box包含stsd、stco、stsc、stsz、stts、stss、ctts。下面先來個概要的介紹，然后再逐個講解細節。

stco / stsc / stsz / stts / stss / ctts / stsd 概述

下面是這幾個box概要的介紹：

stsd：給出視頻、音頻的編碼、寬高、音量等信息，以及每個sample中包含多少個frame；
stco：thunk在文件中的偏移；
stsc：每個thunk中包含幾個sample；
stsz：每個sample的size（單位是字節）；
stts：每個sample的時長；
stss：哪些sample是關鍵幀；
ctts：幀解碼到渲染的時間差值，通常用在B幀的場景；

stsd（Sample Description Box）

stsd給出sample的描述信息，這里面包含了在解碼階段需要用到的任意初始化信息，比如編碼等。對於視頻、音頻來說，所需要的初始化信息不同，這里以視頻為例。

偽代碼如下：

aligned(8) abstract class SampleEntry (unsigned int(32) format) extends Box(format){
	const unsigned int(8)[6] reserved = 0;
	unsigned int(16) data_reference_index;
}

// Visual Sequences
class VisualSampleEntry(codingname) extends SampleEntry (codingname){ 
	unsigned int(16) pre_defined = 0;
	const unsigned int(16) reserved = 0;
	unsigned int(32)[3] pre_defined = 0;
	unsigned int(16) width;
	unsigned int(16) height;
	template unsigned int(32) horizresolution = 0x00480000; // 72 dpi 
	template unsigned int(32) vertresolution = 0x00480000; // 72 dpi 
	const unsigned int(32) reserved = 0;
	template unsigned int(16) frame_count = 1;
	string[32] compressorname;
	template unsigned int(16) depth = 0x0018;
	int(16) pre_defined = -1;
}

// AudioSampleEntry、HintSampleEntry 定義略過


aligned(8) class SampleDescriptionBox (unsigned int(32) handler_type) extends FullBox('stsd', 0, 0){
	int i ;
	unsigned int(32) entry_count;
	for (i = 1 ; i u entry_count ; i++) {
	      switch (handler_type){
	        case ‘soun’: // for audio tracks
				AudioSampleEntry();
				break;
			case ‘vide’: // for video tracks
			   VisualSampleEntry();
			   break;
			case ‘hint’: // Hint track
			   HintSampleEntry();
			   break;	         
		}
	}
}

在SampleDescriptionBox 中，handler_type 參數為 track 的類型（soun、vide、hint），entry_count 變量代表當前box中 smaple description 的條目數。

stsc 中，sample_description_index 就是指向這些smaple description的索引。

針對不同的handler_type，SampleDescriptionBox 后續應用不同的 SampleEntry 類型，比如video track為VisualSampleEntry。

VisualSampleEntry包含如下字段：

data_reference_index：當MP4文件的數據部分，可以被分割成多個片段，每一段對應一個索引，並分別通過URL地址來獲取，此時，data_reference_index 指向對應的片段（比較少用到）；
width、height：視頻的寬高，單位是像素；
horizresolution、vertresolution：水平、垂直方向的分辨率（像素/英寸），16.16定點數，默認是0x00480000（72dpi）；
frame_count：一個sample中包含多少個frame，對video track來說，默認是1；
compressorname：僅供參考的名字，通常用於展示，占32個字節，比如 AVC Coding。第一個字節，表示這個名字實際要占用N個字節的長度。第2到第N+1個字節，存儲這個名字。第N+2到32個字節為填充字節。compressorname 可以設置為0；
depth：位圖的深度信息，比如 0x0018（24），表示不帶alpha通道的圖片；

In video tracks, the frame_count field must be 1 unless the specification for the media format explicitly documents this template field and permits larger values. That specification must document both how the individual frames of video are found (their size information) and their timing established. That timing might be as simple as dividing the sample duration by the frame count to establish the frame duration.

例子如下：

stco（Chunk Offset Box）

chunk在文件中的偏移量。針對小文件、大文件，有兩種不同的box類型，分別是stco、co64，它們的結構是一樣的，只是字段長度不同。

chunk_offset 指的是在文件本身中的 offset，而不是某個box內部的偏移。

在構建mp4文件的時候，需要特別注意 moov 所處的位置，它對於chunk_offset 的值是有影響的。有一些MP4文件的 moov 在文件末尾，為了優化首幀速度，需要將 moov 移到文件前面，此時，需要對 chunk_offset 進行改寫。

stco 定義如下：

# Box Type: ‘stco’, ‘co64’
# Container: Sample Table Box (‘stbl’) Mandatory: Yes
# Quantity: Exactly one variant must be present

aligned(8) class ChunkOffsetBox
	extends FullBox(‘stco’, version = 0, 0) { 
	unsigned int(32) entry_count;
	for (i=1; i u entry_count; i++) {
		unsigned int(32)  chunk_offset;
	}
}

aligned(8) class ChunkLargeOffsetBox
	extends FullBox(‘co64’, version = 0, 0) { 
	unsigned int(32) entry_count;
	for (i=1; i u entry_count; i++) {
		unsigned int(64)  chunk_offset;
	}
}

如下例子所示，第一個chunk的offset是47564，第二個chunk的偏移是120579，其他類似。

stsc（Sample To Chunk Box）

sample 以 chunk 為單位分成多個組。chunk的size可以是不同的，chunk里面的sample的size也可以是不同的。

entry_count：有多少個表項（每個表項，包含first_chunk、samples_per_chunk、sample_description_index信息）；
first_chunk：當前表項中，對應的第一個chunk的序號；
samples_per_chunk：每個chunk包含的sample數；
sample_description_index：指向 stsd 中 sample description 的索引值（參考stsd小節）；

aligned(8) class SampleToChunkBox
	extends FullBox(‘stsc’, version = 0, 0) { 
	unsigned int(32) entry_count;
	for (i=1; i u entry_count; i++) {
		unsigned int(32) first_chunk;
		unsigned int(32) samples_per_chunk; 
		unsigned int(32) sample_description_index;
	}
}

前面描述比較抽象，這里看個例子，這里表示的是：

序號1~15的chunk，每個chunk包含15個sample；
序號16的chunk，包含30個sample；
序號17以及之后的chunk，每個chunk包含28個sample；
以上所有chunk中的sample，對應的sample description的索引都是1；

first_chunk	samples_per_chunk	sample_description_index
1	15	1
16	30	1
17	28	1

stsz（Sample Size Boxes）

每個sample的大小（字節），根據 sample_size 字段，可以知道當前track包含了多少個sample（或幀）。

有兩種不同的box類型，stsz、stz2。

stsz：

sample_size：默認的sample大小（單位是byte），通常為0。如果sample_size不為0，那么，所有的sample都是同樣的大小。如果sample_size為0，那么，sample的大小可能不一樣。
sample_count：當前track里面的sample數目。如果 sample_size==0，那么，sample_count 等於下面entry的條目；
entry_size：單個sample的大小（如果sample_size==0的話）；

aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) { 
	unsigned int(32) sample_size;
	unsigned int(32) sample_count;
	if (sample_size==0) {
		for (i=1; i u sample_count; i++) {
			unsigned int(32)  entry_size;
		}
	}
}

stz2：

field_size：entry表中，每個entry_size占據的位數（bit），可選的值為4、8、16。4比較特殊，當field_size等於4時，一個字節上包含兩個entry，高4位為entry[i]，低4位為entry[i+1]；
sample_count：等於下面entry的條目；
entry_size：sample的大小。

aligned(8) class CompactSampleSizeBox extends FullBox(‘stz2’, version = 0, 0) { 
	unsigned int(24) reserved = 0;
	unisgned int(8) field_size;
	unsigned int(32) sample_count;
	for (i=1; i u sample_count; i++) {
		unsigned int(field_size) entry_size;
	}
}

例子如下：

stts（Decoding Time to Sample Box）

stts包含了DTS到sample number的映射表，主要用來推導每個幀的時長。

aligned(8) class TimeToSampleBox extends FullBox(’stts’, version = 0, 0) {
	unsigned int(32)  entry_count;
	int i;
	for (i=0; i < entry_count; i++) {
		unsigned int(32)  sample_count;
		unsigned int(32)  sample_delta;
	}
}

entry_count：stts 中包含的entry條目數；
sample_count：單個entry中，具有相同時長（duration 或 sample_delta）的連續sample的個數。
sample_delta：sample的時長（以timescale為計量）

還是看例子，如下圖，entry_count為3，前250個sample的時長為1000，第251個sample時長為999，第252~283個sample的時長為1000。

假設timescale為1000，則實際時長需要除以1000。

stss（Sync Sample Box）

mp4文件中，關鍵幀所在的sample序號。如果沒有stss的話，所有的sample中都是關鍵幀。

entry_count：entry的條目數，可以認為是關鍵幀的數目；
sample_number：關鍵幀對應的sample的序號；（從1開始計算）

aligned(8) class SyncSampleBox
   extends FullBox(‘stss’, version = 0, 0) {
   unsigned int(32)  entry_count;
   int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_number;
   }
}

例子如下，第1、31、61、91、121...271個sample是關鍵幀。

ctts（Composition Time to Sample Box）

從解碼（dts）到渲染（pts）之間的差值。

對於只有I幀、P幀的視頻來說，解碼順序、渲染順序是一致的，此時，ctts沒必要存在。

對於存在B幀的視頻來說，ctts就需要存在了。當PTS、DTS不相等時，就需要ctts了，公式為 CT(n) = DT(n) + CTTS(n) 。

aligned(8) class CompositionOffsetBox extends FullBox(‘ctts’, version = 0, 0) { unsigned int(32) entry_count;
      int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_count;
      unsigned int(32)  sample_offset;
   }
}

例子如下，不贅述：

fMP4（Fragmented mp4）

fMP4 跟普通 mp4 基本文件結構是一樣的。普通mp4用於點播場景，fmp4通常用於直播場景。

它們有以下差別：

普通mp4的時長、內容通常是固定的。fMP4 時長、內容通常不固定，可以邊生成邊播放；
普通mp4完整的metadata都在moov里，需要加載完moov box后，才能對mdat中的媒體數據進行解碼渲染；
fMP4中，媒體數據的metadata在moof box中，moof 跟 mdat （通常）結對出現。moof 中包含了sample duration、sample size等信息，因此，fMP4可以邊生成邊播放；

舉例來說，普通mp4、fMP4頂層box結構可能如下。以下是通過筆者編寫的MP4解析小工具打印出來，代碼在文末給出。

// 普通mp4
ftyp size=32(8+24) curTotalSize=32
moov size=4238(8+4230) curTotalSize=4270
mdat size=1124105(8+1124097) curTotalSize=1128375

// fmp4
ftyp size=36(8+28) curTotalSize=36
moov size=1227(8+1219) curTotalSize=1263
moof size=1252(8+1244) curTotalSize=2515
mdat size=65895(8+65887) curTotalSize=68410
moof size=612(8+604) curTotalSize=69022
mdat size=100386(8+100378) curTotalSize=169408

怎么判斷mp4文件是普通mp4，還是fMP4呢？一般可以看下是否存在存在mvex（Movie Extends Box）。

mvex（Movie Extends Box）

當存在mvex時，表示當前文件是fmp4（非嚴謹）。此時，sample相關的metadata不在moov里，需要通過解析moof box來獲得。

偽代碼如下：

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

mehd（Movie Extends Header Box）

mehd是可選的，用來聲明影片的完整時長（fragment_duration）。如果不存在，則需要遍歷所有的fragment，來獲得完整的時長。對於fmp4的場景，fragment_duration一般沒辦法提前預知。

aligned(8) class MovieExtendsHeaderBox extends FullBox(‘mehd’, version, 0) {
	if (version==1) {
		unsigned int(64)  fragment_duration;
	} else { // version==0
		unsigned int(32)  fragment_duration;
	}
}

trex（Track Extends Box）

用來給 fMP4 的 sample 設置各種默認值，比如時長、大小等。

aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ 
	unsigned int(32) track_ID;
	unsigned int(32) default_sample_description_index; 
	unsigned int(32) default_sample_duration;
	unsigned int(32) default_sample_size;
	unsigned int(32) default_sample_flags
}

字段含義如下：

track_id：對應的 track 的 ID，比如video track、audio track 的ID；
default_sample_description_index：sample description 的默認 index（指向stsd）；
default_sample_duration：sample 默認時長，一般為0；
default_sample_size：sample 默認大小，一般為0；
default_sample_flags：sample 的默認flag，一般為0；

default_sample_flags 占4個字節，比較復雜，結構如下：

老版本規范里，前6位都是保留位，新版規范里，只有前4位是保留位。is_leading 含義不是很直觀，下一小節會專門講解下。

reserved：4 bits，保留位；
is_leading：2 bits，是否 leading sample，可能的取值包括：
- 0：當前 sample 不確定是否 leading sample；（一般設為這個值）
- 1：當前 sample 是 leading sample，並依賴於 referenced I frame 前面的 sample，因此無法被解碼；
- 2：當前 sample 不是 leading sample；
- 3：當前 sample 是 leading sample，不依賴於 referenced I frame 前面的 sample，因此可以被解碼；
sample_depends_on：2 bits，是否依賴其他sample，可能的取值包括：
- 0：不清楚是否依賴其他sample；
- 1：依賴其他sample（不是I幀）；
- 2：不依賴其他sample（I幀）；
- 3：保留值；
sample_is_depended_on：2 bits，是否被其他sample依賴，可能的取值包括：
- 0：不清楚是否有其他sample依賴當前sample；
- 1：其他sample可能依賴當前sample；
- 2：其他sample不依賴當前sample；
- 3：保留值；
sample_has_redundancy：2 bits，是否有冗余編碼，可能的取值包括：
- 0：不清楚是否存在冗余編碼；
- 1：存在冗余編碼；
- 2：不存在冗余編碼；
- 3：保留值；
sample_padding_value：3 bits，填充值；
sample_is_non_sync_sample：1 bits，不是關鍵幀；
sample_degradation_priority：16 bits，降級處理的優先級（一般針對如流傳過程中出現的問題）；

例子如下：

關於 is_leading

is_leading 不是特別好解釋，這里貼上原文，方便大家理解。

A leading sample (usually a picture in video) is defined relative to a reference sample, which is the immediately prior sample that is marked as “sample_depends_on” having no dependency (an I picture). A leading sample has both a composition time before the reference sample, and possibly also a decoding dependency on a sample before the reference sample. Therefore if, for example, playback and decoding were to start at the reference sample, those samples marked as leading would not be needed and might not be decodable. A leading sample itself must therefore not be marked as having no dependency.

為方便講解，下面的 leading frame 對應 leading sample，referenced frame 對應 referenced samle。

以 H264編碼為例，H264 中存在 I幀、P幀、B幀。由於 B幀的存在，視頻幀的解碼順序、渲染順序可能不一致。

mp4文件的特點之一，就是支持隨機位置播放。比如，在視頻網站上，可以拖動進度條快進。

很多時候，進度條定位的那個時刻，對應的不一定是 I幀。為了能夠順利播放，需要往前查找最近的一個 I幀，如果可能的話，從最近的 I幀開始解碼播放（也就是說，不一定能從前面最近的I幀播放）。

將上面描述的此刻定位到的幀，稱作 leading frame。leading frame 前面最近的一個 I 幀，叫做 referenced frame。

回顧下 is_leading 為 1 或 3 的情況，同樣都是 leading frame，什么時候可以解碼（decodable），什么時候不能解碼（not decodable）？

1: this sample is a leading sample that has a dependency before the referenced I‐picture (and is therefore not decodable);
3: this sample is a leading sample that has no dependency before the referenced I‐picture (and is therefore decodable);

1、is_leading 為 1 的例子：如下所示，幀2（leading frame）解碼依賴幀1、幀3（referenced frame）。在視頻流里，從幀2 往前查找，最近的 I幀是幀3。哪怕已經解碼了幀3，幀2 也解不出來。

2、is_leading 為 3 的例子：如下所示，此時，幀2（leading frame）可以解碼出來。

moof（Movie Fragment Box）

moof是個container box，相關 metadata 在內嵌box里，比如 mfhd、 tfhd、trun 等。

偽代碼如下：

aligned(8) class MovieFragmentBox extends Box(‘moof’){ }

mfhd（Movie Fragment Header Box）

結構比較簡單，sequence_number 為 movie fragment 的序列號。根據 movie fragment 產生的順序，從1開始遞增。

aligned(8) class MovieFragmentHeaderBox extends FullBox(‘mfhd’, 0, 0){
	unsigned int(32)  sequence_number;
}

traf（Track Fragment Box）

aligned(8) class TrackFragmentBox extends Box(‘traf’){ }

對 fmp4 來說，數據被氛圍多個 movie fragment。一個 movie fragment 可包含多個track fragment（每個 track 包含0或多個 track fragment）。每個 track fragment 中，可以包含多個該 track 的 sample。

每個 track fragment 中，包含多個 track run，每個 track run 代表一組連續的 sample。

tfhd（Track Fragment Header Box）

tfhd 用來設置 track fragment 中的 sample 的 metadata 的默認值。

偽代碼如下，除了 track_ID，其他都是可選字段。

aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){
	unsigned int(32) track_ID;
	// all the following are optional fields 
	unsigned int(64) base_data_offset; 
	unsigned int(32) sample_description_index; 
	unsigned int(32) default_sample_duration; 
	unsigned int(32) default_sample_size; 
	unsigned int(32) default_sample_flags
}

sample_description_index、default_sample_duration、default_sample_size 沒什么好講的，這里只講解下 tf_flags、base_data_offset。

首先是 tf_flags，不同 flag 的值如下（同樣是求按位求或）：

0x000001 base‐data‐offset‐present：存在 base_data_offset 字段，表示數據位置相對於整個文件的基礎偏移量。
0x000002 sample‐description‐index‐present：存在 sample_description_index 字段；
0x000008 default‐sample‐duration‐present：存在 default_sample_duration 字段；
0x000010 default‐sample‐size‐present：存在 default_sample_size 字段；
0x000020 default‐sample‐flags‐present：存在 default_sample_flags 字段；
0x010000 duration‐is‐empty：表示當前時間段不存在sample，default_sample_duration 如果存在則為0 ，；
0x020000 default‐base‐is‐moof：如果 base‐data‐offset‐present 為1，則忽略這個flag。如果 base‐data‐offset‐present 為0，則當前 track fragment 的 base_data_offset 是從 moof 的第一個字節開始計算；

sample 位置計算公式為 base_data_offset + data_offset，其中，data_offset 每個 sample 單獨定義。如果未顯式提供 base_data_offset，則 sample 的位置的通常是基於 moof 的相對位置。

舉個例子，比如 tf_flags 等於 57，表示存在 base_data_offset、default_sample_duration、default_sample_flags。

base_data_offset 為 1263 （ftyp、moov 的size 之和為 1263）。

trun（Track Fragment Run Box）

trun 偽代碼如下：

aligned(8) class TrackRunBox extends FullBox(‘trun’, version, tr_flags) {
   unsigned int(32)  sample_count;
   // the following are optional fields
   signed int(32) data_offset;
   unsigned int(32)  first_sample_flags;
   // all fields in the following array are optional
   {
      unsigned int(32)  sample_duration;
      unsigned int(32)  sample_size;
      unsigned int(32)  sample_flags
      if (version == 0)
         { unsigned int(32) sample_composition_time_offset; }
      else
         { signed int(32) sample_composition_time_offset; }
   }[ sample_count ]
}

前面聽過，track run 表示一組連續的 sample，其中：

sample_count：sample 的數目；
data_offset：數據部分的偏移量；
first_sample_flags：可選，針對當前 track run中第一個 sample 的設置；

tr_flags 如下，大同小異：

0x000001 data‐offset‐present：存在 data_offset 字段；
0x000004 first‐sample‐flags‐present：存在 first_sample_flags 字段，這個字段的值，只會覆蓋第一個 sample 的flag設置；當 first_sample_flags 存在時，sample_flags 則不存在；
0x000100 sample‐duration‐present：每個 sample 都有自己的 sample_duration，否則使用默認值；
0x000200 sample‐size‐present：每個 sample 都有自己的 sample_size，否則使用默認值；
0x000400 sample‐flags‐present：每個 sample 都有自己的 sample_flags，否則使用默認值；
0x000800 sample‐composition‐time‐offsets‐present：每個 sample 都有自己的 sample_composition_time_offset；
0x000004 first‐sample‐flags‐present，覆蓋第一個sample的設置，這樣就可以把一組sample中的第一個幀設置為關鍵幀，其他的設置為非關鍵幀；

舉例如下，tr_flags 為 2565。此時，存在 data_offset 、first_sample_flags、sample_size、sample_composition_time_offset。

編程實踐：解析MP4文件結構

紙上得來終覺淺，絕知此事要coding。根據 mp4 文件規范，可以寫個簡易的 mp4 文件解析工具，比如前文對比普通mp4、fMP4 的 box 結構，就是筆者自己寫的分析腳本。

核心代碼如下，完整代碼有點長，可以在筆者的github 上找到。

class Box {
	constructor(boxType, extendedType, buffer) {
		this.type = boxType; // 必選，字符串，4個字節，box類型
		this.size = 0; // 必選，整數，4個字節，box的大小，單位是字節
		this.headerSize = 8; // 
		this.boxes = [];

		// this.largeSize = 0; // 可選，8個字節
		// this.extendedType = extendedType || boxType; // 可選，16個字節
		this._initialize(buffer);
	}

	_initialize(buffer) {				
		this.size = buffer.readUInt32BE(0); // 4個字節
		this.type = buffer.slice(4, 8).toString(); // 4個字節

		let offset = 8;

		if (this.size === 1) {
			this.size = buffer.readUIntBE(8, 8); // 8個字節，largeSize
			this.headerSize += 8;
			offset = 16;
		} else if (this.size === 1) {
			// last box
		}

		if (this.type === 'uuid') {
			this.type = buffer.slice(offset, 16); // 16個字節
			this.headerSize += 16;
		}
	}

	setInnerBoxes(buffer, offset = 0) {
		const innerBoxes = getInnerBoxes(buffer.slice(this.headerSize + offset, this.size));

		innerBoxes.forEach(item => {
			let { type, buffer } = item;

			type = type.trim(); // 備注，有些box類型不一定四個字母，比如 url、urn

			if (this[type]) {
				const box = this[type](buffer);
				this.boxes.push(box);
			} else {
				this.boxes.push('TODO 待實現');
				// console.log(`unknowed type: ${type}`);
			}
		});
	}
}

class FullBox extends Box {
	constructor(boxType, buffer) {
		super(boxType, '', buffer);

		const headerSize = this.headerSize;

		this.version = buffer.readUInt8(headerSize); // 必選，1個字節
		this.flags = buffer.readUIntBE(headerSize + 1, 3); // 必選，3個字節

		this.headerSize = headerSize + 4;
	}
}

// FileTypeBox、MovieBox、MediaDataBox、MovieFragmentBox 代碼有點長這里就不貼了
class Movie {
	constructor(buffer) {

		this.boxes = [];
		this.bytesConsumed = 0;

		const innerBoxes = getInnerBoxes(buffer);

		innerBoxes.forEach(item => {
			const { type, buffer, size } = item;
			if (this[type]) {
				const box = this[type](buffer);
				this.boxes.push(box);
			} else {
				// 自定義 box 類型
			}
			this.bytesConsumed += size;
		});
	}

	ftyp(buffer) {
		return new FileTypeBox(buffer);
	}

	moov(buffer) {
		return new MovieBox(buffer);
	}

	mdat(buffer) {
		return new MediaDataBox(buffer);
	}

	moof(buffer) {
		return new MovieFragmentBox(buffer);
	}
}

function getInnerBoxes(buffer) {
	let boxes = [];
	let offset = 0;
	let totalByteLen = buffer.byteLength;

	do {
		let box = getBox(buffer, offset);
		boxes.push(box);

		offset += box.size;
	} while(offset < totalByteLen);

	return boxes;
}

function getBox(buffer, offset = 0) {
	let size = buffer.readUInt32BE(offset); // 4個字節
	let type = buffer.slice(offset + 4, offset + 8).toString(); // 4個字節

	if (size === 1) {
		size = buffer.readUIntBE(offset + 8, 8); // 8個字節，largeSize
	} else if (size === 0) {
		// last box
	}

	let boxBuffer = buffer.slice(offset, offset + size);

	return {
		size,
		type,
		buffer: boxBuffer
	};
}

寫在后面

受限於時間，同時為了方便講解，部分內容可能不是很嚴謹，如有錯漏，敬請指出。如有問題，也歡迎隨時交流。