整理mp4協議重點，將協議讀薄

本文轉載自查看原文 2018-03-11 11:55 4591 媒體文件格式

MP4 實際代表的含義是 MPEG-4 Part 14。它只是 MPEG 標准中的 14 部分。它主要參考 ISO/IEC 標准來制定的。MP4 主要作用是可以實現快進快放，邊下載邊播放的效果。他是基於 MOV，然后發展成自己相關的格式內容。然后和 MP4 相關的文件還有：3GP，M4V 這兩種格式。

MP4 的格式稍微比 FLV 復雜一些，它是通過嵌的方式來實現整個數據的攜帶。換句話說，它的每一段內容，都可以變成一個對象，如果需要播放的話，只要得到相應的對象即可。

MP4 中最基本的單元就是 Box，它內部是通過一個一個獨立的 box 拼接而成的。所以，這里，我們先從 Box 的講解開始。

PS：mp4協議本身沒有多復查，沒啥特別難理解的地方，或許其唯一的“復雜”點就在於其“大”，嵌套的各種各樣的子box,簡直就是mux/remuxer的噩夢(gstreamer里面光解析box的代碼，就1W多行，還不包含其他的element 邏輯代碼)

MP4 box

MP4 box 可以分為 basic box 和 full box。

basic box: 主要針對的是相關的基礎 box。比如 ftyp,moov 等。
full box: 主要針對視頻源的 media box。

這里，再次強調一下，MP4 box 是 MP4的核心。在 decode/encode 過程中，最好把它的基本格式背下來，這樣，你寫起來會開心很多（經驗之談）。

OK，我們來看一下，Box 的具體結構。

basic box

首先來看一下 basic box 的結構：

image.png-9.6kB

如果用代碼來表示就是：

aligned(8) class Box (unsigned int(32) boxtype, optional unsigned int(8)[16] extended_type) { unsigned int(32) size; unsigned int(32) type = boxtype; if (size==1) { unsigned int(64) largesize; } else if (size==0) { // box extends to end of file } // 這里針對的是 MP4 extension 的盒子類型。一般不會發生 if (boxtype==‘uuid’) { unsigned int(8)[16] usertype = extended_type; } }

上面代碼其實已經說的很清楚了。這里，我在簡單的闡述一下。

size[4B]: 用來代指該 box 的大小，包括 header 和 body。由於其大小有限制，有可能不滿足超大的 box。所以，這里有一個判斷邏輯，當 size===1 時，會出現一個 8B 的 largesize 字段來存放大小。當 size===0 時，表示文件的結束。
type[4B]: 用來標識該 box 的類型，其實內容很簡單，就是直接取指定盒子的英文字母的 ASCII 碼。因為 boxname 的長度只有 4 個字母，比如'f''t''y''e'。

實際整個盒子的結構可以用下圖來表示：

image.png-27.8kB

這里需要強調的一點就是，在 MP4 中，默認寫入字節序都是 Big-Endian 。所以，在上面，涉及到 4B 8B 等字段內容時，都是以 BE 來寫入的。

上面不是說了，box 有兩種基本格式嗎？

還有一種為 fullBox

full box

full box 和 box 的主要區別是增加了 version 和 flag 字段。它的應用場景不高，主要是在 trak box 中使用。它的基本格式為：

aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f) extends Box(boxtype) { unsigned int(8) version = v; bit(24) flags = f; }

在實操中，如果你的沒有針對 version 和 flags 的業務場景，那么基本上就可以直接設為默認值，比如 0x00。它的基本結構圖為：

image.png-38.3kB

接下來，我們就要正式的來看一下，MP4 中真正用到的一些 Box 了。

這里，我們按照 MP4 box 的划分來進行相關的闡述。先看一張 MP4 給出的結構圖：

image.png-222.2kB

說明一下，我們只講帶星號的 box。其他的因為不是必須 box，我們就選擇性的忽略了。不過，里面帶星號的 Box 還是挺多的。因為，我們的主要目的是為了生成一個 MP4 文件。一個正常的 MP4 文件的結構並不是所有帶星號的 Box 都必須有。

正常播放的 MP4 文件其實還可以分為 unfragmented MP4（簡寫為 MP4）和 fragmented MP4（簡寫為 FMP4)。那這兩者具體有什么區別呢？

可以說，完全不同。因為他們本身確定 media stream 播放的方式都是完全不同的模式。

MP4 格式

基本 box 為：

image.png-12.8kB

上面這是最基本的 MP4 Box 內容。較完整的為：

image.png-12.6kB

MP4 box 根據 trak 中的 stbl 下的 stts stsc 等基本 box 來完成在 mdat box 中的索引。那 FMP4 是啥呢？

非標：非標常用於生成單一 trak 的文件。
- ftyp
- moov
- moof
- mdat
標准：用來生成含有多個 trak 的文件。
- ftyp
- moov
- mdat

看起來非標還多一個 box。但在具體編解碼的時候，標准解碼需要更多關注在如何編碼 stbl 下的幾個子 box–stts,stco,ctts 等盒子。而非標不需要關注 stbl，只需要將本來處於 stbl 的數據直接抽到 moof 中。並且在轉換過程中，moof 里面的格式相比 stbl 來說，是非常簡單的。所以，這里，我們主要圍繞上面兩種的標准，來講解對應的 Box。

標准 MP4 盒子

ftyp

ftyp 盒子相當於就是該 mp4 的綱領性說明。即，告訴demuxer它的基本解碼版本，兼容格式。簡而言之，就是用來告訴客戶端，該 MP4 的使用的解碼標准。通常，ftyp 都是放在 MP4 的開頭。

它的格式為：

aligned(8) class FileTypeBox extends Box(‘ftyp’) { unsigned int(32) major_brand; unsigned int(32) minor_version; unsigned int(32) compatible_brands[]; }

上面的字段一律都是放在 data 字段中（參考，box 的描述）。

major_brand: 因為兼容性一般可以分為推薦兼容性和默認兼容性。這里 major_brand 就相當於是推薦兼容性。一般而言都是使用 isom 這個萬金油即可。如果是需要特定的格式，可以自行定義。
minor_version: 指最低兼容版本。
compatible_brands: 和 major_brand 類似，通常是針對 MP4 中包含的額外格式，比如，AVC，AAC 等相當於的音視頻解碼格式。

說這么多概念，還不如給代碼實在。這里，我們可以來看一下，對於通用 ftyp box 的創建。

 FTYP: new Uint8Array([ 0x69, 0x73, 0x6F, 0x6D, // major_brand: isom 0x0, 0x0, 0x0, 0x1, // minor_version: 0x01 0x69, 0x73, 0x6F, 0x6D, // isom 0x61, 0x76, 0x63, 0x31 // avc1 ])

moov

moov box 主要是作為一個很重要的容器盒子存在的，它本身的實際內容並不重要。moov 主要是存放相關的 trak 。其基本格式為：

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

mvhd

mvhd 是 moov 下的第一個 box，用來描述 media 的相關信息。其基本內容為：

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) timescale; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) timescale; unsigned int(32) duration; } template int(32) rate = 0x00010000; // typically 1.0 template int(16) volume = 0x0100; // typically, full volume const bit(16) reserved = 0; const unsigned int(32)[2] reserved = 0; template int(32)[9] matrix = { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // Unity matrix bit(32)[6] pre_defined = 0; unsigned int(32) next_track_ID; }

version: 一般默認為 0。
creation_time: 創建的時間。從 1904 年開始算起，用秒來表示。
timescale: 時間比例。通過該值和 duration 來算出實際時間
duration: 持續時間，單位是根據 timescale 來決定的。實際時間為：duration/timescale = xx 秒。
rate: 播放比例。
volume: 音量大小。0x0100 為最大值。
matrix: 不解釋。我也不懂
next_track_ID: 需要比當前 trak_id 最大值還大才行。一般隨便填個很大的值即可。

實際上，mvhd 大部分的值，都可以設為固定值：

new Uint8Array([ 0x00, 0x00, 0x00, 0x00, // version(0) + flags 0x00, 0x00, 0x00, 0x00, // creation_time 0x00, 0x00, 0x00, 0x00, // modification_time (timescale >>> 24) & 0xFF, // timescale: 4 bytes (timescale >>> 16) & 0xFF, (timescale >>> 8) & 0xFF, (timescale) & 0xFF, (duration >>> 24) & 0xFF, // duration: 4 bytes (duration >>> 16) & 0xFF, (duration >>> 8) & 0xFF, (duration) & 0xFF, 0x00, 0x01, 0x00, 0x00, // Preferred rate: 1.0 0x01, 0x00, 0x00, 0x00, // PreferredVolume(1.0, 2bytes) + reserved(2bytes) 0x00, 0x00, 0x00, 0x00, // reserved: 4 + 4 bytes 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, // ----begin composition matrix---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, // ----end composition matrix---- 0x00, 0x00, 0x00, 0x00, // ----begin pre_defined 6 * 4 bytes---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ----end pre_defined 6 * 4 bytes---- 0xFF, 0xFF, 0xFF, 0xFF // next_track_ID ]);

trak

trak box 就是主要存放相關 media stream 的內容。其基本格式很簡單就是簡單的 box：

aligned(8) class TrackBox extends Box(‘trak’) { }

不過，有時候里面也可以帶上該 media stream 的相關描述：

image.png-28.1kB

tkhd

tkhd 是 trak box 的子一級 box 的內容。主要是用來描述該特定 trak 的相關內容信息。其主要內容為：

aligned(8) class TrackHeaderBox extends FullBox(‘tkhd’, version, flags){ if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) track_ID; const unsigned int(32) reserved = 0; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) track_ID; const unsigned int(32) reserved = 0; unsigned int(32) duration; } const unsigned int(32)[2] reserved = 0; template int(16) layer = 0; template int(16) alternate_group = 0; template int(16) volume = {if track_is_audio 0x0100 else 0}; const unsigned int(16) reserved = 0; template int(32)[9] matrix= { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // unity matrix unsigned int(32) width; unsigned int(32) height; }

上面內容確實挺多的，但是，有些並不是一定需要填一些合法值。這里簡單說明一下：

creation_time: 創建時間，非必須
modification_time: 修改時間，非必須
track_ID: 指明當前描述的 track ID。
duration: 當前 track 內容持續的時間。通常結合 timescale 進行相關計算。
layer: 沒啥用。通常用來作為分層 video trak 的使用。
alternate_group: 可替換 track 源。如果為 0 表示當前 track 沒有指定的 track 源替代。非 0 的話，則表示存在多個源的 group。
volume: 用來確定音量大小。滿音量為 1(0x0100)。
width and height：確定視頻的寬高

mdia

mdia 主要用來包裹相關的 media 信息。本身沒啥說的，格式為：

aligned(8) class MediaBox extends Box(‘mdia’) { }

mdhd

mdhd 和 tkhd 來說，內容大致都是一樣的。不過，tkhd 通常是對指定的 track 設定相關屬性和內容。而 mdhd 是針對於獨立的 media 來設置的。不過事實上，兩者一般都是一樣的。

具體格式為：


aligned(8) class MediaHeaderBox extends FullBox(‘mdhd’, version, 0) { if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) timescale; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) timescale; unsigned int(32) duration; } bit(1) pad = 0; unsigned int(5)[3] language; // ISO-639-2/T language code unsigned int(16) pre_defined = 0; }

里面就有 3 個額外的字段：pad，language，pre_defined。

根據字面意思很好理解：

pad: 占位符，通常為 0
language: 表明當前 trak 的語言。因為該字段總長為 15bit，通常是和 pad 組合成為 2B 的長度。
pre_defined: 默認為 0.

實際代碼的計算方式為：

new Uint8Array([ 0x00, 0x00, 0x00, 0x00, // version(0) + flags 0x00, 0x00, 0x00, 0x00, // creation_time 0x00, 0x00, 0x00, 0x00, // modification_time (timescale >>> 24) & 0xFF, // timescale: 4 bytes (timescale >>> 16) & 0xFF, (timescale >>> 8) & 0xFF, (timescale) & 0xFF, (duration >>> 24) & 0xFF, // duration: 4 bytes (duration >>> 16) & 0xFF, (duration >>> 8) & 0xFF, (duration) & 0xFF, 0x55, 0xC4, // language: und (undetermined) 0x00, 0x00 // pre_defined = 0 ])

hdlr

hdlr 是用來設置不同 trak 的處理方式的。常用處理方式如下：

vide : Video track
soun : Audio track
hint : Hint track
meta : Timed Metadata track
auxv : Auxiliary Video track

這個，其實就和我們在得到和接收到資源時，設置的 Content-Type 類型字段是一致的，例如 application/javascript。

其基本格式為：

aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) { unsigned int(32) pre_defined = 0; unsigned int(32) handler_type; const unsigned int(32)[3] reserved = 0; string name; }

其中有兩字段需要額外說明一下：

handler_type：是代指具體 trak 的處理類型。也就是我們上面列寫的 vide,soun,hint 字段。
name: 是用來寫名字的。其主要不是給機器讀的，而是給人讀，所以，這里你只要覺得能表述清楚，填啥其實都行。

handler_type 填的值其實就是 string 轉換為 hex 之后得到的值。比如：

vide 為 0x76, 0x69, 0x64, 0x65
soun 為 0x73, 0x6F, 0x75, 0x6E

minf

minf 是子屬內容中，重要的容器 box，用來存放當前 track 的基本描述信息。本身沒啥說的，基本格式為：

aligned(8) class MediaInformationBox extends Box(‘minf’) { }

v/smhd

v/smhd 是對當前 trak 的描述 box。vmhd 針對的是 video，smhd 針對的是 audio。這兩個盒子在解碼中，非不可或缺的（有時候得看播放器），缺了的話，有可能會被認為格式不正確。

我們先來看一下 vmhd 的基本格式：

aligned(8) class VideoMediaHeaderBox extends FullBox(‘vmhd’, version = 0, 1) { template unsigned int(16) graphicsmode = 0; // copy, see below template unsigned int(16)[3] opcolor = {0, 0, 0}; }

這很簡單都是一些默認值，我這里就不多說了。

smhd 的格式同樣也很簡單：

aligned(8) class SoundMediaHeaderBox extends FullBox(‘smhd’, version = 0, 0) { template int(16) balance = 0; const unsigned int(16) reserved = 0; }

其中，balance 這個字段相當於和我們通常設置的左聲道，右聲道有關。

balance: 該值是一個浮點值，0 為 center，1.0 為 right，-1.0 為 left。

dinf

dinf 是用來說明在 trak 中，media 描述信息的位置。其實本身就是一個容器，沒啥內容：

aligned(8) class DataInformationBox extends Box(‘dinf’) { }

dref

dref 是用來設置當前 Box 描述信息的 data_entry。基本格式為：

aligned(8) class DataReferenceBox extends FullBox(‘dref’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { DataEntryBox(entry_version, entry_flags) data_entry; } }

其中的 DataEntryBox 就是 DataEntryUrlBox/DataEntryUrnBox 中的一個。簡單來說，就是 dref 下的子 box – url 或者 urn 這兩個 box。其中，entry_version 和 entry_flags 需要額外說明一下。

entry_version: 用來指明當前 entry 的格式
entry_flags: 其值不是固定的，但是有一個特殊的值, 0x000001 用來表示當前 media 的數據和 moov 包含的數據一致。

不過，就通常來說，我真的沒有用到過有實際數據的 dref 。所以，這里就不衍生來講了。

url

url box 是由 dref 包裹的子一級 box，里面是對不同的 sample 的描述信息。不過，一般都是附帶在其它 box 里。其基本格式為：

aligned(8) class DataEntryUrlBox (bit(24) flags) extends FullBox(‘url ’, version = 0, flags) { string location; }

實際並沒有用到過 location 這個字段，所以，一般也就不需要了。

stts

stts 主要是用來存儲 refSampleDelta。即，相鄰兩幀間隔的時間。它基本格式為：

aligned(8) class TimeToSampleBox extends FullBox(’stts’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_delta; } }

看代碼其實看不出什么，我們結合實際抓包結果，來講解。現有如下的幀：

image.png-61.7kB

可以看到，上面的 Decode delta 值都是 10。這就對應着 sample_delta 的值。而 sample_count 就對應出現幾次的 sample_delta。比如，上面 10 的 delta 出現了 14 次，那么 sample_count 就是 14。

如果對應於 RTMP 中的 Video Msg，那么 sample_delta 就是當前 RTMP Header 中，后面一個的 timeStamp delta。

stco

stco 是 stbl 包里面一個非常關鍵的 Box。它用來定義每一個 sample 在 mdat 具體的位置。基本格式為：

aligned(8) class ChunkOffsetBox extends FullBox(‘stco’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i u entry_count; i++) { unsigned int(32) chunk_offset; } }

具體可以參考：

image.png-25.7kB

stco 有兩種形式，如果你的視頻過大的話，就有可能造成 chunkoffset 超過 32bit 的限制。所以，這里針對大 Video 額外創建了一個 co64 的 Box。它的功效等價於 stco，也是用來表示 sample 在 mdat box 中的位置。只是，里面 chunk_offset 是 64bit 的。

aligned(8) class ChunkLargeOffsetBox extends FullBox(‘co64’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i u entry_count; i++) { unsigned int(64) chunk_offset; } }

stsc

stsc 這個 Box 有點繞，並不是它的字段多，而是它的字段意思有點奇怪。其基本格式為：

aligned(8) class SampleToChunkBox extends FullBox(‘stsc’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i u entry_count; i++) { unsigned int(32) first_chunk; unsigned int(32) samples_per_chunk; unsigned int(32) sample_description_index; } }

關鍵點在於他們里面的三個字段: first_chunk,samples_per_chunk,sample_description_index。

first_chunk: 每一個 entry 開始的 chunk 位置。
samples_per_chunk: 每一個 chunk 里面包含多少的 sample
sample_description_index: 每一個 sample 的描述。一般可以默認設置為 1。

這 3 個字段實際上決定了一個 MP4 中有多少個 chunks，每個 chunks 有多少個 samples。這里順便普及一下 chunk 和 sample 的相關概念。在 MP4 文件中，最小的基本單位是 Chunk 而不是 Sample。

sample: 包含最小單元數據的 slice。里面有實際的 NAL 數據。
chunk: 里面包含的是一個一個的 sample。為了是優化數據的讀取，讓 I/O 更有效率。

看了上面字段就懂得，感覺你要么是大牛，要么就是在裝逼。官方文檔和上面一樣的描述，但是，看了一遍后，懵逼，再看一遍后，懵逼。所以，這里為了大家更好的理解，這里額外再補充一下。

前面說了，在 MP4 中最小的單位是 chunks，那么通過 stco 中定義的 chunk_offsets 字段，它描述的就是 chunks 在 mdat 中的位置。每一個 stco chunk_offset 就對應於某一個 index 的 chunks。那么，first_chunk 就是用來定義該 chunk entry 開始的位置。

那這樣的話，stsc 需要對每一個 chunk 進行定義嗎？

不需要，因為 stsc 是定義一整個 entry，即，如果他們的 samples_per_chunk，sample_description_index 不變的話，那么后續的 chunks 都是用一樣的模式。

即，如果你的 stsc 只有：

first_chunk: 1
samples_per_chunk: 4
sample_description_index: 1

也就是說，從第一個 chunk 開始，每通過切分 4 個 sample 划分為一個 chunk，並且每個 sample 的表述信息都是 1。它會按照這樣划分方法一直持續到最后。當然，如果你的 sample 最后不能被 4 整除，最后的幾段 sample 就會當做特例進行處理。

通常情況下，stsc 的值是不一樣的：

image.png-23.4kB

按照上面的情況就是，第 1 個 chunk 包含 2 個 samples。第 2-4 個 chunk 包含 1 個 sample，第 5 個 chunk 包含兩個 chunk，第 6 個到最后一個 chunk 包含一個 sample。

ctts

ctts 主要針對 Video 中的 B 幀來確定的。也就是說，如果你視頻里面沒有 B 幀，那么，ctts 的結構就很簡單了。它主要的作用，是用來記錄每一個 sample 里面的 cts。格式為：

aligned(8) class CompositionOffsetBox extends FullBox(‘ctts’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_offset; } }

還是看實例吧，假如你視頻中幀的排列如下：

image.png-61.7kB

其中，sample_offset 就是 Composition offset。通過合並一致的 Composition offset，可以得到對應的 sample_count。最終 ctts 的結果為：

image.png-12.6kB

看實例抓包的結果為：

image.png-33.7kB

如果，你是針對 RTMP 的 video，由於，其沒有 B 幀，那么 ctts 的整個結果，就只有一個 sample_count 和 sample_offset。比如：

sample_count: 100 sample_offset: 0

通常只有 video track 才需要 ctts。

stsz

stsz 是用來存放每一個 sample 的 size 信息的。基本格式為：

aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) { unsigned int(32) sample_size; unsigned int(32) sample_count; if (sample_size==0) { for (i=1; i <= sample_count; i++) { unsigned int(32) entry_size; } } }

這個沒啥說的，就是所有 sample 的 size 大小，以及相應的描述信息。

image.png-33.9kB

fragmented MP4

前面部分是標准 box 的所有內容。當然，fMP4 里面大部分內容和 MP4 標准格式有很多重復的地方，剩下的就不過多贅述，只把不同的單獨挑出來講解。

mvex

mvex 是 fMP4 的標准盒子。它的作用是告訴解碼器這是一個 fMP4 的文件，具體的 samples 信息內容不再放到 trak 里面，而是在每一個 moof 中。基本格式為：

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

trex

trex 是 mvex 的子一級 box 用來給 fMP4 的 sample 設置默認值。基本內容為：

aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ unsigned int(32) track_ID; unsigned int(32) default_sample_description_index; unsigned int(32) default_sample_duration; unsigned int(32) default_sample_size; unsigned int(32) default_sample_flags }

具體設哪一個值，這得看你業務里面具體的要求才行。如果實在不知道，那就可以直接設置為 0：

new Uint8Array([ 0x00, 0x00, 0x00, 0x00, // version(0) + flags (trackId >>> 24) & 0xFF, // track_ID (trackId >>> 16) & 0xFF, (trackId >>> 8) & 0xFF, (trackId) & 0xFF, 0x00, 0x00, 0x00, 0x01, // default_sample_description_index 0x00, 0x00, 0x00, 0x00, // default_sample_duration 0x00, 0x00, 0x00, 0x00, // default_sample_size 0x00, 0x01, 0x00, 0x01 // default_sample_flags ])

moof

moof 主要是用來存放 FMP4 的相關內容。它本身沒啥太多的內容：

aligned(8) class TrackFragmentBox extends Box(‘traf’){ }

tfhd

tfhd 主要是對指定的 trak 進行相關的默認設置。例如：sample 的時長，大小，偏移量等。不過，這些都可以忽略不設，只要你在其它 box 里面設置完整即可：

aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){ unsigned int(32) track_ID; // all the following are optional fields unsigned int(64) base_data_offset; unsigned int(32) sample_description_index; unsigned int(32) default_sample_duration; unsigned int(32) default_sample_size; unsigned int(32) default_sample_flags }

base_data_offset 是用來計算后面數據偏移量用到的。如果存在則會用上，否則直接是相關開頭的偏移。

tfdt

tfdt 主要是用來存放相關 sample 編碼的絕對時間的。因為 FMP4 是流式的格式，所以，不像 MP4 一樣可以直接根據 sample 直接 seek 到具體位置。這里就需要一個標准時間參考，來快速定位都某個具體的 fragment。

它的基本格式為：

aligned(8) class TrackFragmentBaseMediaDecodeTimeBox extends FullBox(‘tfdt’, version, 0) { if (version==1) { unsigned int(64) baseMediaDecodeTime; } else { // version==0 unsigned int(32) baseMediaDecodeTime; } }

baseMediaDecodeTime 基本值是前面所有指定 trak_id 中 samples 持續時長的總和，相當於就是當前 traf 里面第一個 sample 的 dts 值。

trun

trun 存儲該 moof 里面相關的 sample 內容。例如，每個 sample 的 size，duration，offset 等。基本內容為：

aligned(8) class TrackRunBox extends FullBox(‘trun’, version, tr_flags) { unsigned int(32) sample_count; // the following are optional fields signed int(32) data_offset; unsigned int(32) first_sample_flags; // all fields in the following array are optional { unsigned int(32) sample_duration; unsigned int(32) sample_size; unsigned int(32) sample_flags if (version == 0) { unsigned int(32) sample_composition_time_offset else { signed int(32) sample_composition_time_offset }[ sample_count ] }

可以說，trun 上面的字段是 traf 里面最重要的標識字段：

tr_flags 是用來表示下列 sample 相關的標識符是否應用到每個字段中：

0x000001: data-offset-present，只應用 data-offset
0x000004: 只對第一個 sample 應用對應的 flags。剩余 sample flags 就不管了。
0x000100: 這個比較重要，表示每個 sample 都有自己的 duration，否則使用默認的
0x000200: 每個 sample 有自己的 sample_size，否則使用默認的。
0x000400: 對每個 sample 使用自己的 flags。否則，使用默認的。
0x000800: 每個 sample 都有自己的 cts 值

后面字段，我們這簡單介紹一下。

data_offset: 用來表示和該 moof 配套的 mdat 中實際數據內容距 moof 開頭有多少 byte。相當於就是 moof.byteLength + mdat.headerSize。
sample_count: 一共有多少個 sample
first_sample_flags: 主要針對第一個 sample。一般來說，都可以默認設為 0。

后面的幾個字段，我就不贅述了，對了，里面的 sample_flags 是一個非常重要的東西，常常用它來表示，到底哪一個 sampel 是對應的 keyFrame。基本計算方法為：

(flags.isLeading << 2) | flags.dependsOn, // sample_flags (flags.isDepended << 6) | (flags.hasRedundancy << 4) | flags.isNonSync

sdtp

sdtp 主要是用來描述具體某個 sample 是否是 I 幀，是否是 leading frame 等相關屬性值，主要用來作為當進行點播回放時的同步參考信息。其內容一共有 4 個：

is_leading：是否是開頭部分。
- 0: 當前 sample 的 leading 屬性未知（經常用到）
- 1: 當前 sample 是 leading sample，並且不能被 decoded
- 2: 當前 sample 並不是 leading sample。
- 3: 當前 sample 是 leading sample，並且能被 decoded
sample_depends_on：是否是 I 幀。
- 0: 該 sample 不知道是否依賴其他幀
- 1: 該 sample 是 B/P 幀
- 2: 該 sample 是 I 幀。
- 3: 保留字
sample_is_depended_on: 該幀是否被依賴
- 0: 不知道是否被依賴，特指（B/P）
- 1: 被依賴，特指 I 幀
- 3: 保留字
sample_has_redundancy: 是否有冗余編碼
- 0: 不知道是否有冗余
- 1: 有冗余編碼
- 2: 沒有冗余編碼
- 3: 保留字

整個基本格式為：

aligned(8) class SampleDependencyTypeBox extends FullBox(‘sdtp’, version = 0, 0) { for (i=0; i < sample_count; i++){ unsigned int(2) is_leading; unsigned int(2) sample_depends_on; unsigned int(2) sample_is_depended_on; unsigned int(2) sample_has_redundancy; } }

sdtp 對於 video 來說很重要，因為，其內容字段主要就是給 video 相關的幀設計的。而 audio，一般直接采用默認值：

isLeading: 0, dependsOn: 1, isDepended: 0, hasRedundancy: 0

到這里，整個 MP4 和 fMP4 的內容就已經介紹完了。更詳細的內容可以參考 MP4 & FMP4 doc。

當然，這里只是非常皮毛的一部分，僅僅知道 box 的內容，並不足夠來做一些音視頻處理。更多的是關於音視頻的基礎知識，比如，dts/pts、音視頻同步、視頻盒子的封裝等等。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 http協議的MP4文件播放問題的分析 capwap協議重點分析用nginx搭建http/rtmp/hls協議的MP4/FLV流媒體服務器讀rfc HTTP 協議 SIP協議整理 AHB協議整理 AMBA http協議知識整理 Ubuntu 14.10下基於Nginx搭建mp4/flv流媒體服務器(可隨意拖動)並支持RTMP/HLS協議(含轉碼工具) 推薦下：開源ckplayer 網頁播放器，跨平台(html5, mobile)，flv, f4v, mp4, rtmp協議. webm, ogg, m3u8 ！國標GB28181協議國標流媒體平台EasyGBS設備錄像下載為PS文件如何轉換成MP4文件？