lua字符串

本文轉載自查看原文 2015-06-08 17:20 2711 lua.code reading/ code reading

本文內容基於版本：Lua 5.3.0

概述

Lua字符串中的合法字符可以是任何的1字節數據，這包括了C語言中表示字符串結束的'\0'字符，也就是說Lua字符串在內部將以帶長度的內存塊的形式存儲，存儲的是二進制數據，解釋器遇到'\0'字符並不會截斷數據。同時在和C語言交互時，Lua又能保證為每個內部儲存的字符串末尾添加'\0'字符以兼容C庫函數，這使得Lua的字符串應用范圍相當廣泛。

Lua字符串一旦被創建，就不可被改寫。Lua的值對象若為字符串類型，則它將以引用方式存在。字符串對象屬於需要被垃圾收集器管理的對象，也就是說一個字符串一旦沒有被任何地方引用就可以回收它。

Lua管理及操作字符串的方式和C語言不太相同，通過閱讀其實現代碼，可以加深對Lua字符串的理解，從而能更為高效的使用它。

TString結構

• TString結構的聲明

Lua字符串對應的C結構為TString，該類型定義在lobject.h中。

// lobject.h
/*
** Common Header for all collectable objects (in macro form, to be
** included in other objects)
*/
#define CommonHeader    GCObject *next; lu_byte tt; lu_byte marked

// lobject.h
/*
** Header for string value; string bytes follow the end of this structure
** (aligned according to 'UTString'; see next).
*/
typedef struct TString {
  CommonHeader;
  lu_byte extra;  /* reserved words for short strings; "has hash" for longs */
  unsigned int hash;
  size_t len;  /* number of characters in string */
  struct TString *hnext;  /* linked list for hash table */
} TString;

CommonHeader : 用於GC的信息。

extra : 用於記錄輔助信息。對於短字符串，該字段用來標記字符串是否為保留字，用於詞法分析器中對保留字的快速判斷；對於長字符串，該字段將用於惰性求哈希值的策略（第一次用到才進行哈希）。

hash : 記錄字符串的hash值，可以用來加快字符串的匹配和查找。

len : 由於Lua並不以'\0'字符結尾來識別字符串的長度，因此需要一個len域來記錄其長度。

hnext : hash table中相同hash值的字符串將串成一個列表，hnext域為指向下一個列表節點的指針。

• TString存儲結構圖

Lua字符串的數據內容部分並未分配獨立的內存來存儲，而是直接追加在TString結構的后面。TString存儲結構如下圖：

• Lua字符串對象 = TString結構 + 實際字符串數據
• TString結構 = GCObject *指針 + 字符串信息數據

短字符串和長字符串

• 長短字符串的划分

字符串將以兩種內部形式保存在lua_State中：短字符串及長字符串。Lua中每個基本內建類型都對應了一個宏定義，其中字符串類型對應於LUA_TSTRING宏定義。對於長短字符串，Lua在LUA_TSTRING宏上擴展了兩個小類型LUA_TSHRSTR和LUA_TLNGSTR，這兩個類型在類型字節高四位存放0和1加以區別。這兩個小類型為內部使用，不為外部API所見，因此對於最終用戶來說，他們只見到LUA_TSTRING一種類型。

// lua.h
/*
** basic types
*/
#define LUA_TNONE             (-1)

#define LUA_TNIL              0
#define LUA_TBOOLEAN          1
#define LUA_TLIGHTUSERDATA    2
#define LUA_TNUMBER           3
#define LUA_TSTRING           4
#define LUA_TTABLE            5
#define LUA_TFUNCTION         6
#define LUA_TUSERDATA         7
#define LUA_TTHREAD           8

#define LUA_NUMTAGS           9

// lobject.h
/* Variant tags for strings */
#define LUA_TSHRSTR    (LUA_TSTRING | (0 << 4))  /* short strings */
#define LUA_TLNGSTR    (LUA_TSTRING | (1 << 4))  /* long strings */

長短字符串的界限是由定義在luaconf.h中的宏LUAI_MAXSHORTLEN來決定的，其默認設置為40（字節）。在Lua的設計中，元方法名和保留字必須是短字符串，所以短字符串長度不得短於最長的元方法__newindex和保留字function的長度，也就是說LUAI_MAXSHORTLEN最小不可以設置低於10（字節）。

// luaconf.h
/*
@@ LUAI_MAXSHORTLEN is the maximum length for short strings, that is,
** strings that are internalized. (Cannot be smaller than reserved words
** or tags for metamethods, as these strings must be internalized;
** #("function") = 8, #("__newindex") = 10.)
*/
#define LUAI_MAXSHORTLEN        40

• 字符串創建的函數調用圖

拋開短字符串的內部化過程來看，創建字符串最終調用的都是createstrobj函數，該函數創建一個可被GC管理的對象，並將字符串內容拷貝到其中。

// lgc.c
/*
** create a new collectable object (with given type and size) and link
** it to 'allgc' list.
*/
GCObject *luaC_newobj (lua_State *L, int tt, size_t sz) {
  global_State *g = G(L);
  GCObject *o = cast(GCObject *, luaM_newobject(L, novariant(tt), sz));
  o->marked = luaC_white(g);
  o->tt = tt;
  // 放入GC對象列表
 o->next = g->allgc;
  g->allgc = o;
  return o;
}

// lstring.c
/*
** creates a new string object
*/
static TString *createstrobj (lua_State *L, const char *str, size_t l,
                              int tag, unsigned int h) {
  TString *ts;
  GCObject *o;
  size_t totalsize;  /* total size of TString object */
  totalsize = sizelstring(l);
  o = luaC_newobj(L, tag, totalsize);
  ts = gco2ts(o);
  ts->len = l;
  ts->hash = h;
  ts->extra = 0;
  memcpy(getaddrstr(ts), str, l * sizeof(char));
  getaddrstr(ts)[l] = '\0';  /* ending 0 */
  return ts;
}

字符串的哈希算法

• 哈希算法

Lua中字符串的哈希算法可以在luaS_hash函數中查看到。對於比較長的字符串（32字節以上），為了加快哈希過程，計算字符串哈希值是跳躍進行的。跳躍的步長（step）是由LUAI_HASHLIMIT宏控制的。

// lstring.c
/*
** Lua will use at most ~(2^LUAI_HASHLIMIT) bytes from a string to
** compute its hash
*/
#if !defined(LUAI_HASHLIMIT)
#define LUAI_HASHLIMIT        5
#endif

// lstring.h
LUAI_FUNC unsigned int luaS_hash (const char *str, size_t l, unsigned int seed);

// lstring.c
unsigned int luaS_hash (const char *str, size_t l, unsigned int seed) {
  unsigned int h = seed ^ cast(unsigned int, l);
  size_t l1;
  size_t step = (l >> LUAI_HASHLIMIT) + 1;
  for (l1 = l; l1 >= step; l1 -= step)
    h = h ^ ((h<<5) + (h>>2) + cast_byte(str[l1 - 1]));
  return h;
}

• str 　: 待哈希的字符串；

• l 　　 : 待哈希的字符串長度（字符數）；

• seed : 哈希算法隨機種子；

• 隨機種子

Hash DoS攻擊：攻擊者構造出上千萬個擁有相同哈希值的不同字符串，用來數十倍地降低Lua從外部壓入字符串到內部字符串表的效率。當Lua用於大量依賴字符串處理的服務（例如HTTP）的處理時，輸入的字符串將不可控制，很容易被人惡意利用。

為了防止Hash DoS攻擊的發生，Lua一方面將長字符串獨立出來，大文本的輸入字符串將不再通過哈希內部化進入全局字符串表中；另一方面使用一個隨機種子用於字符串哈希值的計算，使得攻擊者無法輕易構造出擁有相同哈希值的不同字符串。

隨機種子是在創建虛擬機的global_State（全局狀態機）時構造並存儲在global_State中的。隨機種子也是使用luaS_hash函數生成，它利用內存地址隨機性以及一個用戶可配置的一個隨機量（luai_makeseed宏）同時來決定。

用戶可以在luaconf.h中配置luai_makeseed來定義自己的隨機方法，Lua默認是利用time函數獲取系統當前時間來構造隨機種子。luai_makeseed的默認行為有可能給調試帶來一些困擾：由於字符串hash值的不同，程序每次運行過程中的內部布局將有一些細微變化，不過字符串池使用的是開散列算法，這個影響將非常小。如果用戶希望讓嵌入Lua的程序每次運行都嚴格一致，那么可以自定義luai_makeseed函數來實現。

// lstate.c
/*
** a macro to help the creation of a unique random seed when a state is
** created; the seed is used to randomize hashes.
*/
#if !defined(luai_makeseed)
#include <time.h>
#define luai_makeseed()        cast(unsigned int, time(NULL))
#endif

// lstate.c
/*
** Compute an initial seed as random as possible. Rely on Address Space
** Layout Randomization (if present) to increase randomness..
*/
#define addbuff(b,p,e) \
  { size_t t = cast(size_t, e); \
    memcpy(buff + p, &t, sizeof(t)); p += sizeof(t); }

static unsigned int makeseed (lua_State *L) {
  char buff[4 * sizeof(size_t)];
  unsigned int h = luai_makeseed();
  int p = 0;
  addbuff(buff, p, L);  /* heap variable */ addbuff(buff, p, &h);  /* local variable */ addbuff(buff, p, luaO_nilobject);  /* global variable */ addbuff(buff, p, &lua_newstate);  /* public function */ lua_assert(p == sizeof(buff));
  return luaS_hash(buff, p, h);
}

// lstate.c
LUA_API lua_State *lua_newstate (lua_Alloc f, void *ud) {
  int i;
  lua_State *L;
  global_State *g;
  LG *l = cast(LG *, (*f)(ud, NULL, LUA_TTHREAD, sizeof(LG)));
  if (l == NULL) return NULL;
  L = &l->l.l;
  g = &l->g;
  ......
  g->seed = makeseed(L);   ......
  return L;
}

短字符串的內部化

Lua中所有的短字符串均被存放在全局狀態表（global_State）的strt域中，strt是stringtable的簡寫，它是一個哈希表。

相同的短字符串在同一個lua_State中將只存在唯一一份實例，這被稱為字符串的內部化。合並相同的字符串可以大量減少內存占用，縮短比較字符串的時間。由於相同的字符串只需要保存一份在內存中，當用這個字符串做鍵匹配時，比較字符串只需要比較地址是否相同就夠了，而不必逐字節比較。下面將着重對stringtable進行分析。

• stringtable結構類型

// lstate.h
typedef struct stringtable {
  TString **hash;
  int nuse;  /* number of elements */
  int size;
} stringtable;

• hash : 字符串開散列算法哈希表，hash是一維數組指針，其中數組元素類型為TString *（指向TString類型對象指針），它並不是一個二維數組（數組元素類型為TString）指針；

• nuse : 字符串表當前字符串數量；

• size　: 字符串表最大字符串數量；

• stringtable存儲結構圖

• 短字符串內部化（散列過程描述）

首先求得傳入短字符串的哈希值，然后將該哈希值與stringtable大小取模，從而得到該字符串在stringtable中存放位置（相同哈希值的字符串鏈表）；接着從該字符串鏈表的第一個位置開始，將鏈表中每個字符串與傳入字符串比較字符串內容，如果相等說明傳入字符串已經在表中使用；如果不相等說明不是同一個字符串，繼續往后查找。如果字符串鏈表中都沒有查找到，那么需要創建一個新的字符串。創建過程中，碰到哈希值相同的字符串，簡單地串在同一個哈希位的鏈表上即可。簡單地用一句話描述開散列的哈希過程：傳入字符串被放入字符串表的時候，先檢查一下表中有沒有相同的字符串，如果有則復用已有的字符串，如果沒有則創建一個新的字符串。

由於Lua的垃圾回收過程是分步完成的，而向stringtable添加新字符串在垃圾回收的任何步驟之間都可能發生，所以這個過程中需要檢查表中的字符串是否已經死掉（標記為可垃圾回收）：有可能在標記完字符串死掉后，在下個步驟中又產生了相同的字符串導致這個字符串復活。

// lstring.c
/*
** checks whether short string exists and reuses it or creates a new one
*/
static TString *internshrstr (lua_State *L, const char *str, size_t l) {
  TString *ts;
  global_State *g = G(L);
  // 計算傳入字符串哈希值
  unsigned int h = luaS_hash(str, l, g->seed);
  // 找到目標位置字符串鏈表
  TString **list = &g->strt.hash[lmod(h, g->strt.size)];
  // 在字符串鏈表搜索傳入字符串
  for (ts = *list; ts != NULL; ts = ts->hnext) {
    if (l == ts->len &&
        (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) {
      /* found! */
      if (isdead(g, ts))  /* dead (but not collected yet)? */ changewhite(ts);  /* resurrect it */
      return ts;
    }
  }
  if (g->strt.nuse >= g->strt.size && g->strt.size <= MAX_INT/2) {
    luaS_resize(L, g->strt.size * 2);
    list = &g->strt.hash[lmod(h, g->strt.size)];  /* recompute with new size */
  }
  // 沒有找到創建新的字符串
  ts = createstrobj(L, str, l, LUA_TSHRSTR, h);
  ts->hnext = *list;
  *list = ts;
  g->strt.nuse++;
  return ts;
}

• stringtable的擴大及字符串的重新哈希

當stringtable中的字符串數量（stringtable.muse域）超過預定容量（stringtable.size域）時，說明stringtable太擁擠，許多字符串可能都哈希到同一個維度中去，這將會降低stringtable的遍歷效率。這個時候需要調用luaS_resize方法將stringtable的哈希鏈表數組擴大，重新排列所有字符串的位置。

// lstring.h
LUAI_FUNC void luaS_resize (lua_State *L, int newsize);

// lstring.c
/*
** resizes the string table
*/
void luaS_resize (lua_State *L, int newsize) {
  int i;
 // 取得全局stringtable
  stringtable *tb = &G(L)->strt;
  if (newsize > tb->size) {  /* grow table if needed */   // 如果stringtable的新容量大於舊容量，重新分配
 luaM_reallocvector(L, tb->hash, tb->size, newsize, TString *);
    for (i = tb->size; i < newsize; i++)
      tb->hash[i] = NULL;
  }
 // 根據新容量進行重新哈希
  for (i = 0; i < tb->size; i++) {  /* rehash */ TString *p = tb->hash[i];
    tb->hash[i] = NULL;
    // 將每個哈希鏈表中的元素哈希到新的位置（頭插法）
    while (p) {  /* for each node in the list */ TString *hnext = p->hnext;  /* save next */ unsigned int h = lmod(p->hash, newsize);  /* new position */
      p->hnext = tb->hash[h];  /* chain it */
      tb->hash[h] = p;
      p = hnext;
    }
  }
  // 如果stringtable的新容量小於舊容量，那么要減小表的長度
  if (newsize < tb->size) {  /* shrink table if needed */
    /* vanishing slice should be empty */ lua_assert(tb->hash[newsize] == NULL && tb->hash[tb->size - 1] == NULL);
    luaM_reallocvector(L, tb->hash, tb->size, newsize, TString *);
  }
  tb->size = newsize;
}

stringtable初始大小由宏MINSTRTABSIZE控制，默認是64，用戶可以在luaconf.h重新定義MINSTRTABSIZE宏來改變默認大小。在為stringtable初次分配空間的時候，調用的也是luaS_resize方法，將stringtable空間由0調整到MINSTRTABSIZE的大小。

// llimits.h
/* minimum size for the string table (must be power of 2) */
#if !defined(MINSTRTABSIZE)
#define MINSTRTABSIZE    64    /* minimum size for "predefined" strings */
#endif

// lstate.c
/*
** open parts of the state that may cause memory-allocation errors.
** ('g->version' != NULL flags that the state was completely build)
*/
static void f_luaopen (lua_State *L, void *ud) {
  global_State *g = G(L);
  UNUSED(ud);
  stack_init(L, L);  /* init stack */ init_registry(L, g);
  luaS_resize(L, MINSTRTABSIZE);  /* initial size of string table */
  ...
}

// lstate.c
LUA_API lua_State *lua_newstate (lua_Alloc f, void *ud) {
  int i;
  lua_State *L;
  global_State *g;
  LG *l = cast(LG *, (*f)(ud, NULL, LUA_TTHREAD, sizeof(LG)));
  if (l == NULL) return NULL;
  L = &l->l.l;
  g = &l->g;
  ...
  g->strt.size = g->strt.nuse = 0;
  g->strt.hash = NULL;
  ...
  if (luaD_rawrunprotected(L, f_luaopen, NULL) != LUA_OK) {
    /* memory allocation error: free partial state */ close_state(L);
    L = NULL;
  }
  return L;
}

stringtable在字符串內部化的過程中擴大的策略和STL中的vector比較類似：當空間不足時，大小擴大為當前空間的兩倍大小。

// lstring.c
/*
** checks whether short string exists and reuses it or creates a new one
*/
static TString *internshrstr (lua_State *L, const char *str, size_t l) {
  TString *ts;
  global_State *g = G(L);
  unsigned int h = luaS_hash(str, l, g->seed);
  TString **list = &g->strt.hash[lmod(h, g->strt.size)];
  ...
  if (g->strt.nuse >= g->strt.size && g->strt.size <= MAX_INT/2) {
    luaS_resize(L, g->strt.size * 2);
    list = &g->strt.hash[lmod(h, g->strt.size)];  /* recompute with new size */
  }
  ...
  return ts;
}

字符串的比較操作

由於長短字符串實現的不同，在比較兩個字符串是否相同時，需要區分長短字符串。在進行字符串比較操作時，首先子類型不同（長短字符串）的字符串自然不是相同的字符串，然后如果子類型相同，那么根據長短字符串使用不同策略進行比較。

// lvm.c
/*
** Main operation for equality of Lua values; return 't1 == t2'. 
** L == NULL means raw equality (no metamethods)
*/
int luaV_equalobj (lua_State *L, const TValue *t1, const TValue *t2) {
  const TValue *tm;
 // 如果類型（含子類型）不同
  if (ttype(t1) != ttype(t2)) {  /* not the same variant? */
    // 如果大類型不同或大類型不是數字類型
    if (ttnov(t1) != ttnov(t2) || ttnov(t1) != LUA_TNUMBER)
      return 0;  /* only numbers can be equal with different variants */
    else {  /* two numbers with different variants */ lua_Number n1, n2;  /* compare them as floats */ lua_assert(ttisnumber(t1) && ttisnumber(t2));
      cast_void(tofloat(t1, &n1)); cast_void(tofloat(t2, &n2));
      return luai_numeq(n1, n2);
    }
  }
  /* values have same type and same variant */
  switch (ttype(t1)) {
    case LUA_TNIL: return 1;
    ...
    // 根據子類型不同，用不同字符串比較策略進行比較
case LUA_TSHRSTR: return eqshrstr(tsvalue(t1), tsvalue(t2));
    case LUA_TLNGSTR: return luaS_eqlngstr(tsvalue(t1), tsvalue(t2));
    ...
    default:
      return gcvalue(t1) == gcvalue(t2);
  }
  if (tm == NULL)  /* no TM? */
    return 0;  /* objects are different */ luaT_callTM(L, tm, t1, t2, L->top, 1);  /* call TM */
  return !l_isfalse(L->top);
}

• 短字符串的比較策略

短字符串由於經過內部化操作，所以不必進行字符串內容比較，僅需比較對象地址是否相等即可。Lua使用一個宏eqshrstr來高效地實現這個操作：

// lstring.h
/*
** equality for short strings, which are always internalized
*/
#define eqshrstr(a,b)    check_exp((a)->tt == LUA_TSHRSTR, (a) == (b))

• 長字符串的比較策略

首先對象地址相等的兩個長字符串屬於同一個實例，因此它們是相等的；然后對象地址不相等的情況下，當字符串長度不同時，自然是不同的字符串，而長度相同時，則需要進行逐字節比較。

// lstring.h
LUAI_FUNC int luaS_eqlngstr (TString *a, TString *b);

// lstring.c
/*
** equality for long strings
*/
int luaS_eqlngstr (TString *a, TString *b) {
  size_t len = a->len;
  lua_assert(a->tt == LUA_TLNGSTR && b->tt == LUA_TLNGSTR);
  return (a == b) ||  /* same instance or... */
    ((len == b->len) &&  /* equal length and ... */
     (memcmp(getstr(a), getstr(b), len) == 0));  /* equal contents */
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Lua 字符串 Lua 字符串 lua 分割字符串 lua 字符串分割 Lua字符串操作 lua 字符串處理 lua字符串長度 Lua性能優化技巧[四關於字符串] Step By Step(Lua字符串庫) Lua中分割字符串