多字節與寬字節 string wstring 互轉

本文轉載自查看原文 2019-06-15 13:28 1069 [4] VC&MFC/ [1] C&Cpp

多字節字符集（MBCS，Multi-Byte Chactacter Set）：指用多個字節來表示一個字符的字符編碼集合。一般英文字母用1Byte，漢語等用2Byte來表示。兼容ASCII 127。

在最初的時候，Internet上只有一種字符集——ANSI的ASCII字符集，它使用7 bits來表示一個字符，總共表示128個字符，其中包括了英文字母、數字、標點符號等常用字符。

為了擴充ASCII編碼，以用於顯示本國的語言，不同的國家和地區制定了不同的標准，由此產生了 GB2312, BIG5, JIS 等各自的編碼標准。這些使用 2 個字節來代表一個字符的各種漢字延伸編碼方式，稱為 ANSI 編碼，又稱為"MBCS（Muilti-Bytes Charecter Set，多字節字符集）"。

不同 ANSI 編碼之間互不兼容，當信息在國際間交流時，無法將屬於兩種語言的文字，存儲在同一段 ANSI 編碼的文本中。一個很大的缺點是，同一個編碼值，在不同的編碼體系里代表着不同的字。這樣就容易造成混亂。導致了unicode碼的誕生。

寬字節字符集：一般指Unicode編碼的字符集，

Unicode稱為統一碼或萬國碼，統一了不同國家的字符編碼。

Unicode通常用兩個字節表示一個字符，原有的英文編碼從單字節變成雙字節，只需要把高字節全部填為0就可以。

為了統一所有文字的編碼，Unicode應運而生。Unicode把所有語言都統一到一套編碼里，這樣就不會再有亂碼問題了。

Unicode固然統一了編碼方式，但是它的效率不高，比如UCS-4(Unicode的標准之一)規定用4個字節存儲一個符號，那么每個英文字母前都必然有三個字節是0，這對存儲和傳輸來說都很耗資源。為了提高Unicode的編碼效率，於是就出現了UTF-8編碼。UTF-8可以根據不同的符號自動選擇編碼的長短。比如英文字母可以只用1個字節就夠了。

UTF是“Unicode Transformation Format”的縮寫，可以翻譯成Unicode字符集轉換格式，即怎樣將Unicode定義的數字轉換成程序數據。用char、char16_t、char32_t分別表示無符號8位整數，無符號16位整數和無符號32位整數。UTF-8、UTF-16、UTF-32分別以char、char16_t、char32_t作為編碼單位。（注： char16_t 和 char32_t 是 C++ 11 標准新增的關鍵字。如果你的編譯器不支持 C++ 11 標准，請改用 unsigned short 和 unsigned long。）“漢字”的UTF-8編碼需要3個字節。“漢字”的UTF-16編碼需要兩個char16_t，大小是2個字節。“漢字”的UTF-32編碼需要兩個char32_t，大小是4個字節。

普通字符、字符串前加 L 就變成寬字符 wchar_t 存儲（用2Byte存1個字符）了，例如，L‘看’，L"abc啊";或_T("sf飛")

MFC中的 CString 與 std::string 的轉換：

1. 使用Unicode字符集時，CString等價於CStringW；使用多字節字符集時，CString相對於CStringA

2. CString --> std::string

// 1. Unicode下 CString --> std::string
// 方法1
CString str = L"sdf";
std::string s = CT2A(str.GetString());
    // GetString()比較新的VS有，舊可以用GetBuffer（）
    std::string s = CT2A(str.GetBuffer());
    str.ReleaseBuffer();
// 方法2
CString str = L"dshf";
CStringA stra(str);
std::string s(stra);
//或
std::string s(CStringA(str));

//方法3
USES_CONVERSION;
CString str = L"djg";
std::string s = W2A(str);
//首先str--》const wchar_t* ，然后W2A將const wchar_t*--》const char*，
//最后用const char*初始化s

3. std::string --> CStringW / std::wstring

std::string s("dhhh");
CStringW strw(CStringA(s.c_str());
std::wstring sw(strw);

1）TCHAR 轉換為const wchar_t *，直接強制轉換，在TCHAR前面加上(*const wchar_t)

2）BSTR：是一個OLECHAR*類型的Unicode字符串，是一個COM字符串，帶長度前綴，與VB有關，沒怎么用到過。

LPSTR：即 char *，指向以'/0'結尾的8位（單字節）ANSI字符數組指針

LPWSTR：即wchar_t *，指向'/0'結尾的16位（雙字節）Unicode字符數組指針

LPCSTR：即const char *

LPCWSTR：即const wchar_t *

LPTSTR：LPSTR、LPWSTR兩者二選一，取決於是否宏定義了UNICODE或ANSI

LPCTSTR： LPCSTR、LPCWSTR兩者二選一，取決於是否宏定義了UNICODE或ANSI，

如下是從MFC庫中拷來的：

#ifdef UNICODE 
typedef LPWSTR LPTSTR; 
typedef LPCWSTR LPCTSTR;
#else 
typedef LPSTR LPTSTR; 
typedef LPCSTR LPCTSTR; 
#endif

相互轉換方法：

LPWSTR->LPTSTR: 　　 W2T();
LPTSTR->LPWSTR: 　　 T2W();
LPCWSTR->LPCSTR: 　　W2CT();
LPCSTR->LPCWSTR: 　　T2CW();
ANSI->UNICODE: 　　A2W();
UNICODE->ANSI: 　　W2A();

3）

LPWSTR轉為LPCSTR

LPCSTR=CW2A(LPWSTR);

CString與LPCWSTR的轉化(http://www.cnblogs.com/foolboy/archive/2005/07/25/199869.html)

問題起因：
在寫WritePrivateProfileString寫.ini配置文件時在msdn中看到，如果想要寫的配置信息即時生效，必須在之前使用WritePrivateProfileStringW來re-read一下目標.ini文件，其原文如下：

// force the system to re-read the mapping into shared memory  
// so that future invocations of the application will see it  
//  without the user having to reboot the system  
WritePrivateProfileStringW( NULL, NULL, NULL, L"appname.ini" );

查了一下msdn中WritePrivateProfileStringW的原型如下：

WINBASEAPI BOOL WINAPI WritePrivateProfileStringW ( 
 LPCWSTR lpAppName,  //section []中的字符串
 LPCWSTR lpKeyName,  // key  “=”左邊的字符串
 LPCWSTR lpString,   //寫入的內容
 LPCWSTR lpFileName ) // 配置文件的路徑
例如：
[section]
key=string

　　其中的每個參數的類型都為LPCWSTR，實際中獲得的文件名都為CString，問題產生。

問題分析：

LPCWSTR 是Unicode字符串指針，初始化時串有多大，申請空間就有多大，以后存儲若超過則出現無法預料的結果，這是它與CString的不同之處。而CString是一個串類，內存空間類會自動管理。LPCWSTR 初始化如下：

LPCWSTR Name=L"TestlpCwstr";

由於LPCWSTR必須指向Unicode的字符串，問題的關鍵變成了Anis字符與Unicode字符之間的轉換，不同編碼間的轉換，通過查找資料可知，可以ATL中轉換宏可以用如下方法實現：

//方法一 
CString str=_T("TestStr"); 
USES_CONVERSION; 
LPWSTR pwStr=new wchar_t[str.GetLength()+1]; 
wcscpy(pwStr,T2W((LPCTSTR)str));
 
// 方法二 
CString str=_T("TestStr"); 
USES_CONVERSION; 
LPWCSTR pwcStr = A2CW((LPCSTR)str);

MFC中CString和LPSTR是可以通用，其中A2CW表示(LPCSTR) -> (LPCWSTR)，USER_CONVERSION表示用來定義一些中間變量，在使用ATL的轉換宏之前必須定義該語句。
順便也提一下，如果將LPCWSTR轉換成CString，那就更加容易，在msdn中的CString類說明中提到了可以直接用LPCWSTR來構造CString，所以可以進行如下的轉換代碼：

LPCWSTR pcwStr = L"TestpwcStr";
CString str(pcwStr);

問題總結：
在頭文件<atlconv.h>中定義了ATL提供的所有轉換宏，如：

  A2CW (LPCSTR)  -> (LPCWSTR)
  A2W        (LPCSTR)  -> (LPWSTR)
  W2CA (LPCWSTR) -> (LPCSTR)
  W2A        (LPCWSTR) -> (LPSTR)

所有的宏如下表所示：

A2BSTR	OLE2A	T2A	W2A
A2COLE	OLE2BSTR	T2BSTR	W2BSTR
A2CT	OLE2CA	T2CA	W2CA
A2CW	OLE2CT	T2COLE	W2COLE
A2OLE	OLE2CW	T2CW	W2CT
A2T	OLE2T	T2OLE	W2OLE
A2W	OLE2W	T2W	W2T

上表中的宏函數，非常的有規律，每個字母都有確切的含義如下：

2	to 的發音和 2 一樣，所以借用來表示“轉換為、轉換到”的含義。
A	ANSI 字符串，也就是 MBCS。
W、OLE	寬字符串，也就是 UNICODE。
T	中間類型T。如果定義了 _UNICODE，則T表示W；如果定義了 _MBCS，則T表示A
C	const 的縮寫

    利用這些宏，可以快速的進行各種字符間的轉換。使用前必須包含頭文件，並且申明USER_CONVERSION；使用 ATL 轉換宏，由於不用釋放臨時空間，所以使用起來非常方便。但是考慮到棧空間的尺寸（VC 默認2M），使用時要注意幾點：
    1、只適合於進行短字符串的轉換；
    2、不要試圖在一個次數比較多的循環體內進行轉換；
    3、不要試圖對字符型文件內容進行轉換，因為文件尺寸一般情況下是比較大的；
    4、對情況 2 和 3，要使用 MultiByteToWideChar() 和 WideCharToMultiByte()；

MultiByteToWideChar() 和 WideCharToMultiByte()的用法：
www.cnblogs.com/ranjiewen/p/5770639.html

int MultiByteToWideChar(
　　UINT CodePage, //指定執行轉換的多字節字符所使用的字符集，CP_ACP：ANSI字符集，CP_UTF8：UTF-8字符集
　　DWORD dwFlags, // 一般為NULL
　　LPCSTR lpMultiByteStr, // [in] 要被轉換的字符指針
　　int cchMultiByte,  // lpMultiByteStr指針指向的字符串的長度，若字符串以\0結尾，可簡單寫為 -1
　　LPWSTR lpWideCharStr, //[out] 輸出的寬字符串指針
　　int cchWideChar  // 指定由參數lpWideCharStr指向的緩沖區的寬字符數。若此值為0，函數不會執行轉換，而是返回目標緩存lpWideChatStr所需的寬字符數。
　　);

int WideCharToMultiByte(
UINT CodePage,  //指定執行轉換的字符集
DWORD dwFlags,  // NULL
LPCWSTR lpWideCharStr, // 待轉換的字符串
int cchWideChar, // 待轉換的字符串長度，若以空字符結尾，則可寫-1
LPSTR lpMultiByteStr, // 指向接收被轉換字符串的緩沖區
int cbMultiByte,  // 緩沖區的長度，若為0，函數返回接收的緩沖區的長度
LPCSTR lpDefaultChar, // NULL
LPBOOL lpUsedDefaultChar //NULL
);

　　例子：

/// std::string ==> std::wstring
std::wstring s2ws(std::string s)
{//CP_ACP : ANSI字符集
    //當cchWideChar=0，返回存寬字符的長度，
    //並且待轉換的字符串的長度為 -1 時，返回的長度包括空字符\0，new時 new wchar_t[nLen]
    //待轉換的字符串的長度為 s.size() 時，返回的長度不包括空字符\0，new時 new wchar_t[nLen+1]
    // 1. 用 -1
    int nLen = ::MultiByteToWideChar(CP_ACP, NULL, s.c_str(), -1, NULL, 0);
    wchar_t *buf = new wchar_t[nLen];
    //wmemset(buf, 0, nLen);//當轉換包括\0,就不用初始化0了
    ::MultiByteToWideChar(CP_ACP, NULL, s.c_str(), -1, buf, nLen);
    std::wstring ws(buf);
    delete[] buf;
    return ws;
    // 2. 用 s.size()
    int nLen = ::MultiByteToWideChar(CP_ACP, NULL, s.c_str(), s.size(), NULL, 0);
    wchar_t *buf = new wchar_t[nLen+1];
    wmemset(buf, 0, nLen+1);//當轉換包括\0,就不用初始化0了
    ::MultiByteToWideChar(CP_ACP, NULL, s.c_str(), s.size(), buf, nLen);
    std::wstring ws(buf);
    delete[] buf;
    return ws;
}
/// std::wstring ==> std::string
std::string ws2s(std::wstring ws)
{
    int nLen = ::WideCharToMultiByte(CP_ACP, NULL, ws.c_str(), -1, NULL, 0, NULL, NULL);
    char* buf = new char[nLen];
    ::WideCharToMultiByte(CP_ACP, NULL, ws.c_str(), -1, buf, nLen, NULL, NULL);
    std::string s(buf);
    delete[] buf;
    return s;
}
 
///// 當需要轉換不同字符集（ANSI：CP_ACP UTF8：CP_UTF8）時，
///// 就必須用WideCharToMultiByte和MultiByteToWideChar （暫時沒找到別的，高手請指教）
// ANSI ==> UTF8
std::string ANSI_to_UTF8(std::string sAnsi)
{
    std::wstring wsAnsi = s2ws(sAnsi);
    int nLen = ::WideCharToMultiByte(CP_UTF8, NULL, wsAnsi.c_str(), -1, NULL, 0, NULL, NULL);
    char* buf = new char[nLen];
    ::WideCharToMultiByte(CP_UTF8, NULL, wsAnsi.c_str(), -1, buf, nLen, NULL, NULL);
    std::string sUtf8(buf);
    delete[] buf;
    return sUtf8;
}
// UTF8 ==> ANSI
std::string UTF8_to_ANSI(std::string sUtf8)
{
    //std::wstring wsUtf8 = s2ws(sUtf8);//不能用這句，因為這是ANSI字符集的轉換
    int nLen = ::MultiByteToWideChar(CP_UTF8, NULL, sUtf8.c_str(), -1, NULL, 0);
    wchar_t *wbuf = new wchar_t[nLen];
    ::MultiByteToWideChar(CP_UTF8, NULL, sUtf8.c_str(), -1, wbuf, nLen);
    std::wstring wsUtf8(wbuf);
    delete[] wbuf;
 
    //int nLen2 = ::WideCharToMultiByte(CP_ACP, NULL, wsUtf8.c_str(), -1, NULL, 0, NULL, NULL);
    //char* buf = new char[nLen2];
    //::WideCharToMultiByte(CP_ACP, NULL, wsUtf8.c_str(), -1, buf, nLen2, NULL, NULL);
    //std::string sAnsi(buf);
    //delete[] buf;
    //或者
    std::string sAnsi = ws2s(wsUtf8);
    return sAnsi;
}
 
int main(int argc, char* argv[])
{
    std::string s( "Hello world.你好，中國。");
    std::wstring ws = s2ws(s);
    std::string s1 = ws2s(ws);
    std::string sAnsi(s);
    std::string sUtf8 = ANSI_to_UTF8(sAnsi);
    std::string sAnsi2 = UTF8_to_ANSI(sUtf8);
 
    std::ofstream file("1.txt");
    file << sUtf8.c_str();
    return 0;
}

上面的函數整理：

#include <Windows.h>
// std::string ==> std::wstring
bool s2ws(const std::string &s,std::wstring &ws)
{
    if (s.empty())
        return true;
    int nLen = ::MultiByteToWideChar(CP_ACP, NULL, s.c_str(), -1, NULL, 0);//-1,返回的nLen包括\0,即s.size()+1
    wchar_t *buf = new wchar_t[nLen];
    int nWrited = ::MultiByteToWideChar(CP_ACP, NULL, s.c_str(), -1, buf, nLen);//-1: 轉換包括\0
    ws = buf;
    delete[] buf;
    return (nLen == nWrited) ? true : false;
}
// std::wstring ==> std::string
bool ws2s(const std::wstring &ws, std::string &s)
{
    if (ws.empty())
        return true;
    int nLen = ::WideCharToMultiByte(CP_ACP, NULL, ws.c_str(), -1, NULL, 0, NULL, NULL);
    char* buf = new char[nLen];
    int nWrited = ::WideCharToMultiByte(CP_ACP, NULL, ws.c_str(), -1, buf, nLen, NULL, NULL);
    s = buf;
    delete[] buf;
    return (nWrited == nLen) ? true : false;
}

///// 轉換不同字符集（ANSI：CP_ACP UTF8：CP_UTF8）
// ANSI ==> UTF8
bool ANSI_to_UTF8(const std::string &sAnsi, std::string &sUtf8)
{
    if (sAnsi.empty())
        return true;
    std::wstring wsAnsi;
    int nLen = ::MultiByteToWideChar(CP_ACP, NULL, sAnsi.c_str(), -1, NULL, 0);
    wchar_t *buf1 = new wchar_t[nLen];
    int nWrited = ::MultiByteToWideChar(CP_ACP, NULL, sAnsi.c_str(), -1, buf1, nLen);
    wsAnsi = buf1;
    delete[] buf1;
    if (nWrited != nLen)
        return false;
    nLen = ::WideCharToMultiByte(CP_UTF8, NULL, wsAnsi.c_str(), -1, NULL, 0, NULL, NULL);
    char* buf2 = new char[nLen];
    nWrited = ::WideCharToMultiByte(CP_UTF8, NULL, wsAnsi.c_str(), -1, buf2, nLen, NULL, NULL);
    sUtf8 = buf2;
    delete[] buf2;
    return (nWrited == nLen) ? true : false;
}
// UTF8 ==> ANSI
bool UTF8_to_ANSI(const std::string &sUtf8, std::string &sAnsi)
{
    if (sUtf8.empty())
        return true;
    int nLen = ::MultiByteToWideChar(CP_UTF8, NULL, sUtf8.c_str(), -1, NULL, 0);
    wchar_t *wbuf = new wchar_t[nLen];
    int nWrited = ::MultiByteToWideChar(CP_UTF8, NULL, sUtf8.c_str(), -1, wbuf, nLen);
    std::wstring wsUtf8(wbuf);
    delete[] wbuf;
    if (nWrited != nLen)
        return false;
    nLen = ::WideCharToMultiByte(CP_ACP, NULL, wsUtf8.c_str(), -1, NULL, 0, NULL, NULL);
    char* buf = new char[nLen];
    nWrited = ::WideCharToMultiByte(CP_ACP, NULL, wsUtf8.c_str(), -1, buf, nLen, NULL, NULL);
    sAnsi = buf;
    delete[] buf;
    return (nWrited == nLen) ? true : false;
}

View Code

采用ATL封裝_bstr_t的過渡：

#include <comutil.h>  
#pragma comment(lib, "comsuppw.lib")
 
string ws2s(const wstring& ws)
{
    _bstr_t t = ws.c_str();  
    char* pchar = (char*)t;  
    string result = pchar;  
    return result;  
}
 
wstring s2ws(const string& s)
{
    _bstr_t t = s.c_str();  
    wchar_t* pwchar = (wchar_t*)t;  
    wstring result = pwchar;  
    return result; 
}
--------------------- 
原文：https://blog.csdn.net/liminwang0311/article/details/79975174

使用MFC的CString：

#include <atlstr.h>
std::string ws2s(std::wstring ws)
{
	return std::string(CStringA(CStringW(ws.c_str())));
}
std::wstring s2ws(std::string s)
{
	return std::wstring(CStringW(CStringA(s.c_str())));
}

//其實 ws => const wchar_t* => CStringW => LPCWSTR  => CStringA => LPCSTR => string
string s(CStringA(CStringW(ws.c_str()));
wstring ws(CStringW(CStringA(s.c_str()));
//其中 CStringW => LPCWSTR 和 CStringA => LPCSTR 是默認自動轉換的。定義了 operator LPCSTR() const
	LPCSTR pStr = "kkk"; //LPCSTR == const char*
	LPCWSTR pwStr = L"hhh"; //LPCWSTR == const wchar_t*
	CStringA a(pwStr);//"hhh"
	CStringW w(pStr);//L"kkk"

std::string <--> std::wstring 最簡單用basic_string的迭代器構造函數
（注意：不支持中文）

	std::string s("hello world.");
	std::wstring ws(s.begin(), s.end());

	std::wstring ws2(L"hello China.");
	std::string s2(ws2.begin(), ws2.end());

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 匯編實現多字節乘除法 string和wstring相互轉換多字節字符集與Unicode字符集 VC++中多字節字符集和Unicode之間的互換 MAC【R語言】讀取文件多字節字符串出錯 [51單片機] EEPROM 24c02 [讀取存儲多字節] Sqoop自定義多字節列分隔符在多字節的目標代碼頁中，沒有此 Unicode 字符可以映射到的字符。 (#1113) std::string與std::wstring互相轉換解決：iis7,在多字節的目標代碼頁中，沒有此 Unicode 字符可以映射到的字符。 (異常來自 HRESULT:0x80070459)