[Erlang 0107] Erlang實現文本截斷

本文轉載自查看原文 2013-10-11 08:42 4410 字符串截斷/ erlang/ Erlang/ 中文截斷

抽時間處理一下之前積壓的一些筆記.前段時間有網友 @稻草人問字符串截斷的問題"各位大俠 erlang截取字符串一般用哪個函數啊",有人支招用string:substr/3,緊接着他補充了一下"大俠們一個字符串有漢字和字母組合我想截取但是不管用什么方法每個漢字的長度都是3 字母是1 截取出來總是有亂碼還望高手們賜教",我們一步步看看這個問題.

在Eshell先看下什么情況,貌似結果很理想啊,但是考慮到 Erlang Shell和文件對編碼的處理方式不一致,還是要寫段代碼測試下

6> string:substr("abcd我們就是喜歡Erlang,就是喜歡OTP",4,3).
[100,25105,20204]
7> io:format("~ts",[v(6)]).
d我們ok

同樣的代碼貼到文件里面,編譯后執行,看結果:

Eshell V5.10.2  (abort with ^G)
1> u:sub().
[100,230,136]
2> io:format("~ts",[v(1)]).
dæok
3> q().
ok
4>

傻眼了吧,之所以出現這種情況是因為EShell是按照UTF-8編碼讀取的代碼,而Erlang編譯器是按照ISO-latin-1(ISO-8859-1)編碼進行代碼解析的,所以就出現上面的不一致的"怪現象"了.怎么解決?如果你和我也是R16B01或更高版本,那么就簡單了,只需要在源代碼文件的頭部添加一個文件編碼的聲明即可:

%% coding: utf-8

重新編譯,之后執行結果:

Eshell V5.10.2  (abort with ^G)
1> u:sub().
[100,25105,20204]
2> io:format("~ts",[v(1)]).
d我們ok

這個解決方案見Erlang的epp模塊,epp模塊是Erlang源代碼的預處理器(處理宏替換,include文件),所以編碼問題它會首當其沖遇到.看代碼片段,可知默認編碼是latin1,目前支持的編碼選項有latin-1和utf-8

-define(DEFAULT_ENCODING, latin1).

-spec default_encoding() -> source_encoding().

default_encoding() ->
    ?DEFAULT_ENCODING.

-spec encoding_to_string(Encoding) -> string() when
      Encoding :: source_encoding().

encoding_to_string(latin1) -> "coding: latin-1";
encoding_to_string(utf8) -> "coding: utf-8".

R16B的epp模塊文檔中多了下面這段說明:

The Erlang source file encoding is selected by a comment in one of the first two lines of the source file. The first string that matches the regular expression coding\s*[:=]\s*([-a-zA-Z0-9])+ selects the encoding. If the matching string is not a valid encoding it is ignored. The valid encodings are Latin-1 and UTF-8 where the case of the characters can be chosen freely.

http://www.erlang.org/doc/man/epp.html#encoding

在其它語言可以看到一些很奇葩的寫法比如在XX管理系統中用中文定義各種類,屬性,實現了所謂"中文編程".在Erlang中目前只能是string和注釋部分使用unicode,其它部分還要再ISO-latin-1的編碼范圍內選擇.看下文檔:

As of Erlang/OTP R16 Erlang source files can be written in either UTF-8 or bytewise encoding (a.k.a. latin1 encoding). The details on how to state the encoding of an Erlang source file can be found in epp(3). Strings and comments can be written using Unicode, but functions still have to be named using characters from the ISO-latin-1 character set and atoms are restricted to the same ISO-latin-1 range. These restrictions in the language are of course independent of the encoding of the source file. Erlang/OTP R18 is expected to handle functions named in Unicode as well as Unicode atoms. http://www.erlang.org/doc/apps/stdlib/unicode_usage.html

R16B之前版本

看了上面的解決方法估計是有人歡喜有人愁,不是所有人都升級到了R16B,特別是R16B的一些 Breaking Changes 更是很多人暫時擱置了升級計划.那在之前的版本如何解決這個問題呢?之前曾經用過ErlDTL,其中有一個Web中常見的功能就是文本超長后截斷然后用省略號顯示.按照這個線索,找到 erlydtl_filters.erl 代碼.把文本截斷的兩個方法完整的剝離出來,兩個方法其中一個是按字符截斷,一個是按照詞截斷.代碼如下:

-module(u).
-compile(export_all).
test() ->
  t("abcd我們就是喜歡Erlang,就是喜歡OTP",10).

test2() ->
  tw("Youth is not a time of life; it is a state of mind; it is not a matter of
rosy cheeks, red lips and supple knees; it is a matter of the will, a
quality of the imagination, a vigor of the emotions; it is the freshness of
the deep springs of life.",10).


dump(FileName,Data)->
  file:write_file(FileName, io_lib:fwrite("~s.\n", [Data])).


sub()->
  string:substr("abcd我們就是喜歡Erlang,就是喜歡OTP",4,3).


t(Input,Max) ->
  truncatechars(Input,Max).

tw(Input,Max) ->
  truncatewords(Input,Max).


%% @doc Truncates a string after a certain number of characters.
truncatechars(_Input, Max) when Max =< 0 ->
    "";
truncatechars(Input, Max) when is_binary(Input) ->
    list_to_binary(truncatechars(binary_to_list(Input), Max));
truncatechars(Input, Max) ->
    truncatechars(Input, Max, []).

%% @doc Truncates a string after a certain number of words.
truncatewords(_Input, Max) when Max =< 0 ->
    "";
truncatewords(Input, Max) when is_binary(Input) ->
    list_to_binary(truncatewords(binary_to_list(Input), Max));
truncatewords(Input, Max) ->
    truncatewords(Input, Max, []).

truncatechars([], _CharsLeft, Acc) ->
    lists:reverse(Acc);
truncatechars(_Input, 0, Acc) ->
    lists:reverse("..." ++ Acc);
truncatechars([C|Rest], CharsLeft, Acc) when C >= 2#11111100 ->
    truncatechars(Rest, CharsLeft + 4, [C|Acc]);
truncatechars([C|Rest], CharsLeft, Acc) when C >= 2#11111000 ->
    truncatechars(Rest, CharsLeft + 3, [C|Acc]);
truncatechars([C|Rest], CharsLeft, Acc) when C >= 2#11110000 ->
    truncatechars(Rest, CharsLeft + 2, [C|Acc]);
truncatechars([C|Rest], CharsLeft, Acc) when C >= 2#11100000 ->
    truncatechars(Rest, CharsLeft + 1, [C|Acc]);
truncatechars([C|Rest], CharsLeft, Acc) when C >= 2#11000000 ->
    truncatechars(Rest, CharsLeft, [C|Acc]);
truncatechars([C|Rest], CharsLeft, Acc) ->
    truncatechars(Rest, CharsLeft - 1, [C|Acc]).

truncatewords(Value, _WordsLeft, _Acc) when is_atom(Value) ->
    Value;
truncatewords([], _WordsLeft, Acc) ->
    lists:reverse(Acc);
truncatewords(_Input, 0, Acc) ->
    lists:reverse("..." ++ Acc);
truncatewords([C1, C2|Rest], WordsLeft, Acc) when C1 =/= $\  andalso C2 =:= $\  ->
    truncatewords([C2|Rest], WordsLeft - 1, [C1|Acc]);
truncatewords([C1|Rest], WordsLeft, Acc) ->
    truncatewords(Rest, WordsLeft, [C1|Acc]).

測試代碼如下:

  
test() ->
  t("abcd我們就是喜歡Erlang,就是喜歡OTP",10).

dump(FileName,Data)->
  file:write_file(FileName, io_lib:fwrite("~s.\n", [Data])).

 
Eshell V5.10.2  (abort with ^G)
1> u:test().
[97,98,99,100,230,136,145,228,187,172,229,176,177,230,152,
175,229,150,156,230,172,162,46,46,46]
2>
2> u:dump("u_result",v(1)).
ok
3>

執行一下結果,那么這次的結果是不是正確呢?[97,98,99,100,230,136,145,228,187,172,229,176,177,230,152,

175,229,150,156,230,172,162,46,46,46]這樣的結果着實很難判斷,我們使用上面的dump方法把它寫到文本里面看看.看看結果:

[root@nimbus demo]# cat u_result
abcd我們就是喜歡....

結果正確,OK,下面我們看ErlyDTL是怎么實現的.我們知道Unicode是變長編碼,不同范圍的字符使用不同的長度編碼.比如下面我們看"開心"這兩個字的編碼過程,下面的是字符范圍和編碼模板對照表,更多信息可以參考: http://www.cl.cam.ac.uk/~mgk25/unicode.html

Unicode編碼(16進制)	UTF-8 字節流模板
000000 - 00007F	0xxxxxxx
000080 - 0007FF	110xxxxx 10xxxxxx
000800 - 00FFFF	1110xxxx 10xxxxxx 10xxxxxx
010000 - 10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

我們先在EShell中把需要的幾個數據做出來,然后按照根據字符范圍選擇三字節編碼模板:

Eshell V5.10.2  (abort with ^G)
1> unicode:characters_to_binary("開心").
<<229,188,128,229,191,131>>
2> unicode:characters_to_list("開心").
[24320,24515]
3> integer_to_list(24320,2).
"101111100000000"
4> integer_to_list(24515,2).
"101111111000011"
5> integer_to_list(23383,2).
"101101101010111"

truncatechars實現的要點就是根據上面的字符范圍和字節變長規則做了判斷,這個從Guard部分的代碼就能看出來,在其它語言里面實現這個邏輯也大多如此思路.truncatewords方法實現的是按照單詞進行截斷,需要判斷一下單詞邊界,比如 "Youth is not a time of life; it is a state of mind; it is not a matter of rosy cheeks, red lips and supple knees; it is a matter of the will, a quality of the imagination, a vigor of the emotions; it is the freshness of the deep springs of life."這段文字做單詞截斷的結果是"Youth is not a time of life; it is a..."這個就沒有什么好說的了.

[1] http://stackoverflow.com/questions/9984428/erlang-binary-strings-by-default

[2] http://www.cnblogs.com/me-sa/archive/2012/05/31/erlang-unicode.html

最后,強烈推薦一個Erlang資源站:

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [Erlang 0045] Erlang 雜記 Ⅲ [Erlang 0035] Erlang SMP [Erlang 0046] Erlang Timer [Erlang 0068] Erlang dict [Erlang 0028] Erlang atom [Erlang 0034] Erlang iolist [Erlang 0069] Erlang ordsets [Erlang 0070] Erlang Queue [Erlang 0123] Erlang EPMD [Erlang 0064] Erlang Array