sas優化技巧(2),縮減數據存儲空間、length、compress、reuse、data步視圖


1:控制sas數據存儲空間大小

 

1.1:縮減字符變量存儲空間

sas怎么存儲字符變量?

對於賦值情況的字符型變量,變量的長度依據第一個值得長度

比如name=yi(那么name的長度即為2),然后再給其賦值name=can,那么只會讀入ca

對於datalines讀入或從外部數據集中讀入的數據,sas默認為字符變量長度為8。

 

用length語句改變長度,data語句中要出現在變量前才有用

 

1.2:縮減數值變量存儲空間

sas怎么存儲數值變量

sas默認數值變量長度為8

 

length改變長度,以及其作用的范圍

Numeric variables always have a length of 8 bytes in the program data vector and during processing.

就是說length語句不會影響讀入數據的長度,對於讀入數據,原始長度是多少讀進去就是多少

keep in mind that the LENGTH statement affects the length of a numeric variable only in the output data set.

只會影響輸出數據的長度

You should never use the LENGTH statement to reduce the length of your numeric variables if the values are not integers

不要對非整數用length語句,會導致精度的丟失,即使是5.0這種也是不行的,length只能對整數使用

對於不同的length長度,能取的值得范圍也有不同

 

如何判斷用length后精度是否缺失---->>proc compare過程

 

2:壓縮數據文件

壓縮數據文件使得數據變小,這樣減少I/O次數,但是每次讀取時都要進行一次解壓操作,會增加CPU的消耗,還有一點,壓縮不會百分百保證

使得文件變小,也有可能變大。

數據文件未壓縮前是什么情況?

1:每個變量中的值占的字節數都一樣,每個觀測行占的字節數也一樣

2:字符由空格補全,數值由binary zeros補全。

3:每一頁上面都由16字節的空白

4:描述文件在第一頁的末尾

 

壓縮后是什么個情況?

􀀀1:treat an observation as a single string of bytes by ignoring variable types and
boundaries.
2:collapse consecutive repeating characters and numbers into fewer bytes.
􀀀3:contain a 28-byte overhead at the beginning of each page.
􀀀4:contain a 12-byte- or 24-byte-per-observation overhead following the page overhead. This space is used for deletion status, compressed length, pointers, and flags.

 

哪些數據集適合被壓縮?

􀀀 It is large.
􀀀 It contains many long character values.
􀀀 It contains many values that have repeated characters or binary zeros.
􀀀 It contains many missing values.
􀀀 It contains repeated values in variables that are physically stored next to one another.

 

哪些數據集不適合被壓縮?

􀀀 few repeated characters
􀀀 small physical size
􀀀 few missing values
􀀀 short text strings.

 

compress具體實現方式

YES/CHAR適合壓縮數據集中數值變量多,且含有較多0值,或者字符變量多且字符間空格較多的數據

BINARY特別適合壓縮中到大的數值型數據集

Use binary compression only if the observation length is several hundred bytes or more.

對compress的數據最好不要使用直接訪問形式,可以通過POINTOBS選項來做到

要在壓縮后這個選項才有效

reuse=選項,和compress一樣,也有兩種形式

具體用途:If the REUSE= option is set to YES, observations that are added to the SAS data set are inserted wherever enough free space exists, instead of at the end of the SAS data set,當確定用reuse后,sas不會再數據集尾上添加數據,而會在有足夠空白的地方添加數據

描述:track and reuse free space within the data set when you delete or update observations

用了reuse=yes后就默認pointobs=no

 

3:視圖

contains only descriptor information aboutthe data and instructions on how to retrieve data values that are stored elsewhere.

A DATA step view can be created only in a DATA step. A DATA step view cannot contain global statements, host-specific data set options, or most host-specific FILE and INFILE statements. Also, a DATA step view cannot be indexed or compressed.

The VIEW= option tells SAS to compile, but not to execute, the source program andto store the compiled code in the input DATA step view that is named in the option

描述視圖

data view=company.newdata;
  describe;
run;

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM