sas優化技巧(3)排序


1:防止不必要的排序

下面四種方式可以防止排序的進行

􀀀1.1:BY-group processing with an index to avoid a sort

在以下情況下,by語句不會用索引

the BY statement includes the DESCENDING or NOTSORTED option or if SAS detects that the data file is physically stored in sorted order on the BY variables

索引列用來進行排序的利弊

弊:

􀀀 1:It is generally less efficient than sequentially reading a sorted data set because
processing BY groups typically means retrieving the entire file.
􀀀 2:It requires storage space for the index.


􀀀1.2:BY-group processing with the NOTSORTED option/GROUPFORMAT option

by variable option;

The NOTSORTED option specifies that observations that have the same BY value are grouped together but are not necessarily sorted in alphabetical or numeric order.

The NOTSORTED option works best when observations that have the same BY value are stored together.

注意事項:

The NOTSORTED option turns off sequence checking. If your data is not grouped,using the NOTSORTED option can produce a large amount of output.

The NOTSORTED option cannot be used with the MERGE or UPDATE statements

 

Groupformat option

The GROUPFORMAT option uses the formatted values of a variable instead of the internal values to determine where a BY group begins and ends

意思就是用格式化后的變量值來進行分組,而不是用原來的數據集的值

by order_date groupformat notsorted;

 

􀀀1.3:a CLASS statement

提前對變量進行排序對class語句幾乎沒有什么幫助,但是對by語句有很大的幫助

 

 

1.4: the SORTEDBY= data set option.

If you are working with input data that is already sorted, you can specify how the data is ordered by using the SORTEDBY= data set option.

Although the SORTEDBY= option does not sort a data set, it sets the value of the Sorted flag. It does not set the value of the Validated sort flag. (PROC SORT sets the Validated sort flag.)

data company.transactions (sortedby=invoice);invoice為排序好的列,這個選項表示這個列已被排好序。

 

 

排序對空間的要求

When data is sorted, SAS requires enough space in the data library for two copies ofthe data ?le that is being sorted as well as additional workspace,等於是原數據集*4的空間,這針對的是use disk space in order to sort the data

 

 

2:多線程排序

PROC SORT SAS-data-set-name THREADS | NOTHREADS;

進行多線程排序的策略

When a threaded sort is used, the observations in the input data set are divided intoequal temporary subsets, based on the number of processors that are allocated to theSORT procedure. Each subset is then sorted on a different processor. The sortedsubsets are then interleaved to re-create the sorted version of the input data set.

 

設置多余實際CPU數量會降低運行效率

CPUCOUNT= n | ACTUAL;

 

3:大數據集排序

對於大數據集的排序,如果空間不夠可以分塊進行

合並時,如果是用obs進行分割,則不能用append來合並

五種分割在advance上面看吧。。。。

 

 

用tagsort進行排序,不支持多線程

PROC SORT DATA=SAS-data-set-name TAGSORT;

原理The TAGSORT option stores only the BY variables and the observation numbers in temporary files. The BY variables and the observation numbers are called tags. At the completion of the sorting process, PROC SORT uses the tags to retrieve records from the input data set in sorted order.

時間多,空間少,對比第一種

和正常排序比較,如果數據集序列混亂則用到的時間多很多,I/O也是

但是如果基本排好序,那么時間 I/O都只會多一點點

 

PROC SORT DATA=SAS-data-set-name TAGSORT;

The TAGSORT optionstores only the BY variables and the observation numbers in temporary ?les. The BYvariables and the observation numbers are called tags.At the completion of the sortingprocess, PROC SORT uses the tags to retrieve records from the input data set in sortedorder.

 

 

4:高效刪除重復值

4.1:Using the NODUPKEY Option

PROC SORT compares all BY-variable values for each observation to those for the previous observation that was written to the output data set

PROC SORT DATA=SAS-data-set-name NODUPKEY;

4.2:Using the NODUPRECS /nodup Option

the NODUPRECS option compares all of the variable values for each observation to those for the previous observation that was written to the output data set.

PROC SORT DATA=SAS-data-set-name NODUPRECS;

Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove allduplicates with this option by sorting on all variables.(這個選項只對連續的重復值有效,不連續的就不會消除)

 

4.3:Using the EQUALS | NOEQUALS Option

EQUALS maintains the order from the input data set in the output data set.NOEQUALS does not necessarily preserve this order in the output data set.NOEQUALS can save CPU time and memory resources.

這里要這樣理解,對於這樣兩條數據

1 2

1 3

進行這樣的程序 proc sort data=old out=new nodupkey equal/unequal; by ,..;run;

如果是equal那么會保留1 2

如果是unequal則會保留1 3

 

5:host sort utility

 

Host sort utilities are third-party sort packages that are available in some operating environments. In some cases, using a host sort utility with PROC SORT might be more efficient than using the SAS sort utility with PROC SORT.(是一個第三方包,對比特定數據集用起來效果會比proc sort好)

 

5.1:Using the SORTPGM= System Option

tells SAS whether to use the SAS sort, to use the host sort, or to determine which sort utility is best for the data set.

指定用哪種排序策略或者讓sas自己選擇最好的

 

5.2:Using the SORTCUTP= System Option

The SORTCUTP= system option specifies the number of bytes above which the host sort utility is used instead of the SAS sort utility.

OPTIONS SORTCUTP=n / nK / nM / nG / MIN / MAX / hexX;

5.3:Using the SORTCUT= System Option

Beginning with SAS 9, the SORTCUT= system option can be used to specify the number of observations above which the host sort utility is used instead of the SAS sortutility.

OPTIONS SORTCUT=n / nK / nM / nG / MIN / MAX / hexX;

 

5.4:Using the SORTNAME= System Option

The SORTNAME= option specifies the host sort utility that will be used if the value of SORTPGM= is BEST or HOST.

OPTIONS SORTNAME=host-sort-utility name;
options sortpgm=best sortcutp=10000 sortname=syncsort;

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM