sklearn.model_selection 的train_test_split方法和參數


train_test_split是sklearn中用於划分數據集,即將原始數據集划分成測試集和訓練集兩部分的函數。

from sklearn.model_selection import train_test_split

1. 其函數源代碼是:

def train_test_split(*arrays, **options):
    """Split arrays or matrices into random train and test subsets

    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    test_size : float, int, None, optional
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. By default, the value is set to 0.25.
        The default will change in version 0.21. It will remain 0.25 only
        if ``train_size`` is unspecified, otherwise it will complement
        the specified ``train_size``.

    train_size : float, int, or None, default None
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.

    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

    shuffle : boolean, optional (default=True)
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.

    stratify : array-like or None (default is None)
        If not None, data is split in a stratified fashion, using this as
        the class labels.

    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.

        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the
            input type.

2. 參數

train_size:訓練集大小

  float:0-1之間,表示訓練集所占的比例

  int:直接指定訓練集的數量

  None:自動為測試集的補集,也就是原始數據集減去測試集

test_size:測試集大小,默認值是0.25

  float:0-1之間,表示測試集所占的比例

  int:直接指定測試集的數量

  None:自動為訓練集的補集,也就是原始數據集減去訓練集

random_state:可以理解為隨機數種子,主要是為了復現結果而設置

shuffle:表示是否打亂數據位置,True或者False,默認是True

stratify:表示是否按照樣本比例(不同類別的比例)來划分數據集,例如原始數據集 類A:類B = 75%:25%,那么划分的測試集和訓練集中的A:B的比例都會是75%:25%;可用於樣本類別差異很大的情況,一般使用為:stratify=y,即用數據集的標簽y來進行划分。

 

3. 一般使用形式是:

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.75, random_state=14, stratify=y)

 

參考:

https://blog.csdn.net/liuxiao214/article/details/79019901

https://blog.csdn.net/qq_38410428/article/details/94054920


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM