像Excel一樣使用Python進行數據分析

本文轉載自查看原文 2017-06-16 10:34 15659 pandas/ Excel/ Python基礎/ numpy/ Python

序

Excel是數據分鍾中最常用的工具，通過Python和Excel功能對比，介紹如何使用Python通過函數式編程完成Excel中的數據處理及分析工作。

在Python中pandas庫用於數據處理，我們從1787頁的pandas官網文檔中總結出最常用的36個函數，通過這些函數介紹如何通過Python完成數據生成和導入，數據清洗，預處理，以及最常見的數據分類，數據篩選，分類匯總，透視等最常見的操作。

第7章數據匯總

本章主要講解如何對數據進行分類匯總。

Excel中使用分類匯總和數據透視可以按特定維度對數據進行匯總;

Python中使用的主要函數是groupby和pivot_table.

下面分別介紹這兩個函數的使用方法。

第1章生成數據表

Excel

常見的生成數據表的方法有兩種，第一種是導入外部數據，第二種是直接寫入數據。

Excel中的“文件”菜單中提供了獲取外部數據的功能，支持數據庫和文本文件和頁面的多種數據源導入。

計算機生成了可選文字:
文件
開始插入頁面布局公式
數哐審閱視圖
匚叾
丷苜屬性
自 Access 自網站自文本自其他來源現有連接全部刷新
獲取外

Python

Python支持從多種類型的數據導入。

在開始使用Python進行數據導入之前需要先導入pandas庫，為了方便起見，我們也同時導入numpy庫

1.導入數據表

下面分別是Excel和csv格式文件中導入數據並創建數據表的方法。

代碼是最簡模式，里面有很多可選參數設置，例如列名稱，索引列，數據格式。

計算機生成了可選文字:
df=pd. DataFrame()d 。 read_c sv （《 name. CSV' ,header=l) ）
df=pd. DataFrame(pd. read_ExceI （《 name. ））

help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

Read CSV (comma-separated) file into DataFrame

Also supports optionally iterating or breaking of the file

into chunks.

Additional help can be found in the `online docs for IO Tools

<http://pandas.pydata.org/pandas-docs/stable/io.html>`_.

Parameters

----------

filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

The string could be a URL. Valid URL schemes include http, ftp, s3, and

file. For file URLs, a host is expected. For instance, a local file could

be file ://localhost/path/to/table.csv

sep : str, default ','

Delimiter to use. If sep is None, will try to automatically determine

this. Separators longer than 1 character and different from ``'\s+'`` will

be interpreted as regular expressions, will force use of the python parsing

engine and will ignore quotes in the data. Regex example: ``'\r\t'``

delimiter : str, default ``None``

Alternative argument name for sep.

delim_whitespace : boolean, default False

Specifies whether or not whitespace (e.g. ``' '`` or ``' '``) will be

used as the sep. Equivalent to setting ``sep='\s+'``. If this option

is set to True, nothing should be passed in for the ``delimiter``

parameter.

.. versionadded:: 0.18.1 support for the Python parser.

header : int or list of ints, default 'infer'

Row number(s) to use as the column names, and the start of the data.

Default behavior is as if set to 0 if no ``names`` passed, otherwise

``None``. Explicitly pass ``header=0`` to be able to replace existing

names. The header can be a list of integers that specify row locations for

a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not

specified will be skipped (e.g. 2 in this example is skipped). Note that

this parameter ignores commented lines and empty lines if

``skip_blank_lines=True``, so header=0 denotes the first line of data

rather than the first line of the file.

names : array-like, default None

List of column names to use. If file contains no header row, then you

should explicitly pass header=None. Duplicates in this list are not

allowed unless mangle_dupe_cols=True, which is the default.

index_col : int or sequence or False, default None

Column to use as the row labels of the DataFrame. If a sequence is given, a

MultiIndex is used. If you have a malformed file with delimiters at the end

of each line, you might consider index_col=False to force pandas to _not_

use the first column as the index (row names)

usecols : array-like, default None

Return a subset of the columns. All elements in this array must either

be positional (i.e. integer indices into the document columns) or strings

that correspond to column names provided either by the user in `names` or

inferred from the document header row(s). For example, a valid `usecols`

parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter

results in much faster parsing time and lower memory usage.

as_recarray : boolean, default False

DEPRECATED: this argument will be removed in a future version. Please call

`pd.read_csv(...).to_records()` instead.

Return a NumPy recarray instead of a DataFrame after parsing the data.

If set to True, this option takes precedence over the `squeeze` parameter.

In addition, as row indices are not available in such a format, the

`index_col` parameter will be ignored.

squeeze : boolean, default False

If the parsed data only contains one column then return a Series

prefix : str, default None

Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...

mangle_dupe_cols : boolean, default True

Duplicate columns will be specified as 'X.0'...'X.N', rather than

'X'...'X'. Passing in False will cause data to be overwritten if there

are duplicate names in the columns.

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}

(Unsupported with engine='python'). Use `str` or `object` to preserve and

not interpret dtype.

engine : {'c', 'python'}, optional

Parser engine to use. The C engine is faster while the python engine is

currently more feature-complete.

converters : dict, default None

Dict of functions for converting values in certain columns. Keys can either

be integers or column labels

true_values : list, default None

Values to consider as True

false_values : list, default None

Values to consider as False

skipinitialspace : boolean, default False

Skip spaces after delimiter.

skiprows : list-like or integer, default None

Line numbers to skip (0-indexed) or number of lines to skip (int)

at the start of the file

skipfooter : int, default 0

Number of lines at bottom of file to skip (Unsupported with engine='c')

skip_footer : int, default 0

DEPRECATED: use the `skipfooter` parameter instead, as they are identical

nrows : int, default None

Number of rows of file to read. Useful for reading pieces of large files

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific

per-column NA values. By default the following values are interpreted as

NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',

'1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'`.

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False the default NaN

values are overridden, otherwise they're appended to.

na_filter : boolean, default True

Detect missing value markers (empty strings and the value of na_values). In

data without any NAs, passing na_filter=False can improve the performance

of reading a large file

verbose : boolean, default False

Indicate number of NA values placed in non-numeric columns

skip_blank_lines : boolean, default True

If True, skip over blank lines rather than interpreting as NaN values

parse_dates : boolean or list of ints or names or list of lists or dict, default False

* boolean. If True -> try parsing the index.

* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3

each as a separate date column.

* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as

a single date column.

* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result

'foo'

Note: A fast-path exists for iso8601-formatted dates.

infer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format

of the datetime strings in the columns, and if it can be inferred, switch

to a faster method of parsing them. In some cases this can increase the

parsing speed by ~5-10x.

keep_date_col : boolean, default False

If True and parse_dates specifies combining multiple columns then

keep the original columns.

date_parser : function, default None

Function to use for converting a sequence of string columns to an array of

datetime instances. The default uses ``dateutil.parser.parser`` to do the

conversion. Pandas will try to call date_parser in three different ways,

advancing to the next if an exception occurs: 1) Pass one or more arrays

(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the

string values from the columns defined by parse_dates into a single array

and pass that; and 3) call date_parser once for each row using one or more

strings (corresponding to the columns defined by parse_dates) as arguments.

dayfirst : boolean, default False

DD/MM format dates, international and European format

iterator : boolean, default False

Return TextFileReader object for iteration or getting chunks with

``get_chunk()``.

chunksize : int, default None

Return TextFileReader object for iteration. `See IO Tools docs for more

information

<http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_ on

``iterator`` and ``chunksize``.

compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'

For on-the-fly decompression of on-disk data. If 'infer', then use gzip,

bz2, zip or xz if filepath_or_buffer is a string ending in '.gz', '.bz2',

'.zip', or 'xz', respectively, and no decompression otherwise. If using

'zip', the ZIP file must contain only one data file to be read in.

Set to None for no decompression.

.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.

thousands : str, default None

Thousands separator

decimal : str, default '.'

Character to recognize as decimal point (e.g. use ',' for European data).

float_precision : string, default None

Specifies which converter the C engine should use for floating-point

values. The options are `None` for the ordinary converter,

`high` for the high-precision converter, and `round_trip` for the

round-trip converter.

lineterminator : str (length 1), default None

Character to break file into lines. Only valid with C parser.

quotechar : str (length 1), optional

The character used to denote the start and end of a quoted item. Quoted

items can include the delimiter and it will be ignored.

quoting : int or csv.QUOTE_* instance, default 0

Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of

QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

doublequote : boolean, default ``True``

When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate

whether or not to interpret two consecutive quotechar elements INSIDE a

field as a single ``quotechar`` element.

escapechar : str (length 1), default None

One-character string used to escape delimiter when quoting is QUOTE_NONE.

comment : str, default None

Indicates remainder of line should not be parsed. If found at the beginning

of a line, the line will be ignored altogether. This parameter must be a

single character. Like empty lines (as long as ``skip_blank_lines=True``),

fully commented lines are ignored by the parameter `header` but not by

`skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'

with `header=0` will result in 'a,b,c' being

treated as the header.

encoding : str, default None

Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python

standard encodings

<https://docs.python.org/3/library/codecs.html#standard-encodings>`_

dialect : str or csv.Dialect instance, default None

If None defaults to Excel dialect. Ignored if sep longer than 1 char

See csv.Dialect documentation for more details

tupleize_cols : boolean, default False

Leave a list of tuples on columns as is (default is to convert to

a Multi Index on the columns)

error_bad_lines : boolean, default True

Lines with too many fields (e.g. a csv line with too many commas) will by

default cause an exception to be raised, and no DataFrame will be returned.

If False, then these "bad lines" will dropped from the DataFrame that is

returned. (Only valid with C parser)

warn_bad_lines : boolean, default True

If error_bad_lines is False, and warn_bad_lines is True, a warning for each

"bad line" will be output. (Only valid with C parser).

low_memory : boolean, default True

Internally process the file in chunks, resulting in lower memory use

while parsing, but possibly mixed type inference. To ensure no mixed

types either set False, or specify the type with the `dtype` parameter.

Note that the entire file is read into a single DataFrame regardless,

use the `chunksize` or `iterator` parameter to return the data in chunks.

(Only valid with C parser)

buffer_lines : int, default None

DEPRECATED: this argument will be removed in a future version because its

value is not respected by the parser

compact_ints : boolean, default False

DEPRECATED: this argument will be removed in a future version

If compact_ints is True, then for any column that is of integer dtype,

the parser will attempt to cast it as the smallest integer dtype possible,

either signed or unsigned depending on the specification from the

`use_unsigned` parameter.

use_unsigned : boolean, default False

DEPRECATED: this argument will be removed in a future version

If integer columns are being compacted (i.e. `compact_ints=True`), specify

whether the column should be compacted to the smallest signed or unsigned

integer dtype.

memory_map : boolean, default False

If a filepath is provided for `filepath_or_buffer`, map the file object

directly onto memory and access the data directly from there. Using this

option can improve performance because there is no longer any I/O overhead.

Returns

-------

result : DataFrame or TextParser

2.創建數據表

另一種方法是通過直接寫入數據來生成數據表，Excel中直接在單元格中輸入數據就可以，Python中通過下面的代碼來實現。

計算機生成了可選文字:
1 n [ 45 ]
： df
Out [ 45 ] 生
date
2213 一 21 一
City category
2
2
4
1231
1232
1233
1234
1235
2213 一 21 一 22
2213 一 21 一 23
24
2213 一 21 一 25
2213 一 21 一 25
8 巳 1 ] Ing
guangzhou
She n 記 he n
shanghai
日 E I ] ING
112 一〔
32
5433 · 2

生成數據表的函數是pandas庫中的DataFrame函數，數據表一共有6行數據，每行有6個字段。在數據中我們特意設置了一些NA值和有問題的字段，例如包含空格等。

后面將在數據清洗步驟進行處理。

后面我們將統一以DataFrame的簡稱df來命名數據表。

以上是剛剛創建的數據表，我們沒有設置索引列，price字段中包含有NA值，city字段中還包含了一些臟數據。

第2章數據表檢查

本章主要介紹對數據表進行檢查。

Python中處理的數據量通常會比較大，比如紐約的出租車數據和Citibike的騎行數據，其數據量都在千萬級，我們無法一目了然地了解數據表的整體情況，必須要通過一些方法來獲得數據表的關鍵信息。

數據表檢查的另一個目的是了解數據的概況，例如整個數據表的大小，所占空間，數據格式，是否有空值和具體的數據內容，為后面的清洗和預處理做好准備。

1.數據維度（行列）

Excel中可以通過CTRL+向下的光標鍵，和CTRL+向右的光標鍵來查看行號和列好。

Python中使用shape函數來產看數據表的維度，也就是行數和列數，函數返回的結果(6,6)表示數據表有6行，6列。

下面是具體的代碼。

計算機生成了可選文字:
# 查看數據表的維度
df. shape
（ 6 ， 6 ）

2.數據表信息

使用info函數查看數據表的整體信息，這里返回的信息比較多，包括數據維度、列名稱、數據格式和所占空間等信息。

計算機生成了可選文字:
# 數據表信息
df.lnfo()
<class •pandas. core. frame.DataFrame'>
Rangelndex: 6 entries, 0 to 5
Data columns (total 6 COIumnS) ：
id
date
City
category
age
pr1ce
6 non—null int64
6 non—null datet ime64 [ns]
6 non—null object
6 non—null object
6 non—null int64
4 non—null f loat64
dtypes ： datetime64[ns] （ 1 ）， float64(1) ，
memo ry usage: 368 ． 0 + bytes
int64(2) ，
object(2)

3.查看數據格式

Excel中通過選中單元格並查看開始菜單中的數值類型來判斷數據的格式。

計算機生成了可選文字:
丹始
凶切
下栳刷
捲入貞面布局公式數瞎審閱視圖
開發工目加頂
，自動行
0 笙藁合后居中 -

Python中使用dtypes函數來返回數據格式。

dtyps是一個查看數據格式的函數，可以一次性查看數據表中所有數據的格式，也可以指定一列來單獨查看。

計算機生成了可選文字:
# 查看數據表各列格式
df. dtypes
date
City
category
age
pr1ce
主 nt64
datet me64 [ n s ]
object
object
主 nt64
f loat64
dtype ： object

4.查看空值

Excel中查看空值的方法是使用‘定位條件’功能對數據表中的空值進行定位。

‘定位條件’在‘開始’目錄下的‘查找和選擇’目錄中。

計算機生成了可選文字:
定位條件
O 批注
@ 量（ 0
C) 公式舊
數字 (U)
文本凶
囝輯值（ G ）
841 錯誤舊
0 空值
C) 當前區域
(D 當前數組（
C) 對象
C) 行內容差異單元格 (W ）
O 列內容差異單元格）
@ 引用單元格舊
O 從屬單元格（ D ）
@ 直屬 (1)
所有級別舊
@ 最后一個單元格
C) 可見單元格 (Y)
@ 條亻牛格式 (I)
@ 數據驗證 (Y)
@ 全部舊
相同舊
取消

Isnull是Python中檢驗空值的函數，返回的結果是邏輯值，包含空值返回True，不包含則返回False。

用戶既可以對整個數據表進行檢查，也可以單獨對某一列進行空值檢查。

計算機生成了可選文字:
1 n [ 59 ] ： df•isnull()
Out [ 59 ] 生
date
False
False
False
False
False
False
Ity category
2
2
4
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
F IS 已
True
F IS 已

計算機生成了可選文字:
# 檢查特定列空值
0
1
2
3
4
5
Name ：
False
True
False
False
True
False
prlce,
dtype:
bool

5.查看唯一值

Excel中查看唯一值得方法是使用‘條件格式’對唯一值進行顏色標記。

計算機生成了可選文字:
重復值
為包含以下類型值的單元格設首格式：
值，設置為淺紅埴充色深紅色文不
取消

Python中是用unique函數查看唯一值。

Unique是查看唯一值的函數，只能對數據表中的特定列進行檢查。

下面是代碼，返回的結果是該列中的唯一值。

類似與Excel中刪除重復項后的結果。

計算機生成了可選文字:
# 查看 City 列中的唯一值
array( [ 'Beij ing
guangzhou
BEIJING 《 ] ， dtype=object)
1 Shenzhen
shanghai'

6.查看數據表數值

Python中的Values函數用來查看數據表中的數值。

以數組的形式返回，不包含表頭信息。

7.查看列名稱

計算機生成了可選文字:
co 恤函數用來單獨查看數據表中的列名稱。
# 查看列名稱
df.columns
Index( [ 《 id 《
object')
《 category
dtype=

8.查看前10行數據

Head()函數用來查看數據表中前N行數據，默認head()顯示前10行數據，可以自己設置參數值來確定查看的行數。

下面的代碼中設置查看前3行的數據。

計算機生成了可選文字:
# 查看前 3 行數據
d f ． head （ 3 ）
0
1
2
id
date
1001
2013 ． 01 ． 02
1002
2013 ． 01 · 03
1003 2013 ． 01 ． 04
City
Beijing
SH
guangzhou
category age price
100 ． A
100 ． B
110 ． A
23
44
54
1200 ℃
NaN
2133 ℃

9.查看后10行數據

tail函數與head函數相反，用來查看數據表中后N行的數據，默認tail()顯示后10行數據，可以自己設置參數值來確定查看的行數。

下面的代碼中設置查看后3行的數據。

計算機生成了可選文字:
# 查看最后 3 行
d f ． ta iI( 3 ）

計算機生成了可選文字:
id
date
3 1004 2013 ． 01 ． 05
4 1005 2013 ． 01 ． 06
5 1006 2013 ． 01 ． 07
city
Shenzhen
shanghai
BEIJING
category age
1 10-C
210 · A
130 · F
32
34
32
price
5433 ． 0
NaN
4432 ． 0

第3章數據表清洗

本章介紹的是對數據表中的問題進行清洗，主要內容包括對空值、大小寫問題、數據格式和重復值的處理。

這里不包含對數據間的邏輯驗證。

1.處理空值（刪除或填充）

我們在創建數據表的時候在price字段中故意設置了幾個NA值。

對於空值的處理方式有很多種，可以直接刪除包含空值的數據，也可以對控制進行填充，比如用0填充或者用均值填充。還可以根據不同字段的邏輯對空值進行推算。

Excel中可以通過‘查找和替換’功能對空值進行處理，將空值統一替換為0或均值。也可以通過‘定位’空值來實現。

計算機生成了可選文字:
查找和替懊
查伐 (P) 替 P ）
查找內容（ N ）：
替懊為舊：
全部替換
替換
查伐全部
選頃田 > >
查伐下一個舊

Python中處理空值的方法比較靈活，可以使用Dropna函數用來刪除數據表中包含空值的數據，也可以使用fillna函數對空值進行填充。下面的代碼和結果中可以看到使用dropna函數后，包含NA值得兩個字段已經不見了。返回的是一個不包含空值的數據表。

計算機生成了可選文字:
# 刪除數據表中含有空值的行
df.dropna(how='any' ）
id
1001
04
1006 2013 ． 01 ． 07
City
Beijing
guangzhou
Shenzhen
BEIJING
category age
0
2
3
5
date
2013 £ 1 2
1003 2013 £ 1 ．
1004 2013 £ 1 ． 05
100 ． A
110 ． A
1 10-C
130 ． F
23
54
32
32
price
1200 ℃
2133 ℃
5433 ℃
4432 ℃

除此之外，也可以使用數字對空值進行填充，下面的代碼使用fillna函數對空值字段填充數字0.

計算機生成了可選文字:
# 使用數字真充數據表中空值
df.filIna(vaIue=0)

我們選擇填充的方式來處理空值，使用price列的均值來填充NA字段，同樣使用fillna函數，在要填充的數值中使用mean函數先計算price列當前的均值，然后使用這個均值對NA進行填充。可以看到兩個空值字段顯示為3299.5

計算機生成了可選文字:
# 使用 pr 躚 e 均值對 NA 進行填充
df [ 《 ] 。 price' ] ． mean 0 ）
9
1
2
3
4
1200 ． 0
3299 ． 5
2133 ． 0
5433 ． 0
3299 ． 5

計算機生成了可選文字:
伽 t [ 7 ·
date
2213 一 21 一
City category
2
4
1232
1233
1234
1235
2213 一 21 一 23
24
2213 一 21 一 25
2213 一 21 一 25
8 巳 1 ] Ing
guangzhou
Shenzhen
shanghai
日 E I ] ING
123 一
112 一二
112 一〔
212 一層
4 #
54
32
34
3299 ·
2133 · 2
5433 · 2
3299 ·

2.清理空格

處理空值，字段中的空格也是數據清洗中一個常見的問題，下面是清楚字符中空格的代碼。

計算機生成了可選文字:
# 清除 City 字段中的字符空格
df['C1ty' ] =df [ 'City 《 ] map(str.strip)

3.大小寫轉換

在英文字段中，字母的大小寫不統一也是一個常見的問題。

Excel中有UPPER,LOWER等函數，Python中也有同名函數用來解決大小寫的問題。在數據表的city列中就存在這樣的問題。

我們將city列的所有字母轉換為小寫。下面是具體的代碼和結果。

計算機生成了可選文字:
city 列大小寫轉換
df['C1ty' ] =df [ ] ． St r ． lower()

4.更改數據格式

Excel中通過‘設置單元格格式’功能可以修改數據格式。

Python中通過astype函數用來修改數據格式。

計算機生成了可選文字:
單元格格式
數字對齊
貨幣
會計虧用
自分比
§ 定義
字體
邊
填充
保護
小數數（ D ）： 0
0 使用干位分隔符 0 圓
負數（ N ）：
{ 1234 ）
（ 123 引
1 234
一 1234
數值格式於一般數字的表示 · 貨幣和會計格式則提供貨幣值計筻的專用格式 ·

Python中dtype是查看數據格式的函數，與之對應的是astype函數，用來更改數據格式。下面的代碼將price字段的值修改為int格式。

計算機生成了可選文字:
# 更改數據格式
df['pnce'] .astype( 《 int 《）
9
1
2
3
4
5
Name ：
1299
3299
2133
5433
3299
4432
prlce,
dtype:
主 nt32

5.更改列名稱

Rename是更改列名稱的函數，我們將來數據表中的category列更改為category-size。

下面是具體的代碼和更改后的結果。

計算機生成了可選文字:
# 更改列名稱
df. rename (columns={ 《 category 《
《 category—size'})
2013 ． 01 ． 04 guangzhou
0
2
3
4
5
id
1001
1003
1004
1005
1006
date
2013 ． 01 ． 02
1002 2013 ． 01 ． 03
2013 ． 01
2013 ． 01 ． 06
2013 ． 01 ． 07
category-Size age
． 05
City
Beijing
sh
Shenzhen
shanghai
Beijing
100 ． A
100 ． B
110 ． A
110-C
210 ． A
130 ． F
23
44
54
32
34
32
prICe
1200
3299
2133
5433
329g
4432

6.刪除重復值

很多數據表中還包含重復值的問題，Excel的數據目錄下有‘刪除重復項’的功能，可以用來刪除數據表中的重復值。

默認Excel會保留最先出現的數據，刪除后面重復出現的數據。

計算機生成了可選文字:
目一而
分列快速填刪除驗合並計算模擬分析關系
重復項證，

Python中使用drop_duplicates函數刪除重復值。

我們以數據表中的city列為例，city字段中存在重復值。

默認情況下drop_duplicates()將刪除后出現的重復值（與Excel邏輯一致）。增加keep='last'參數后將刪除最先出現的重復值，保留最后的值。

下面是具體的代碼和比較結果。

計算機生成了可選文字:
原始的 ci 呼列中 be 巧 ing 存在重復，分別在第一位和最后一位。
df [ 'city' ]
0
1
2
3
4
5
Name ：
beij ing
guangzhou
Shenzhen
shanghai
beij Ing
city, dtype:
object

使用默認的drop_duplicates()函數刪除重復值，從結果中可以看到第一位的beijing被保留，最后出現的beijing被刪除。

計算機生成了可選文字:
I n [ 108 ] ： dfC'city').drop duplicates 0
Out [ 8 ] 生
Beijing
guangzhou
Shenzhen
shanghai
m 巳： city, dtype
object

設置keep='last'參數后，與之前刪除重復值的結果相反，第一位出現的beijing被刪除，保留了最后一位出現的beijing。

計算機生成了可選文字:
# 刪除先出現的重復值
df[ 'city'] .drop—duplicates(keep=' last' ）
1
2
3
4
5
Name ：
guangzhou
Shenzhen
shanghai
belj Ing
city, dtype:
Objec

7.數值修改及替換

數據清洗中最后一個問題是數值修改或替換，Excel中使用“查找和替換”功能就可以實現數值的替換。

計算機生成了可選文字:
查找 (D) 替換（ P ）
查找內容（ N ）．
替換為舊．
找全部 (I)
選項田 ”
查戊下一個舊

Python中使用replace函數實現數據替換。

數據表中city字段上海存在兩種寫法，分別為shanghai和SH。

我們使用replace函數對SH進行替換。

計算機生成了可選文字:
# 數據替換
df [ ] 。 replace( 'sh 《
《 shanghai' ）
0
1
2
3
4
5
Name ：
beij Ing
shanghai
guangzhou
Shenzhen
shanghai
beij ing
City, dtype:
object

第4章數據預處理

本章主要講的是數據的預處理，對清洗完的數據進行整理以便后期的統計和分析工作。

主要包括數據表的合並，排序，數值分列，數據分組及標記等工作。

1.數據表合並

首先是對不同的數據表進行合並，我們這里創建一個新的數據表df1，並將df和df1兩個數據表進行合並。

在Excel中沒有直接完成數據表合並的功能，可以通過vlookup函數分步實現。

在Python中可以通過merge函數一次性實現。

下面建立df1數據表，用於和df數據表進行合並。

計算機生成了可選文字:
1 n [113) ．
1 n [ 且 4 ] ·
Out C114)
dfl=pd ． D t F 「 m 巳（ {
· · ： [ 1231J 1232J 1233J1234
gender' ： [ 《 male 'female
'm—point ' ： [ 12J 12J 22J42J42
1235 , 1235J 1237 , 1238
'male 'female
4 六 3 六 2 明 } ）
' fema le
' ma le
'female'],
dfI
1231
1232
1233
1234
1235
1235
1237
1238
m-polnt pay
7
gender
female
female
female
female

使用merge函數對兩個數據表進行合並，合並的方式為inner，將兩個數據表中共有的數據匹配到一起生成新的數據表。並命名為df_inner。

除了inner方式以外，合並的方式還有left，right和outer方式。這幾種方式的差別在我其他的文章中有詳細的說明和對比。

計算機生成了可選文字:
df_left=pd 。 merge()f ， dfl, how= 《 left ）
df_right=pd merge(df,dfl, how=' right' ）
df_outer=pd merge(df,dfl,how='outerl ）

2.設置索引列

完成數據表的合並后，我們對df_inner數據表設置索引列，索引列的功能很多，可以進行數據提取，匯總，也可以進行數據篩選等。

設置索引的函數為set_index.

3.排序（按索引，按數值）

Excel中可以通過數據目錄下的排序按鈕直接對數據表進行排序，比較簡單。

計算機生成了可選文字:
排序篩選《重新
和篩選

Python中需要使用ort_values函數和sort_index函數完成排序。

在Python中，既可以按索引對數據表進行排序，也可以看置頂列的數值進行排序。

首先我們按age列中用戶的年齡對數據表進行排序。

使用的函數為sort_values.

Sort_index函數用來將數據表按索引列的值進行排序。

4.數據分組

Excel中可以通過vlookup函數進行近似匹配來完成對數值的分組，或者使用“數據透視表”來完成分組。

相應的Python中使用where函數完成數據分組。

where函數用來對數據進行判斷和分組，下面的代碼中我們對price列的值進行判斷，將符合條件的分為一組，不符合條件的分為另一組，並使用group字段進行標記。

除了where函數以外，還可以對多個字段的值進行判斷后對數據進行分組，下面的代碼中對city列等於beijing並且price列大於等於4000的數據標記為1.

5.數據分列

與數據分組相反的是對數值進行分列，Excel中的數據目錄下提供‘分列’功能。

在Python中使用split函數實現分列。

計算機生成了可選文字:
分列
[ 二 0
速埴充刪除數據驗合並計篡模擬分析關系
重頂證，
數據工具

在數據表中category列中的數據包含有兩個信息，前面的數字為類別id，后面的字母為size值。中間以連字符進行聯結。我們使用split函數對這個字段進行拆分，並將拆分后的數據表匹配回原數據表中。

計算機生成了可選文字:
# 對 category 字段的值依次進行分列，並創建數據表，索引值為 df 一 inner 的索引列，列
名稱為 category 和 Size
pd.DataFrame( （ x ． split( 《
《） for x in df_inner[ 'category'] ）， index=d
f_inner. index, columns=[ 《 category' ， size' ] ）

計算機生成了可選文字:
I n [ 142 ] ： pd.DataFrame((x.sp1it('-') for x in
df nn 「 [ 'category ), index=df nn 已「． index, columns=C ' category
Out [ 142 ] 。
category 51 記已

第5章數據提取

數據提取，也就是數據分析中最常見的一個工作。

這部分主要使用3個函數，即loc,iloc和ix。

loc函數按標簽值進行提取；

iloc函數按位置進行提取；

ix函數可以同時按標簽和位置進行提取。

下面介紹每一種函數的使用方法。

1.按標簽提取loc函數

loc函數按數據表的索引標簽進行提取，下面的代碼中提取了索引列為3的單條數據。

計算機生成了可選文字:
# 按索引提取單行的數值
df_lnner. IOC [ 3 ]
id
date
City
category
age
pr1ce
gender
m—P01nt
pay
group
Sign
category_
sIze
Name ： 3 ，
1004
2013 一 01 一 05 00 ： 00 ： 00
Shenzhen
110 一 C
5433
fema le
40
Y
high
NaN
1
110
C
dtype:
object

使用冒號可以限定提取數據的范圍，冒號前面為開始的標簽值，后面為結束的標簽值。

下面提取了0-5的數據行。

計算機生成了可選文字:
# 按索引提取區域行數值
df_inner. 10 c [ 0 ： 5 ]
category
100 ． A
1 10-C
130 ． F
age
23
32
32
pnce
1200
5433
4432
gender
male
female
female
m-polnt
10
40
40
pay
Y
Y
group
《 ow
high
Slgn
NaN
NaN
category_l size
0
3
5
id
date
1001
2013 ． 01 ． 02
1004 2013 ． 01 ． 05
1006 2013 £ 1 7
City
Beijing
Shenzhen
Beijing
1
1 10
A
C
F

Reset_index函數用於恢復索引，這里我們重新將date字段的日期設置為數據表的索引，並按日期進行數據提取。

使用冒號限定提取數據的范圍，冒號前面為空表示從0開始。

提取所有2013年1月4日以前的數據。

計算機生成了可選文字:
# 提取 4 日之前的所有數據
df_inner[: 《 2013 一 01 一 04 《
city
Beijing
shanghai
guangzhou
category age price
gender m-point pay group sign
ca 怡 go 吖一 1 size
2013 1 刁 2
2013 “ 01 “ 03
2013 ． 01 刁 4
id
0 001
1002
1003
0 00 ． A
1 0 3
110 ． A
23
44
54
12 羽
3299
2133
male
ma
male
12
20
N
Y
high
《 ow
NaN
NaN
102
110
A
A

2.按位置提取 iloc函數

使用iloc函數按位置對數據表中的數據進行提取，這里冒號前后的數字不再是索引的標簽名稱，而是數據所在的位置，從0開始。

計算機生成了可選文字:
使用 iloc 按位置區域提取數據
df_inner.iIoc[:3, ： 2 ]
date
2013 ． 01 ． 02
2013 · 01 “ 05
2013 1 “ 0 ／
id
1001
1004
1006
city
Beijing
shenzhen
beijing

iloc函數除了可以按區域提取數據，還可以按位置逐條提取，前面方括號中的0,2,5,表示數據所在行的位置，后面方括號中的數表示所在列的位置。

計算機生成了可選文字:
1 n [ 巧 9 ] ·
Out [ 篁 59 ] ：
date
2213 一 21 一 22
2213 一 21 一 24
2213 一 21 一 27
df nn 已「． iloc [ Cø, 2, 5 ]
1223
2133
4 # 32
gender
female

3.按標簽和位置提取 ix函數

ix是loc和iloc的混合，既能按索引標簽提取，也能按位置進行數據提取。

下面的代碼中行的位置按索引日期設置，列按位置設置。

計算機生成了可選文字:
# 使用按索引標簽和位置混合提取數據
df_inner. ix [ ：《 2013 一 01 一 03 《，： 4 ]
id
date
City
category age
2013 ． 01 ． 02 1001 Beijing
100 ． A
2013 “ 01 刁 3 1C02 shanghai 1 開 -8
23
44

4.按條件提取（區域和條件值）

除了按標簽和位置提取數據以外，還可以按具體的條件取數。

下面使用loc和isin兩個函數配合使用，按指定條件對數據進行提取。

使用isin函數對city中的值是否為beijing進行判斷。

計算機生成了可選文字:
I n [ 162 ] ： df n 已「 0 ty ' ] ．（ [
Out [ 2 ] ：
Beijing'))
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27
Name ：
City
False
False
False
False
True
, dtype. bool

將isin函數嵌套到loc的數據提取函數中，將判斷結果為True數據提取出來。

這里我們把判斷條件改為city值是否為beijing和shanghai，如果是，就把這條數據提取出來。

數據提取還可以完成類似數據分列的工作，從合並的數值中提取出指定的數值。

計算機生成了可選文字:
category=df_xnner [ 《 category 《 ]
0
3
5
4
1
2
199 一 A
119 一 C
130 一 F
210 一 A
199 一 B
110 一 A
Name ： category, dtype: object
# 提取前三個字符，並生成數據表
pd.DataFrame(category.str[:3] ）

計算機生成了可選文字:
1 n [ 5 ] ·
1 n [ 7 ] ·
Out [ 67 ] ：
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27
category=df nn 已「 [ ' category x')
category
category x, dtype: object
I n [ 168 ] ： pd .DataFrame(category st 「 [ ：明）
伽 t [ 篁 68 ] ·
category x
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27

#提取類別的字段的前3個字符

第6章數據篩選

使用與，或，非三個條件配合大於，小於和等於對數據進行篩選，並進行計數和求和。

與Excel中的篩選功能和countifs和sumifs功能相似。

1.按條件篩選（與、或、非）

Excel數據目錄下提供“篩選”功能，用於對數據表按不同的條件進行篩選。

計算機生成了可選文字:
馭 i 不
重新應用
廴排序篩選
高級
排序和師辶一

Python中使用loc函數配合篩選條件來完成篩選功能。

配合sum和count函數還能實現Excel中sumif和countif函數的功能。

使用“與”條件進行篩選，條件是年齡大於25歲，並且城市為beijing。篩選后只有一條數據符合要求。

計算機生成了可選文字:
I n [ 172 ] ： df nn 已「． loc [ ()f inne
0 id ity g 巴
1 n [ 丷 3 ] ·
Out [ 篁 72 ] ：
date
「 [ 的 >25)&(df n 已「 0 ty ' ] ：
Beijing'),
'category 'gender']]
City
2213 一 21 一 27 1235 Beijing
g 巳 category
32
gender
femalel

使用“或”條件進行篩選，條件是年齡大於25歲，或城市為beijing。

計算機生成了可選文字:
I n [ 173 ] ： df
CIt-y
Out [ 篁 73 ] ：
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27
.10cC()f inne
'category 'gender']]
Beijing'),
1231
1232
1233
1234
1235
1235
City
Beijing
shanghai
guangzhou
Shenzhen
shanghai
g 巳 category
23
4 #
54
34
gender
female
female
female

在前面的代碼后增加price字段以及sum函數，按篩選后的結果將price字段值進行求和，相當於Excel中sumifs的功能。

使用“非”條件進行篩選，城市不等於beijing。符合條件的數據有4條。將篩選結果按id列進行排序。

計算機生成了可選文字:
I n [ 178 ] ： df
Clty
Out [ 篁 78 ] ·
date
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
loc Cdf inner( ty !='beijing
'category', 'gender']] · 50 「 t values (by
1232
1233
1234
1235
City
shanghai
guangzhou
Shenzhen
shanghai
g 巳 category
4 #
54
34
gender
female
female

在前面的代碼后面增加city列，並使用count函數進行計數。相當於Excel中的countifs函數的功能。

還有一種篩選的方式是用query函數。

下面是具體的代碼和篩選結果。

計算機生成了可選文字:
1 n [ 2 ％ ] ·
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 25
2213 一 21 一 27
date
2213 一 21 一 23
df nn 已「
1231
1232
1235
1235
5 Ign
City category x
''shanghai") ')
gender
Beijing
shanghai
shanghai
Beijing
123 一
212 一層
m-polnt pay group
12 Y IOW
12 high
4 #
34
32
3299
3299
4432
female
female
51gn2 categoryJ 51 記已