Pandas高级教程之:统计方法

绝代码农 发表于 2021-7-8 10:50:28

　　简介　　数据分析中经常会用到很多统计类的方法，本文将会介绍Pandas中使用到的统计方法。
变动百分百　　Series和DF都有一个pct_change() 方法用来计算数据变动的百分比。这个方法在填充NaN值的时候特别有用。
ser = pd.Series(np.random.randn(8))

ser.pct_change()
Out:
0       NaN
1 -1.264716
2 4.125006
3 -1.159092
4 -0.091292
5 4.837752
6 -1.182146
7 -8.721482
dtype: float64

ser
Out:
0 -0.950515
1 0.251617
2 1.289537
3 -0.205155
4 -0.186426
5 -1.088310
6 0.198231
7 -1.530635
dtype: float64
　　pct_change还有个periods参数，可以指定计算百分比的periods，也就是隔多少个元素来计算：
In : df = pd.DataFrame(np.random.randn(10, 4))

In : df.pct_change(periods=3)
Out:
      0       1       2       3
0    NaN    NaN    NaN    NaN
1    NaN    NaN    NaN    NaN
2    NaN    NaN    NaN    NaN
3 -0.218320 -1.0540011.987147 -0.510183
4 -0.439121 -1.8164540.649715 -4.822809
5 -0.127833 -3.042065 -5.866604 -1.776977
6 -2.596833 -1.959538 -2.111697 -3.798900
7 -0.117826 -2.1690580.036094 -0.067696
82.492606 -1.357320 -1.205802 -1.558697
9 -1.0129772.324558 -1.003744 -0.371806
Covariance协方差　　Series.cov() 用来计算两个Series的协方差，会忽略掉NaN的数据。
In : s1 = pd.Series(np.random.randn(1000))

In : s2 = pd.Series(np.random.randn(1000))

In : s1.cov(s2)
Out: 0.0006801088174310875
　　同样的，DataFrame.cov() 会计算对应Series的协方差，也会忽略NaN的数据。
In : frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])

In : frame.cov()
Out:
      a       b       c       d       e
a1.000882 -0.003177 -0.002698 -0.0068890.031912
b -0.0031771.0247210.0001910.0092120.000857
c -0.0026980.0001910.950735 -0.031743 -0.005087
d -0.0068890.009212 -0.0317431.002983 -0.047952
e0.0319120.000857 -0.005087 -0.0479521.042487
　　DataFrame.cov 带有一个min_periods参数，可以指定计算协方差的最小元素个数，以保证不会出现极值数据的情况。
In : frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])

In : frame.loc, "a"] = np.nan

In : frame.loc, "b"] = np.nan

In : frame.cov()
Out:
      a       b       c
a1.123670 -0.4128510.018169
b -0.4128511.1541410.305260
c0.0181690.3052601.301149

In : frame.cov(min_periods=12)
Out:
      a       b       c
a1.123670    NaN0.018169
b    NaN1.1541410.305260
c0.0181690.3052601.301149
Correlation相关系数　　corr() 方法可以用来计算相关系数。有三种相关系数的计算方法：
方法名描述pearson (default)标准相关系数kendallKendall Tau相关系数spearman斯皮尔曼等级相关系数n : frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])

In : frame.iloc[::2] = np.nan

# Series with Series
In : frame["a"].corr(frame["b"])
Out: 0.013479040400098775

In : frame["a"].corr(frame["b"], method="spearman")
Out: -0.007289885159540637

# Pairwise correlation of DataFrame columns
In : frame.corr()
Out:
      a       b       c       d       e
a1.0000000.013479 -0.049269 -0.042239 -0.028525
b0.0134791.000000 -0.020433 -0.0111390.005654
c -0.049269 -0.0204331.0000000.018587 -0.054269
d -0.042239 -0.0111390.0185871.000000 -0.017060
e -0.0285250.005654 -0.054269 -0.0170601.000000
　　corr同样也支持 min_periods ：
In : frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])

In : frame.loc, "a"] = np.nan

In : frame.loc, "b"] = np.nan

In : frame.corr()
Out:
      a       b       c
a1.000000 -0.1211110.069544
b -0.1211111.0000000.051742
c0.0695440.0517421.000000

In : frame.corr(min_periods=12)
Out:
      a       b       c
a1.000000    NaN0.069544
b    NaN1.0000000.051742
c0.0695440.0517421.000000
　　corrwith 可以计算不同DF间的相关系数。
In : index = ["a", "b", "c", "d", "e"]

In : columns = ["one", "two", "three", "four"]

In : df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)

In : df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)

In : df1.corrwith(df2)
Out:
one -0.125501
two -0.493244
three 0.344056
four 0.004183
dtype: float64

In : df2.corrwith(df1, axis=1)
Out:
a -0.675817
b 0.458296
c 0.190809
d -0.186275
e       NaN
dtype: float64
rank等级　　rank方法可以对Series中的数据进行排列等级。什么叫等级呢？我们举个例子：
s = pd.Series(np.random.randn(5), index=list("abcde"))

s
Out:
a 0.336259
b 1.073116
c -0.402291
d 0.624186
e -0.422478
dtype: float64

s["d"] = s["b"]# so there's a tie

s
Out:
a 0.336259
b 1.073116
c -0.402291
d 1.073116
e -0.422478
dtype: float64

s.rank()
Out:
a 3.0
b 4.5
c 2.0
d 4.5
e 1.0
dtype: float64
　　上面我们创建了一个Series，里面的数据从小到大排序：
-0.422478 < -0.402291 <0.336259 <1.073116 < 1.073116
　　所以相应的rank就是 1 ， 2 ，3 ，4 ， 5.
　　因为我们有两个值是相同的，默认情况下会取两者的平均值，也就是 4.5.
　　除了 default_rank ，还可以指定max_rank ，这样每个值都是最大的5 。
　　还可以指定 NA_bottom ，表示对于NaN的数据也用来计算rank，并且会放在最底部，也就是最大值。
　　还可以指定 pct_rank ， rank值是一个百分比值。
df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',
...                                  'spider', 'snake'],
...                      'Number_legs': })
>>> df
AnimalNumber_legs
0    cat       4.0
1penguin       2.0
2    dog       4.0
3 spider       8.0
4 snake       NaN
df['default_rank'] = df['Number_legs'].rank()
>>> df['max_rank'] = df['Number_legs'].rank(method='max')
>>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
>>> df['pct_rank'] = df['Number_legs'].rank(pct=True)
>>> df
AnimalNumber_legsdefault_rankmax_rankNA_bottompct_rank
0    cat       4.0       2.5    3.0    2.5 0.625
1penguin       2.0       1.0    1.0    1.0 0.250
2    dog       4.0       2.5    3.0    2.5 0.625
3 spider       8.0       4.0    4.0    4.0 1.000
4 snake       NaN       NaN    NaN    5.0    NaN
　　rank还可以指定按行 (axis=0) 或者按列 (axis=1)来计算。
In : df = pd.DataFrame(np.random.randn(10, 6))

In : df = df[:5]# some ties

In : df
Out:
      0       1       2       3       4       5
0 -0.904948 -1.163537 -1.4571870.135463 -1.4571870.294650
1 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.800809
20.4019651.4608401.2560571.3081271.2560570.876004
30.2059540.369552 -0.6693040.038378 -0.6693041.140296
4 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.211196
5 -1.092970 -0.6892460.9081140.204848    NaN0.463347
60.3768920.9592920.095572 -0.593740    NaN -0.069180
7 -1.0026011.957794 -0.1207080.094214    NaN -1.467422
8 -0.5472310.664402 -0.519424 -0.073254    NaN -1.263544
9 -0.250277 -0.237428 -1.0564430.419477    NaN1.375064

In : df.rank(1)
Out:
0 1 2 3 4 5
04.03.01.55.01.56.0
12.06.04.51.04.53.0
21.06.03.55.03.52.0
34.05.01.53.01.56.0
45.03.01.54.01.56.0
51.02.05.03.0NaN4.0
64.05.03.01.0NaN2.0
72.05.03.04.0NaN1.0
82.05.03.04.0NaN1.0
92.03.01.04.0NaN5.0
　　本文已收录于 http://www.flydean.com/10-python-pandas-statistical/
　　最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧等你来发现！

　　
文档来源：51CTO技术博客https://blog.51cto.com/u_11256213/3006953

页: [1]

CodeAE代码之家-专为程序员打造的技术家园！-网站地图

Pandas高级教程之:统计方法