版本 0.13.1（2014 年 2 月 3 日）#

这是 0.13.0 的次要版本，包括少量 API 更改、一些新功能、增强功能和性能改进以及大量错误修复。我们建议所有用户升级到此版本。

亮点包括：

添加了infer_datetime_format关键字 toread_csv/to_datetime以允许加速同质格式的日期时间。
将智能地限制日期时间/时间增量格式的显示精度。
增强面板apply()方法。
新教程部分中的建议教程。
我们的 pandas 生态系统正在不断发展，我们现在在新的生态系统页面部分中展示相关项目。
为了改进文档，我们做了很多工作，并且添加了一个新的贡献部分。
尽管它可能只有开发人员感兴趣，但我们 <3 我们的新 CI 状态页面：ScatterCI。

警告

0.13.1 修复了由于 numpy < 1.8 和在类似字符串的数组上进行链式赋值而导致的错误。请查看文档，链式索引可能会产生意想不到的结果，通常应该避免。

以前这会出现段错误：

df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})
df["A"].iloc[0] = np.nan

执行此类作业的推荐方法是：

In [1]: df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})

In [2]: df.loc[0, "A"] = np.nan

In [3]: df
Out[3]: 
     A
0  NaN
1  bar
2  bah
3  foo
4  bar

输出格式增强#

df.info() 视图现在显示每列的 dtype 信息 ( GH 5682 )

df.info() 现在支持该选项max_info_rows，禁用大帧的空计数（GH 5974）

In [4]: max_info_rows = pd.get_option("max_info_rows")

In [5]: df = pd.DataFrame(
   ...:     {
   ...:         "A": np.random.randn(10),
   ...:         "B": np.random.randn(10),
   ...:         "C": pd.date_range("20130101", periods=10),
   ...:     }
   ...: )
   ...: 

In [6]: df.iloc[3:6, [0, 2]] = np.nan

# set to not display the null counts
In [7]: pd.set_option("max_info_rows", 0)

In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Dtype         
---  ------  -----         
 0   A       float64       
 1   B       float64       
 2   C       datetime64[ns]
dtypes: datetime64[ns](1), float64(2)
memory usage: 368.0 bytes

# this is the default (same as in 0.13.0)
In [9]: pd.set_option("max_info_rows", max_info_rows)

In [10]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       7 non-null      float64       
 1   B       10 non-null     float64       
 2   C       7 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(2)
memory usage: 368.0 bytes

为新的 DataFrame repr添加show_dimensions显示选项以控制是否打印尺寸。

In [11]: df = pd.DataFrame([[1, 2], [3, 4]])

In [12]: pd.set_option("show_dimensions", False)

In [13]: df
Out[13]: 
   0  1
0  1  2
1  3  4

In [14]: pd.set_option("show_dimensions", True)

In [15]: df
Out[15]: 
   0  1
0  1  2
1  3  4

[2 rows x 2 columns]

forArrayFormatter和datetimenowtimedelta64根据数组中的值智能限制精度 ( GH 3401 )

以前的输出可能如下所示：

  age                 today               diff
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00

现在输出看起来像：

In [16]: df = pd.DataFrame(
   ....:     [pd.Timestamp("20010101"), pd.Timestamp("20040601")], columns=["age"]
   ....: )
   ....: 

In [17]: df["today"] = pd.Timestamp("20130419")

In [18]: df["diff"] = df["today"] - df["age"]

In [19]: df
Out[19]: 
         age      today      diff
0 2001-01-01 2013-04-19 4491 days
1 2004-06-01 2013-04-19 3244 days

[2 rows x 3 columns]

API 更改#

将-NaN和添加-nan到默认的 NA 值集 ( GH 5952 )。请参阅NA 值。

添加了Series.str.get_dummies矢量化字符串方法（GH 6021），以提取分隔字符串列的虚拟/指示符变量：

In [20]: s = pd.Series(["a", "a|b", np.nan, "a|c"])

In [21]: s.str.get_dummies(sep="|")
Out[21]: 
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

[4 rows x 3 columns]

添加了NDFrame.equals()比较两个 NDFrame 是否相等且具有相同轴、数据类型和值的方法。添加了 array_equivalent比较两个 ndarray 是否相等的函数。相同位置的 NaN 被视为相等。 ( GH 5283 ) 另请参阅文档以获取激励示例。
```
df = pd.DataFrame({"col": ["foo", 0, np.nan]})
df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])
df.equals(df2)
df.equals(df2.sort_index())
```

DataFrame.apply将使用reduce参数来确定当为空时是否应返回aSeries或 a ( GH 6007 )。DataFrameDataFrame

以前，如果没有列，则调用DataFrame.apply空DataFrame将返回 a，或者使用空调用正在应用的函数来猜测是否应返回a或：DataFrameSeriesSeriesDataFrame

In [32]: def applied_func(col):
  ....:    print("Apply function being called with: ", col)
  ....:    return col.sum()
  ....:

In [33]: empty = DataFrame(columns=['a', 'b'])

In [34]: empty.apply(applied_func)
Apply function being called with:  Series([], Length: 0, dtype: float64)
Out[34]:
a   NaN
b   NaN
Length: 2, dtype: float64

现在，当apply在空上调用时DataFrame：如果reduce 参数是TrueaSeries将返回，如果是Falsea DataFrame将返回，如果是None（默认），则将使用空系列调用正在应用的函数以尝试猜测返回类型。

In [35]: empty.apply(applied_func, reduce=True)
Out[35]:
a   NaN
b   NaN
Length: 2, dtype: float64

In [36]: empty.apply(applied_func, reduce=False)
Out[36]:
Empty DataFrame
Columns: [a, b]
Index: []

[0 rows x 2 columns]

先前版本弃用/更改#

0.13 或之前的版本中没有宣布的更改将于 0.13.1 生效

弃用#

0.13.1 中没有弃用先前的行为

增强功能#

pd.read_csv并pd.to_datetime学习了一个新的infer_datetime_format关键字，它在许多情况下极大地提高了解析性能。感谢@lexual 的建议和@danbirken 的快速实施。（GH 5490、GH 6021）

如果parse_dates启用并设置了此标志，pandas 将尝试推断列中日期时间字符串的格式，如果可以推断，则切换到更快的解析方法。在某些情况下，这可以将解析速度提高约 5-10 倍。
```
# Try to infer the format for the index column
df = pd.read_csv(
    "foo.csv", index_col=0, parse_dates=True, infer_datetime_format=True
)
```
date_formatdatetime_format现在可以在写入文件时指定关键字（excel GH 4133）

MultiIndex.from_product用于根据一组可迭代对象的笛卡尔积创建 MultiIndex 的便捷函数 ( GH 6055 )：

In [22]: shades = ["light", "dark"]

In [23]: colors = ["red", "green", "blue"]

In [24]: pd.MultiIndex.from_product([shades, colors], names=["shade", "color"])
Out[24]: 
MultiIndex([('light',   'red'),
            ('light', 'green'),
            ('light',  'blue'),
            ( 'dark',   'red'),
            ( 'dark', 'green'),
            ( 'dark',  'blue')],
           names=['shade', 'color'])

小组apply()将在非 ufunc 上工作。请参阅文档。

In [28]: import pandas._testing as tm

In [29]: panel = tm.makePanel(5)

In [30]: panel
Out[30]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [31]: panel['ItemA']
Out[31]:
                   A         B         C         D
2000-01-03 -0.673690  0.577046 -1.344312 -1.469388
2000-01-04  0.113648 -1.715002  0.844885  0.357021
2000-01-05 -1.478427 -1.039268  1.075770 -0.674600
2000-01-06  0.524988 -0.370647 -0.109050 -1.776904
2000-01-07  0.404705 -1.157892  1.643563 -0.968914

[5 rows x 4 columns]

指定apply对 Series 进行操作的 an （返回单个元素）

In [32]: panel.apply(lambda x: x.dtype, axis='items')
Out[32]:
                  A        B        C        D
2000-01-03  float64  float64  float64  float64
2000-01-04  float64  float64  float64  float64
2000-01-05  float64  float64  float64  float64
2000-01-06  float64  float64  float64  float64
2000-01-07  float64  float64  float64  float64

[5 rows x 4 columns]

类似的归约类型操作

In [33]: panel.apply(lambda x: x.sum(), axis='major_axis')
Out[33]:
      ItemA     ItemB     ItemC
A -1.108775 -1.090118 -2.984435
B -3.705764  0.409204  1.866240
C  2.110856  2.960500 -0.974967
D -4.532785  0.303202 -3.685193

[4 rows x 3 columns]

这相当于

In [34]: panel.sum('major_axis')
Out[34]:
      ItemA     ItemB     ItemC
A -1.108775 -1.090118 -2.984435
B -3.705764  0.409204  1.866240
C  2.110856  2.960500 -0.974967
D -4.532785  0.303202 -3.685193

[4 rows x 3 columns]

返回面板的转换操作，但正在计算major_axis上的z分数

In [35]: result = panel.apply(lambda x: (x - x.mean()) / x.std(),
  ....:                      axis='major_axis')
  ....:

In [36]: result
Out[36]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [37]: result['ItemA']                           # noqa E999
Out[37]:
                  A         B         C         D
2000-01-03 -0.535778  1.500802 -1.506416 -0.681456
2000-01-04  0.397628 -1.108752  0.360481  1.529895
2000-01-05 -1.489811 -0.339412  0.557374  0.280845
2000-01-06  0.885279  0.421830 -0.453013 -1.053785
2000-01-07  0.742682 -0.474468  1.041575 -0.075499

[5 rows x 4 columns]

apply()在横截面上操作的面板。（GH 1148）

In [38]: def f(x):
   ....:     return ((x.T - x.mean(1)) / x.std(1)).T
   ....:

In [39]: result = panel.apply(f, axis=['items', 'major_axis'])

In [40]: result
Out[40]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [41]: result.loc[:, :, 'ItemA']
Out[41]:
                   A         B         C         D
2000-01-03  0.012922 -0.030874 -0.629546 -0.757034
2000-01-04  0.392053 -1.071665  0.163228  0.548188
2000-01-05 -1.093650 -0.640898  0.385734 -1.154310
2000-01-06  1.005446 -1.154593 -0.595615 -0.809185
2000-01-07  0.783051 -0.198053  0.919339 -1.052721

[5 rows x 4 columns]

这相当于下面的

In [42]: result = pd.Panel({ax: f(panel.loc[:, :, ax]) for ax in panel.minor_axis})

In [43]: result
Out[43]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [44]: result.loc[:, :, 'ItemA']
Out[44]:
                   A         B         C         D
2000-01-03  0.012922 -0.030874 -0.629546 -0.757034
2000-01-04  0.392053 -1.071665  0.163228  0.548188
2000-01-05 -1.093650 -0.640898  0.385734 -1.154310
2000-01-06  1.005446 -1.154593 -0.595615 -0.809185
2000-01-07  0.783051 -0.198053  0.919339 -1.052721

[5 rows x 4 columns]

表现＃

0.13.1 的性能改进

系列 datetime/timedelta 二进制运算 ( GH 5801 )
数据框count/dropna用于axis=1
Series.str.contains 现在有一个regex=False关键字，对于普通（非正则表达式）字符串模式可以更快。（GH 5879）
系列.str.提取物（GH 5944）
dtypes/ftypes方法（GH 5968）
使用对象数据类型进行索引（GH 5968）
DataFrame.apply（GH 6013）
JSON IO 中的回归（GH 5765）
系列索引构建 ( GH 6150 )

实验#

0.13.1 中没有实验性变化

Bug修复＃

io.wb.get_countries不包括所有国家/地区的错误（ GH 6008）
系列中的错误替换为时间戳字典（GH 5797）
read_csv/read_table 现在尊重prefixkwarg ( GH 5732 )。
.ix通过重复索引 DataFrame 失败而选择缺失值的错误( GH 5835 )
修复空 DataFrame 上的布尔比较问题（GH 5808）
NaT对象数组中isnull 处理中的错误( GH 5443 )
to_datetime当传递一个np.nan或整数日期和格式字符串时出现错误（ GH 5863）
使用 datetimelike 进行 groupby 数据类型转换时出现错误（GH 5869）
将空系列作为索引器处理为系列的回归（GH 5877）
内部缓存中的错误，与 ( GH 5727 )相关
在 py3 下的 Windows 上从非文件路径读取 JSON/msgpack 的测试错误（GH 5874）
分配给 .ix[tuple(…)] 时出现错误（GH 5896）
完全重新索引面板时出现错误（GH 5905）
对象 dtypes 的 idxmin/max 中的错误（GH 5914）
BusinessDay当 n>5 且 n%5==0 时向日期添加 n 天而不是偏移量时出现错误（ GH 5890）
通过 ix ( GH 5928 )将系列分配给链式系列时出现错误
创建空 DataFrame、复制然后分配时出现错误（GH 5932）
DataFrame.tail 中存在空帧错误（GH 5846）
resample在( GH 5862 )上传播元数据时出现错误
将字符串表示形式固定为NaT“NaT”（GH 5708）
修复了时间戳的字符串表示形式，以显示纳秒（如果存在）（GH 5912）
pd.match不返回经过的哨兵
Panel.to_frame()major_axis当是 MultiIndex( GH 5402 )时不再失败。
pd.read_msgpack错误推断频率的错误DateTimeIndex（GH 5947）
修复了具有 Tz 感知日期时间和( GH 5961 )to_datetime的数组NaT
当传递带有错误数据的系列时，滚动偏斜/峰度存在错误（GH 5749）
interpolate具有日期时间索引的scipy 方法中的错误（ GH 5975）
如果传递了带有 NaT 的混合 datetime/np.datetime64，则 NaT 比较中的错误 ( GH 5968 )
pd.concat修复了如果所有输入均为空则丢失数据类型信息的错误（ GH 5742）
IPython 的最新更改导致在 QTConsole 中使用以前版本的 pandas 时发出警告，现已修复。如果您使用的是旧版本并且需要抑制警告，请参阅 ( GH 5922 )。
合并timedelta数据类型时出现错误（GH 5695）
plotting.scatter_matrix 函数中的错误。对角线和非对角线图之间的对齐错误，请参阅（GH 5497）。
通过 ix 使用 MultiIndex 进行串联回归 ( GH 6018 )
具有多重索引的 Series.xs 中的错误 ( GH 6018 )
使用类似日期和整数的混合类型的系列构造中的错误（这应该导致对象类型而不是自动转换）（GH 6028）
在 NumPy 1.7.1 下使用对象数组进行链式索引时可能出现段错误（GH 6026、GH 6056）
使用非标量（例如列表）对单个元素进行花式索引设置时出现错误（GH 6043）
to_sql不尊重if_exists（GH 4110 GH 4304）
.get(None)从 0.12 开始进行索引回归( GH 5652 )
微妙的iloc索引错误，出现在（GH 6059）中
将字符串插入 DatetimeIndex 的错误（GH 5818）
修复了 to_html/HTML repr 中的 unicode 错误（GH 6098）
修复了 get_options_data 中缺少的参数验证（GH 6105）
帧中重复列的分配错误，其中位置是切片（例如彼此相邻）（GH 6120）
在使用重复索引/列构造 DataFrame 期间传播 _ref_locs 时出现错误 ( GH 6121 )
DataFrame.apply使用混合日期式缩减时出现错误（ GH 6125）
DataFrame.append附加具有不同列的行时出现错误（ GH 6129）
使用 rearray 和非 ns 日期时间数据类型构建 DataFrame 时出现错误 ( GH 6140 )
.loc使用 rhs 上的数据帧、多项目设置和类似日期时间的 setitem 索引中的错误（ GH 6152）
修复了字典字符串比较期间query/中的错误( GH 6155 )。eval
query修复了单个元素的索引Series被丢弃的错误（ GH 6148）。
HDFStore将具有多索引列的数据框附加到现有表时出现错误（ GH 6167）
设置空数据帧时与数据类型的一致性（GH 6171）
HDFStore即使存在指定的列规格（GH 6169），在多重索引上选择也会出现错误
nanops.var和1 元素中的错误ddof=1有时会返回，inf 而不是nan在某些平台上（GH 6136）
系列和 DataFrame 条形图中的错误忽略use_index关键字 ( GH 6209 )
修复了 python3 下混合 str/int 的 groupby 中的错误；argsort失败了（GH 6212）

贡献者#

共有 52 人为此版本贡献了补丁。名字带有“+”的人首次贡献了补丁。

亚历克斯·罗斯伯格
阿洛克·辛格 +
安德鲁·伯罗斯 +
安迪·海登
比约恩·阿内森 +
布拉德·布兰
迦勒·爱泼斯坦
萧启文
蔡斯艾伯特+
克拉克·菲茨杰拉德 +
帝斯曼
丹·伯肯
丹尼尔·韦伯 +
大卫·沃尔弗 +
多兰·德鲁兹 +
道格拉斯·麦克尼尔 +
道格拉斯·拉德 +
德拉赞·卢卡宁
艾略特小号+
菲利克斯·劳伦斯 +
乔治·关 +
纪尧姆·盖伊 +
雅各布·谢尔
简·瓦格纳 +
杰夫·特拉特纳
约翰·麦克纳马拉
乔里斯·范登博什
朱莉娅·埃文斯 +
基兰·奥马霍尼
迈克尔·沙佐 +
纳文·米肖-阿格拉瓦尔 +
帕特里克·奥基夫 +
菲利普·克劳德
罗曼·佩卡
船长西博尔德
斯宾塞·里昂
汤姆·奥格斯普格 +
汤姆·奥格斯普格
阿科贝+
阿基特里奇 +
体重指数+
布维纳尔+
萧启文
丹尼尔巴兰
大卫+
大卫辛
浸入+
杰雷巴克
性的
姆瓦斯科姆 +
乌努特布
yp