In [1]: import pandas as pd
本教程使用的数据:
  • 本教程使用泰坦尼克号数据集,存储为 CSV。数据由以下数据列组成:

    • PassengerId:每位乘客的 ID。

    • Survived:表示乘客是否幸存。0对于是和1否。

    • Pclass:3 个票价类别之一: Class 1、 Class2和 Class 3

    • 姓名:乘客姓名。

    • 性别:乘客的性别。

    • 年龄:乘客的年龄(岁)。

    • SibSp:船上兄弟姐妹或配偶的数量。

    • Parch:船上父母或儿童的数量。

    • 票号:旅客的票号。

    • 票价:表示票价。

    • 客舱:乘客的客舱编号。

    • 登船:登船港口。

    至原始数据
    In [2]: titanic = pd.read_csv("data/titanic.csv")
    
    In [3]: titanic.head()
    Out[3]: 
       PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
    0            1         0       3  ...   7.2500   NaN         S
    1            2         1       1  ...  71.2833   C85         C
    2            3         1       3  ...   7.9250   NaN         S
    3            4         1       1  ...  53.1000  C123         S
    4            5         0       3  ...   8.0500   NaN         S
    
    [5 rows x 12 columns]
    

如何操作文本数据#

  • 将所有名称字符设为小写。

    In [4]: titanic["Name"].str.lower()
    Out[4]: 
    0                                braund, mr. owen harris
    1      cumings, mrs. john bradley (florence briggs th...
    2                                 heikkinen, miss. laina
    3           futrelle, mrs. jacques heath (lily may peel)
    4                               allen, mr. william henry
                                 ...                        
    886                                montvila, rev. juozas
    887                         graham, miss. margaret edith
    888             johnston, miss. catherine helen "carrie"
    889                                behr, mr. karl howell
    890                                  dooley, mr. patrick
    Name: Name, Length: 891, dtype: object
    

    要使列中的每个字符串都Name小写,请选择该Name列(请参阅有关选择数据的教程),添加str访问器并应用该lower方法。因此,每个字符串都会按元素进行转换。

与时间序列教程中 具有访问器的日期时间对象类似dt,使用访问器时可以使用许多专用字符串方法str 。这些方法通常具有与单个元素的等效内置字符串方法相匹配的名称,但按元素应用(还记得按元素计算吗?)到列的每个值。

  • Surname通过提取逗号之前的部分来创建一个包含乘客姓氏的新列。

    In [5]: titanic["Name"].str.split(",")
    Out[5]: 
    0                             [Braund,  Mr. Owen Harris]
    1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
    2                              [Heikkinen,  Miss. Laina]
    3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
    4                            [Allen,  Mr. William Henry]
                                 ...                        
    886                             [Montvila,  Rev. Juozas]
    887                      [Graham,  Miss. Margaret Edith]
    888          [Johnston,  Miss. Catherine Helen "Carrie"]
    889                             [Behr,  Mr. Karl Howell]
    890                               [Dooley,  Mr. Patrick]
    Name: Name, Length: 891, dtype: object
    

    使用该Series.str.split()方法,每个值都作为 2 个元素的列表返回。第一个元素是逗号之前的部分,第二个元素是逗号之后的部分。

    In [6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
    
    In [7]: titanic["Surname"]
    Out[7]: 
    0         Braund
    1        Cumings
    2      Heikkinen
    3       Futrelle
    4          Allen
             ...    
    886     Montvila
    887       Graham
    888     Johnston
    889         Behr
    890       Dooley
    Name: Surname, Length: 891, dtype: object
    

    由于我们只对代表姓氏的第一部分(元素 0)感兴趣,因此我们可以再次使用str访问器并 applySeries.str.get()来提取相关部分。事实上,这些字符串函数可以串联起来同时组合多个函数!

转至用户指南

More information on extracting parts of strings is available in the user guide section on splitting and replacing strings.

  • Extract the passenger data about the countesses on board of the Titanic.

    In [8]: titanic["Name"].str.contains("Countess")
    Out[8]: 
    0      False
    1      False
    2      False
    3      False
    4      False
           ...  
    886    False
    887    False
    888    False
    889    False
    890    False
    Name: Name, Length: 891, dtype: bool
    
    In [9]: titanic[titanic["Name"].str.contains("Countess")]
    Out[9]: 
         PassengerId  Survived  Pclass  ... Cabin Embarked  Surname
    759          760         1       1  ...   B77        S   Rothes
    
    [1 rows x 13 columns]
    

    (Interested in her story? See Wikipedia!)

    The string method Series.str.contains() checks for each of the values in the column Name if the string contains the word Countess and returns for each of the values True (Countess is part of the name) or False (Countess is not part of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the subsetting of data tutorial. As there was only one countess on the Titanic, we get one row as a result.

Note

More powerful extractions on strings are supported, as the Series.str.contains() and Series.str.extract() methods accept regular expressions, but out of scope of this tutorial.

To user guide

More information on extracting parts of strings is available in the user guide section on string matching and extracting.

  • Which passenger of the Titanic has the longest name?

    In [10]: titanic["Name"].str.len()
    Out[10]: 
    0      23
    1      51
    2      22
    3      44
    4      24
           ..
    886    21
    887    28
    888    40
    889    21
    890    19
    Name: Name, Length: 891, dtype: int64
    

    To get the longest name we first have to get the lengths of each of the names in the Name column. By using pandas string methods, the Series.str.len() function is applied to each of the names individually (element-wise).

    In [11]: titanic["Name"].str.len().idxmax()
    Out[11]: 307
    

    Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is the largest. The idxmax() method does exactly that. It is not a string method and is applied to integers, so no str is used.

    In [12]: titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
    Out[12]: 'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'
    

    Based on the index name of the row (307) and the column (Name), we can do a selection using the loc operator, introduced in the tutorial on subsetting.

  • In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”.

    In [13]: titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
    
    In [14]: titanic["Sex_short"]
    Out[14]: 
    0      M
    1      F
    2      F
    3      F
    4      M
          ..
    886    M
    887    F
    888    F
    889    M
    890    M
    Name: Sex_short, Length: 891, dtype: object
    

    Whereas replace() is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a dictionary to define the mapping {from : to}.

Warning

There is also a replace() method available to replace a specific set of characters. However, when having a mapping of multiple values, this would become:

titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")

This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if those two statements are applied in the opposite order…

REMEMBER

  • String methods are available using the str accessor.

  • String methods work element-wise and can be used for conditional indexing.

  • The replace method is a convenient method to convert values according to a given dictionary.

To user guide

A full overview is provided in the user guide pages on working with text data.