Zgulde's pandas extensions

This module adds functionality to pandas Series and DataFrame objects. The objects in pandas will be modified by simply importing this module.

>>> import zgulde.extend_pandas

The following methods are added to all Series:

and the following are added to all DataFrames:

It also defines the left and right shift operators to be similar to pandas.DataFrame.pipe. For example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(dict(x=np.arange(4)))
>>> df
   x
0  0
1  1
2  2
3  3
>>> create_y = lambda df: df.assign(y=df.x + 1)
>>> df >> create_y
   x  y
0  0  1
1  1  2
2  2  3
3  3  4
>>> # This gives the same results as .pipe
>>> ((df >> create_y) == df.pipe(create_y)).all(axis=None)
True

Data Frame Extensions

chi2

Performs a chi squared contingency table test between all the combinations of two columns in the data frame.

Returns

(pvals, chi2s)

A tuple with two data frames, each which have all of the columns from the original data frame as both the indexes and the columns. The values in the first are the p-values, the values in the second are the chi square test statistics.

Example

>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> p_vals, chi2s = tips[['smoker', 'time', 'day']].chi2()
>>> p_vals
             smoker        time          day
smoker          NaN    0.477149  1.05676e-05
time       0.477149         NaN   8.4499e-47
day     1.05676e-05  8.4499e-47          NaN
>>> chi2s
          smoker      time      day
smoker       NaN  0.505373  25.7872
time    0.505373       NaN  217.113
day      25.7872   217.113      NaN

cleanup_column_names

Returns a data frame with the column names cleaned up. Special characters are removed and spaces, dots, and dashes are replaced with underscores.

Parameters

  • inplace : Whether or not to modify the data frame in-place and return
    None.

Example

>>> df = pd.DataFrame({'*Feature& A': [1, 2], ' feature.b  ': [2, 3], 'FEATURE-C': [3, 4]})
>>> df
   *Feature& A   feature.b    FEATURE-C
0            1             2          3
1            2             3          4
>>> df.cleanup_column_names()
   feature_a  feature_b  feature_c
0          1          2          3
1          2          3          4
>>> df
   *Feature& A   feature.b    FEATURE-C
0            1             2          3
1            2             3          4
>>> df.cleanup_column_names(inplace=True)
>>> df
   feature_a  feature_b  feature_c
0          1          2          3
1          2          3          4

correlation_heatmap

Plot a heatmap of the correlation matrix for the data frame.

Any additional kwargs are passed to seaborn.heatmap and the resulting axes object is returned.

>>> x = np.arange(0, 10)
>>> y = x / 2
>>> df = pd.DataFrame(dict(x=x, y=y))
>>> df.correlation_heatmap()
<matplotlib.axes._subplots.AxesSubplot object at ...>

crosstab

Shortcut to call to pd.crosstab.

Parameters

  • rows : the name of the columns that will make up the rows in resulting
    contingency table
  • cols : the name of the columns that will make up the columns in resulting
    contingency table
  • values : (optional) name of the column to use for the cell values in the
    resulting contingency table. If supplied, aggfunc must be provided as well. See pd.crosstab for more details.
  • kwargs : any additional key word arguments to pass along to
    pd.crosstab

Example

>>> df = pd.DataFrame(dict(x=list('aaabbb'), y=list('cdcdcd'), z=range(6)))
>>> df
   x  y  z
0  a  c  0
1  a  d  1
2  a  c  2
3  b  d  3
4  b  c  4
5  b  d  5
>>> df.crosstab('x', 'y')
y  c  d
x
a  2  1
b  1  2
>>> (df.crosstab('x', 'y') == pd.crosstab(df.x, df.y)).all(axis=None)
True
>>> df.crosstab('x', 'y', margins=True)
y    c  d  All
x
a    2  1    3
b    1  2    3
All  3  3    6
>>> df.xtab(rows='x', cols='y', values='z', aggfunc='mean')
y  c  d
x
a  1  1
b  4  4

drop_outliers

Drop rows with outliers in the given columns from the dataframe

See the docs for .outliers for more details on parameters, and to customize how the outliers are detected.

Parameters

cols : either a string or a list of strings of which column(s) to drop the
outliers in
kwargs : additional key-word arguments passed on to
pandas.Series.outliers

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6]))
>>> df
      x     y
0     1  1000
1     2     2
2     3     3
3     4     4
4     5     5
5  1000     6
>>> df.drop_outliers('x')
   x     y
0  1  1000
1  2     2
2  3     3
3  4     4
4  5     5
>>> df.drop_outliers('y')
      x  y
1     2  2
2     3  3
3     4  4
4     5  5
5  1000  6
>>> df.drop_outliers(['x', 'y'])
   x  y
1  2  2
2  3  3
3  4  4
4  5  5

get_scalers

Obtain a function that will scale multiple columns on a data frame.

The returned function accepts a data frame and returns the data frame with the specified column(s) scaled.

This can be useful to make sure you apply the same transformation to both training and test data sets.

See the docstring for Series.get_scaler for more details.

Parameters

  • columns : Either a single string, or a list of strings where each string
    is a column name to scale
  • kwargs : any additional arguments are passed to Series.get_scaler for each
    column specified

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 10], y=[-10, 1, 1, 2]))
>>> df
    x   y
0   1 -10
1   2   1
2   3   1
3  10   2
>>> scale_x = df.get_scalers('x', how='minmax')
>>> scale_x(df)
          x   y
0  0.000000 -10
1  0.111111   1
2  0.222222   1
3  1.000000   2
>>> scale_x_and_y = df.get_scalers(['x', 'y'])
>>> scale_x_and_y(df)
          x         y
0 -0.734847 -1.494836
1 -0.489898  0.439658
2 -0.244949  0.439658
3  1.469694  0.615521
>>> df.pipe(scale_x_and_y)
          x         y
0 -0.734847 -1.494836
1 -0.489898  0.439658
2 -0.244949  0.439658
3  1.469694  0.615521

hdtl

Return the head and the tail of the data frame.

Parameters

  • n : number of rows to get from both the head and tail

Example

>>> df = pd.DataFrame(dict(x=np.arange(10), y=np.arange(10)))
>>> df
   x  y
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9
>>> df.hdtl(1)
   x  y
0  0  0
9  9  9
>>> df.hdtl()
   x  y
0  0  0
1  1  1
2  2  2
7  7  7
8  8  8
9  9  9

nnull

Provide a summary of null values in each column.

alias of nna

Example

>>> df = pd.DataFrame(dict(x=[1, 2, np.nan], y=[4, np.nan, np.nan]))
>>> df
     x    y
0  1.0  4.0
1  2.0  NaN
2  NaN  NaN
>>> nulls_by_column = df.nnull()
>>> nulls_by_column
   n_missing  p_missing
x          1   0.333333
y          2   0.666667
>>> nulls_by_row = df.nnull(axis=1)
>>> nulls_by_row
   n_missing  p_missing
0          0        0.0
1          1        0.5
2          2        1.0

n_outliers

Provide a summary of the number of outliers in each numeric column.

Parameters

  • kwargs : any additional arguments to pass along to
    pandas.Series.outliers

Returns

A pandas.DataFrame indexed by the column names of the the data frame, with columns that indicate the number of outliers and the percentage of outliers in each column.

Example

>>> x = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> y = [1, 2, 3, 4, 5, 100, 2, 3, 4, 5]
>>> z = [1, 2, 3, 4, 5, -100, 2, 3, 100, 5]
>>> df = pd.DataFrame(dict(x=x, y=y, z=z))
>>> df
   x    y    z
0  1    1    1
1  2    2    2
2  3    3    3
3  4    4    4
4  5    5    5
5  1  100 -100
6  2    2    2
7  3    3    3
8  4    4  100
9  5    5    5
>>> df.n_outliers()
   n_outliers  p_outliers
x           0         0.0
y           1         0.1
z           2         0.2

pluck

>>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']})
>>> x, y = df.pluck('x', 'y')
>>> x
0    1
1    2
2    3
Name: x, dtype: int64
>>> y
0    a
1    b
2    c
Name: y, dtype: object

ttest

Runs a 1 sample t-test comparing the specified target variable to the overall mean among all of the possible subgroups.

Parameters

  • target : name of the column that holds the target (continuous) variable

Example

>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> tips = tips[['total_bill', 'day', 'time']]
>>> tips.ttest('total_bill')
                 statistic    pvalue    n
variable value
day      Sun      1.603035  0.113130   76
         Sat      0.644856  0.520737   87
         Thur    -2.099957  0.039876   62
         Fri     -1.383042  0.183569   19
time     Dinner   1.467432  0.144054  176
         Lunch   -2.797882  0.006710   68

ttest_2samp

Runs a 2 sample t-test comparing the specified target variable for every unique value from every other column in the data frame.

The resulting t-statistic and pvalue are based on subdividing the data for each unique value for each column, with each individual value indicating that the test was performed based on belonging to that unique value vs not belonging to that group.

Parameters

  • target : name of the column that holds the target (continuous) variable

Example

>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> tips = tips[['total_bill', 'day', 'time']]
>>> tips.ttest_2samp('total_bill')
                 statistic    pvalue    n
variable value
day      Sun      1.927317  0.055111   76
         Sat      0.855634  0.393046   87
         Thur    -2.170294  0.030958   62
         Fri     -1.345462  0.179735   19
time     Dinner   2.897638  0.004105  176
         Lunch   -2.897638  0.004105   68

unnest

Turns a column with multiple values in each row in it into separate rows, each with a single value.

Parameters

  • col : name of the column to unnest
  • split : default True. whether or not to split the data in the column. Set
    this to False if the column already contains lists in each row
  • sep : separator to split on. Ignored if split=False
  • reset_index : default True. whether to reset the index in the resulting
    data frame. If False, the resulting data frame will have an index that could contain duplicates.

Example

>>> df = pd.DataFrame(dict(x=list('abc'), y=['a,b,c', 'd,e', 'f']))
>>> df
   x      y
0  a  a,b,c
1  b    d,e
2  c      f
>>> df.unnest('y')
   x  y
0  a  a
1  a  b
2  a  c
3  b  d
4  b  e
5  c  f

select

Return specified columns from a dataframe, optionally renaming some.

Parameters

  • args: strings that are column names to include
  • kwargs: any additional columns to rename

Returns

A subset of the dataframe

Examples

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> df
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9
>>> df.select('a')
   a
0  1
1  2
2  3
>>> df.select('a', 'b')
   a  b
0  1  4
1  2  5
2  3  6
>>> df.select('a', c='the_letter_c')
   a  the_letter_c
0  1             7
1  2             8
2  3             9
>>> df.select(a='A', b='BBB')
   A  BBB
0  1    4
1  2    5
2  3    6

sql

Run a SQL query against the dataframe.

The dataframe is converted to a sqlite database and the provided query is run against it. As such, any SQL that is valid in sqlite is supported.

Parameters

  • query: The SQL query to run
  • table: (optional) name of the table to call the dataframe. Defaults to df
  • in_memory: whether or not the sqlite database should be in memory or
    written to an external file. If False, a temporary file will be created and cleaned up after the query is run.
  • index: whether or not to include the dataframe's index in the table that
    is queried against.

Examples

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': ['a', 'a', 'b']})
>>> df
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT * FROM df')
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT * FROM my_df', table='my_df')
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT c, AVG(a + b) FROM df GROUP BY c')
   c  AVG(a + b)
0  a         6.0
1  b         9.0

Series Extensions

cut

Bin series values into discrete intervals.

Shortcut to pd.cut

Parameters

  • args : positional arguments passed to pandas.cut
  • keyword : keyword arguments passed to pandas.cut

Example

>>> x = pd.Series(range(1, 7))
>>> x
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
>>> x.cut(2)
0    (0.995, 3.5]
1    (0.995, 3.5]
2    (0.995, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]]
>>> x.cut(bins=[0, 3, 6])
0    (0, 3]
1    (0, 3]
2    (0, 3]
3    (3, 6]
4    (3, 6]
5    (3, 6]
dtype: category
Categories (2, interval[int64]): [(0, 3] < (3, 6]]

get_scaler

Obtain a function that will scale the series on a data frame.

The returned function accepts a data frame and returns the data frame with the specified column scaled.

This can be useful to make sure you apply the same transformation to both training and test data sets.

  • zscore = (x - mu) / sigma
  • minmax = (x - min) / (max - min)

Parameters

  • how : One of {'zscore', 'minmax'} to either apply z-score or min-max
    scaling

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6]))
>>> scale_x = df.x.get_scaler()
>>> scale_x(df)
          x     y
0 -0.413160  1000
1 -0.410703     2
2 -0.408246     3
3 -0.405789     4
4 -0.403332     5
5  2.041229     6
>>> scale_y = df.y.get_scaler('minmax')
>>> scale_y(df)
      x         y
0     1  1.000000
1     2  0.000000
2     3  0.001002
3     4  0.002004
4     5  0.003006
5  1000  0.004008
>>> df.pipe(scale_x).pipe(scale_y)
          x         y
0 -0.413160  1.000000
1 -0.410703  0.000000
2 -0.408246  0.001002
3 -0.405789  0.002004
4 -0.403332  0.003006
5  2.041229  0.004008

ln

Returns the natural log of the values in the series using np.log

Example

>>> x = pd.Series([1, np.e, np.e ** 2, np.e ** 3])
>>> x
0     1.000000
1     2.718282
2     7.389056
3    20.085537
dtype: float64
>>> x.ln()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

log

Returns the log base 10 of the values in the series using np.log10

Example

>>> x = pd.Series([1, 10, 100, 1000])
>>> x
0       1
1      10
2     100
3    1000
dtype: int64
>>> x.log()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

log2

Returns the log base 2 of the values in the series using np.log2

Example

>>> x = pd.Series([1, 2,4, 8, 16])
>>> x
0     1
1     2
2     4
3     8
4    16
dtype: int64
>>> x.log2()
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

outliers

Detect outliers in the series.

Returns

A pandas Series of boolean values indicating whether each point is an outlier or not.

Parameters

how : {'iqr', 'std'}, default 'iqr'
  • 'iqr' : identify outliers based on whether they are > q3 + k * iqr or < q1 - k * iqr
  • 'std' : identify outliers based on whether they are further than a
    specified number of standard deviations from the mean
k : value to multiply the iqr by for outlier detection. Ignored when
how='std'. Default 1.5
std_cutoff : cutoff for identifying an outlier based on standard deviation.
Ignored when how='iqr'. Default 2

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 6, 100],
...                        y=[-100, 5, 3, 4, 1, 2, 0]))
>>> df
     x    y
0    1 -100
1    2    5
2    3    3
3    4    4
4    5    1
5    6    2
6  100    0
>>> df.x.outliers()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: x, dtype: bool
>>> df[df.x.outliers()]
     x  y
6  100  0

qcut

Bin series values into discrete intervals.

Shortcut to pd.cut

Parameters

  • args : positional arguments passed to pandas.cut
  • keyword : keyword arguments passed to pandas.cut

Example

>>> x = pd.Series(range(1, 7))
>>> x
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
>>> x.qcut(2)
0    (0.999, 3.5]
1    (0.999, 3.5]
2    (0.999, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 3.5] < (3.5, 6.0]]

top_n

Convert the series to the most frequent n values and use other_val for the rest.

Parameters

  • n: number of values to consider
  • other_val: value that will be used for everything that isn't the most
    frequent n values

Returns

A pandas Series

Examples

>>> s = pd.Series(['a', 'a', 'b', 'b', 'c', 'd'])
>>> s.top_n(2)
0        a
1        a
2        b
3        b
4    Other
5    Other
dtype: object

zscore

Returns the z-score for every value in the series.

Z = (x - mu) / sigma

Example

>>> x = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> x
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64
>>> x.zscore()
0   -1.460593
1   -1.095445
2   -0.730297
3   -0.365148
4    0.000000
5    0.365148
6    0.730297
7    1.095445
8    1.460593
dtype: float64