Zgulde's pandas extensions

This module adds functionality to pandas Series and DataFrame objects. The objects in pandas will be modified by simply importing this module.

>>> import zgulde.extend_pandas

The following methods are added to all Series:

cut (bin): put data into bins; shortcut to pd.cut
get_scaler: obtain a function that scales a series
ln: natural log
log2: log base 2
log: log base 10
outliers: detect outliers
top_n: encode as the most frequent n values or "Other"
zscore: obtain the z-score for every value

and the following are added to all DataFrames:

chi2: run chi square tests on all column combinations
correlation_heatmap: plot a heatmap of the correlations
crosstab (xtab): shortcut to pd.crosstab
drop_outliers: remove outliers
get_scalers: obtain a function that scales multiple columns
hdtl: look at the head and tail of the data frame
nnull (nna): summarize the number of missing values
n_outliers: summarize the number of outliers in each numeric column
ttest: run multiple 1 sample t-tests for multiple categories
ttest_2samp: run multiple 2 sample t-tests for multiple categories
select: select and rename columns in a dataframe
sql: run SQL queries against a dataframe
unnest: handle multiple values in a single cell

It also defines the left and right shift operators to be similar to pandas.DataFrame.pipe. For example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(dict(x=np.arange(4)))
>>> df
   x
0  0
1  1
2  2
3  3
>>> create_y = lambda df: df.assign(y=df.x + 1)
>>> df >> create_y
   x  y
0  0  1
1  1  2
2  2  3
3  3  4
>>> # This gives the same results as .pipe
>>> ((df >> create_y) == df.pipe(create_y)).all(axis=None)
True

Data Frame Extensions

`chi2`

Performs a chi squared contingency table test between all the combinations of two columns in the data frame.

Returns

(pvals, chi2s)

A tuple with two data frames, each which have all of the columns from the original data frame as both the indexes and the columns. The values in the first are the p-values, the values in the second are the chi square test statistics.

Example

>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> p_vals, chi2s = tips[['smoker', 'time', 'day']].chi2()
>>> p_vals
             smoker        time          day
smoker          NaN    0.477149  1.05676e-05
time       0.477149         NaN   8.4499e-47
day     1.05676e-05  8.4499e-47          NaN
>>> chi2s
          smoker      time      day
smoker       NaN  0.505373  25.7872
time    0.505373       NaN  217.113
day      25.7872   217.113      NaN

`cleanup_column_names`

Returns a data frame with the column names cleaned up. Special characters are removed and spaces, dots, and dashes are replaced with underscores.

Parameters

inplace : Whether or not to modify the data frame in-place and return

None.

Example

>>> df = pd.DataFrame({'*Feature& A': [1, 2], ' feature.b  ': [2, 3], 'FEATURE-C': [3, 4]})
>>> df
   *Feature& A   feature.b    FEATURE-C
0            1             2          3
1            2             3          4
>>> df.cleanup_column_names()
   feature_a  feature_b  feature_c
0          1          2          3
1          2          3          4
>>> df
   *Feature& A   feature.b    FEATURE-C
0            1             2          3
1            2             3          4
>>> df.cleanup_column_names(inplace=True)
>>> df
   feature_a  feature_b  feature_c
0          1          2          3
1          2          3          4

`correlation_heatmap`

Plot a heatmap of the correlation matrix for the data frame.

Any additional kwargs are passed to seaborn.heatmap and the resulting axes object is returned.

>>> x = np.arange(0, 10)
>>> y = x / 2
>>> df = pd.DataFrame(dict(x=x, y=y))
>>> df.correlation_heatmap()
<matplotlib.axes._subplots.AxesSubplot object at ...>

`crosstab`

Shortcut to call to pd.crosstab.

Parameters

rows : the name of the columns that will make up the rows in resulting

contingency table
cols : the name of the columns that will make up the columns in resulting

contingency table
values : (optional) name of the column to use for the cell values in the

resulting contingency table. If supplied, aggfunc must be provided as well. See pd.crosstab for more details.
kwargs : any additional key word arguments to pass along to

pd.crosstab

Example

>>> df = pd.DataFrame(dict(x=list('aaabbb'), y=list('cdcdcd'), z=range(6)))
>>> df
   x  y  z
0  a  c  0
1  a  d  1
2  a  c  2
3  b  d  3
4  b  c  4
5  b  d  5
>>> df.crosstab('x', 'y')
y  c  d
x
a  2  1
b  1  2
>>> (df.crosstab('x', 'y') == pd.crosstab(df.x, df.y)).all(axis=None)
True
>>> df.crosstab('x', 'y', margins=True)
y    c  d  All
x
a    2  1    3
b    1  2    3
All  3  3    6
>>> df.xtab(rows='x', cols='y', values='z', aggfunc='mean')
y  c  d
x
a  1  1
b  4  4

`drop_outliers`

Drop rows with outliers in the given columns from the dataframe

See the docs for .outliers for more details on parameters, and to customize how the outliers are detected.

Parameters

cols : either a string or a list of strings of which column(s) to drop the: outliers in
kwargs : additional key-word arguments passed on to: pandas.Series.outliers

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6]))
>>> df
      x     y
0     1  1000
1     2     2
2     3     3
3     4     4
4     5     5
5  1000     6
>>> df.drop_outliers('x')
   x     y
0  1  1000
1  2     2
2  3     3
3  4     4
4  5     5
>>> df.drop_outliers('y')
      x  y
1     2  2
2     3  3
3     4  4
4     5  5
5  1000  6
>>> df.drop_outliers(['x', 'y'])
   x  y
1  2  2
2  3  3
3  4  4
4  5  5

`get_scalers`

Obtain a function that will scale multiple columns on a data frame.

The returned function accepts a data frame and returns the data frame with the specified column(s) scaled.

This can be useful to make sure you apply the same transformation to both training and test data sets.

See the docstring for Series.get_scaler for more details.

Parameters

columns : Either a single string, or a list of strings where each string

is a column name to scale
kwargs : any additional arguments are passed to Series.get_scaler for each

column specified

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 10], y=[-10, 1, 1, 2]))
>>> df
    x   y
0   1 -10
1   2   1
2   3   1
3  10   2
>>> scale_x = df.get_scalers('x', how='minmax')
>>> scale_x(df)
          x   y
0  0.000000 -10
1  0.111111   1
2  0.222222   1
3  1.000000   2
>>> scale_x_and_y = df.get_scalers(['x', 'y'])
>>> scale_x_and_y(df)
          x         y
0 -0.734847 -1.494836
1 -0.489898  0.439658
2 -0.244949  0.439658
3  1.469694  0.615521
>>> df.pipe(scale_x_and_y)
          x         y
0 -0.734847 -1.494836
1 -0.489898  0.439658
2 -0.244949  0.439658
3  1.469694  0.615521

`hdtl`

Return the head and the tail of the data frame.

Parameters

n : number of rows to get from both the head and tail

Example

>>> df = pd.DataFrame(dict(x=np.arange(10), y=np.arange(10)))
>>> df
   x  y
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9
>>> df.hdtl(1)
   x  y
0  0  0
9  9  9
>>> df.hdtl()
   x  y
0  0  0
1  1  1
2  2  2
7  7  7
8  8  8
9  9  9

`nnull`

Provide a summary of null values in each column.

alias of nna

Example

>>> df = pd.DataFrame(dict(x=[1, 2, np.nan], y=[4, np.nan, np.nan]))
>>> df
     x    y
0  1.0  4.0
1  2.0  NaN
2  NaN  NaN
>>> nulls_by_column = df.nnull()
>>> nulls_by_column
   n_missing  p_missing
x          1   0.333333
y          2   0.666667
>>> nulls_by_row = df.nnull(axis=1)
>>> nulls_by_row
   n_missing  p_missing
0          0        0.0
1          1        0.5
2          2        1.0

`n_outliers`

Provide a summary of the number of outliers in each numeric column.

Parameters

kwargs : any additional arguments to pass along to

pandas.Series.outliers

Returns

A pandas.DataFrame indexed by the column names of the the data frame, with columns that indicate the number of outliers and the percentage of outliers in each column.

Example

>>> x = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> y = [1, 2, 3, 4, 5, 100, 2, 3, 4, 5]
>>> z = [1, 2, 3, 4, 5, -100, 2, 3, 100, 5]
>>> df = pd.DataFrame(dict(x=x, y=y, z=z))
>>> df
   x    y    z
0  1    1    1
1  2    2    2
2  3    3    3
3  4    4    4
4  5    5    5
5  1  100 -100
6  2    2    2
7  3    3    3
8  4    4  100
9  5    5    5
>>> df.n_outliers()
   n_outliers  p_outliers
x           0         0.0
y           1         0.1
z           2         0.2

`pluck`

>>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']})
>>> x, y = df.pluck('x', 'y')
>>> x
0    1
1    2
2    3
Name: x, dtype: int64
>>> y
0    a
1    b
2    c
Name: y, dtype: object

`ttest`

Runs a 1 sample t-test comparing the specified target variable to the overall mean among all of the possible subgroups.

Parameters

target : name of the column that holds the target (continuous) variable

Example

>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> tips = tips[['total_bill', 'day', 'time']]
>>> tips.ttest('total_bill')
                 statistic    pvalue    n
variable value
day      Sun      1.603035  0.113130   76
         Sat      0.644856  0.520737   87
         Thur    -2.099957  0.039876   62
         Fri     -1.383042  0.183569   19
time     Dinner   1.467432  0.144054  176
         Lunch   -2.797882  0.006710   68

`ttest_2samp`

Runs a 2 sample t-test comparing the specified target variable for every unique value from every other column in the data frame.

The resulting t-statistic and pvalue are based on subdividing the data for each unique value for each column, with each individual value indicating that the test was performed based on belonging to that unique value vs not belonging to that group.

Parameters

target : name of the column that holds the target (continuous) variable

Example

>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> tips = tips[['total_bill', 'day', 'time']]
>>> tips.ttest_2samp('total_bill')
                 statistic    pvalue    n
variable value
day      Sun      1.927317  0.055111   76
         Sat      0.855634  0.393046   87
         Thur    -2.170294  0.030958   62
         Fri     -1.345462  0.179735   19
time     Dinner   2.897638  0.004105  176
         Lunch   -2.897638  0.004105   68

`unnest`

Turns a column with multiple values in each row in it into separate rows, each with a single value.

Parameters

col : name of the column to unnest
split : default True. whether or not to split the data in the column. Set

this to False if the column already contains lists in each row
sep : separator to split on. Ignored if split=False
reset_index : default True. whether to reset the index in the resulting

data frame. If False, the resulting data frame will have an index that could contain duplicates.

Example

>>> df = pd.DataFrame(dict(x=list('abc'), y=['a,b,c', 'd,e', 'f']))
>>> df
   x      y
0  a  a,b,c
1  b    d,e
2  c      f
>>> df.unnest('y')
   x  y
0  a  a
1  a  b
2  a  c
3  b  d
4  b  e
5  c  f

`select`

Return specified columns from a dataframe, optionally renaming some.

Parameters

args: strings that are column names to include
kwargs: any additional columns to rename

Returns

A subset of the dataframe

Examples

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> df
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9
>>> df.select('a')
   a
0  1
1  2
2  3
>>> df.select('a', 'b')
   a  b
0  1  4
1  2  5
2  3  6
>>> df.select('a', c='the_letter_c')
   a  the_letter_c
0  1             7
1  2             8
2  3             9
>>> df.select(a='A', b='BBB')
   A  BBB
0  1    4
1  2    5
2  3    6

`sql`

Run a SQL query against the dataframe.

The dataframe is converted to a sqlite database and the provided query is run against it. As such, any SQL that is valid in sqlite is supported.

Parameters

query: The SQL query to run
table: (optional) name of the table to call the dataframe. Defaults to df
in_memory: whether or not the sqlite database should be in memory or

written to an external file. If False, a temporary file will be created and cleaned up after the query is run.
index: whether or not to include the dataframe's index in the table that

is queried against.

Examples

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': ['a', 'a', 'b']})
>>> df
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT * FROM df')
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT * FROM my_df', table='my_df')
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT c, AVG(a + b) FROM df GROUP BY c')
   c  AVG(a + b)
0  a         6.0
1  b         9.0

Series Extensions

`cut`

Bin series values into discrete intervals.

Shortcut to pd.cut

Parameters

args : positional arguments passed to pandas.cut
keyword : keyword arguments passed to pandas.cut

Example

>>> x = pd.Series(range(1, 7))
>>> x
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
>>> x.cut(2)
0    (0.995, 3.5]
1    (0.995, 3.5]
2    (0.995, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]]
>>> x.cut(bins=[0, 3, 6])
0    (0, 3]
1    (0, 3]
2    (0, 3]
3    (3, 6]
4    (3, 6]
5    (3, 6]
dtype: category
Categories (2, interval[int64]): [(0, 3] < (3, 6]]

`get_scaler`

Obtain a function that will scale the series on a data frame.

The returned function accepts a data frame and returns the data frame with the specified column scaled.

This can be useful to make sure you apply the same transformation to both training and test data sets.

zscore = (x - mu) / sigma
minmax = (x - min) / (max - min)

Parameters

how : One of {'zscore', 'minmax'} to either apply z-score or min-max

scaling

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6]))
>>> scale_x = df.x.get_scaler()
>>> scale_x(df)
          x     y
0 -0.413160  1000
1 -0.410703     2
2 -0.408246     3
3 -0.405789     4
4 -0.403332     5
5  2.041229     6
>>> scale_y = df.y.get_scaler('minmax')
>>> scale_y(df)
      x         y
0     1  1.000000
1     2  0.000000
2     3  0.001002
3     4  0.002004
4     5  0.003006
5  1000  0.004008
>>> df.pipe(scale_x).pipe(scale_y)
          x         y
0 -0.413160  1.000000
1 -0.410703  0.000000
2 -0.408246  0.001002
3 -0.405789  0.002004
4 -0.403332  0.003006
5  2.041229  0.004008

`ln`

Returns the natural log of the values in the series using np.log

Example

>>> x = pd.Series([1, np.e, np.e ** 2, np.e ** 3])
>>> x
0     1.000000
1     2.718282
2     7.389056
3    20.085537
dtype: float64
>>> x.ln()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

`log`

Returns the log base 10 of the values in the series using np.log10

Example

>>> x = pd.Series([1, 10, 100, 1000])
>>> x
0       1
1      10
2     100
3    1000
dtype: int64
>>> x.log()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

`log2`

Returns the log base 2 of the values in the series using np.log2

Example

>>> x = pd.Series([1, 2,4, 8, 16])
>>> x
0     1
1     2
2     4
3     8
4    16
dtype: int64
>>> x.log2()
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

`outliers`

Detect outliers in the series.

Returns

A pandas Series of boolean values indicating whether each point is an outlier or not.

Parameters

how : {'iqr', 'std'}, default 'iqr'

'iqr' : identify outliers based on whether they are > q3 + k * iqr or < q1 - k * iqr
'std' : identify outliers based on whether they are further than a

specified number of standard deviations from the mean

k : value to multiply the iqr by for outlier detection. Ignored when

how='std'. Default 1.5

std_cutoff : cutoff for identifying an outlier based on standard deviation.

Ignored when how='iqr'. Default 2

Example

>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 6, 100],
...                        y=[-100, 5, 3, 4, 1, 2, 0]))
>>> df
     x    y
0    1 -100
1    2    5
2    3    3
3    4    4
4    5    1
5    6    2
6  100    0
>>> df.x.outliers()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: x, dtype: bool
>>> df[df.x.outliers()]
     x  y
6  100  0

`qcut`

Bin series values into discrete intervals.

Shortcut to pd.cut

Parameters

args : positional arguments passed to pandas.cut
keyword : keyword arguments passed to pandas.cut

Example

>>> x = pd.Series(range(1, 7))
>>> x
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
>>> x.qcut(2)
0    (0.999, 3.5]
1    (0.999, 3.5]
2    (0.999, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 3.5] < (3.5, 6.0]]

`top_n`

Convert the series to the most frequent n values and use other_val for the rest.

Parameters

n: number of values to consider
other_val: value that will be used for everything that isn't the most

frequent n values

Returns

A pandas Series

Examples

>>> s = pd.Series(['a', 'a', 'b', 'b', 'c', 'd'])
>>> s.top_n(2)
0        a
1        a
2        b
3        b
4    Other
5    Other
dtype: object

`zscore`

Returns the z-score for every value in the series.

Z = (x - mu) / sigma

Example

>>> x = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> x
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64
>>> x.zscore()
0   -1.460593
1   -1.095445
2   -0.730297
3   -0.365148
4    0.000000
5    0.365148
6    0.730297
7    1.095445
8    1.460593
dtype: float64