This module adds functionality to pandas Series and DataFrame objects. The objects in pandas will be modified by simply importing this module.
>>> import zgulde.extend_pandas
The following methods are added to all Series:
and the following are added to all DataFrames:
It also defines the left and right shift operators to be similar to pandas.DataFrame.pipe. For example:
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame(dict(x=np.arange(4))) >>> df x 0 0 1 1 2 2 3 3 >>> create_y = lambda df: df.assign(y=df.x + 1) >>> df >> create_y x y 0 0 1 1 1 2 2 2 3 3 3 4 >>> # This gives the same results as .pipe >>> ((df >> create_y) == df.pipe(create_y)).all(axis=None) True
Performs a chi squared contingency table test between all the combinations of two columns in the data frame.
(pvals, chi2s)
A tuple with two data frames, each which have all of the columns from the original data frame as both the indexes and the columns. The values in the first are the p-values, the values in the second are the chi square test statistics.
>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> p_vals, chi2s = tips[['smoker', 'time', 'day']].chi2()
>>> p_vals
             smoker        time          day
smoker          NaN    0.477149  1.05676e-05
time       0.477149         NaN   8.4499e-47
day     1.05676e-05  8.4499e-47          NaN
>>> chi2s
          smoker      time      day
smoker       NaN  0.505373  25.7872
time    0.505373       NaN  217.113
day      25.7872   217.113      NaN
Returns a data frame with the column names cleaned up. Special characters are removed and spaces, dots, and dashes are replaced with underscores.
>>> df = pd.DataFrame({'*Feature& A': [1, 2], ' feature.b  ': [2, 3], 'FEATURE-C': [3, 4]})
>>> df
   *Feature& A   feature.b    FEATURE-C
0            1             2          3
1            2             3          4
>>> df.cleanup_column_names()
   feature_a  feature_b  feature_c
0          1          2          3
1          2          3          4
>>> df
   *Feature& A   feature.b    FEATURE-C
0            1             2          3
1            2             3          4
>>> df.cleanup_column_names(inplace=True)
>>> df
   feature_a  feature_b  feature_c
0          1          2          3
1          2          3          4
Plot a heatmap of the correlation matrix for the data frame.
Any additional kwargs are passed to seaborn.heatmap and the resulting axes object is returned.
>>> x = np.arange(0, 10) >>> y = x / 2 >>> df = pd.DataFrame(dict(x=x, y=y)) >>> df.correlation_heatmap() <matplotlib.axes._subplots.AxesSubplot object at ...>
Shortcut to call to pd.crosstab.
>>> df = pd.DataFrame(dict(x=list('aaabbb'), y=list('cdcdcd'), z=range(6)))
>>> df
   x  y  z
0  a  c  0
1  a  d  1
2  a  c  2
3  b  d  3
4  b  c  4
5  b  d  5
>>> df.crosstab('x', 'y')
y  c  d
x
a  2  1
b  1  2
>>> (df.crosstab('x', 'y') == pd.crosstab(df.x, df.y)).all(axis=None)
True
>>> df.crosstab('x', 'y', margins=True)
y    c  d  All
x
a    2  1    3
b    1  2    3
All  3  3    6
>>> df.xtab(rows='x', cols='y', values='z', aggfunc='mean')
y  c  d
x
a  1  1
b  4  4
Drop rows with outliers in the given columns from the dataframe
See the docs for .outliers for more details on parameters, and to customize how the outliers are detected.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6]))
>>> df
      x     y
0     1  1000
1     2     2
2     3     3
3     4     4
4     5     5
5  1000     6
>>> df.drop_outliers('x')
   x     y
0  1  1000
1  2     2
2  3     3
3  4     4
4  5     5
>>> df.drop_outliers('y')
      x  y
1     2  2
2     3  3
3     4  4
4     5  5
5  1000  6
>>> df.drop_outliers(['x', 'y'])
   x  y
1  2  2
2  3  3
3  4  4
4  5  5
Obtain a function that will scale multiple columns on a data frame.
The returned function accepts a data frame and returns the data frame with the specified column(s) scaled.
This can be useful to make sure you apply the same transformation to both training and test data sets.
See the docstring for Series.get_scaler for more details.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 10], y=[-10, 1, 1, 2]))
>>> df
    x   y
0   1 -10
1   2   1
2   3   1
3  10   2
>>> scale_x = df.get_scalers('x', how='minmax')
>>> scale_x(df)
          x   y
0  0.000000 -10
1  0.111111   1
2  0.222222   1
3  1.000000   2
>>> scale_x_and_y = df.get_scalers(['x', 'y'])
>>> scale_x_and_y(df)
          x         y
0 -0.734847 -1.494836
1 -0.489898  0.439658
2 -0.244949  0.439658
3  1.469694  0.615521
>>> df.pipe(scale_x_and_y)
          x         y
0 -0.734847 -1.494836
1 -0.489898  0.439658
2 -0.244949  0.439658
3  1.469694  0.615521
Return the head and the tail of the data frame.
>>> df = pd.DataFrame(dict(x=np.arange(10), y=np.arange(10))) >>> df x y 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 >>> df.hdtl(1) x y 0 0 0 9 9 9 >>> df.hdtl() x y 0 0 0 1 1 1 2 2 2 7 7 7 8 8 8 9 9 9
Provide a summary of null values in each column.
alias of nna
>>> df = pd.DataFrame(dict(x=[1, 2, np.nan], y=[4, np.nan, np.nan]))
>>> df
     x    y
0  1.0  4.0
1  2.0  NaN
2  NaN  NaN
>>> nulls_by_column = df.nnull()
>>> nulls_by_column
   n_missing  p_missing
x          1   0.333333
y          2   0.666667
>>> nulls_by_row = df.nnull(axis=1)
>>> nulls_by_row
   n_missing  p_missing
0          0        0.0
1          1        0.5
2          2        1.0
Provide a summary of the number of outliers in each numeric column.
A pandas.DataFrame indexed by the column names of the the data frame, with columns that indicate the number of outliers and the percentage of outliers in each column.
>>> x = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5] >>> y = [1, 2, 3, 4, 5, 100, 2, 3, 4, 5] >>> z = [1, 2, 3, 4, 5, -100, 2, 3, 100, 5] >>> df = pd.DataFrame(dict(x=x, y=y, z=z)) >>> df x y z 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 100 -100 6 2 2 2 7 3 3 3 8 4 4 100 9 5 5 5 >>> df.n_outliers() n_outliers p_outliers x 0 0.0 y 1 0.1 z 2 0.2
>>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']})
>>> x, y = df.pluck('x', 'y')
>>> x
0    1
1    2
2    3
Name: x, dtype: int64
>>> y
0    a
1    b
2    c
Name: y, dtype: object
Runs a 1 sample t-test comparing the specified target variable to the overall mean among all of the possible subgroups.
>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> tips = tips[['total_bill', 'day', 'time']]
>>> tips.ttest('total_bill')
                 statistic    pvalue    n
variable value
day      Sun      1.603035  0.113130   76
         Sat      0.644856  0.520737   87
         Thur    -2.099957  0.039876   62
         Fri     -1.383042  0.183569   19
time     Dinner   1.467432  0.144054  176
         Lunch   -2.797882  0.006710   68
Runs a 2 sample t-test comparing the specified target variable for every unique value from every other column in the data frame.
The resulting t-statistic and pvalue are based on subdividing the data for each unique value for each column, with each individual value indicating that the test was performed based on belonging to that unique value vs not belonging to that group.
>>> from seaborn import load_dataset
>>> tips = load_dataset('tips')
>>> tips = tips[['total_bill', 'day', 'time']]
>>> tips.ttest_2samp('total_bill')
                 statistic    pvalue    n
variable value
day      Sun      1.927317  0.055111   76
         Sat      0.855634  0.393046   87
         Thur    -2.170294  0.030958   62
         Fri     -1.345462  0.179735   19
time     Dinner   2.897638  0.004105  176
         Lunch   -2.897638  0.004105   68
Turns a column with multiple values in each row in it into separate rows, each with a single value.
>>> df = pd.DataFrame(dict(x=list('abc'), y=['a,b,c', 'd,e', 'f']))
>>> df
   x      y
0  a  a,b,c
1  b    d,e
2  c      f
>>> df.unnest('y')
   x  y
0  a  a
1  a  b
2  a  c
3  b  d
4  b  e
5  c  f
Return specified columns from a dataframe, optionally renaming some.
A subset of the dataframe
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> df
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9
>>> df.select('a')
   a
0  1
1  2
2  3
>>> df.select('a', 'b')
   a  b
0  1  4
1  2  5
2  3  6
>>> df.select('a', c='the_letter_c')
   a  the_letter_c
0  1             7
1  2             8
2  3             9
>>> df.select(a='A', b='BBB')
   A  BBB
0  1    4
1  2    5
2  3    6
Run a SQL query against the dataframe.
The dataframe is converted to a sqlite database and the provided query is run against it. As such, any SQL that is valid in sqlite is supported.
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': ['a', 'a', 'b']})
>>> df
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT * FROM df')
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT * FROM my_df', table='my_df')
   a  b  c
0  1  4  a
1  2  5  a
2  3  6  b
>>> df.sql('SELECT c, AVG(a + b) FROM df GROUP BY c')
   c  AVG(a + b)
0  a         6.0
1  b         9.0
Bin series values into discrete intervals.
Shortcut to pd.cut
>>> x = pd.Series(range(1, 7)) >>> x 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64 >>> x.cut(2) 0 (0.995, 3.5] 1 (0.995, 3.5] 2 (0.995, 3.5] 3 (3.5, 6.0] 4 (3.5, 6.0] 5 (3.5, 6.0] dtype: category Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]] >>> x.cut(bins=[0, 3, 6]) 0 (0, 3] 1 (0, 3] 2 (0, 3] 3 (3, 6] 4 (3, 6] 5 (3, 6] dtype: category Categories (2, interval[int64]): [(0, 3] < (3, 6]]
Obtain a function that will scale the series on a data frame.
The returned function accepts a data frame and returns the data frame with the specified column scaled.
This can be useful to make sure you apply the same transformation to both training and test data sets.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6]))
>>> scale_x = df.x.get_scaler()
>>> scale_x(df)
          x     y
0 -0.413160  1000
1 -0.410703     2
2 -0.408246     3
3 -0.405789     4
4 -0.403332     5
5  2.041229     6
>>> scale_y = df.y.get_scaler('minmax')
>>> scale_y(df)
      x         y
0     1  1.000000
1     2  0.000000
2     3  0.001002
3     4  0.002004
4     5  0.003006
5  1000  0.004008
>>> df.pipe(scale_x).pipe(scale_y)
          x         y
0 -0.413160  1.000000
1 -0.410703  0.000000
2 -0.408246  0.001002
3 -0.405789  0.002004
4 -0.403332  0.003006
5  2.041229  0.004008
Returns the natural log of the values in the series using np.log
>>> x = pd.Series([1, np.e, np.e ** 2, np.e ** 3]) >>> x 0 1.000000 1 2.718282 2 7.389056 3 20.085537 dtype: float64 >>> x.ln() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Returns the log base 10 of the values in the series using np.log10
>>> x = pd.Series([1, 10, 100, 1000]) >>> x 0 1 1 10 2 100 3 1000 dtype: int64 >>> x.log() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Returns the log base 2 of the values in the series using np.log2
>>> x = pd.Series([1, 2,4, 8, 16]) >>> x 0 1 1 2 2 4 3 8 4 16 dtype: int64 >>> x.log2() 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64
Detect outliers in the series.
A pandas Series of boolean values indicating whether each point is an outlier or not.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 6, 100],
...                        y=[-100, 5, 3, 4, 1, 2, 0]))
>>> df
     x    y
0    1 -100
1    2    5
2    3    3
3    4    4
4    5    1
5    6    2
6  100    0
>>> df.x.outliers()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: x, dtype: bool
>>> df[df.x.outliers()]
     x  y
6  100  0
Bin series values into discrete intervals.
Shortcut to pd.cut
>>> x = pd.Series(range(1, 7)) >>> x 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64 >>> x.qcut(2) 0 (0.999, 3.5] 1 (0.999, 3.5] 2 (0.999, 3.5] 3 (3.5, 6.0] 4 (3.5, 6.0] 5 (3.5, 6.0] dtype: category Categories (2, interval[float64]): [(0.999, 3.5] < (3.5, 6.0]]
Convert the series to the most frequent n values and use other_val for the rest.
A pandas Series
>>> s = pd.Series(['a', 'a', 'b', 'b', 'c', 'd']) >>> s.top_n(2) 0 a 1 a 2 b 3 b 4 Other 5 Other dtype: object
Returns the z-score for every value in the series.
Z = (x - mu) / sigma
>>> x = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> x 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 dtype: int64 >>> x.zscore() 0 -1.460593 1 -1.095445 2 -0.730297 3 -0.365148 4 0.000000 5 0.365148 6 0.730297 7 1.095445 8 1.460593 dtype: float64