This module adds functionality to pandas Series and DataFrame objects. The objects in pandas will be modified by simply importing this module.
>>> import zgulde.extend_pandas
The following methods are added to all Series:
and the following are added to all DataFrames:
It also defines the left and right shift operators to be similar to pandas.DataFrame.pipe. For example:
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame(dict(x=np.arange(4))) >>> df x 0 0 1 1 2 2 3 3 >>> create_y = lambda df: df.assign(y=df.x + 1) >>> df >> create_y x y 0 0 1 1 1 2 2 2 3 3 3 4 >>> # This gives the same results as .pipe >>> ((df >> create_y) == df.pipe(create_y)).all(axis=None) True
Performs a chi squared contingency table test between all the combinations of two columns in the data frame.
(pvals, chi2s)
A tuple with two data frames, each which have all of the columns from the original data frame as both the indexes and the columns. The values in the first are the p-values, the values in the second are the chi square test statistics.
>>> from seaborn import load_dataset >>> tips = load_dataset('tips') >>> p_vals, chi2s = tips[['smoker', 'time', 'day']].chi2() >>> p_vals smoker time day smoker NaN 0.477149 1.05676e-05 time 0.477149 NaN 8.4499e-47 day 1.05676e-05 8.4499e-47 NaN >>> chi2s smoker time day smoker NaN 0.505373 25.7872 time 0.505373 NaN 217.113 day 25.7872 217.113 NaN
Returns a data frame with the column names cleaned up. Special characters are removed and spaces, dots, and dashes are replaced with underscores.
>>> df = pd.DataFrame({'*Feature& A': [1, 2], ' feature.b ': [2, 3], 'FEATURE-C': [3, 4]}) >>> df *Feature& A feature.b FEATURE-C 0 1 2 3 1 2 3 4 >>> df.cleanup_column_names() feature_a feature_b feature_c 0 1 2 3 1 2 3 4 >>> df *Feature& A feature.b FEATURE-C 0 1 2 3 1 2 3 4 >>> df.cleanup_column_names(inplace=True) >>> df feature_a feature_b feature_c 0 1 2 3 1 2 3 4
Plot a heatmap of the correlation matrix for the data frame.
Any additional kwargs are passed to seaborn.heatmap and the resulting axes object is returned.
>>> x = np.arange(0, 10) >>> y = x / 2 >>> df = pd.DataFrame(dict(x=x, y=y)) >>> df.correlation_heatmap() <matplotlib.axes._subplots.AxesSubplot object at ...>
Shortcut to call to pd.crosstab.
>>> df = pd.DataFrame(dict(x=list('aaabbb'), y=list('cdcdcd'), z=range(6))) >>> df x y z 0 a c 0 1 a d 1 2 a c 2 3 b d 3 4 b c 4 5 b d 5 >>> df.crosstab('x', 'y') y c d x a 2 1 b 1 2 >>> (df.crosstab('x', 'y') == pd.crosstab(df.x, df.y)).all(axis=None) True >>> df.crosstab('x', 'y', margins=True) y c d All x a 2 1 3 b 1 2 3 All 3 3 6 >>> df.xtab(rows='x', cols='y', values='z', aggfunc='mean') y c d x a 1 1 b 4 4
Drop rows with outliers in the given columns from the dataframe
See the docs for .outliers for more details on parameters, and to customize how the outliers are detected.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6])) >>> df x y 0 1 1000 1 2 2 2 3 3 3 4 4 4 5 5 5 1000 6 >>> df.drop_outliers('x') x y 0 1 1000 1 2 2 2 3 3 3 4 4 4 5 5 >>> df.drop_outliers('y') x y 1 2 2 2 3 3 3 4 4 4 5 5 5 1000 6 >>> df.drop_outliers(['x', 'y']) x y 1 2 2 2 3 3 3 4 4 4 5 5
Obtain a function that will scale multiple columns on a data frame.
The returned function accepts a data frame and returns the data frame with the specified column(s) scaled.
This can be useful to make sure you apply the same transformation to both training and test data sets.
See the docstring for Series.get_scaler for more details.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 10], y=[-10, 1, 1, 2])) >>> df x y 0 1 -10 1 2 1 2 3 1 3 10 2 >>> scale_x = df.get_scalers('x', how='minmax') >>> scale_x(df) x y 0 0.000000 -10 1 0.111111 1 2 0.222222 1 3 1.000000 2 >>> scale_x_and_y = df.get_scalers(['x', 'y']) >>> scale_x_and_y(df) x y 0 -0.734847 -1.494836 1 -0.489898 0.439658 2 -0.244949 0.439658 3 1.469694 0.615521 >>> df.pipe(scale_x_and_y) x y 0 -0.734847 -1.494836 1 -0.489898 0.439658 2 -0.244949 0.439658 3 1.469694 0.615521
Return the head and the tail of the data frame.
>>> df = pd.DataFrame(dict(x=np.arange(10), y=np.arange(10))) >>> df x y 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 >>> df.hdtl(1) x y 0 0 0 9 9 9 >>> df.hdtl() x y 0 0 0 1 1 1 2 2 2 7 7 7 8 8 8 9 9 9
Provide a summary of null values in each column.
alias of nna
>>> df = pd.DataFrame(dict(x=[1, 2, np.nan], y=[4, np.nan, np.nan])) >>> df x y 0 1.0 4.0 1 2.0 NaN 2 NaN NaN >>> nulls_by_column = df.nnull() >>> nulls_by_column n_missing p_missing x 1 0.333333 y 2 0.666667 >>> nulls_by_row = df.nnull(axis=1) >>> nulls_by_row n_missing p_missing 0 0 0.0 1 1 0.5 2 2 1.0
Provide a summary of the number of outliers in each numeric column.
A pandas.DataFrame indexed by the column names of the the data frame, with columns that indicate the number of outliers and the percentage of outliers in each column.
>>> x = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5] >>> y = [1, 2, 3, 4, 5, 100, 2, 3, 4, 5] >>> z = [1, 2, 3, 4, 5, -100, 2, 3, 100, 5] >>> df = pd.DataFrame(dict(x=x, y=y, z=z)) >>> df x y z 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 1 100 -100 6 2 2 2 7 3 3 3 8 4 4 100 9 5 5 5 >>> df.n_outliers() n_outliers p_outliers x 0 0.0 y 1 0.1 z 2 0.2
>>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']}) >>> x, y = df.pluck('x', 'y') >>> x 0 1 1 2 2 3 Name: x, dtype: int64 >>> y 0 a 1 b 2 c Name: y, dtype: object
Runs a 1 sample t-test comparing the specified target variable to the overall mean among all of the possible subgroups.
>>> from seaborn import load_dataset >>> tips = load_dataset('tips') >>> tips = tips[['total_bill', 'day', 'time']] >>> tips.ttest('total_bill') statistic pvalue n variable value day Sun 1.603035 0.113130 76 Sat 0.644856 0.520737 87 Thur -2.099957 0.039876 62 Fri -1.383042 0.183569 19 time Dinner 1.467432 0.144054 176 Lunch -2.797882 0.006710 68
Runs a 2 sample t-test comparing the specified target variable for every unique value from every other column in the data frame.
The resulting t-statistic and pvalue are based on subdividing the data for each unique value for each column, with each individual value indicating that the test was performed based on belonging to that unique value vs not belonging to that group.
>>> from seaborn import load_dataset >>> tips = load_dataset('tips') >>> tips = tips[['total_bill', 'day', 'time']] >>> tips.ttest_2samp('total_bill') statistic pvalue n variable value day Sun 1.927317 0.055111 76 Sat 0.855634 0.393046 87 Thur -2.170294 0.030958 62 Fri -1.345462 0.179735 19 time Dinner 2.897638 0.004105 176 Lunch -2.897638 0.004105 68
Turns a column with multiple values in each row in it into separate rows, each with a single value.
>>> df = pd.DataFrame(dict(x=list('abc'), y=['a,b,c', 'd,e', 'f'])) >>> df x y 0 a a,b,c 1 b d,e 2 c f >>> df.unnest('y') x y 0 a a 1 a b 2 a c 3 b d 4 b e 5 c f
Return specified columns from a dataframe, optionally renaming some.
A subset of the dataframe
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> df a b c 0 1 4 7 1 2 5 8 2 3 6 9 >>> df.select('a') a 0 1 1 2 2 3 >>> df.select('a', 'b') a b 0 1 4 1 2 5 2 3 6 >>> df.select('a', c='the_letter_c') a the_letter_c 0 1 7 1 2 8 2 3 9 >>> df.select(a='A', b='BBB') A BBB 0 1 4 1 2 5 2 3 6
Run a SQL query against the dataframe.
The dataframe is converted to a sqlite database and the provided query is run against it. As such, any SQL that is valid in sqlite is supported.
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': ['a', 'a', 'b']}) >>> df a b c 0 1 4 a 1 2 5 a 2 3 6 b >>> df.sql('SELECT * FROM df') a b c 0 1 4 a 1 2 5 a 2 3 6 b >>> df.sql('SELECT * FROM my_df', table='my_df') a b c 0 1 4 a 1 2 5 a 2 3 6 b >>> df.sql('SELECT c, AVG(a + b) FROM df GROUP BY c') c AVG(a + b) 0 a 6.0 1 b 9.0
Bin series values into discrete intervals.
Shortcut to pd.cut
>>> x = pd.Series(range(1, 7)) >>> x 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64 >>> x.cut(2) 0 (0.995, 3.5] 1 (0.995, 3.5] 2 (0.995, 3.5] 3 (3.5, 6.0] 4 (3.5, 6.0] 5 (3.5, 6.0] dtype: category Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]] >>> x.cut(bins=[0, 3, 6]) 0 (0, 3] 1 (0, 3] 2 (0, 3] 3 (3, 6] 4 (3, 6] 5 (3, 6] dtype: category Categories (2, interval[int64]): [(0, 3] < (3, 6]]
Obtain a function that will scale the series on a data frame.
The returned function accepts a data frame and returns the data frame with the specified column scaled.
This can be useful to make sure you apply the same transformation to both training and test data sets.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 1000], y=[1000, 2, 3, 4, 5, 6])) >>> scale_x = df.x.get_scaler() >>> scale_x(df) x y 0 -0.413160 1000 1 -0.410703 2 2 -0.408246 3 3 -0.405789 4 4 -0.403332 5 5 2.041229 6 >>> scale_y = df.y.get_scaler('minmax') >>> scale_y(df) x y 0 1 1.000000 1 2 0.000000 2 3 0.001002 3 4 0.002004 4 5 0.003006 5 1000 0.004008 >>> df.pipe(scale_x).pipe(scale_y) x y 0 -0.413160 1.000000 1 -0.410703 0.000000 2 -0.408246 0.001002 3 -0.405789 0.002004 4 -0.403332 0.003006 5 2.041229 0.004008
Returns the natural log of the values in the series using np.log
>>> x = pd.Series([1, np.e, np.e ** 2, np.e ** 3]) >>> x 0 1.000000 1 2.718282 2 7.389056 3 20.085537 dtype: float64 >>> x.ln() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Returns the log base 10 of the values in the series using np.log10
>>> x = pd.Series([1, 10, 100, 1000]) >>> x 0 1 1 10 2 100 3 1000 dtype: int64 >>> x.log() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Returns the log base 2 of the values in the series using np.log2
>>> x = pd.Series([1, 2,4, 8, 16]) >>> x 0 1 1 2 2 4 3 8 4 16 dtype: int64 >>> x.log2() 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64
Detect outliers in the series.
A pandas Series of boolean values indicating whether each point is an outlier or not.
>>> df = pd.DataFrame(dict(x=[1, 2, 3, 4, 5, 6, 100], ... y=[-100, 5, 3, 4, 1, 2, 0])) >>> df x y 0 1 -100 1 2 5 2 3 3 3 4 4 4 5 1 5 6 2 6 100 0 >>> df.x.outliers() 0 False 1 False 2 False 3 False 4 False 5 False 6 True Name: x, dtype: bool >>> df[df.x.outliers()] x y 6 100 0
Bin series values into discrete intervals.
Shortcut to pd.cut
>>> x = pd.Series(range(1, 7)) >>> x 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64 >>> x.qcut(2) 0 (0.999, 3.5] 1 (0.999, 3.5] 2 (0.999, 3.5] 3 (3.5, 6.0] 4 (3.5, 6.0] 5 (3.5, 6.0] dtype: category Categories (2, interval[float64]): [(0.999, 3.5] < (3.5, 6.0]]
Convert the series to the most frequent n values and use other_val for the rest.
A pandas Series
>>> s = pd.Series(['a', 'a', 'b', 'b', 'c', 'd']) >>> s.top_n(2) 0 a 1 a 2 b 3 b 4 Other 5 Other dtype: object
Returns the z-score for every value in the series.
Z = (x - mu) / sigma
>>> x = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> x 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 dtype: int64 >>> x.zscore() 0 -1.460593 1 -1.095445 2 -0.730297 3 -0.365148 4 0.000000 5 0.365148 6 0.730297 7 1.095445 8 1.460593 dtype: float64