Pandas
3 min read
Updated:
Note
pandasonly for interviews or small datasets. So don’t drill down too much on it. Just get familiar with the basic operations, and pass the interview.PySpark,PolarsorDuckDBfor real-world applications, which can use SQL-like syntax and are faster thanpandas
Practice
Common operations
mergeperforms inner join by default, note thathow=leftis left outer join, not a left inner join, it is the same forjoinmethod with defaulthow='left'concatperforms outer join by default, either by rows or columns, can join multiple dataframes at once, runs in linear time, allows inner join on the indexjoinperforms left join by default. Note:joindon’t haveleft_index,right_index,left_on, orright_onparameters. It only hasonparameter. Withouton, it will join on the index.appendis a shortcut forconcatby rowsstraccessor for string manipulation. Note that it does not change the column intostrtype, andobjectis not a string type. Using it for non-string columns will produce unexpected results. E.g. usingdf.col.str.replace('.0', '')on a object column containing1.0will produceNaNbecause pandas trying to use.strip('.0')on a float, which cause exception and returnNaN.df.shape[0]==len(df)!=df.size- use
pd.concat([arr1, arr2], axis=1)instead ofpd.DataFrame([arr1, arr2])to preserve dtypes
Data manipulation
How to create a dataframe from a bunch of numpy arrays with preserved dtypes?
- Turn each array into a series
- Concatenate the series into a dataframe along the column axis
arr1 = np.arange(10)arr2 = np.random.rand(10)arr3 = np.random.choice(['a', 'b'], 10)series = map(pd.Series, [arr1, arr2, arr3])df = pd.concat(series, axis=1)print(df.types)Note:
pd.DataFrame(np.c_[arr1, arr2, arr3])orpd.DataFrame([arr1, arr2, arr3]).Tdo not preserve dtypes, so avoid them.
How to merge a dataframe and a series?
# rename the series to the column namedf.merge(series.rename('col2'), how='left', left_on='col1', right_index=True)How to turn a string column into a numerical form that could be handled by model?
# for ordinal encodingdf.col.astype('category').cat.reorder_categories(['xs', 'md', 'xl'], order=True).cat.codes# for one-hot encodingdf.col.str.get_dummies(sep='|')# for a list of stringdf.col.str.join('|').str.get_dummies(sep='|')
# sklearn alternative# for ordinal encoding, note that categories is for orderingenc = OrdinalEncoder(categories=[['xs', 'md', 'xl']], dtype=int)# for one-hot encoding (not for list of string)enc = OneHotEncoder(sparse_output=False, dtype=int)# for one-hot encoding (for list of string)# note it uses single bracketenc = MultiLabelBinarizer(sparse_output=False)enc.fit_transform(df.col) # for list column (storing list of strings)enc.fit_transform(df.col.str.split(" ")) # for string column (storing sentence)
# Note: without sparse_output=False, the transform output will be a sparse matrix# which can be turned into a dense matrix by calling toarray() method
# Note 2: pd.factorize() is not recommended because it doesn't store the mappingHow to manipulate a column with a list as its value?
- If it is a list of strings, you can use
straccessor to manipulate, e.g.df.col.str.join('|')to join the strings in the list with a comma. Note that the only function that can be used for a list of strings islenandjoin. Other string methods will produce very strange results that you may not catch and make you debugging for hours. - If it is a list of numbers, you can use
applyto turn it into a series, e.g.df.col.apply(pd.Series).mean(axis=1)to calculate the mean of the numbers in the list.