Pandas
3 min read
Updated:
Note
pandas
only for interviews or small datasets. So don’t drill down too much on it. Just get familiar with the basic operations, and pass the interview.PySpark
,Polars
orDuckDB
for real-world applications, which can use SQL-like syntax and are faster thanpandas
Practice
Common operations
merge
performs inner join by default, note thathow=left
is left outer join, not a left inner join, it is the same forjoin
method with defaulthow='left'
concat
performs outer join by default, either by rows or columns, can join multiple dataframes at once, runs in linear time, allows inner join on the indexjoin
performs left join by default. Note:join
don’t haveleft_index
,right_index
,left_on
, orright_on
parameters. It only hason
parameter. Withouton
, it will join on the index.append
is a shortcut forconcat
by rowsstr
accessor for string manipulation. Note that it does not change the column intostr
type, andobject
is not a string type. Using it for non-string columns will produce unexpected results. E.g. usingdf.col.str.replace('.0', '')
on a object column containing1.0
will produceNaN
because pandas trying to use.strip('.0')
on a float, which cause exception and returnNaN
.df.shape[0]
==len(df)
!=df.size
- use
pd.concat([arr1, arr2], axis=1)
instead ofpd.DataFrame([arr1, arr2])
to preserve dtypes
Data manipulation
How to create a dataframe from a bunch of numpy arrays with preserved dtypes?
- Turn each array into a series
- Concatenate the series into a dataframe along the column axis
arr1 = np.arange(10)arr2 = np.random.rand(10)arr3 = np.random.choice(['a', 'b'], 10)series = map(pd.Series, [arr1, arr2, arr3])df = pd.concat(series, axis=1)print(df.types)
Note:
pd.DataFrame(np.c_[arr1, arr2, arr3])
orpd.DataFrame([arr1, arr2, arr3]).T
do not preserve dtypes, so avoid them.
How to merge a dataframe and a series?
# rename the series to the column namedf.merge(series.rename('col2'), how='left', left_on='col1', right_index=True)
How to turn a string column into a numerical form that could be handled by model?
# for ordinal encodingdf.col.astype('category').cat.reorder_categories(['xs', 'md', 'xl'], order=True).cat.codes# for one-hot encodingdf.col.str.get_dummies(sep='|')# for a list of stringdf.col.str.join('|').str.get_dummies(sep='|')
# sklearn alternative# for ordinal encoding, note that categories is for orderingenc = OrdinalEncoder(categories=[['xs', 'md', 'xl']], dtype=int)# for one-hot encoding (not for list of string)enc = OneHotEncoder(sparse_output=False, dtype=int)# for one-hot encoding (for list of string)# note it uses single bracketenc = MultiLabelBinarizer(sparse_output=False)enc.fit_transform(df.col) # for list column (storing list of strings)enc.fit_transform(df.col.str.split(" ")) # for string column (storing sentence)
# Note: without sparse_output=False, the transform output will be a sparse matrix# which can be turned into a dense matrix by calling toarray() method
# Note 2: pd.factorize() is not recommended because it doesn't store the mapping
How to manipulate a column with a list as its value?
- If it is a list of strings, you can use
str
accessor to manipulate, e.g.df.col.str.join('|')
to join the strings in the list with a comma. Note that the only function that can be used for a list of strings islen
andjoin
. Other string methods will produce very strange results that you may not catch and make you debugging for hours. - If it is a list of numbers, you can use
apply
to turn it into a series, e.g.df.col.apply(pd.Series).mean(axis=1)
to calculate the mean of the numbers in the list.