4.1. Change Values¶
pandas’ methods to change the values of a pandas DataFrame or a pandas Series.
4.1.1. pandas.DataFrame.agg: Aggregate over Columns or Rows Using Multiple Operations¶
If you want to aggregate over columns or rows using one or more operations, try pd.DataFrame.agg
.
from collections import Counter
import pandas as pd
def count_two(nums: list):
return Counter(nums)[2]
df = pd.DataFrame({"coll": [1, 3, 5], "col2": [2, 4, 6]})
df.agg(["sum", count_two])
coll | col2 | |
---|---|---|
sum | 9 | 12 |
count_two | 0 | 1 |
4.1.2. pandas.DataFrame.agg: Apply Different Aggregations to Different Columns¶
If you want to apply different aggregations to different columns, insert a dictionary of column and aggregation methods to the pd.DataFrame.agg
method.
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df.agg({"a": ["sum", "mean"], "b": ["min", "max"]})
a | b | |
---|---|---|
sum | 10.0 | NaN |
mean | 2.5 | NaN |
min | NaN | 2.0 |
max | NaN | 5.0 |
4.1.3. pandas.DataFrame.pipe: Increase the Readability of your Code when Applying Multiple Functions to a DataFrame¶
If you want to increase the readability of your code when applying multiple functions to a DataFrame, use pands.DataFrame.pipe
method.
from textblob import TextBlob
def remove_white_space(df: pd.DataFrame):
df['text'] = df['text'].apply(lambda row: row.strip())
return df
def get_sentiment(df: pd.DataFrame):
df['sentiment'] = df['text'].apply(lambda row:
TextBlob(row).sentiment[0])
return df
df = pd.DataFrame({'text': ["It is a beautiful day today ",
" This movie is terrible"]})
df = (df.pipe(remove_white_space)
.pipe(get_sentiment)
)
df
text | sentiment | |
---|---|---|
0 | It is a beautiful day today | 0.85 |
1 | This movie is terrible | -1.00 |
4.1.4. pandas.Series.map: Change Values of a Pandas Series Using a Dictionary¶
If you want to change values of a pandas Series using a dictionary, use pd.Series.map
.
s = pd.Series(["a", "b", "c"])
s.map({"a": 1, "b": 2, "c": 3})
0 1
1 2
2 3
dtype: int64
4.1.5. pandas.Series.str: Manipulate Text Data in a Pandas Series¶
If you are working the text data in a pandas Series, instead of creating your own functions, use pandas.Series.str
to access common methods to process string.
The code below shows how to convert text to lower case then replace “e” with “a”.
fruits = pd.Series(['Orange', 'Apple', 'Grape'])
fruits
0 Orange
1 Apple
2 Grape
dtype: object
fruits.str.lower()
0 orange
1 apple
2 grape
dtype: object
fruits.str.lower().str.replace("e", "a")
0 oranga
1 appla
2 grapa
dtype: object
Find other useful string methods here.
4.1.6. set_categories in Pandas: Sort Categorical Column by a Specific Ordering¶
If you want to sort pandas DataFrame’s categorical column by a specific ordering such as small, medium, large, use df.col.cat.set_categories()
method.
df = pd.DataFrame(
{"col1": ["large", "small", "mini", "medium", "mini"], "col2": [1, 2, 3, 4, 5]}
)
ordered_sizes = "large", "medium", "small", "mini"
df.col1 = df.col1.astype("category")
df.col1.cat.set_categories(ordered_sizes, ordered=True, inplace=True)
df.sort_values(by="col1")
col1 | col2 | |
---|---|---|
0 | large | 1 |
3 | medium | 4 |
1 | small | 2 |
2 | mini | 3 |
4 | mini | 5 |
4.1.7. parse_dates: Convert Columns into Datetime When Using Pandas to Read CSV Files¶
If there are datetime columns in your csv file, use the parse_dates
parameter when reading csv file with pandas. This reduces one extra step to convert these columns from string to datetime after reading the file.
df = pd.read_csv("data1.csv", parse_dates=["date_column_1", "date_column_2"])
df
date_column_1 | date_column_2 | value | |
---|---|---|---|
0 | 2021-02-10 | 2021-02-11 | 3 |
1 | 2021-02-12 | 2021-02-13 | 3 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_column_1 2 non-null datetime64[ns]
1 date_column_2 2 non-null datetime64[ns]
2 value 2 non-null int64
dtypes: datetime64[ns](2), int64(1)
memory usage: 176.0 bytes
4.1.8. Pandas.Series.isin: Filter Rows Only If Column Contains Values From Another List¶
When working with a pandas Dataframe, if you want to select the rows when a column contains values from another list, the fastest way is to use isin
.
In the example below, row 2
is filtered out because 3
is not in the list.
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df
a | b | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
l = [1, 2, 6, 7]
df.a.isin(l)
0 True
1 True
2 False
Name: a, dtype: bool
df = df[df.a.isin(l)]
df
a | b | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
4.1.9. Specify Suffixes When Using df.merge()¶
If you are merging 2 dataframes that have the same features using df.merge()
, it might be confusing to know which dataframe a_x
or a_y
belongs to.
df1 = pd.DataFrame({"left_key": [1, 2, 3], "a": [4, 5, 6]})
df2 = pd.DataFrame({"right_key": [1, 2, 3], "a": [5, 6, 7]})
df1.merge(df2, left_on="left_key", right_on="right_key")
left_key | a_x | right_key | a_y | |
---|---|---|---|---|
0 | 1 | 4 | 1 | 5 |
1 | 2 | 5 | 2 | 6 |
2 | 3 | 6 | 3 | 7 |
A better way is to specify suffixes of the features in each Dataframe like below. Now a_x
becomes a_left
and a_y
becomes a_right
.
df1.merge(df2, left_on="left_key", right_on="right_key", suffixes=("_left", "_right"))
left_key | a_left | right_key | a_right | |
---|---|---|---|---|
0 | 1 | 4 | 1 | 5 |
1 | 2 | 5 | 2 | 6 |
2 | 3 | 6 | 3 | 7 |
Try it if you want the names of your columns to be less confusing.
4.1.10. Highlight your pandas DataFrame¶
Have you ever wanted to highlight your pandas DataFrame to analyze it easier? For example, positive values will be highlighted as green and negative values will be highlighted as red.
That could be done with df.style.apply(highlight_condition_func)
.
df = pd.DataFrame({"col1": [-5, -2, 1, 4], "col2": [2, 3, -1, 4]})
def highlight_number(row):
return [
"background-color: red; color: white"
if cell <= 0
else "background-color: green; color: white"
for cell in row
]
df.style.apply(highlight_number)
col1 | col2 | |
---|---|---|
0 | -5 | 2 |
1 | -2 | 3 |
2 | 1 | -1 |
3 | 4 | 4 |
4.1.11. Assign Values to Multiple New Columns¶
If you want to assign values to multiple new columns, instead of assigning them separately, you can do everything in one line of code with df.assign
.
In the code below, I first created col3
then use col3
to create col4
. Everything is in one line of code.
df = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})
df = df.assign(col3=lambda x: x.col1 * 100 + x.col2).assign(
col4=lambda x: x.col2 * x.col3
)
df
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 1 | 3 | 103 | 309 |
1 | 2 | 4 | 204 | 816 |
4.1.12. Reduce pandas.DataFrame’s Memory¶
If you want to reduce the memory of your pandas DataFrame, start with changing the data type of a column. If your categorical variable has low cardinality, change the data type to category like below.
from sklearn.datasets import load_iris
X, y = load_iris(as_frame=True, return_X_y=True)
df = pd.concat([X, pd.DataFrame(y, columns=["target"])], axis=1)
df.memory_usage()
Index 128
sepal length (cm) 1200
sepal width (cm) 1200
petal length (cm) 1200
petal width (cm) 1200
target 1200
dtype: int64
df["target"] = df["target"].astype("category")
df.memory_usage()
Index 128
sepal length (cm) 1200
sepal width (cm) 1200
petal length (cm) 1200
petal width (cm) 1200
target 282
dtype: int64
The memory is now is reduced to almost a fifth of what it was!
4.1.13. pandas.DataFrame.explode: Transform Each Element in an Iterable to a Row¶
When working with pandas DataFrame
, if you want to transform each element in an iterable to a row, use explode
.
df = pd.DataFrame({"a": [[1, 2], [4, 5]], "b": [11, 13]})
df
a | b | |
---|---|---|
0 | [1, 2] | 11 |
1 | [4, 5] | 13 |
df.explode("a")
a | b | |
---|---|---|
0 | 1 | 11 |
0 | 2 | 11 |
1 | 4 | 13 |
1 | 5 | 13 |
4.1.14. pandas.cut: Bin a DataFrame’s values into Discrete Intervals¶
If you want to bin your Dataframe’s values into discrete intervals, use pd.cut
.
df = pd.DataFrame({"a": [1, 3, 7, 11, 14, 17]})
bins = [0, 5, 10, 15, 20]
df["binned"] = pd.cut(df["a"], bins=bins)
df
a | binned | |
---|---|---|
0 | 1 | (0, 5] |
1 | 3 | (0, 5] |
2 | 7 | (5, 10] |
3 | 11 | (10, 15] |
4 | 14 | (10, 15] |
5 | 17 | (15, 20] |
4.1.15. Forward Fill in pandas: Use the Previous Value to Fill the Current Missing Value¶
If you want to use the previous value in a column or a row to fill the current missing value in a pandas DataFrame, use df.fillna(method=’ffill’)
. ffill
stands for forward fill.
import numpy as np
df = pd.DataFrame({"a": [1, np.nan, 3], "b": [4, 5, np.nan], "c": [1, 2, 3]})
df
a | b | c | |
---|---|---|---|
0 | 1.0 | 4.0 | 1 |
1 | NaN | 5.0 | 2 |
2 | 3.0 | NaN | 3 |
df = df.fillna(method="ffill")
df
a | b | c | |
---|---|---|---|
0 | 1.0 | 4.0 | 1 |
1 | 1.0 | 5.0 | 2 |
2 | 3.0 | 5.0 | 3 |
4.1.16. pandas.pivot_table: Turn Your DataFrame Into a Pivot Table¶
A pivot table is useful to summarize and analyze the patterns in your data. If you want to turn your DataFrame into a pivot table, use pandas.pivot_table
.
df = pd.DataFrame(
{
"item": ["apple", "apple", "apple", "apple", "apple"],
"size": ["small", "small", "large", "large", "large"],
"location": ["Walmart", "Aldi", "Walmart", "Aldi", "Aldi"],
"price": [3, 2, 4, 3, 2.5],
}
)
df
item | size | location | price | |
---|---|---|---|---|
0 | apple | small | Walmart | 3.0 |
1 | apple | small | Aldi | 2.0 |
2 | apple | large | Walmart | 4.0 |
3 | apple | large | Aldi | 3.0 |
4 | apple | large | Aldi | 2.5 |
pivot = pd.pivot_table(
df, values="price", index=["item", "size"], columns=["location"], aggfunc="mean"
)
pivot
location | Aldi | Walmart | |
---|---|---|---|
item | size | ||
apple | large | 2.75 | 4.0 |
small | 2.00 | 3.0 |
4.1.17. pandas’ DateOffset: Add a Time Interval to a pandas Timestamp¶
If you want to add days, months, or other time intervals to a pandas Timestamp
, use pd.DateOffset
.
import pandas as pd
from pandas.tseries.offsets import DateOffset, BDay
ts = pd.Timestamp('2021-10-10 9:00:00')
# Increase the timestamp by 3 months
ts + DateOffset(months=3)
Timestamp('2022-01-10 09:00:00')
# Increase the timestamp by 3 years and 3 hours
ts + DateOffset(years=3, hours=3)
Timestamp('2024-10-10 12:00:00')
You can also increase the timestamp by n
business days using BDay
.
# Increase the timestamp by 6 business days
ts + BDay(n=6)
Timestamp('2021-10-18 09:00:00')
4.1.18. pandas.DataFrame.combine_first: Update Null Elements Using Another DataFrame¶
If you want to fill null values in one DataFrame with non-null values at the same locations from another DataFrame, use pandas.DataFrame.combine_first
.
import pandas as pd
store1 = pd.DataFrame({"orange": [None, 5], "apple": [None, 1]})
store2 = pd.DataFrame({"orange": [1, 3], "apple": [4, 2]})
print(store1)
orange apple
0 NaN NaN
1 5.0 1.0
print(store2)
orange apple
0 1 4
1 3 2
In the code below, the values at the first row of store1
are updated with the values at the first row of store2
.
print(store1.combine_first(store2))
orange apple
0 1.0 4.0
1 5.0 1.0