Change Values¶

pandas’ methods to change the values of a pandas DataFrame or a pandas Series.

pandas.DataFrame.agg: Aggregate over Columns or Rows Using Multiple Operations¶

If you want to aggregate over columns or rows using one or more operations, try pd.DataFrame.agg.

from collections import Counter
import pandas as pd


def count_two(nums: list):
    return Counter(nums)[2]


df = pd.DataFrame({"coll": [1, 3, 5], "col2": [2, 4, 6]})
df.agg(["sum", count_two])

	coll	col2
sum	9	12
count_two	0	1

pandas.DataFrame.agg: Apply Different Aggregations to Different Columns¶

If you want to apply different aggregations to different columns, insert a dictionary of column and aggregation methods to the pd.DataFrame.agg method.

df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})

df.agg({"a": ["sum", "mean"], "b": ["min", "max"]})

	a	b
sum	10.0	NaN
mean	2.5	NaN
min	NaN	2.0
max	NaN	5.0

pandas.DataFrame.pipe: Increase the Readability of your Code when Applying Multiple Functions to a DataFrame¶

If you want to increase the readability of your code when applying multiple functions to a DataFrame, use pands.DataFrame.pipe method.

from textblob import TextBlob

def remove_white_space(df: pd.DataFrame):
    df['text'] = df['text'].apply(lambda row: row.strip())
    return df

def get_sentiment(df: pd.DataFrame):
    df['sentiment'] = df['text'].apply(lambda row:
                                    TextBlob(row).sentiment[0])
    return df

df = pd.DataFrame({'text': ["It is a beautiful day today  ",
                        "  This movie is terrible"]})

df = (df.pipe(remove_white_space)
    .pipe(get_sentiment)
)

df

	text	sentiment
0	It is a beautiful day today	0.85
1	This movie is terrible	-1.00

pandas.Series.map: Change Values of a Pandas Series Using a Dictionary¶

If you want to change values of a pandas Series using a dictionary, use pd.Series.map.

s = pd.Series(["a", "b", "c"])

s.map({"a": 1, "b": 2, "c": 3})

  1
  2
  3
dtype: int64

pandas.Series.str: Manipulate Text Data in a Pandas Series¶

If you are working the text data in a pandas Series, instead of creating your own functions, use pandas.Series.str to access common methods to process string.

The code below shows how to convert text to lower case then replace “e” with “a”.

fruits = pd.Series(['Orange', 'Apple', 'Grape'])
fruits

  Orange
   Apple
   Grape
dtype: object

fruits.str.lower()

  orange
   apple
   grape
dtype: object

fruits.str.lower().str.replace("e", "a")

  oranga
   appla
   grapa
dtype: object

Find other useful string methods here.

set_categories in Pandas: Sort Categorical Column by a Specific Ordering¶

If you want to sort pandas DataFrame’s categorical column by a specific ordering such as small, medium, large, use df.col.cat.set_categories() method.

df = pd.DataFrame(
    {"col1": ["large", "small", "mini", "medium", "mini"], "col2": [1, 2, 3, 4, 5]}
)
ordered_sizes = "large", "medium", "small", "mini"

df.col1 = df.col1.astype("category")
df.col1.cat.set_categories(ordered_sizes, ordered=True, inplace=True)
df.sort_values(by="col1")

	col1	col2
0	large	1
3	medium	4
1	small	2
2	mini	3
4	mini	5

parse_dates: Convert Columns into Datetime When Using Pandas to Read CSV Files¶

If there are datetime columns in your csv file, use the parse_dates parameter when reading csv file with pandas. This reduces one extra step to convert these columns from string to datetime after reading the file.

df = pd.read_csv("data1.csv", parse_dates=["date_column_1", "date_column_2"])

df

	date_column_1	date_column_2	value
0	2021-02-10	2021-02-11	3
1	2021-02-12	2021-02-13	3

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date_column_1  2 non-null      datetime64[ns]
 1   date_column_2  2 non-null      datetime64[ns]
 2   value          2 non-null      int64         
dtypes: datetime64[ns](2), int64(1)
memory usage: 176.0 bytes

Pandas.Series.isin: Filter Rows Only If Column Contains Values From Another List¶

When working with a pandas Dataframe, if you want to select the rows when a column contains values from another list, the fastest way is to use isin.

In the example below, row 2 is filtered out because 3 is not in the list.

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df

	a	b
0	1	4
1	2	5
2	3	6

l = [1, 2, 6, 7]
df.a.isin(l)

   True
   True
  False
Name: a, dtype: bool

df = df[df.a.isin(l)]
df

	a	b
0	1	4
1	2	5

Specify Suffixes When Using df.merge()¶

If you are merging 2 dataframes that have the same features using df.merge(), it might be confusing to know which dataframe a_x or a_y belongs to.

df1 = pd.DataFrame({"left_key": [1, 2, 3], "a": [4, 5, 6]})
df2 = pd.DataFrame({"right_key": [1, 2, 3], "a": [5, 6, 7]})
df1.merge(df2, left_on="left_key", right_on="right_key")

	left_key	a_x	right_key	a_y
0	1	4	1	5
1	2	5	2	6
2	3	6	3	7

A better way is to specify suffixes of the features in each Dataframe like below. Now a_x becomes a_left and a_y becomes a_right.

df1.merge(df2, left_on="left_key", right_on="right_key", suffixes=("_left", "_right"))

	left_key	a_left	right_key	a_right
0	1	4	1	5
1	2	5	2	6
2	3	6	3	7

Try it if you want the names of your columns to be less confusing.

Highlight your pandas DataFrame¶

Have you ever wanted to highlight your pandas DataFrame to analyze it easier? For example, positive values will be highlighted as green and negative values will be highlighted as red.

That could be done with df.style.apply(highlight_condition_func).

df = pd.DataFrame({"col1": [-5, -2, 1, 4], "col2": [2, 3, -1, 4]})

def highlight_number(row):
    return [
        "background-color: red; color: white"
        if cell <= 0
        else "background-color: green; color: white"
        for cell in row
    ]

df.style.apply(highlight_number)

	col1	col2
0	-5	2
1	-2	3
2	1	-1
3	4	4

Assign Values to Multiple New Columns¶

If you want to assign values to multiple new columns, instead of assigning them separately, you can do everything in one line of code with df.assign.

In the code below, I first created col3 then use col3 to create col4. Everything is in one line of code.

df = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})

df = df.assign(col3=lambda x: x.col1 * 100 + x.col2).assign(
    col4=lambda x: x.col2 * x.col3
)
df

	col1	col2	col3	col4
0	1	3	103	309
1	2	4	204	816

Reduce pandas.DataFrame’s Memory¶

If you want to reduce the memory of your pandas DataFrame, start with changing the data type of a column. If your categorical variable has low cardinality, change the data type to category like below.

from sklearn.datasets import load_iris

X, y = load_iris(as_frame=True, return_X_y=True)
df = pd.concat([X, pd.DataFrame(y, columns=["target"])], axis=1)
df.memory_usage()

Index                 128
sepal length (cm)    1200
sepal width (cm)     1200
petal length (cm)    1200
petal width (cm)     1200
target               1200
dtype: int64

df["target"] = df["target"].astype("category")
df.memory_usage()

Index                 128
sepal length (cm)    1200
sepal width (cm)     1200
petal length (cm)    1200
petal width (cm)     1200
target                282
dtype: int64

The memory is now is reduced to almost a fifth of what it was!

pandas.DataFrame.explode: Transform Each Element in an Iterable to a Row¶

When working with pandas DataFrame, if you want to transform each element in an iterable to a row, use explode.

df = pd.DataFrame({"a": [[1, 2], [4, 5]], "b": [11, 13]})
df

	a	b
0	[1, 2]	11
1	[4, 5]	13

df.explode("a")

	a	b
0	1	11
0	2	11
1	4	13
1	5	13

pandas.cut: Bin a DataFrame’s values into Discrete Intervals¶

If you want to bin your Dataframe’s values into discrete intervals, use pd.cut.

df = pd.DataFrame({"a": [1, 3, 7, 11, 14, 17]})

bins = [0, 5, 10, 15, 20]
df["binned"] = pd.cut(df["a"], bins=bins)

df

	a	binned
0	1	(0, 5]
1	3	(0, 5]
2	7	(5, 10]
3	11	(10, 15]
4	14	(10, 15]
5	17	(15, 20]

Forward Fill in pandas: Use the Previous Value to Fill the Current Missing Value¶

If you want to use the previous value in a column or a row to fill the current missing value in a pandas DataFrame, use df.fillna(method=’ffill’). ffill stands for forward fill.

import numpy as np

df = pd.DataFrame({"a": [1, np.nan, 3], "b": [4, 5, np.nan], "c": [1, 2, 3]})
df

	a	b	c
0	1.0	4.0	1
1	NaN	5.0	2
2	3.0	NaN	3

df = df.fillna(method="ffill")
df

	a	b	c
0	1.0	4.0	1
1	1.0	5.0	2
2	3.0	5.0	3

pandas.pivot_table: Turn Your DataFrame Into a Pivot Table¶

A pivot table is useful to summarize and analyze the patterns in your data. If you want to turn your DataFrame into a pivot table, use pandas.pivot_table.

df = pd.DataFrame(
    {
        "item": ["apple", "apple", "apple", "apple", "apple"],
        "size": ["small", "small", "large", "large", "large"],
        "location": ["Walmart", "Aldi", "Walmart", "Aldi", "Aldi"],
        "price": [3, 2, 4, 3, 2.5],
    }
)

df

	item	size	location	price
0	apple	small	Walmart	3.0
1	apple	small	Aldi	2.0
2	apple	large	Walmart	4.0
3	apple	large	Aldi	3.0
4	apple	large	Aldi	2.5

pivot = pd.pivot_table(
    df, values="price", index=["item", "size"], columns=["location"], aggfunc="mean"
)
pivot

	location	Aldi	Walmart
item	size
apple	large	2.75	4.0
apple	small	2.00	3.0

Effective Python for Data Scientists