6.9. Visualization

This section covers some tools to visualize your data and model.

6.9.1. Graphviz: Create a Flowchart to Capture Your Ideas in Python

A flowchart is helpful for summarizing and visualizing your workflow. This also helps your team understand your workflow. Wouldn’t it be nice if you could create a flowchart using Python?

Graphviz makes it easy to create a flowchart like below.

!pip install graphviz
from graphviz import Graph 

# Instantiate a new Graph object
dot = Graph('Data Science Process', format='png')

# Add nodes
dot.node('A', 'Get Data')
dot.node('B', 'Clean, Prepare, & Manipulate Data')
dot.node('C', 'Train Model')
dot.node('D', 'Test Data')
dot.node('E', 'Improve')

# Connect these nodes
dot.edges(['AB', 'BC', 'CD', 'DE'])

# Save chart
dot.render('data_science_flowchart', view=True)
'data_science_flowchart.png'
dot 
../_images/visualization_6_0.svg

Link to graphviz

6.9.2. folium: Create an Interactive Map in Python

!pip install folium

If you want to create a map provided the location in a few lines of code, try folium. Folium is a Python library that allows you to create an interactive map.

import folium
m = folium.Map(location=[45.5236, -122.6750])

tooltip = 'Click me!'
folium.Marker([45.3288, -121.6625], popup='<i>Mt. Hood Meadows</i>',
              tooltip=tooltip).add_to(m)
m 
Make this Notebook Trusted to load map: File -> Trust Notebook

View the document of folium here.

I used this library to view the locations of the owners of top machine learning repositories. Pretty cool to see their locations through an interactive map.

6.9.3. dtreeviz: Visualize and Interpret a Decision Tree Model

!pip install dtreeviz

If you want to find an easy way to visualize and interpret a decision tree model, use dtreeviz.

from dtreeviz.trees import dtreeviz
from sklearn import tree
from sklearn.datasets import load_wine

wine = load_wine()
classifier = tree.DecisionTreeClassifier(max_depth=2)
classifier.fit(wine.data, wine.target)

vis = dtreeviz(
    classifier,
    wine.data,
    wine.target,
    target_name="wine_type",
    feature_names=wine.feature_names,
)

vis.view()
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.

The image below shows the output of dtreeviz when applying it on DecisionTreeClassifier.

image

Link to dtreeviz.

6.9.4. HiPlot - High Dimensional Interactive Plotting

!pip install hiplot 

If you are tuning hyperparameters of your machine learning model, it can be difficult to understand the relationships between different combinations of hyperparameters and a specific metric.

That is when HiPlot comes in handy. HiPlot allows you to discover patterns in high-dimensional data using parallel plots like below.

import hiplot as hip
data = [{'lr': 0.001, 'loss': 10.0, 'r2': 0.8, 'optimizer': 'SGD'},
        {'lr': 0.01, 'loss': 2.5, 'r2': 0.9, 'optimizer': 'Adam'},
        {'lr': 0.1, 'loss': 4, 'r2': 0.86, 'optimizer': 'Adam'}]
hip.Experiment.from_iterable(data).display()
HiPlot
Loading HiPlot...
<hiplot.ipython.IPythonExperimentDisplayed at 0x7f10ba5ebbb0>

Link to HiPlot.

6.9.5. missingno.dendogram: Visualize Correlation Between Missing Data

pip install missingno sklearn

Missing values can sometimes tell you how strongly the presence or absence of one variable affects the presence of another. To visualize the correlation between different columns based on the missing values, use missingno.dendogram.

from sklearn.datasets import fetch_openml

soybean = fetch_openml(name="soybean", as_frame=True)['data']
import missingno as msno

msno.dendrogram(soybean)
<AxesSubplot:>
../_images/visualization_29_1.png

The dendrogram uses a hierarchical clustering algorithm to bin variables against one another by their nullity correlation. Cluster leaves which linked together at a distance of zero fully predict one another’s presence. In the graph above, the nullity of seed-discolor fully predicts the nullity of germination.

Link to missingno.

6.9.6. matplotlib-venn: Create a Venn Diagram Using Python

!pip install matplotlib-venn

If you want to draw a venn diagram using Python, try matplotlib-venn. To create a venn diagram using matplotlib-venn, you can specify the size of each region:

import matplotlib.pyplot as plt
from matplotlib_venn import venn2

venn2(subsets = (8, 10, 5), set_labels = ('Are Healthy', 'Do Exercise'))
plt.show()
../_images/visualization_35_0.png

… or specify the elements in each set:

venn2([set(['A', 'B', 'C', 'D']), set(['D', 'E', 'F'])], set_labels=['Group1', 'Group2'])
plt.show()
../_images/visualization_37_0.png

You can also draw three cicles using venn3:

from matplotlib_venn import venn3
venn3(subsets = (5, 5, 3, 5, 3, 3, 2), set_labels = ('Are Healthy', 'Do Exercise', 'Eat Well'))
plt.show()
../_images/visualization_39_0.png

Link to matplotlib-venn.

6.9.7. UMAP: Dimension Reduction in Python

!pip install umap-learn[plot]

It can be difficult to visualize a multi-dimensional dataset. Luckily, UMAP allows you to reduce the dimension of your dataset and create a 2D views of your data.

To understand how UMAP works, let’s try with the fruit dataset. In this dataset, we have the features of different fruits, such as mass, width, height, and color score. Our task is to classify which fruit a sample belongs to based on its features.

from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

Let’s start with loading and visualizing the data.

data = pd.read_table("https://raw.githubusercontent.com/susanli2016/Machine-Learning-with-Python/master/fruit_data_with_colors.txt")
data.head(10)
fruit_label fruit_name fruit_subtype mass width height color_score
0 1 apple granny_smith 192 8.4 7.3 0.55
1 1 apple granny_smith 180 8.0 6.8 0.59
2 1 apple granny_smith 176 7.4 7.2 0.60
3 2 mandarin mandarin 86 6.2 4.7 0.80
4 2 mandarin mandarin 84 6.0 4.6 0.79
5 2 mandarin mandarin 80 5.8 4.3 0.77
6 2 mandarin mandarin 80 5.9 4.3 0.81
7 2 mandarin mandarin 76 5.8 4.0 0.81
8 1 apple braeburn 178 7.1 7.8 0.92
9 1 apple braeburn 172 7.4 7.0 0.89
sns.pairplot(data.drop(columns=['fruit_label', 'fruit_subtype']), hue='fruit_name')
<seaborn.axisgrid.PairGrid at 0x7f09bcc1b250>
../_images/visualization_49_1.png

We can see some distinctions between different fruits in the pairwise feature scatterplot matrix. Now to visualize all 4 features in a 2D plot, we start with creating a UMAP object.

import umap

reducer = umap.UMAP()

Next, we scale the features so that they are all on the same scale.

features = data.iloc[:, 3:].values
scaled_features = StandardScaler().fit_transform(features)

Lastly, we use the UMAP object to reduce the dimension of the dataset and plot the features as a scatter plot.

embedding = reducer.fit_transform(scaled_features)
embedding.shape
(59, 2)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in data.fruit_label])
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the fruit dataset', fontsize=24)
Text(0.5, 1.0, 'UMAP projection of the fruit dataset')
../_images/visualization_56_1.png

Now we can see some distinctions in features between 4 different fruits in a 2D plot.

Link to UMAP.

6.9.8. Evidently: Detect and Visualize Data Drift

!pip install evidently

Data drift is unexpected changes in model input data that can lead to model performance degradation. Since your code is built around the characteristics of your data, it is important to detect data drift when it occurs. Evidently allows you to do exactly in a few lines of code.

In the code below, we use Evidently to detect changes in feature distribution.

import pandas as pd
from sklearn import datasets

from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab

california = datasets.fetch_california_housing()
california = pd.DataFrame(california.data, columns = california.feature_names)
california_data_drift_report = Dashboard(tabs=[DataDriftTab])
california_data_drift_report.calculate(california[:1000], california[1000:], column_mapping = None)
california_data_drift_report.show()
Loading...

gif

Find other features of Evidently here.