6.1. Feature Extraction

6.1.1. distfit: Find The Best Theoretical Distribution For Your Data in Python

!pip install distfit

If you want to find the best theoretical distribution for your data in Python, try distfit.

import numpy as np
from distfit import distfit

X = np.random.normal(0, 3, 1000)

# Initialize model
dist = distfit()

# Find best theoretical distribution for empirical data X
distribution = dist.fit_transform(X)
dist.plot()
[distfit] >fit..
[distfit] >transform..
[distfit] >[norm      ] [0.00 sec] [RSS: 0.0037316] [loc=-0.018 scale=2.999]
[distfit] >[expon     ] [0.00 sec] [RSS: 0.1588997] [loc=-14.019 scale=14.001]
[distfit] >[dweibull  ] [0.00 sec] [RSS: 0.0079433] [loc=-0.012 scale=2.529]
[distfit] >[t         ] [0.02 sec] [RSS: 0.0036884] [loc=-0.012 scale=2.873]
[distfit] >[genextreme] [0.07 sec] [RSS: 0.0049831] [loc=-1.132 scale=3.037]
[distfit] >[gamma     ] [0.04 sec] [RSS: 0.0038504] [loc=-101.098 scale=0.089]
[distfit] >[lognorm   ] [0.09 sec] [RSS: 0.0037897] [loc=-237.099 scale=237.056]
[distfit] >[uniform   ] [0.00 sec] [RSS: 0.1145382] [loc=-14.019 scale=24.469]
[distfit] >[loggamma  ] [0.04 sec] [RSS: 0.0036960] [loc=-239.858 scale=44.472]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot..
../_images/feature_extraction_4_1.png
(<Figure size 1000x800 with 1 Axes>,
 <AxesSubplot:title={'center':'\nt\ndf=24.44, loc=-0.01, scale=2.87'}, xlabel='Values', ylabel='Frequency'>)

Besides finding the best theoretical distribution, distfit is also useful in detecting outliers. New data points that deviate significantly can then be marked as outliers.

Link to distfit.

6.1.2. Geopy: Extract Location Based on Python String

!pip install geopy

If you work with location data, you might want to visualize them on the map. Geopy makes it easy to locate the coordinates of addresses across the globe based on a Python string.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="find_location")
location = geolocator.geocode("30 North Circle Drive, Edwardsville, IL")

After defining the app name and insert location, all you need to exact information about the location is to use location.address.

location.address
'30, Circle Drive, Edwardsville, Madison County, Illinois, 62025, United States'

To extract the latitude and longitude or the use location.latitide, location.longitude.

location.latitude, location.longitude
(38.80371599362934, -89.93842706888563)

Link to Geopy

6.1.3. fastai’s cont_cat_split: Get a DataFrame’s Continuous and Categorical Variables Based on Their Cardinality

!pip install fastai

To get a DataFrame’s continuous and categorical variables based on their cardinality, use fastai’s cont_cat_split method.

If a column consists of integers, but its cardinality is smaller than the max_card parameter, it is considered as a category variable.

import pandas as pd
from fastai.tabular.core import cont_cat_split

df = pd.DataFrame(
    {
        "col1": [1, 2, 3, 4, 5],
        "col2": ["a", "b", "c", "d", "e"],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    }
)

cont_names, cat_names = cont_cat_split(df)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col3']
Categorical columns: ['col1', 'col2']
cont_names, cat_names = cont_cat_split(df, max_card=3)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col1', 'col3']
Categorical columns: ['col2']

Link to the documentation.

6.1.4. Patsy: Build Features with Arbitrary Python Code

!pip install patsy

If you want to quickly experiment with different combinations of features, try Patsy. Patsy allows you to create features for your model using human-readable syntax.

from sklearn.datasets import load_wine
import pandas as pd 
df = load_wine(as_frame=True)
data = pd.concat([df['data'], df['target']], axis=1)
data.head(10)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0
5 14.20 1.76 2.45 15.2 112.0 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450.0 0
6 14.39 1.87 2.45 14.6 96.0 2.50 2.52 0.30 1.98 5.25 1.02 3.58 1290.0 0
7 14.06 2.15 2.61 17.6 121.0 2.60 2.51 0.31 1.25 5.05 1.06 3.58 1295.0 0
8 14.83 1.64 2.17 14.0 97.0 2.80 2.98 0.29 1.98 5.20 1.08 2.85 1045.0 0
9 13.86 1.35 2.27 16.0 98.0 2.98 3.15 0.22 1.85 7.22 1.01 3.55 1045.0 0
from patsy import dmatrices

y, X = dmatrices('target ~ alcohol + flavanoids + proline', data=data)
X
DesignMatrix with shape (178, 4)
  Intercept  alcohol  flavanoids  proline
          1    14.23        3.06     1065
          1    13.20        2.76     1050
          1    13.16        3.24     1185
          1    14.37        3.49     1480
          1    13.24        2.69      735
          1    14.20        3.39     1450
          1    14.39        2.52     1290
          1    14.06        2.51     1295
          1    14.83        2.98     1045
          1    13.86        3.15     1045
          1    14.10        3.32     1510
          1    14.12        2.43     1280
          1    13.75        2.76     1320
          1    14.75        3.69     1150
          1    14.38        3.64     1547
          1    13.63        2.91     1310
          1    14.30        3.14     1280
          1    13.83        3.40     1130
          1    14.19        3.93     1680
          1    13.64        3.03      845
          1    14.06        3.17      780
          1    12.93        2.41      770
          1    13.71        2.88     1035
          1    12.85        2.37     1015
          1    13.50        2.61      845
          1    13.05        2.68      830
          1    13.39        2.94     1195
          1    13.30        2.19     1285
          1    13.87        2.97      915
          1    14.02        2.33     1035
  [148 rows omitted]
  Terms:
    'Intercept' (column 0)
    'alcohol' (column 1)
    'flavanoids' (column 2)
    'proline' (column 3)
  (to view full data, use np.asarray(this_obj))

The features created by Patsy could be easily used by popular machine learning libraries like scikit-learn.

from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X, y)

Link to Patsy.

Link to my full article on Patsy.

6.1.5. yarl: Create and Extract Elements from a URL Using Python

!pip install yarl

If you want to easily create and extract elements from a URL using Python, try yarl. In the code below, I use yarl to extract different elements of the URL https://github.com/search?q=data+science.

from yarl import URL 

url = URL('https://github.com')
new_url = url/ "search" % 'q=data+science'
print(new_url) 
https://github.com/search?q=data+science
print(new_url.host) 
github.com
print(new_url.path) 
/search
print(new_url.query_string) 
q=data science

Link to yarl.