6.10. Tools for Best Python Practices¶

This section cover tools that encourage best Python practices.

6.10.1. Don’t Hard-Code. Use Hydra Instead¶

!pip install hydra-core

When writing code, it is a good practice to put the values that you might change in a separate file from your original script.

This practice not only saves you from wasting time searching for a specific variable in your scripts but also makes your scripts more reproducible.

My favorite tool to handle config files is Hydra. The code below shows how to get values from a config file using Hydra.

All parameters are specified in a configuration file named config.yaml:

# config.yaml
data: data1 
variables: 
  drop_features: ['iid', 'id', 'idg', 'wave']
  categorical_vars: ['undergra', 'zipcode']

In seperate file named main.py, the parameters in the config.yaml file are called using Hydra:

# main.py
import hydra 

@hydra.main(config_name='config.yaml')
def main(config):
    print(f'Process {config.data}')
    print(f'Drop features: {config.variables.drop_features}')

if __name__ == '__main__':
    main()

On your terminal, type:

$ python main.py

Output:

!python hydra_examples/main.py

hydra_examples/main.py:3: UserWarning: 
config_path is not specified in @hydra.main().
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_hydra_main_config_path for more information.
  @hydra.main(config_name='config.yaml')
Process data1
Drop features: ['iid', 'id', 'idg', 'wave']

Link to my article about Hydra.

Link to Hydra.

6.10.2. python-dotenv: How to Load the Secret Information from .env File¶

!pip install python-dotenv

An alternative to saving your secret information to the environment variable is to save it to .env file in the same path as the root of your project.

# .env
USERNAME=my_user_name
PASSWORD=secret_password

The easiest way to load the environment variables from .env file is to use python-dotenv library.

from dotenv import load_dotenv
import os 

load_dotenv()
PASSWORD = os.getenv('PASSWORD')
print(PASSWORD)

secret_password

Link to python-dotenv

6.10.3. kedro Pipeline: Create Pipeline for Your Data Science Projects in Python¶

!pip install kedro

When writing code for a data science project, it can be difficult to understand the workflow of the code. Is there a way that you can create different components based on their functions then combine them together in one pipeline?

That is when kedro comes in handy. In the code below, each node is one component of the pipeline. We can specify the input and output of the node. Then combine them together using Pipeline.

DataCatalog is the input data. Structuring your code this way makes it easier for you and others to follow your code logic.

from kedro.pipeline import node, Pipeline
from kedro.io import DataCatalog, MemoryDataSet
from kedro.runner import SequentialRunner

# Prepare a data catalog
data_catalog = DataCatalog({"data.csv": MemoryDataSet()})

# Prepare first node
def process_data():
    return f"processed data"

process_data_node = node(
    func=process_data, inputs=None, outputs="processed_data"
)

def train_model(data: str):
    return f"Training model using {data}"

train_model_node = node(
    func=train_model, inputs="processed_data", outputs="trained_model"
)

# Assemble nodes into a pipeline
pipeline = Pipeline([process_data_node, train_model_node])

# Create a runner to run the pipeline
runner = SequentialRunner()
print(runner.run(pipeline, data_catalog))

{'trained_model': 'Training model using processed data'}

Link to my article about Kedro

Link to Kedro

6.10.4. docopt: Create Beautiful Command-line Interfaces for Documentation in Python¶

!pip install docopt

Writing documentation for your Python script helps others understand how to use your script. However, instead of making them spend some time to find the documentation in your script, wouldn’t it be nice if they can view the documentation in the terminal?

That is when docopt comes in handy. docopt allows you to create beautiful command-line interfaces by passing a Python string.

To understand how docopt works, we can add a docstring at the beginning of the file named docopt_example.py.

# docopt_example.py
"""Extract keywords of an input file
Usage:
    docopt_example.py --data-dir=<data-directory> [--input-path=<path>]
Options:
    --data-dir=<path>    Directory of the data
    --input-path=<path>  Name of the input file [default: input_text.txt]
"""

from docopt import docopt 

if __name__ == '__main__':
    args = docopt(__doc__, argv=None, help=True)
    data_dir = args['--data-dir']
    input_path = args['--input-path']

    if data_dir:
        print(f"Extracting keywords from {data_dir}/{input_path}")

Running the file docopt_example.py should give us the output like below:

$ python docopt_example.py

!python docopt_example.py

Usage:
    docopt_example.py --data-dir=<data-directory> [--input-path=<path>]
Options:
    --data-dir=<path>    Directory of the data
    --input-path=<path>  Name of the input file [default: input_text.txt]

Link to docopt.

Effective Python for Data Scientists

6.10. Tools for Best Python Practices¶

6.10.1. Don’t Hard-Code. Use Hydra Instead¶

6.10.2. python-dotenv: How to Load the Secret Information from .env File¶

6.10.3. kedro Pipeline: Create Pipeline for Your Data Science Projects in Python¶

6.10.4. docopt: Create Beautiful Command-line Interfaces for Documentation in Python¶