Data¶

faker: Create Fake Data in One Line of Code¶

To quickly create fake data for testing, use faker.

from faker import Faker

fake = Faker()

fake.color_name()

'Aquamarine'

fake.name()

'Susan Martin'

fake.address()

'68109 Steven Via\nStephaniechester, MA 66854'

fake.date_of_birth(minimum_age=22)

datetime.date(1908, 6, 6)

fake.city()

'Port Walter'

fake.job()

'Buyer, retail'

Link to faker

Link to my full article on faker.

DVC: A Data Version Control Tool for your Data Science Projects¶

Git is a powerful tool to go back and forth different versions of your code. Is there a way that you can also control different versions of your data?

That is when DVC comes in handy. With DVC, you can keep the information about different versions of your data in Git while storing your original data somewhere else.

It is essentially like Git but is used for data. The code below shows how to use DVC.

# Initialize
$ dvc init

# Track data directory
$ dvc add data # Create data.dvc
$ git add data.dvc
$ git commit -m "add data"

# Store the data remotely
$ dvc remote add -d remote gdrive://lynNBbT-4J0ida0eKYQqZZbC93juUUUbVH

# Push the data to remote storage
$ dvc push 

# Get the data
$ dvc pull 

# Switch between different version
$ git checkout HEAD^1 data.dvc
$ dvc checkout

Link to DVC

Find step-by-step instructions on how to use DVC in my article.

fetch_openml: Get OpenML’s Dataset in One Line of Code¶

OpenML has many interesting datasets. The easiest way to get OpenML’s data in Python is to use sklearn.datasets.fetch_openml method.

In one line of code, you get the OpenML’s dataset to play with!

from sklearn.datasets import fetch_openml

monk = fetch_openml(name="monks-problems-2", as_frame=True)
print(monk["data"].head(10))

  attr1 attr2 attr3 attr4 attr5 attr6
   1     1     1     1     2     2
   1     1     1     1     4     1
   1     1     1     2     1     1
   1     1     1     2     1     2
   1     1     1     2     2     1
   1     1     1     2     3     1
   1     1     1     2     4     1
   1     1     1     3     2     1
   1     1     1     3     4     1
   1     1     2     1     1     1

Autoscraper¶

If you want to get the data from some websites, Beautifulsoup makes it easy for you to do so. But can scraping be automated even more? If you are looking for a faster way to scrape some complicated websites such as Stackoverflow, Github in a few lines of codes, try autoscraper.

All you need is to give it some texts so it can recognize the rule, and it will take care of the rest for you!

from autoscraper import AutoScraper

url = "https://stackoverflow.com/questions/2081586/web-scraping-with-python"

wanted_list = ["How to check version of python modules?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)

for res in result:
    print(res)

How to execute a program or call a system command?
What are metaclasses in Python?
Does Python have a ternary conditional operator?
Convert bytes to a string
Does Python have a string 'contains' substring method?
How to check version of python modules?

pandas-reader: Extract series data from various Internet sources directly into a pandas DataFrame¶

Have you wanted to extract series data from various Internet sources directly into a pandas DataFrame? That is when pandas_reader comes in handy.

Below is the snippet to extract daily data of AD indicator from 2008 to 2018.

import os
from datetime import datetime
import pandas_datareader.data as web

df = web.DataReader(
    "AD",
    "av-daily",
    start=datetime(2008, 1, 1),
    end=datetime(2018, 2, 28),
    api_key=os.gehide-outputtenv("ALPHAVANTAGE_API_KEY"),
)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_207839/3614967667.py in <module>
      8     start=datetime(2008, 1, 1),
      9     end=datetime(2018, 2, 28),
---> 10     api_key=os.gehide-outputtenv("ALPHAVANTAGE_API_KEY"),
     11 )

AttributeError: module 'os' has no attribute 'gehide'

sweetviz: Compare the similar features between 2 different datasets¶

Sometimes it is important to compare the similar features between 2 different datasets side by side such as comparing train and test sets. If you want to quickly compare 2 datasets through graphs, check out sweetviz.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import sweetviz as sv

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

report = sv.compare([X_train, "train data"], [X_test, "test data"])
report.show_html()

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Opening in existing browser session.

Run the code above and you will generate a report similar to this:

Link to sweetviz

newspaper3k: Extract Meaningful Information From an Articles in 2 Lines of Code¶

If you want to quickly extract meaningful information from an article in a few lines of code, try newspaper3k.

from newspaper import Article
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/khuyen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

True

url = "https://www.dataquest.io/blog/learn-data-science/"
article = Article(url)
article.download()
article.parse()

article.title

'How to Learn Data Science (A step-by-step guide)'

article.publish_date

datetime.datetime(2020, 5, 4, 7, 1, tzinfo=tzutc())

article.top_image

'https://www.dataquest.io/wp-content/uploads/2020/05/learn-data-science.jpg'

article.nlp()

article.summary

'How to Learn Data ScienceSo how do you start to learn data science?\nIf you want to learn data science or just pick up some data science skills, your first goal should be to learn to love data.\nRather, consider it as a rough set of guidelines to follow as you learn data science on your own path.\nI personally believe that anyone can learn data science if they approach it with the right frame of mind.\nI’m also the founder of Dataquest, a site that helps you learn data science in your browser.'

article.keywords

['scientists',
 'guide',
 'learning',
 'youre',
 'science',
 'work',
 'skills',
 'youll',
 'data',
 'learn',
 'stepbystep',
 'need']

Link to newspaper3k.

Effective Python for Data Scientists