logo

Effective Python for Data Scientists

  • Efficient Python Tricks and Tools for Data Scientists
  • 1. How to Read This Book
  • 2. Python Built-in Methods
    • 2.1. String
    • 2.2. Number
    • 2.3. List
      • 2.3.1. Good Practices
      • 2.3.2. Get Elements
      • 2.3.3. Unpack Iterables
      • 2.3.4. Join Iterables
      • 2.3.5. Interaction Between 2 Lists
      • 2.3.6. Apply Functions to Elements in a List
    • 2.4. Tuple
    • 2.5. Dictionary
    • 2.6. Function
    • 2.7. Classes
    • 2.8. Datetime
    • 2.9. Best Practices
    • 2.10. Code Speed
  • 3. Python Built-in Libraries
    • 3.1. Collections
    • 3.2. Itertools
    • 3.3. Functools
    • 3.4. Operator
    • 3.5. Typing
  • 4. Pandas
    • 4.1. Change Values
    • 4.2. Get Values
    • 4.3. Testing
  • 5. NumPy
    • 5.1. NumPy
  • 6. Data Science Tools
    • 6.1. Feature Extraction
    • 6.2. Get Data
    • 6.3. Manage Data
    • 6.4. Machine Learning
    • 6.5. Natural Language Processing
    • 6.6. Time Series
    • 6.7. Sharing and Downloading
    • 6.8. Tools to Speed Up Code
    • 6.9. Visualization
    • 6.10. Tools for Best Python Practices
    • 6.11. Better Pandas
    • 6.12. Testing
  • 7. Cool Tools
    • 7.1. Alternative Approach
    • 7.2. Workflow Automation
    • 7.3. Code Review
    • 7.4. Better Outputs
    • 7.5. Git and GitHub
    • 7.6. Environment Management
  • 8. Jupyter Notebook
    • 8.1. Jupyter Notebook
  • 9. Insights From Data
    • 9.1. Find Top Most Popular Languages
Powered by Jupyter Book

9.1. Find Top Most Popular Languages¶

!pip install observable_jupyter

What are the top languages used by data scientists, data analysts, data engineers, and machine learning engineers? I answered this question using the data consisting of the top 100 most popular skills of people who have these job titles.

This data is collected from 160k+ data scientists, 570k data analysts, 100k+ data engineers, and 19k+ machine learning engineers from all over the world using Diffbot, the world’s largest knowledge graph. Find more instructions on how to use Diffbot here.

I uploaded the dataset used in this tutorial to this repository so that you can try out the dataset yourself.

from observable_jupyter import embed
import pandas as pd 

Start with loading the dataset:

skill_count = pd.read_csv(
    "https://media.githubusercontent.com/media/khuyentran1401/dataset/master/data_science_market/all_skills.csv",
    index_col=0,
)
skill_count.head(10)
count skill Title
0 131292 teaching Data Scientist
1 113898 economics Data Scientist
2 106630 programming language Data Scientist
3 105294 mathematics Data Scientist
4 79871 machine learning Data Scientist
5 79735 python Data Scientist
6 77810 robotics Data Scientist
7 70540 software development Data Scientist
8 69262 phython Data Scientist
9 62084 data analysis Data Scientist

Next, we will visualize the dataset using a bubble matrix created on Observable.

A bubble matrix uses sizes and colors to represent two-dimensional information. The rows represent the job titles and the columns represent the languages. The bigger a bubble is, the more frequently the language is used in a certain job category.

The bubbles are highlighted if they are above a certain number of occurrences. You can use the slider to choose the threshold above which the bubbles are highlighted. For example, if you choose the threshold to be 100k, only the bubbles with a count above 100k are colored dark purple.

embed("@khuyentran1401/languages-between-jobs", cells=["chart", "viewof options"])
Edit @khuyentran1401/languages-between-jobs on Observable

To sort the bubbles by a specific job title, click that job title.

Based on this plot, we can see that:

  • The top 3 skills of data analysts in descending order are SQL, Python, and R

  • The top 3 skills of data engineers in descending order are SQL, Python, and Java

  • The top 3 skills of data scientists in descending order are Python, R, and SQL

  • The top 3 skills of machine learning engineers in descending order are Python, Java, and C++

previous

9. Insights From Data

By Khuyen Tran
© Copyright 2021.