9.1. Find Top Most Popular Languages¶
!pip install observable_jupyter
What are the top languages used by data scientists, data analysts, data engineers, and machine learning engineers? I answered this question using the data consisting of the top 100 most popular skills of people who have these job titles.
This data is collected from 160k+ data scientists, 570k data analysts, 100k+ data engineers, and 19k+ machine learning engineers from all over the world using Diffbot, the world’s largest knowledge graph. Find more instructions on how to use Diffbot here.
I uploaded the dataset used in this tutorial to this repository so that you can try out the dataset yourself.
from observable_jupyter import embed
import pandas as pd
Start with loading the dataset:
skill_count = pd.read_csv(
"https://media.githubusercontent.com/media/khuyentran1401/dataset/master/data_science_market/all_skills.csv",
index_col=0,
)
skill_count.head(10)
count | skill | Title | |
---|---|---|---|
0 | 131292 | teaching | Data Scientist |
1 | 113898 | economics | Data Scientist |
2 | 106630 | programming language | Data Scientist |
3 | 105294 | mathematics | Data Scientist |
4 | 79871 | machine learning | Data Scientist |
5 | 79735 | python | Data Scientist |
6 | 77810 | robotics | Data Scientist |
7 | 70540 | software development | Data Scientist |
8 | 69262 | phython | Data Scientist |
9 | 62084 | data analysis | Data Scientist |
Next, we will visualize the dataset using a bubble matrix created on Observable.
A bubble matrix uses sizes and colors to represent two-dimensional information. The rows represent the job titles and the columns represent the languages. The bigger a bubble is, the more frequently the language is used in a certain job category.
The bubbles are highlighted if they are above a certain number of occurrences. You can use the slider to choose the threshold above which the bubbles are highlighted. For example, if you choose the threshold to be 100k, only the bubbles with a count above 100k are colored dark purple.
embed("@khuyentran1401/languages-between-jobs", cells=["chart", "viewof options"])
To sort the bubbles by a specific job title, click that job title.
Based on this plot, we can see that:
The top 3 skills of data analysts in descending order are SQL, Python, and R
The top 3 skills of data engineers in descending order are SQL, Python, and Java
The top 3 skills of data scientists in descending order are Python, R, and SQL
The top 3 skills of machine learning engineers in descending order are Python, Java, and C++