Data Scientist Job Skills Web Scraping Analysis


It is well known that the field of data science is rapidly changing. As an aspiring data scientist, this characteristic can be troubling. With the diverse set of data science topics, all of the available MOOCs, but only so many hours in the day, on which topics should I be spending my time?

We will use a combination of web scraping, statistics, and undirected graphs to answer this in the form of two questions:

1. What data scientist skills are the most popular?

2. What pairs of data scientist skills are the most popular?


The Data Set

Before analyzing the data, I would like to mention a bit about how the data was collected. The data set is pretty large. Using the python web scraping modules, Requests and bs4.BeautifulSoup, I obtained 8065 job descriptions for the query of “Data Scientist”, with up to 250 job descriptions accessed from each state.

Now… what data did I collect?

Before the scrape, I put together a list of 43 data science skills. Then for every job description, I checked for the existence of each skill and stored a 1 if the skill existed, and a 0 otherwise. I also stored the name of the state and the name of the job. A row of data gathered from a job post looked something like this:

A row of data obtained for a job post titled “Data Scientist — Predictive Modeling”

Take a look at this row of data, what does it tell us?

First, the job is titled “Data Scientist — Predictive Modeling”. Second, the job is listed within the state of New York. Third, somewhere in the job description excel and git are mentioned. Fourth, nowhere in the job description are C++, Cassandra, CouchDB, D3, Db2, or Flume mentioned. After we include the other 35 columns of skills in this row, and create 8064 of these rows, this constitutes the entire data set.


The Analysis

This analysis is fueled by two powerful visualization tools: histograms and undirected graphs. Histograms will be used to answer the first question and undirected graphs will be used to answer the second question. If you are unfamiliar with any of these terms, it is alright, the visualizations should be able to speak for themselves.

Question 1: What data scientist skills are the most popular?

Since I am a proud New Yorker, we will look at the most popular data science skills in New York. Then, we will compare these preferences to the job posts from all of the United States, all with the use of histograms.

Part a: What data scientist skills are the most popular in NY?

Below is the first histogram. Each bar represents the mean value of the data obtained for a skill in the job posts from New York. These mean values approximate the probability that we will see a skill in a job post in NY. So say we looked at 100 data scientist job posts from NY, around 70 of them should include python, which has mean value of .7. Let’s take a look at this first histogram.

Frequency of data scientist skills that appear in job posts in New York

What can we see?

I used color labels that adjust according to the value, so we can visually identify a few groups in the data. Let’s make a some observations.

  • Python, R and SQL have the highest demand.
  • Hadoop, Spark, and Java have second highest demand.
  • Tableau, SAS, Scala, Hive, Matlab, Excel, NoSQL, C++ and tensorflow have third highest demand.
  • All other skills appear for less than 1 in 10 job posts.

It looks like its a safe bet to invest our time in Python, R, and SQL. Big data skills like Hadoop and Spark are also very popular. Before making any more commitments, let’s continue our analysis.

Part b: How do the most popular data science skills in NY compare the rest of the US?

Take a look at the pair of histograms below. The figure at the top is the same as our first plot for the popularity of data science skills in NY. The figure at the bottom is the plot that shows the popularity of data science skills for the whole US.

Frequency of data scientist skills that appear in job posts in New York and the US

What can we see?

Again, I assigned a color according to the scale of the values for each bar for the two plots. Let’s state our observations.

  • Python, R, and SQL, are the first set of most popular data science skills for both.
  • Hadoop, Java, and Spark, are the second most popular set of skills with the addition of SAS for the US.
  • The shape of the histogram for the US looks similar to the shape of the histogram for NY.
  • The scale difference makes the comparison difficult.

The data science skills for NY are similar in ordering to those for the entire US, but we can see that SAS has made its way into the second most popular group of skills. With these plots, we can’t compare the rest of the groups that we found in the first graph.

We can do better.

We saw that the shapes of these histograms were similar. Let’s normalize the data in the plots and confirm this. To do this, we will use an equation that looks like this:

This equation just takes the data in each of the plots and makes all of their values span between zero and one. This will increase the visibility of the shape for each histogram and afterwards, we will be able to use this normalized data to clearly visualize all of the differences. But first, let’s take a look at the normalized histograms.

Normalized frequency of data scientist skills that appear in job posts in New York and the US

What can we see?

  • As stated, the shapes of the histograms are in fact very similar.
  • Python, R , and SQL are the most popular in both.
  • Hadoop, Spark, and Java are the second most popular, with the addition of SAS for the whole US.
  • Beyond Tableau, lots of differences arise.

Let’s use this data to directly visualize all of the differences in skill popularity with a single and final plot.

We are going to take the difference between our two normalized plots.

What can we see?

This plot clearly visualizes all of the differences that we were looking for in the earlier plots. It tells us the percent difference in how often data scientist skills appear in data scientist job posts for NY in comparison to how often they appear in job posts for the whole US. Now, let’s state our observations.

  • New York has a greater preference for big data in comparison to the whole country, as the leading positive differences are Spark, Scala, and Hadoop (all relating to big data). There are also positive differences for modern database technologies (Hive, Pig, MongoDB, Cassandra, MySQL, and the list goes on).
  • On the other hand, New York cares less about data visualization technologies like Tableau, Excel, and D3, than does the rest of the country, and it is consistent to say that they are less interested in a traditional database like Oracle.

With this final plot, we finished answering our first question. To summarize, Python, R and SQL are important for data science. Big data technologies like Spark and Hadoop are second in demand. Then with the assumption that New York is a leader in data scientist trends, it would be smart to focus on learning modern databases technologies in comparison to traditional databases like Oracle.

Question 2: What pairs of data scientist skills are the most popular?

Now the fun starts. We are going to answer this question using undirected graphs. But what are undirected graphs? I think it’s best to answer this by jumping into the first example. Now, to get the data to build this graph, we need to add some columns to our original rows of data. We are going to make a new column for every possible combination of two skills. When the existing value for both skills in a pair are 1, we will assign a 1 to the new column, and otherwise we will assign a 0 . A row of our modified data will look something like this:

A row of data obtained for a job post titled “Senior Data Scientist, Analytics Lab Team”

Take a look at this row of data, what does it tell us?

First, the job is titled “Senior Data Scientist, Analytics Lab Team”. Second, the job is listed within the state of New York. Third, somewhere in the job description C#, C++, and Cassandra are mentioned. Fourth, nowhere in the job description are CouchDB, D3, Db2, or Excel mentioned. Fifth, somewhere in the job description the pair of skills C# and C++ and the pair of skills C# and Cassandra are mentioned. Sixth, nowhere in the job description is the pair of skills C# and CouchDB mentioned. After we include the other 36 columns of skills, 900 columns of skill pairs, and create 8064 of these rows, this constitutes the entire new and improved data set.


Onto the analysis with the undirected graphs.

We will repeat the pattern used to answer the first question, starting with an initial focus on New York, followed by a comparison to the rest of the US.

Part a: What data scientist pairs of skills are the most popular in NY?

Using the NetworkX python module, we make the first graph that is displayed below.

How do we read this graph?

  • The size of each node in the graph corresponds to the probability of occurrence for each individual skill. This is the mean value of the data obtained for a skill using all of the rows of job posts in New York. The larger the node, the more likely the skill occurs.
  • Next, the thickness and color of the lines connecting two nodes (we call these edges) correspond to the probability that the pair of skills occur in a job post. This is the mean value of the column for the pair of skills using all of the rows of job posts in New York. The thicker the line, the more likely the pair, and the color of the line corresponds to a region of probability.

What can we see?

  • I see a lot of turquoise lines in this graph. These are pairs of skills that occur with near zero probability and they are causing a lot of clutter.

We can do better.

Lets prune the graph. I’m going to remove all of the skills that are connected solely by these low probability turquoise lines.

Here is our pruned result.

What can we see?

Lets break our observations into two different sets. The first set is for the data scientist that focuses on learning Python and the second set is for the data scientist that focuses on learning R.

First, let’s look at what skills occur together with Python:

  • We see that Python and SQL occur together the most.
  • We see that Python and Spark and Python and Hadoop all occur the second most with the same frequency (This is an enlightening observation)

Even though Hadoop is written in Java, Python and Hadoop appear together more frequently than Java and Hadoop. This indicates that there is most likely support for using Hadoop with Python and that it is in demand (Hadoop Streaming, Hadoopy, Pydoop, and MRJob — Ref).

Furthermore, even though Spark is written in Scala, Python and Spark appear together more frequently than Spark and Scala. This also indicates that there is most likely support for using Spark with Python and it is in demand (PySpark).

Second, let’s look at what skills occur together with R:

  • We see that R and SQL occur together the most.
  • We see that R and Spark and R and Hadoop appear the second most with the same frequency (This is also an enlightening observation)

We can make the same argument. Even though Hadoop is written in Java, R and Hadoop appear together more frequently than Java and Hadoop. This indicates that there is most likely support for using Hadoop with R that is in demand (Hadoop Streaming, RHadoop, RHipe, and ORCH — Ref).

Again, even though Spark is written in Scala, R and Spark appear together more frequently than Spark and Scala. This indicates that there is most likely support for using Spark with R that is in demand (sparklyr).

Let’s summarize.

From our first question, we discovered the importance of Python, R and SQL to the data science profession. We also realized the importance of big data technologies. Using the undirected graph, we highlighted the usage of the combination of Python with Spark and Python with Hadoop or the usage of the combination of R with Spark and R with Hadoop when learning these big data technologies.

Part b: How do the most popular data science skills in NY compare to the rest of the US?

I will use the same normalization method from the first question to get the undirected graph that will tell us the answer to this question and I will speed up this process by saving the discussion for the final graph. However, I will describe the steps that I will take to get to that final point:

  • First, we create the undirected graph for the skills occurrence and skill pairs occurrence for the job posts that occur in the entire US.
  • Second, we normalize the data for the nodes and the data for the edges in the NY and the US graphs.
  • Third, we take the difference between these normalized plots. For this next plot, I am going to change the color pallet of the graph to something that will better visualize the percent differences of data science skill popularity in favor of and against New York. Let’s call this the percent preference of NY vs. US.
  • Fourth, this plot contains a lot of insignificant and low probability edges. Let’s prune the majority of the low probability nodes with edge values primarily between -0.05 and 0.05.

What can we see?

  • Matlab and Hadoop, Matlab and Python, Python and Spark, Python and Hadoop, R and Spark, and R and Hadoop, are all job skill pairs that have the highest percent preference for NY job descriptions in comparison to the rest of the US.

This graph reveals that Matlab supports usage with both Hadoop and Spark and NY is loving it. This adds a third route to consider when seeking to gain experience with big data.

Moving Forward

In this blog post we analyzed web scraped job data and revealed which data science skills are worth spending our time to learn. We asked the question, “what data science skills are the most popular?” Our analysis responded with Python, R, SQL, Hadoop, Spark, and modern database technologies. We then asked the question, “what are the most popular pairs of data scientist skills?” Our analysis responded with Python and SQL or R and SQL that followed with three possibilities to incorporate big data: Python with Spark and Python with Hadoop, R with Spark and R with Hadoop, or Matlab with Spark and Matlab with Hadoop. Using these insights we can confidently work our way into our next efforts towards becoming the data scientists of tomorrow.

Leave a comment