Analyzing Python Job Market with Pandas
In this post I’m doing some simple data analytics of job market for python programmes. I will be using Python Pandas
My dataset comes from reed.co.uk - UK job board. I created simple Scrapy project that crawls reed.co.uk python job section, and parses all ads it finds. While crawling I set high download delay of 2 seconds, low number of max concurrent requests per domain and added descriptive user agent header linking back to my blog. If you are interested in source code for spider let me know.
I’ve chosen this specific job board because in contrast with other sites of this type it displays interesting information about each post. Aside from boring marketing speech describing how exciting each position is, reed shows more interesting facts about each position such as salary range and number of applications. I was curious if one can find some patterns in all this, also analyzing this data is good way to learn/demonstrate some Python Pandas functions.
My data is stored in .csv file that has 615 records, all job ads found when searching for Python, you can download it here
Let’s feed our data to Pandas.
Since my data comes from Internet it made sense to do some normalization at the level of extraction. What you get in .csv is not exactly raw data, it is partly processed. For example dates on reed are displayed either as specific date in ‘day month’ format (e.g. ‘24 December’), or as string ‘just now’ or ‘yesterday’, so I had to use some regular expressions to extract proper data and then format all as ISO date string. Similarly with salary data it could be either posted as string ‘From 20 000 to 30 000’ or just ‘20 000 per annum’. I decided to use two fields: “salary_max” which is higher value of salary range, and “salary_min” as lower value. If there was only one value posted I assumed value present is salary_max.
Location, location, location
First of all it makes sense to ask: where are Python positions located?
If you know UK job market you’re probably not suprised by domination of London. Almost 1/3 of all jobs are in London. Cambridge’s second spot is interesting, as is high position of Bristol and Reading.
Given high amount of open positions in each city, one wonders if market is perhaps saturated in London or Cambridge. How many applications per position do we have in top 10 locations?
Let’s create smaller data set with job ads only from top 10 cities for Python programmers:
and now let’s see how many applications are there per job post
Seems like everyone wants to work in London and no one wants work in Manchester and Cambridge. Bristol attracts talent as does Oxford, but keep in mind low number of positions there. Devon has unusually high number of applications per ad, which seems interesting.
Let’s check if our calculations are correct (as you read my post please feel free to check all my calculations, also let me know if there is some smarter, easier way of getting some specific result).
What determines number of applications
When browsing data you quickly notice uneven distribution of applications. Some positions have zero applications, and some have relatively high number. For example this query will give you number of applicantions for each Cambridge job.
You can see lots of ads with zero applications and some unusually popular posts. One position is particularly attractive, it attract 17 applicants.
Why are there so many applications there?
Is it the salary?
No, salary is not given. Actually nothing in my data explains why this position is so popular so I had to follow link (I store all links to job records in link columns, this is mostly for testing of accuracy of data extraction), perhaps I missed something?
If you follow link and read description you’ll see that it’s just entry level position, keywords “graduate” probably attract people without experience with Python, so this would explain high application rate.
This got me thinking: is there a strong link between a post being entry level position and high number of applicants? We could do some natural language processing to identify entry level positions, but this would probably require separate blog article, so for now, let’s just try checking if some keyword is present in description, for example keyword ‘graduate’.
Which in nutshell means: if you have some experience in Python you have on average 4 competitors to your position. Given that the most of the time recruiters are inviting 4-5 people to interview, you should probably get to interview if you worked with Python before.
Salaries
To get data about salary grouped by location:
Two things are strange here, unusually high values for Berkhamsted and USA and absolutely no wage data for Devon. Are people really getting that much in Berhamsted and USA?
As a side note, the fact that we have USA in our data set is because of inconsitency in job postings on reed. For UK posts location is specified as city, for US posts we get USA as location, without adding city. To get actual US city where position is located we would have to parse description which would be difficult to do, so I decided to keep it as it is without normalizing this to city string.
Are these wages in Berkhamsted so high? I suspected either cheating here or error in my script extracting data, so to actually check this I had to follow those links and see raw data.
Turns out it’s cheating on agency side. Payroll superstar title hides rather small wages of 20k per annum. Since agency that uses this trick is responsible for 10 job postings out of all 15 Python jobs in Berkhamsted it is natural that it distorts results. What’s more this job is actually not work of Python programmer, but was caught in our results because of reed indexing which caught reference to monty python.
And for USA jobs? Are they really getting so much more then UK engineers? Let’s look at descriptions…
All salaries are given in dollars not in pounds! Actual difference is not as big as it seems. Yet even after adjusting to pound it seems that US salaries are much higher at around 84-100k per annum. Whether actual take home wages are higher in US is not completely evident. When comparing salaries between two countries you have to ask yourself if you really compare apple to apples. Taxes, social security contributions and health care make a big difference to actual take home pay and it is not clear if they are always specified in same way. Perhaps they publish salaries without tax in US and with tax in UK, just like they publish prices of goods with tax in UK and without tax in US.
Finally aggregate data for UK as a whole:
Position without applicants
One interesting thing about job market for Python programmers is unusually high number of positions for which there is no applications.
One out of six jobs has zero applications. It has to be difficult to find experienced python dev these days.
But one can also ask why some positions don’t get any interest. First of all maybe they were just recently published and noone had the time to apply yet. Luckily I’m storing date published and date found in my data, so we can get number of days each add is on market from these fields. My script stores both dates as isoformat string. To get number of days we need to convert isoformat date to timedelta and then to integer.
With pandas we can actually easily add new columns to DataFrame, so let’s do this. I will add new column “daysOn” - that contains timedelta between date published and date found by my spider.
Adding new column is simple, just assign another series to DataFrame:
At this point we have extra column “daysOn” which contains timedelta object representing time that passed between date of publication and date when each ad was found. Note that we’re using numpy datetime which exposes slighly different api from python’s native datetime.timedelta object To actually get number of days we need to cast our timedelta to int and representing number of days.o
Now let’s eliminate those ads which are on job market only for zero days
At this point ‘zeros’ frame contains only ads that are on market for more then one day, and no one had applied for them yet.
Average time on market is two weeks.
Where are those positions located?
Fun fact: salary for those position is actually higher from average.
What types of jobs are these? Probably those that require lots of experience, we can tell that only two of them contain word graduate. Juding by presence of some keywords like: “lead” or “experience” and salary above mean they are probably roles for experienced devs, and description seems scares people off.
I think we’ve got quite an insight into Python job market at this point. There are still some interesting questions to ask but I think we can safely leave them for part two