Analyzing Python Job Market with Pandas

In this post I’m doing some simple data analytics of job market for python programmes. I will be using Python Pandas

My dataset comes from reed.co.uk - UK job board. I created simple Scrapy project that crawls reed.co.uk python job section, and parses all ads it finds. While crawling I set high download delay of 2 seconds, low number of max concurrent requests per domain and added descriptive user agent header linking back to my blog. If you are interested in source code for spider let me know.

I’ve chosen this specific job board because in contrast with other sites of this type it displays interesting information about each post. Aside from boring marketing speech describing how exciting each position is, reed shows more interesting facts about each position such as salary range and number of applications. I was curious if one can find some patterns in all this, also analyzing this data is good way to learn/demonstrate some Python Pandas functions.

My data is stored in .csv file that has 615 records, all job ads found when searching for Python, you can download it here

Let’s feed our data to Pandas.

In [1]: import pandas as pd
In [2]: data = pd.read_csv('reed.csv')
In [3]: data.columns
Out[3]: Index([u'salary_min', u'description', u'title', u'salary_max', u'applications', u'page_number', u'location', u'published', u'link', u'found', u'id'], dtype='object')

Since my data comes from Internet it made sense to do some normalization at the level of extraction. What you get in .csv is not exactly raw data, it is partly processed. For example dates on reed are displayed either as specific date in ‘day month’ format (e.g. ‘24 December’), or as string ‘just now’ or ‘yesterday’, so I had to use some regular expressions to extract proper data and then format all as ISO date string. Similarly with salary data it could be either posted as string ‘From 20 000 to 30 000’ or just ‘20 000 per annum’. I decided to use two fields: “salary_max” which is higher value of salary range, and “salary_min” as lower value. If there was only one value posted I assumed value present is salary_max.

Location, location, location

First of all it makes sense to ask: where are Python positions located?

In [37]: data.location.value_counts().head(10)
Out[37]: 
London         203
Cambridge       42
Reading         19
Bristol         19
Manchester      16
Devon           15
Berkhamsted     15
Oxford          11
Cardiff          8
USA              8
dtype: int64

If you know UK job market you’re probably not suprised by domination of London. Almost 1/3 of all jobs are in London. Cambridge’s second spot is interesting, as is high position of Bristol and Reading.

Given high amount of open positions in each city, one wonders if market is perhaps saturated in London or Cambridge. How many applications per position do we have in top 10 locations?

Let’s create smaller data set with job ads only from top 10 cities for Python programmers:

In [113]: toplocations = data.location.value_counts().head(10)
In [114]: mask = data.location.isin(toplocations.keys())
In [115]: tops = data[mask]

and now let’s see how many applications are there per job post

In [183]: applications = tops.groupby('location').applications.sum()
In [184]: ads = tops.groupby('location').id.count()
In [185]: applications / ads
Out[185]: 
location
Berkhamsted    2.333333
Bristol        4.157895
Cambridge      2.523810
Cardiff        3.250000
Devon          7.666667
London         7.935961
Manchester     2.125000
Oxford         4.181818
Reading        3.631579
USA            5.375000
dtype: float64

Seems like everyone wants to work in London and no one wants work in Manchester and Cambridge. Bristol attracts talent as does Oxford, but keep in mind low number of positions there. Devon has unusually high number of applications per ad, which seems interesting.

Let’s check if our calculations are correct (as you read my post please feel free to check all my calculations, also let me know if there is some smarter, easier way of getting some specific result).

In [371]: cam = data[data.location=='Cambridge'].applications.sum() / float(data[data.location == 'Cambridge'].id.count())
In [372]: cam
Out[372]: 2.5238095238095237

What determines number of applications

When browsing data you quickly notice uneven distribution of applications. Some positions have zero applications, and some have relatively high number. For example this query will give you number of applicantions for each Cambridge job.

In [373]: data[data.location == 'Cambridge'][['location', 'applications']]

You can see lots of ads with zero applications and some unusually popular posts. One position is particularly attractive, it attract 17 applicants.

Why are there so many applications there?

Is it the salary?

In [222]: data[data.applications == top_ad][data.location == 'Cambridge'][['applications', 'salary_max', 'salary_min']]
Out[222]: 
     applications  salary_max  salary_min
445            17         NaN

No, salary is not given. Actually nothing in my data explains why this position is so popular so I had to follow link (I store all links to job records in link columns, this is mostly for testing of accuracy of data extraction), perhaps I missed something?

In [238]: data[data.applications == top_ad][data.location == 'Cambridge'].link.values
Out[238]: array([ 'http://www.reed.co.uk/jobs/senior-graduate-software-engineers-web-developers-iot/25336524#/jobs?keywords=python&cached=True&pageno=18'], dtype=object)

If you follow link and read description you’ll see that it’s just entry level position, keywords “graduate” probably attract people without experience with Python, so this would explain high application rate.

This got me thinking: is there a strong link between a post being entry level position and high number of applicants? We could do some natural language processing to identify entry level positions, but this would probably require separate blog article, so for now, let’s just try checking if some keyword is present in description, for example keyword ‘graduate’.

In [389]: filter = data.description.str.contains('graduate', case=False)
# mean of applications if there is keyword 'graduate' in description
In [390]: data[filter].applications.mean()
Out[390]: 12.0
# 'graduate' not present in description
In [391]: data[filter == False].applications.mean()
Out[391]: 4.3551236749116606
# mean salary if 'graduate' in description
In [392]: data[filter].salary_max.mean()
Out[392]: 27714.285714285714
# mean salary if 'graduate' not in description
In [393]: data[filter == False].salary_max.mean()
Out[393]: 62014.893506493507

Which in nutshell means: if you have some experience in Python you have on average 4 competitors to your position. Given that the most of the time recruiters are inviting 4-5 people to interview, you should probably get to interview if you worked with Python before.

Salaries

To get data about salary grouped by location:

In [244]: tops.groupby('location').salary_max.mean()
Out[244]: 
location
Berkhamsted    203714.285714
Bristol         52352.941176
Cambridge       46720.000000
Cardiff         36714.285714
Devon                    NaN
London          62846.376812
Manchester      74733.333333
Oxford          56666.666667
Reading         41642.857143
USA            154285.714286
Name: salary_max, dtype: float64

In [245]: tops.groupby('location').salary_min.mean()
Out[245]: 
location
Berkhamsted    177142.857143
Bristol         38294.117647
Cambridge       36400.000000
Cardiff         27285.714286
Devon                    NaN
London          47963.768116
Manchester      59866.666667
Oxford          46000.000000
Reading         32428.571429
USA            127857.142857
Name: salary_min, dtype: float64

Two things are strange here, unusually high values for Berkhamsted and USA and absolutely no wage data for Devon. Are people really getting that much in Berhamsted and USA?

As a side note, the fact that we have USA in our data set is because of inconsitency in job postings on reed. For UK posts location is specified as city, for US posts we get USA as location, without adding city. To get actual US city where position is located we would have to parse description which would be difficult to do, so I decided to keep it as it is without normalizing this to city string.

Are these wages in Berkhamsted so high? I suspected either cheating here or error in my script extracting data, so to actually check this I had to follow those links and see raw data.

Turns out it’s cheating on agency side. Payroll superstar title hides rather small wages of 20k per annum. Since agency that uses this trick is responsible for 10 job postings out of all 15 Python jobs in Berkhamsted it is natural that it distorts results. What’s more this job is actually not work of Python programmer, but was caught in our results because of reed indexing which caught reference to monty python.

And for USA jobs? Are they really getting so much more then UK engineers? Let’s look at descriptions…

In[260]: tops[tops.location=='USA'].description.values
Out[260]: 
array([ 'Software Developer - C & Linux/Unix - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing systems for the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary of up to $150,000 (possibly...',
       '  Senior Data Scientist / Manager   New York / New Jersey   $150,000 - $200,000 base salary + performance related annual bonus and exceptional benefits  Are you interested in working as a Senior Data Scientist / Manager for a hugely exciting, fast-paced, dynamic and globally renowned financial services...',
       'Software Developer - iOS, C/C++ and/or Java - San Francisco - To c.$130,000 + bonus + relocation + bens. This role will involve designing and developing SDKs and APIs that are used by iPhone, iPad and iOS developers around the world. This role offers a starting salary of up to $130,000 (possibly higher...',
       'Software Developer - C & Linux/Unix - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing systems for the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary of up to $150,000 (possibly...',
       'Software Developer - C & Linux/Unix - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing systems for the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary of up to $150,000 (possibly...',
       '  Data Scientist, Analytics   New York City, New York   $100,000 - $150,000 base salary + performance related annual bonus  Are you interested in working as a Data Scientist for a forward thinking, dynamic, innovative and customer focused online retailer where exceptional levels of career progression...',
       'Cloud Engineer - C/C++, Linux, AWS, Openstack - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing cloud software applications to improve the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary...',
       '  Algorithm Engineer New York City, New York $120,000 - $150,000 base salary + performance related annual bonus + benefits  The opportunity to work for the coolest, most innovative and cutting-edge digital media organization in New York City and arguably across the entire globe should not be passed upon...'], dtype=object)

All salaries are given in dollars not in pounds! Actual difference is not as big as it seems. Yet even after adjusting to pound it seems that US salaries are much higher at around 84-100k per annum. Whether actual take home wages are higher in US is not completely evident. When comparing salaries between two countries you have to ask yourself if you really compare apple to apples. Taxes, social security contributions and health care make a big difference to actual take home pay and it is not clear if they are always specified in same way. Perhaps they publish salaries without tax in US and with tax in UK, just like they publish prices of goods with tax in UK and without tax in US.

Finally aggregate data for UK as a whole:

In [263]: data.salary_max.mean()
Out[263]: 59689.428571428572

In [264]: data.salary_min.mean()
Out[264]: 46540.220338983054

Position without applicants

One interesting thing about job market for Python programmers is unusually high number of positions for which there is no applications.

In [88]: zeros = data[data.applications.eq(0)]
In [89]: zeros.id.count()
Out[89]: 101

One out of six jobs has zero applications. It has to be difficult to find experienced python dev these days.

But one can also ask why some positions don’t get any interest. First of all maybe they were just recently published and noone had the time to apply yet. Luckily I’m storing date published and date found in my data, so we can get number of days each add is on market from these fields. My script stores both dates as isoformat string. To get number of days we need to convert isoformat date to timedelta and then to integer.

With pandas we can actually easily add new columns to DataFrame, so let’s do this. I will add new column “daysOn” - that contains timedelta between date published and date found by my spider.

Adding new column is simple, just assign another series to DataFrame:

In [291]: zeros["daysOn"] = zeros.found.astype(np.datetime64) - zeros.published.astype(np.datetime64)

In [292]: zeros[["found", "published", "daysOn"]].head(3)
Out[292]: 
                        found                   published  \
0  2014-12-26T21:54:34.050102  2014-12-22T21:54:34.049654   
2  2014-12-26T21:54:34.054577  2014-12-22T21:54:34.054197   
5  2014-12-26T21:54:34.063716  2014-12-26T21:54:34.063350   

                  daysOn  
0 4 days 00:00:00.000448  
2 4 days 00:00:00.000380  
5 0 days 00:00:00.000366

At this point we have extra column “daysOn” which contains timedelta object representing time that passed between date of publication and date when each ad was found. Note that we’re using numpy datetime which exposes slighly different api from python’s native datetime.timedelta object To actually get number of days we need to cast our timedelta to int and representing number of days.o

In [293]: zeros.daysOn = zeros.daysOn.apply(lambda a:np.timedelta64(a, 'D').astype(int))

In [294]: zeros[["found", "published", "daysOn"]].head(3)
Out[294]: 
                        found                   published  daysOn
0  2014-12-26T21:54:34.050102  2014-12-22T21:54:34.049654       4
2  2014-12-26T21:54:34.054577  2014-12-22T21:54:34.054197       4
5  2014-12-26T21:54:34.063716  2014-12-26T21:54:34.063350       0

Now let’s eliminate those ads which are on job market only for zero days

In [307]: zeros = zeros[zeros.daysOn.eq(0) == False]

In [309]: zeros[["found", "published", "daysOn"]].head(3)
Out[309]: 
                         found                   published  daysOn
0   2014-12-26T21:54:34.050102  2014-12-22T21:54:34.049654       4
2   2014-12-26T21:54:34.054577  2014-12-22T21:54:34.054197       4
12  2014-12-26T21:54:34.081788         2014-12-17T00:00:00       9

At this point ‘zeros’ frame contains only ads that are on market for more then one day, and no one had applied for them yet.

Average time on market is two weeks.

In [337]: zeros.daysOn.mean()
Out[337]: 14.329787234042554

Where are those positions located?

In [311]: zeros.location.value_counts()
Out[311]: 
London             25
Cambridge          13
Manchester          6
Southampton         3
Surrey              3
Bristol             2
Cheltenham          2

Fun fact: salary for those position is actually higher from average.

In [315]: zeros.salary_max.describe()
Out[315]: 
count        58.000000
mean      66736.793103
std       46585.508446
min       28000.000000
25%       40000.000000
50%       56300.000000
75%       75000.000000
max      276000.000000
Name: salary_max, dtype: float64

In [316]: data.salary_max.describe()
Out[316]: 
count       413.000000
mean      59689.428571
std       43279.344537
min       22000.000000
25%       36000.000000
50%       50000.000000
75%       65000.000000
max      276000.000000
Name: salary_max, dtype: float64

What types of jobs are these? Probably those that require lots of experience, we can tell that only two of them contain word graduate. Juding by presence of some keywords like: “lead” or “experience” and salary above mean they are probably roles for experienced devs, and description seems scares people off.

# probably not entry level positions, only two of 101 contain 'graduate'
In [351]: zeros[zeros.description.str.contains('graduate')].id.count()
Out[351]: 2
# which keywords are present? 'lead'...
In [358]: zeros[zeros.description.str.contains('lead', case=False)].id.count()
Out[358]: 42
# ... 'experience' 
In [359]: zeros[zeros.description.str.contains('experience', case=False)].id.count()
Out[359]: 40

I think we’ve got quite an insight into Python job market at this point. There are still some interesting questions to ask but I think we can safely leave them for part two