Analyzing Python Job Market with Pandas
In this post I’m doing some simple data analytics of job market for python programmes. I will be using Python Pandas
My dataset comes from reed.co.uk - UK job board. I created simple Scrapy project that crawls reed.co.uk python job section, and parses all ads it finds. While crawling I set high download delay of 2 seconds, low number of max concurrent requests per domain and added descriptive user agent header linking back to my blog. If you are interested in source code for spider let me know.
I’ve chosen this specific job board because in contrast with other sites of this type it displays interesting information about each post. Aside from boring marketing speech describing how exciting each position is, reed shows more interesting facts about each position such as salary range and number of applications. I was curious if one can find some patterns in all this, also analyzing this data is good way to learn/demonstrate some Python Pandas functions.
My data is stored in .csv file that has 615 records, all job ads found when searching for Python, you can download it here
Let’s feed our data to Pandas.
In [1]: import pandas as pd
In [2]: data = pd.read_csv('reed.csv')
In [3]: data.columns
Out[3]: Index([u'salary_min', u'description', u'title', u'salary_max', u'applications', u'page_number', u'location', u'published', u'link', u'found', u'id'], dtype='object')
Since my data comes from Internet it made sense to do some normalization at the level of extraction. What you get in .csv is not exactly raw data, it is partly processed. For example dates on reed are displayed either as specific date in ‘day month’ format (e.g. ‘24 December’), or as string ‘just now’ or ‘yesterday’, so I had to use some regular expressions to extract proper data and then format all as ISO date string. Similarly with salary data it could be either posted as string ‘From 20 000 to 30 000’ or just ‘20 000 per annum’. I decided to use two fields: “salary_max” which is higher value of salary range, and “salary_min” as lower value. If there was only one value posted I assumed value present is salary_max.
Location, location, location
First of all it makes sense to ask: where are Python positions located?
In [37]: data.location.value_counts().head(10)
Out[37]:
London 203
Cambridge 42
Reading 19
Bristol 19
Manchester 16
Devon 15
Berkhamsted 15
Oxford 11
Cardiff 8
USA 8
dtype: int64
If you know UK job market you’re probably not suprised by domination of London. Almost 1/3 of all jobs are in London. Cambridge’s second spot is interesting, as is high position of Bristol and Reading.
Given high amount of open positions in each city, one wonders if market is perhaps saturated in London or Cambridge. How many applications per position do we have in top 10 locations?
Let’s create smaller data set with job ads only from top 10 cities for Python programmers:
In [113]: toplocations = data.location.value_counts().head(10)
In [114]: mask = data.location.isin(toplocations.keys())
In [115]: tops = data[mask]
and now let’s see how many applications are there per job post
In [183]: applications = tops.groupby('location').applications.sum()
In [184]: ads = tops.groupby('location').id.count()
In [185]: applications / ads
Out[185]:
location
Berkhamsted 2.333333
Bristol 4.157895
Cambridge 2.523810
Cardiff 3.250000
Devon 7.666667
London 7.935961
Manchester 2.125000
Oxford 4.181818
Reading 3.631579
USA 5.375000
dtype: float64
Seems like everyone wants to work in London and no one wants work in Manchester and Cambridge. Bristol attracts talent as does Oxford, but keep in mind low number of positions there. Devon has unusually high number of applications per ad, which seems interesting.
Let’s check if our calculations are correct (as you read my post please feel free to check all my calculations, also let me know if there is some smarter, easier way of getting some specific result).
In [371]: cam = data[data.location=='Cambridge'].applications.sum() / float(data[data.location == 'Cambridge'].id.count())
In [372]: cam
Out[372]: 2.5238095238095237
What determines number of applications
When browsing data you quickly notice uneven distribution of applications. Some positions have zero applications, and some have relatively high number. For example this query will give you number of applicantions for each Cambridge job.
In [373]: data[data.location == 'Cambridge'][['location', 'applications']]
You can see lots of ads with zero applications and some unusually popular posts. One position is particularly attractive, it attract 17 applicants.
Why are there so many applications there?
Is it the salary?
In [222]: data[data.applications == top_ad][data.location == 'Cambridge'][['applications', 'salary_max', 'salary_min']]
Out[222]:
applications salary_max salary_min
445 17 NaN
No, salary is not given. Actually nothing in my data explains why this position is so popular so I had to follow link (I store all links to job records in link columns, this is mostly for testing of accuracy of data extraction), perhaps I missed something?
In [238]: data[data.applications == top_ad][data.location == 'Cambridge'].link.values
Out[238]: array([ 'http://www.reed.co.uk/jobs/senior-graduate-software-engineers-web-developers-iot/25336524#/jobs?keywords=python&cached=True&pageno=18'], dtype=object)
If you follow link and read description you’ll see that it’s just entry level position, keywords “graduate” probably attract people without experience with Python, so this would explain high application rate.
This got me thinking: is there a strong link between a post being entry level position and high number of applicants? We could do some natural language processing to identify entry level positions, but this would probably require separate blog article, so for now, let’s just try checking if some keyword is present in description, for example keyword ‘graduate’.
In [389]: filter = data.description.str.contains('graduate', case=False)
# mean of applications if there is keyword 'graduate' in description
In [390]: data[filter].applications.mean()
Out[390]: 12.0
# 'graduate' not present in description
In [391]: data[filter == False].applications.mean()
Out[391]: 4.3551236749116606
# mean salary if 'graduate' in description
In [392]: data[filter].salary_max.mean()
Out[392]: 27714.285714285714
# mean salary if 'graduate' not in description
In [393]: data[filter == False].salary_max.mean()
Out[393]: 62014.893506493507
Which in nutshell means: if you have some experience in Python you have on average 4 competitors to your position. Given that the most of the time recruiters are inviting 4-5 people to interview, you should probably get to interview if you worked with Python before.
Salaries
To get data about salary grouped by location:
In [244]: tops.groupby('location').salary_max.mean()
Out[244]:
location
Berkhamsted 203714.285714
Bristol 52352.941176
Cambridge 46720.000000
Cardiff 36714.285714
Devon NaN
London 62846.376812
Manchester 74733.333333
Oxford 56666.666667
Reading 41642.857143
USA 154285.714286
Name: salary_max, dtype: float64
In [245]: tops.groupby('location').salary_min.mean()
Out[245]:
location
Berkhamsted 177142.857143
Bristol 38294.117647
Cambridge 36400.000000
Cardiff 27285.714286
Devon NaN
London 47963.768116
Manchester 59866.666667
Oxford 46000.000000
Reading 32428.571429
USA 127857.142857
Name: salary_min, dtype: float64
Two things are strange here, unusually high values for Berkhamsted and USA and absolutely no wage data for Devon. Are people really getting that much in Berhamsted and USA?
As a side note, the fact that we have USA in our data set is because of inconsitency in job postings on reed. For UK posts location is specified as city, for US posts we get USA as location, without adding city. To get actual US city where position is located we would have to parse description which would be difficult to do, so I decided to keep it as it is without normalizing this to city string.
Are these wages in Berkhamsted so high? I suspected either cheating here or error in my script extracting data, so to actually check this I had to follow those links and see raw data.
Turns out it’s cheating on agency side. Payroll superstar title hides rather small wages of 20k per annum. Since agency that uses this trick is responsible for 10 job postings out of all 15 Python jobs in Berkhamsted it is natural that it distorts results. What’s more this job is actually not work of Python programmer, but was caught in our results because of reed indexing which caught reference to monty python.
And for USA jobs? Are they really getting so much more then UK engineers? Let’s look at descriptions…
In[260]: tops[tops.location=='USA'].description.values
Out[260]:
array([ 'Software Developer - C & Linux/Unix - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing systems for the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary of up to $150,000 (possibly...',
' Senior Data Scientist / Manager New York / New Jersey $150,000 - $200,000 base salary + performance related annual bonus and exceptional benefits Are you interested in working as a Senior Data Scientist / Manager for a hugely exciting, fast-paced, dynamic and globally renowned financial services...',
'Software Developer - iOS, C/C++ and/or Java - San Francisco - To c.$130,000 + bonus + relocation + bens. This role will involve designing and developing SDKs and APIs that are used by iPhone, iPad and iOS developers around the world. This role offers a starting salary of up to $130,000 (possibly higher...',
'Software Developer - C & Linux/Unix - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing systems for the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary of up to $150,000 (possibly...',
'Software Developer - C & Linux/Unix - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing systems for the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary of up to $150,000 (possibly...',
' Data Scientist, Analytics New York City, New York $100,000 - $150,000 base salary + performance related annual bonus Are you interested in working as a Data Scientist for a forward thinking, dynamic, innovative and customer focused online retailer where exceptional levels of career progression...',
'Cloud Engineer - C/C++, Linux, AWS, Openstack - San Francisco - To c.$150,000 + bonus + relocation + bens. This role will involve designing and developing cloud software applications to improve the delivery of media content to consumers on a variety of devices, worldwide. This role offers a starting salary...',
' Algorithm Engineer New York City, New York $120,000 - $150,000 base salary + performance related annual bonus + benefits The opportunity to work for the coolest, most innovative and cutting-edge digital media organization in New York City and arguably across the entire globe should not be passed upon...'], dtype=object)
All salaries are given in dollars not in pounds! Actual difference is not as big as it seems. Yet even after adjusting to pound it seems that US salaries are much higher at around 84-100k per annum. Whether actual take home wages are higher in US is not completely evident. When comparing salaries between two countries you have to ask yourself if you really compare apple to apples. Taxes, social security contributions and health care make a big difference to actual take home pay and it is not clear if they are always specified in same way. Perhaps they publish salaries without tax in US and with tax in UK, just like they publish prices of goods with tax in UK and without tax in US.
Finally aggregate data for UK as a whole:
In [263]: data.salary_max.mean()
Out[263]: 59689.428571428572
In [264]: data.salary_min.mean()
Out[264]: 46540.220338983054
Position without applicants
One interesting thing about job market for Python programmers is unusually high number of positions for which there is no applications.
In [88]: zeros = data[data.applications.eq(0)]
In [89]: zeros.id.count()
Out[89]: 101
One out of six jobs has zero applications. It has to be difficult to find experienced python dev these days.
But one can also ask why some positions don’t get any interest. First of all maybe they were just recently published and noone had the time to apply yet. Luckily I’m storing date published and date found in my data, so we can get number of days each add is on market from these fields. My script stores both dates as isoformat string. To get number of days we need to convert isoformat date to timedelta and then to integer.
With pandas we can actually easily add new columns to DataFrame, so let’s do this. I will add new column “daysOn” - that contains timedelta between date published and date found by my spider.
Adding new column is simple, just assign another series to DataFrame:
In [291]: zeros["daysOn"] = zeros.found.astype(np.datetime64) - zeros.published.astype(np.datetime64)
In [292]: zeros[["found", "published", "daysOn"]].head(3)
Out[292]:
found published \
0 2014-12-26T21:54:34.050102 2014-12-22T21:54:34.049654
2 2014-12-26T21:54:34.054577 2014-12-22T21:54:34.054197
5 2014-12-26T21:54:34.063716 2014-12-26T21:54:34.063350
daysOn
0 4 days 00:00:00.000448
2 4 days 00:00:00.000380
5 0 days 00:00:00.000366
At this point we have extra column “daysOn” which contains timedelta object representing time that passed between date of publication and date when each ad was found. Note that we’re using numpy datetime which exposes slighly different api from python’s native datetime.timedelta object To actually get number of days we need to cast our timedelta to int and representing number of days.o
In [293]: zeros.daysOn = zeros.daysOn.apply(lambda a:np.timedelta64(a, 'D').astype(int))
In [294]: zeros[["found", "published", "daysOn"]].head(3)
Out[294]:
found published daysOn
0 2014-12-26T21:54:34.050102 2014-12-22T21:54:34.049654 4
2 2014-12-26T21:54:34.054577 2014-12-22T21:54:34.054197 4
5 2014-12-26T21:54:34.063716 2014-12-26T21:54:34.063350 0
Now let’s eliminate those ads which are on job market only for zero days
In [307]: zeros = zeros[zeros.daysOn.eq(0) == False]
In [309]: zeros[["found", "published", "daysOn"]].head(3)
Out[309]:
found published daysOn
0 2014-12-26T21:54:34.050102 2014-12-22T21:54:34.049654 4
2 2014-12-26T21:54:34.054577 2014-12-22T21:54:34.054197 4
12 2014-12-26T21:54:34.081788 2014-12-17T00:00:00 9
At this point ‘zeros’ frame contains only ads that are on market for more then one day, and no one had applied for them yet.
Average time on market is two weeks.
In [337]: zeros.daysOn.mean()
Out[337]: 14.329787234042554
Where are those positions located?
In [311]: zeros.location.value_counts()
Out[311]:
London 25
Cambridge 13
Manchester 6
Southampton 3
Surrey 3
Bristol 2
Cheltenham 2
Fun fact: salary for those position is actually higher from average.
In [315]: zeros.salary_max.describe()
Out[315]:
count 58.000000
mean 66736.793103
std 46585.508446
min 28000.000000
25% 40000.000000
50% 56300.000000
75% 75000.000000
max 276000.000000
Name: salary_max, dtype: float64
In [316]: data.salary_max.describe()
Out[316]:
count 413.000000
mean 59689.428571
std 43279.344537
min 22000.000000
25% 36000.000000
50% 50000.000000
75% 65000.000000
max 276000.000000
Name: salary_max, dtype: float64
What types of jobs are these? Probably those that require lots of experience, we can tell that only two of them contain word graduate. Juding by presence of some keywords like: “lead” or “experience” and salary above mean they are probably roles for experienced devs, and description seems scares people off.
# probably not entry level positions, only two of 101 contain 'graduate'
In [351]: zeros[zeros.description.str.contains('graduate')].id.count()
Out[351]: 2
# which keywords are present? 'lead'...
In [358]: zeros[zeros.description.str.contains('lead', case=False)].id.count()
Out[358]: 42
# ... 'experience'
In [359]: zeros[zeros.description.str.contains('experience', case=False)].id.count()
Out[359]: 40
I think we’ve got quite an insight into Python job market at this point. There are still some interesting questions to ask but I think we can safely leave them for part two