You are on page 1of 6

There comes a point in every person’s life where they will have to start searching for a job.

It can be
very stressful, scouting for all the information necessary to find the appropriate roles, however
having to search in so many different websites and pages. Indeed.co.uk is a job search engine that
searches for millions of jobs in one web page, therefore I will use the Spyder IDE which is developed
to use the python language, to scrape only the essential info to get you closer to your idle job. The
aim of this project is to extract the most relevant information when it comes to job search from any
occupation listed on Indeed, and classify it by job title, location, description of the role and the salary
in a table created automatically in Excel. Furthermore, we will also be able to see the total number
of jobs scraped from the occupation chosen by the user, average, minimum and maximum wage,
median and Standard Deviation.

Ive decided to use python for a couple of reasons, mostly due to its simplicity in syntax and its
advantages over other languages in regards to data science. As we go along, there will be an
explanation of the code with screenshot of the code right below it. Note that every sentence
followed by a ‘#’ is a comment to help you understand the process and more about the inner
workings (The file will also be available for download at the end of the paper to replicate easier).

First, we will have to learn a bit about the HTML of our website. To do so, once we enter in
Indeed.co.uk, right click on any blank spot of the website and click ‘inspect’. This reveals the html of
the page, which we can look into to find out where specific information is stored. To do this,
beautifulsoup - a library in python is used, which facilitates the pulling of data from HTML . Each jobs
information is stored with a jobsearch-serpjobcard object, therefore now the libraries know where
to extract the raw data from.

To extract the job title, we find that job titles are nested under the umbrella “div” tag, with attribute
“a” which defines a link. Furthermore, to make it work smoother with Excel, we will replace commas
with dashes. We will use the define function for all our variables in the same way, although changing
all the tags accordingly.
To extract summary, we will have to make a small alteration, as every job with a summary has an
initial description which can later be expanded by clicking on it. We must define summary in two
parts, small and large. For small summaries, there is a “li” tag attached to it, therefore we will find all
of these tags under the umbrella “div” and define them so the programme recognizes it at small
summary.

To show the enlarged summary, we will have to grab the link from the title so the program can
navigate to the the page which contains the full description, we then get all text from the job
description understand and search the tag via BeautifulSoup to get all the text from the files.

Now we arrived at the main functionality of the program, the build search command. This is what
allows the program to interact with our requests. The input function of the code is telling the
program the way to formulate the question of job title and location to the users. Furthermore, we
must remove the spaces and replace them with dashes as spaces break the program - this is due to
how the URL is built.

Moreover, we will create a loop to make sure the correct input is being presented. The input
function, creates the question "Small or large summary? (none/small/large)” with the display of the
only answer accepted, and IF the question is answered right, we can move one to the next stage.
The next line defines Exit as the fourth answer, to which the program will answer “GOODBYE” and
ends the process We will apply this structure for the next two questions: "Only show jobs which
advertise Salary(y/n)..." and "Please enter the number of pages to scrape". For the latter question, if
the number written by the user is more than the pages that exist for the specific role, an error
notification will alert the user.

Indeed has a default of 10 jobs per page, therefore if the user only want one page, the first line of
the next code should do the trick. However, for page 2, we must build the rest of URL, needing to
start on '&start10', page 3 '&start20', etc.
The next paragraph of code creates a .csv file in Ecxel where all the information will be created and
displayed on a table, naming the file as the job title followed by the location of the position. This
table will be programmed to start on the row 0, so that the number of rows will be the amount of
jobs scrapped.

This next part will work out the min, max, average, median and standard deviation using the
standard python libraries as well as the statistics library and passes them to the print nice function,
which will also appear at the bottom of our excel table created earlier, and displayed on the program
following the questions.

Finally, we will create a loop which allows the user to answer the question “do you wish to start
another search (y/n)”, answering yes to continue to create another scrape or no to finish the
function and receive a ‘GOODBYE’ and the ends the process.

If there are any further questions or doubts about the code, there will be a link at the bottom which
contains all the codes with further comments and insight to help you finish your Web scrapper.
Therefore, we will go through an example of the results of the code:
Once you open/create your code in Spyder, your next move will be to run the whole file, clicking on
the green arrow on the top bar or pressing F5. On the bottom right box, the questions will emerge
one by one.

After you answer the last question, and press enter your results will appear.

Finally when you get your results, open your documents and go to the folder where your Spyder
code is kept, in this case ‘finishscrapper’, and you will find the Excel file.
Sources Used When Learning Python

1. https://hackernoon.com/10-reasons-to-learn-python-in-2018-f473dc35e2ee
2. https://www.tutorialspoint.com/python/index.htm
3. https://www.youtube.com/watch?v=ng2o98k983k
4. https://stackoverflow.com/questions/tagged/python
5. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

You might also like