You are on page 1of 5

1.

Data scraping
Project Outline

For this project, we are using the simple online data mining tool phantombuster.com to
download information from thousands of linkedin.com profiles. We will be interacting with
Phantom Buster by using a customized python script, to significantly speed up the process.

Detailed Project Guide (4 steps)

Step 0.5 – Logging in to phantom buster.

Please open the following google sheet:

https://docs.google.com/spreadsheets/d/
1oh4hPkSl4DWK1Uml7H4r0DJivd2J7ARzJS6U9XmW6Go/edit#gid=0

This google sheet contains a list of phantom buster accounts and log in credentials.

You can see in column “D” the status of the account. If the value is set to “ready”, this means
that the account is available. If the status is set to “Used”, this means that the account has
been used, and is no longer available.

Please log in to the first “available” account that you can see.

Step 1 – configuring the python script to the “available” account.

You have now logged in to phantom buster.

Now, please open the “phantom-scripts.zip” file. In this folder, you will see 3 scripts:

 Phantom.py
 Const.py
 Save.py

We need to make some edits to these scripts based on some of the “available” account
information.

1. API Key

First, on phantom buster, go to user settings, and enable developer mode, and click “save
settings”.

Next, go to workspace settings and click “add API key”, and copy this key to your clipboard.

Now open the const.py script from the “phantom-scripts.zip” file. And assign the key value
to line 1 “API_Key”.

2. S_Cookie

Second, in the const.py script, on line 2 you will see the S_COOKIE value. Please ensure that
this variable is set to the follow value:
- AQEDAR_0OHkFMkeeAAABhrZnkjEAAAGHayx0XE4Aoeb_f7UJL_Pu8G5N5Ni9JRkqTjcPbu
9D9WcBacrAnVYxjWt9Wbky86cWQJcG6Vezjay64Wj31LgagbgV5NZUWmOE9K_tZEga31x
J1T_9kmKj7eV4

Now, save the file.

3. Launch Cookie

Third and finally, please go to the following page on phantom buster:

- https://phantombuster.com/automations/sales-navigator/6988/sales-navigator-search-
export

Click on “Use this phantom”.

Next, click save and proceed despite the error message.

Click save again and proceed again despite the error message.

Click save again and proceed.

Now, you will reach the “settings” page.

Right click the “save” button and choose “inspect”.

Your web browser’s develop tools will now open.

Next, select “Network” from the options at the top of the interface.

Next, whilst this interface is still open, click “save” on the phantom buster settings page.

Next click on the “save” option in the develop tools interface:

This will display the following “cookie” value. This is the value we need to copy.
Copy the whole value like so:

And now go to the “save.py” script in the “phantom-scripts.py” folder. On line 28, you will
see the ‘cookie’ variable.

Replace the cookie value seen here with the cookie value you just copied from the developer
tools.

Now, save the file.

Step 3 – URLs: understanding the main input data type.

Please navigate to the following google sheet:

- https://docs.google.com/spreadsheets/d/1_NShVfrXKqvHcrml4KbtE-
5myzcViEirhJlqs0aRz8c/edit#gid=0

In this sheet, you will see 4 columns:

 Column A: this contains numbers from 1-5


 Column B: this contains URLs.
 Column C: this also contains URLs.
 Column D.

Each one of these URLs is a link to a maximum of 2500 people in the LinkedIn database. Our
phantom python script is going to be feeding these URLs to phantom buster, so that
phantom buster can download information on all of these people, in each link.

The phantom python script loads 5 URLs at a time from column B.

The URLs in column C will need to be moved over into column B, 5 at a time, once the 5
previous URLs have finished their scraping.

Step 4 – Starting the script and scraping the data.


Now, everything is in place.

Open a command terminal on your device and navigate to the “phantom-scripts” folder.

Type the following terminal command to run the “phantom.py” script:

- Python phantom.py

Authenticate using “me@geogregriffiths.uk”.

If you now refresh the phantom buster dashboard, you will see that 5 tasks have been
created, and two tasks have started running.

Click on 1 out of the 3 tasks not running, and launch it, so that there are now a total of 3
tasks running.

It will take 3-15 minutes for all 3 tasks to finish running. Once each task has finished running,
a “success” label will be displayed in the dashboard for each task.

Ensure that there are always a maximum of 3 tasks running. Therefore, if one of them
finishes, click to launch the next task.

Once each Phantom has finished running successfully, open each Phantom, click download
CSV and save the file in a folder called “phantomData-[“name”]”, where “name” is the value
seen in Cell D2 in the URL google sheet:

- https://docs.google.com/spreadsheets/d/1_NShVfrXKqvHcrml4KbtE-
5myzcViEirhJlqs0aRz8c/edit#gid=0

Step 5 – restarting the script for the next set of URLs.

Once all five tasks have finished, and you have downloaded the results for each one, you
should now have 5 csv files in the “phantomData-[“name”]” file.

Next, go back to the sheet containing all the URLs:

- https://docs.google.com/spreadsheets/d/1_NShVfrXKqvHcrml4KbtE-
5myzcViEirhJlqs0aRz8c/edit#gid=0

Replace all 5 URLs in column B with the next 5 URLs in column C.

Now repeat step 4.

Step 6 – Repeat step 5

Repeat step 5 until all URLs in the google sheet have been processed. Then use this tool to
merge all csv’s together in to 1 single csv: https://filesmerge.com/merge-csv-files. Save this
in phantomData-[“name”].

Muhammad – There should be between 85,000 and 100,000 unique rows in the final
output.

PLEASE NOTE: if there are any URLs that refuse to run in a phantom buster task, please copy
the URL and paste them in the anomalies txt file, one per line, in “phantomData-[“name”]”.

Step 7 - Project complete

Project complete.
Please email the “phantomData-[“name”]” folder as a zip file to me@georgegriffiths.uk, with
the following as the email subject:

 Phantom scraping complete: “phantomData-[“name”]

You might also like