Professional Documents
Culture Documents
Data scraping
Project Outline
For this project, we are using the simple online data mining tool phantombuster.com to
download information from thousands of linkedin.com profiles. We will be interacting with
Phantom Buster by using a customized python script, to significantly speed up the process.
https://docs.google.com/spreadsheets/d/
1oh4hPkSl4DWK1Uml7H4r0DJivd2J7ARzJS6U9XmW6Go/edit#gid=0
This google sheet contains a list of phantom buster accounts and log in credentials.
You can see in column “D” the status of the account. If the value is set to “ready”, this means
that the account is available. If the status is set to “Used”, this means that the account has
been used, and is no longer available.
Please log in to the first “available” account that you can see.
Now, please open the “phantom-scripts.zip” file. In this folder, you will see 3 scripts:
Phantom.py
Const.py
Save.py
We need to make some edits to these scripts based on some of the “available” account
information.
1. API Key
First, on phantom buster, go to user settings, and enable developer mode, and click “save
settings”.
Next, go to workspace settings and click “add API key”, and copy this key to your clipboard.
Now open the const.py script from the “phantom-scripts.zip” file. And assign the key value
to line 1 “API_Key”.
2. S_Cookie
Second, in the const.py script, on line 2 you will see the S_COOKIE value. Please ensure that
this variable is set to the follow value:
- AQEDAR_0OHkFMkeeAAABhrZnkjEAAAGHayx0XE4Aoeb_f7UJL_Pu8G5N5Ni9JRkqTjcPbu
9D9WcBacrAnVYxjWt9Wbky86cWQJcG6Vezjay64Wj31LgagbgV5NZUWmOE9K_tZEga31x
J1T_9kmKj7eV4
3. Launch Cookie
- https://phantombuster.com/automations/sales-navigator/6988/sales-navigator-search-
export
Click save again and proceed again despite the error message.
Next, select “Network” from the options at the top of the interface.
Next, whilst this interface is still open, click “save” on the phantom buster settings page.
This will display the following “cookie” value. This is the value we need to copy.
Copy the whole value like so:
And now go to the “save.py” script in the “phantom-scripts.py” folder. On line 28, you will
see the ‘cookie’ variable.
Replace the cookie value seen here with the cookie value you just copied from the developer
tools.
- https://docs.google.com/spreadsheets/d/1_NShVfrXKqvHcrml4KbtE-
5myzcViEirhJlqs0aRz8c/edit#gid=0
Each one of these URLs is a link to a maximum of 2500 people in the LinkedIn database. Our
phantom python script is going to be feeding these URLs to phantom buster, so that
phantom buster can download information on all of these people, in each link.
The URLs in column C will need to be moved over into column B, 5 at a time, once the 5
previous URLs have finished their scraping.
Open a command terminal on your device and navigate to the “phantom-scripts” folder.
- Python phantom.py
If you now refresh the phantom buster dashboard, you will see that 5 tasks have been
created, and two tasks have started running.
Click on 1 out of the 3 tasks not running, and launch it, so that there are now a total of 3
tasks running.
It will take 3-15 minutes for all 3 tasks to finish running. Once each task has finished running,
a “success” label will be displayed in the dashboard for each task.
Ensure that there are always a maximum of 3 tasks running. Therefore, if one of them
finishes, click to launch the next task.
Once each Phantom has finished running successfully, open each Phantom, click download
CSV and save the file in a folder called “phantomData-[“name”]”, where “name” is the value
seen in Cell D2 in the URL google sheet:
- https://docs.google.com/spreadsheets/d/1_NShVfrXKqvHcrml4KbtE-
5myzcViEirhJlqs0aRz8c/edit#gid=0
Once all five tasks have finished, and you have downloaded the results for each one, you
should now have 5 csv files in the “phantomData-[“name”]” file.
- https://docs.google.com/spreadsheets/d/1_NShVfrXKqvHcrml4KbtE-
5myzcViEirhJlqs0aRz8c/edit#gid=0
Repeat step 5 until all URLs in the google sheet have been processed. Then use this tool to
merge all csv’s together in to 1 single csv: https://filesmerge.com/merge-csv-files. Save this
in phantomData-[“name”].
Muhammad – There should be between 85,000 and 100,000 unique rows in the final
output.
PLEASE NOTE: if there are any URLs that refuse to run in a phantom buster task, please copy
the URL and paste them in the anomalies txt file, one per line, in “phantomData-[“name”]”.
Project complete.
Please email the “phantomData-[“name”]” folder as a zip file to me@georgegriffiths.uk, with
the following as the email subject: