Professional Documents
Culture Documents
Scraping Instagram With Python
Scraping Instagram With Python
username='pickuplimes
browser = webdriver.Chrome('/path/to/chromedriver')
browser.get('https://www.instagram.com/'+username+'/?hl=en')
Pagelength = browser.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
If you want to open a hashtag page -
hashtag='food'
browser = webdriver.Chrome('/path/to/chromedriver')
browser.get('https://www.instagram.com/explore/tags/'+hashtag)
Pagelength = browser.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
3. Parse HTML source page: Open the source page and use
beautiful soup to parse it. Go through the body of html script and
extract link for each image in that page and pass it to an empty list
‘links[]’.
links=[]
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('span')
for link in script.findAll('a'):
if re.match("/p", link.get('href')):
links.append('https://www.instagram.com'+link.get('href'))
Pagelength = browser.execute_script("window.scrollTo(0,
document.body.scrollHeight/1.5);")
links=[]
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('span')
for link in script.findAll('a'):
if re.match("/p", link.get('href')):
links.append('https://www.instagram.com'+link.get('href'))
#sleep time is required. If you don't use this Instagram may
interrupt the script and doesn't scroll through pages
time.sleep(5)
Pagelength =
browser.execute_script("window.scrollTo(document.body.scrollHeig
ht/1.5, document.body.scrollHeight/3.0);")
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('span')
for link in script.findAll('a'):
if re.match("/p", link.get('href')):
links.append('https://www.instagram.com'+link.get('href'))
This may not be efficient way to scroll pages. I haven’t tried other
methods but you can check using end_cursor and has_next_page
= True or False and loop through it.
result=pd.DataFrame()
for i in range(len(links)):
try:
page = urlopen(links[i]).read()
data=bs(page, 'html.parser')
body = data.find('body')
script = body.find('script')
raw = script.text.strip().replace('window._sharedData
=', '').replace(';', '')
json_data=json.loads(raw)
posts =json_data['entry_data']['PostPage'][0]['graphql']
posts= json.dumps(posts)
posts = json.loads(posts)
x = pd.DataFrame.from_dict(json_normalize(posts),
orient='columns')
x.columns = x.columns.str.replace("shortcode_media.",
"")
result=result.append(x)
except:
np.nan
Just check for the duplicates
result = result.drop_duplicates(subset = 'shortcode')
result.index = range(len(result.index))
The columns you get might be slightly different for user profile
page and hashtag page. Checkout the columns and filter whatever
you need.
import os
import requests
result.index = range(len(result.index))
directory="/directory/you/want/to/save/images/"
for i in range(len(result)):
r = requests.get(result['display_url'][i])
with open(directory+result['shortcode'][i]+".jpg", 'wb') as
f:
f.write(r.content)
Thanks for reading and I hope you find this article useful. If you
have any questions, I’d be more than happy to discuss.