You are on page 1of 18

Data Analytics using Python Module IV

Data Analytics Using


Python

Module 4: Web Scraping and


Numerical Analysis

Presented By,
Kavya HV
Asst Professor
Dept. of MCA
PESITM

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 1


Data Analytics using Python Module IV

What is Web Scraping?


• Web scraping is a technique to fetch data from
websites.
• While surfing on the web, many websites don’t
allow the user to save data for personal use.
• One way is to manually copy-paste the data, which is
both tedious and time-consuming.
• So, with the help of some libraries, we can extract
the data from the websites.
• This event is also done with the help of web scraping
software known as web scrapers.
• They automatically load and extract data from the
websites based on user requirements.

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 2


Data Analytics using Python Module IV

Techniques of Web Scraping:

There are two ways of extracting data from websites, the


Manual extraction technique, and the automated
extraction technique.

Manual Extraction Techniques: Manually copy-pasting


the site content comes under this technique. Though
tedious, time taking and repetitive task.

Automated Extraction Techniques: Web scraping


software is used to automatically extract data from sites
based on user requirement.

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 3


Data Analytics using Python Module IV

Web scraping used for:

1. Price Monitoring

Web Scraping can be used by companies to scrap the


product data for their products and competing products
as well to see how it impacts their pricing strategies.

2. Market Research:

Web scraping can be used for market research by


companies. High-quality web scraped data obtained in
large volumes can be very helpful for companies in
analyzing consumer trends.

3. News Monitoring:

Web scraping news sites can provide detailed reports on


the current news to a company. This is even more
essential for companies that are frequently in the news or

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 4


Data Analytics using Python Module IV

that depend on daily news for their day-to-day


functioning.

Data Acquisition by scraping web applications:

• First, we have to install request.


• This Requests is a library used for making HTTP
requests to a specific URL and returns the response.
• Python requests provide inbuilt functionalities for
managing both the request and response.

pip install requests

Making a Request:
Kavya HV, Dept. of MCA, PESITM, Shimoga Page 5
Data Analytics using Python Module IV

• The Python requests module has several built-in


methods to make HTTP requests to specified URI
using GET, POST, PUT requests.
• A HTTP request is meant to either retrieve data from
a specified URI or to push data to a server.
• It works as a request-response protocol between a
client and a server. Here we will be using the GET
request.
• GET Request is used to retrieve information from the
given server using a given URI. The GET method
sends the encoded user information appended to the
page request.

Example:

import requests
# Specify the URL we want to make a GET request to

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 6


Data Analytics using Python Module IV

# Make the GET request

response = requests.get('https://www.w3schools.com')
# Check if the request was successful (status code 200)

if response.status_code == 200:
# Print the content of the response

print("Response content:")

print(response.content)

else:

# Print an error message if the request was not successful

print(f"Error: {response.status_code}")

Output:
Kavya HV, Dept. of MCA, PESITM, Shimoga Page 7
Data Analytics using Python Module IV

BeautifulSoup Library:

• BeautifulSoup is a Python library used to pull the


data out of HTML and XML files for web scraping
purposes.
• Used to extract tables, lists, paragraph and you can
also put filters to extract information from web
pages.
• BeautifulSoup does not fetch the web page for us.
So we use requests.

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 8


Data Analytics using Python Module IV

To import:

from bs4 import BeautifulSoup

Submitting a form:
Steps for submitting form:

• Import the webdriver from selenium


• Create driver instance by specifying browser
• Find the element
• Send the values to the elements
• Use click function to submit

import time

from selenium import webdriver

from selenium.webdriver.common.by import By

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 9


Data Analytics using Python Module IV

driver = webdriver.chrome()

driver.maximize_window()

driver.get('https://the-internet.herokuapp.com/login')

driver.find_element(By.ID,"input-username").submit()

driver.find_element(By.ID,"input-password").submit()

username.send_keys('Ruthvi')

password.send_keys('Ruthvi@798')

submit.click()

Fetching web pages:


import requests

from bs4 import BeautifulSoup

import creds

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 10


Data Analytics using Python Module IV

loginurl=('https://the-
internet.herokuapp.com/authenticate')

secure_url=('https://the-internet.herokuapp.com/secure')

payload = {

'username': creds.username,

'password': creds.password

with requests.session() as s:

s.post(loginurl, data=payload)

r = s.get(secure_url)

soup = BeautifulSoup(r.content, 'html.parser')

print(soup.prettify())

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 11


Data Analytics using Python Module IV

Downloading web pages through form


submission:
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver

driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(3)
# Navigate to the form page

driver.get('https://the-internet.herokuapp.com/login')
# Locate form elements

pnr_field = driver.find_element("username",
"password")

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 12


Data Analytics using Python Module IV

submit_button =
driver.find_element(By.CSS_SELECTOR, '.col-xs-4')
# Fill in form fields

pnr_field.send_keys('Ruthvi', 'Ruthvi@798')
# Submit the form

submit_button.click()
welcome_message =
driver.find_element(By.CSS_SELECTOR,".pnr_field")
# Print or use the scraped values

print(type(welcome_message))
html_content =
welcome_message.get_attribute('outerHTML')

# Print the HTML content

print("HTML Content:", html_content)


# Close the browser

driver.quit()
Kavya HV, Dept. of MCA, PESITM, Shimoga Page 13
Data Analytics using Python Module IV

Introduction to NumPy:
• NumPy is a Python library used for working with
arrays.
• It also has functions for working in domain of linear
algebra, Fourier transform, and matrices.
• NumPy was created in 2005 by Travis Oliphant. It is
an open source project and we can use it freely.
• NumPy stands for Numerical Python.
• In Python we have lists that serve the purpose of
arrays, but they are slow to process.
• NumPy aims to provide an array object that is up to
50x faster than traditional Python lists.
• The array object in NumPy is called ndarray, it
provides a lot of supporting functions that make
working with ndarray very easy.

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 14


Data Analytics using Python Module IV

To import NumPy:

import numpy as np

Basics of NumPy Arrays:


Data manipulation in Python is nearly synonymous
with NumPy array manipulation.
A few categories of basic array manipulations are:
1) Attributes of arrays: Determining the size,
shape, memory consumption, and data types of
arrays.
2) Indexing of arrays: Getting and setting the
value of individual array elements.
3) Slicing of arrays: Getting and setting smaller
sub arrays within a larger array.

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 15


Data Analytics using Python Module IV

4) Reshaping of arrays: Changing the shape of a


given array.
5) Joining and splitting of arrays: Combining
multiple arrays into one, and splitting one array
into many.
Attributes of Arrays:

Indexing of Arrays:

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 16


Data Analytics using Python Module IV

Slicing of Arrays:

Concatenation of Arrays:

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 17


Data Analytics using Python Module IV

Splitting of Arrays:

Kavya HV, Dept. of MCA, PESITM, Shimoga Page 18

You might also like