You are on page 1of 37

Babu Banarasi Das

National Institute of Technology & Management


Lucknow

Department of I.T.
An
INDUSTRIAL TRAINING REPORT
On
Python using Data Science
Project Internship Title: “Internship Finder Application”
Training Organization: HCL TSS, Lucknow
[Sub code: RIT-753]
Semester 7th
Section: IT-41
Session: 2019-2020

Submitted to- Submitted by-

Mr. Asit Gahalaut Ameesha Saxena


(Assistant Professor) Roll No- 1605413010
ACKNOWLEDGEMENT

I would like to acknowledge all those have given me moral support


and helped in making the project a success.

Firstly, I would like to profess my deep gratitude to the Final year Industrial
training coordinators of BBDNITM, Mr. A.K. Gahalautand Mr. Anurag Tiwari,
whose eminent contribution in stimulating good suggestions and enhancing the
encouragementthat helped me to gear up my report to a good output.

I am extremely thankful to the HCL TSS Lucknow who gave me the golden
opportunity of completing my Summer Internship under the reflection of the
educational SKILL DEVELOPMENT CENTRE.

I would like to express my gratitude to the senior manager Mr.


HazariSinghalfor not only providing me opportunity to work on this project , but
also giving his support and encouragement throughout the process of the
internship and guiding me to achieve the desired goal in learning and understanding
the new objectives.

I wish to also express my sincere gratitude to our project guide,Mr.


Manmeet Jalota at the organization of HCL TSSfor his valuable guidance and
support for the completion of this project.

I wish to express my gratitude to my teammate, Harshit Bhadouria who


guided and helped me to understand and relate the different concepts. His
suggestions and our teamwork as a whole helped me to achieve the desired goal
towards the completion of the project.

I would like to appreciate the guidance given by the other supervisor as well as
the panels that motivated me and grew up my confidence in improving my
presentation skills.

And finally, I would like to offer many thanks to all my colleagues for their
valuable suggestions and constructive feedback.

Ameesha Saxena( IT-41)


TABLE OF CONTENT

S.No. Topic Page

1. Training Objective 1

2. Training Organization Details 2-3

3. Introduction to Python with Data Science 4-9

3.1 Introduction to Python

3.1.1 History

3.1.2 Features

3.1.3 Python 3-“Based on Libraries”

3.1.4 Application and Uses

3.2 Introduction to Data Science

3.2.1 Data Analysis

3.2.2 Data Mining

3.2.3 Business Intelligence

3.2.4 Data Visualization

3.2.5 Data Modification

3.2.6 Data Manipulation

4. Detailed Study Plan and Schedule 10-18

4.1 Training Schedule Chart

4.2 Study 1

4.2.1 GUI Used

4.3 Study 2

4.3.1 Web Scraping


4.4 Study 3

4.4.1 Creation and storage into CSV File

4.5 Study 4

4.5.1 Displaying of the data

5. Project Internship and Learning 19-20

5.1 Project Title’Internship Finder Application’ and Introduction

5.2 Problem Definition

6. Detailed Project Installation to the Development Phases 21-30

6.1 Customer Requirement Specification

6.2 Architecture and Design Model

6.3 Entity Relationship Diagram

6.4 Table Design

6.5 Snapshots of Pages

6.6 Display Window

6.7 Snapshots of Database Tables

6.8 Project Directory Structure

7. Conclusion 31

8. References 32
CHAPTER 1
TRAINING OBJECTIVE
The main aim behind the Industrial training was to develop and enhance the project
skills. The great Hardwork and perseverance towards the specific motive in order to
learn something new was the basic focus. We learn, we implement and finally we
produce which leads to a beneficial output. The Skill development Centre at the HCL
TSS organization focuses on what the trainee wants to get trained with?

I chose Python as the backend as well as the frontend language in this project. The
major focus is on the DATA SCIENCE concept which is trending nowadays in order
to enhance and modify your project to a new technological based development.

The few major objectives of this Internship Programme are thrown light below-

1. The main aim was to learn trending technology which is PYTHON-3 nowadays
with which the coding becomes simpler and efficient.
2. The concept of DATA SCIENCE is so influencing that focus on the major
concepts of scrapping, modification, analysis, displaying and many more.
3. The global concept of learning python with data science helped me to complete
my project during the internship sessions.
4. The learning of this new technology provided me a step ahead towards
developing my career.
5. The internship not only developed the keen potential to learn something new but
also to develop a productive output.
6. It developed my interest towards what new I have learned and what I will follow
up in future for better analysis.
7. The concept of Python helped me to learn the concept from installation to the
final binding.
8. It helped in data visualization, the data modification, the data analysis and the
displaying of data.
9. The major objective was to know the working environment of the corporate
sector.

1
CHAPTER 2
TRAINING ORGANISATION DETAILS

HCL Training & Staffing Services (HCL TSS) is a division subsidiary company
of HCL Technologies Limited created with a vision to provide trained & skilled
workforce through its multiple Training & Hiring Programs. Given the ever-
increasing demand for quality talent within HCL, there was a significant need to
create a talent pool that would be equipped with the requisite expertise, and will
be technically and professionally prepared to join the highly specialized
workforce at HCL. Thus was conceived the idea of HCL TSS, with the
objective of becoming the largest integrated talent-solutions Company in India
preparing skilled workforce for the future.
Recruitment experts have noted that the skill gap across industries is primarily
an education issue – one that creates a mismatch between what gets taught to
aspiring students in institutions, and the expectations awaiting them in the real-
life job environment. HCL TSS creates the much-needed bridge between
deserving talent across the country and vacant jobs that are difficult to fill due to
a crisis of skills. HCL TSS’s offers best-in class skill based training programs
for entry level job roles across in HCL. Candidates interested to kick-start their
IT career with HCL can apply for our fee based training & hiring programs.
HCL TSS offers training programs for students who have completed Science
Graduates and Engineering Graduates / Post Graduates.
HCL Technologies Limited is an Indian multinational information
technology (IT) service and consulting company headquartered in Noida, Uttar
Pradesh. It is a subsidiary of HCL Enterprise. Originally a research and
development division of HCL, it emerged as an independent company in 1991
when HCL entered into the software services business.
The company has offices in 42 countries including the United Kingdom, the
United States, France, and Germany with a worldwide network of R&D,
"innovation labs" and "delivery centers", and 137,000+ employees and its
customers include 250 of the Fortune 500 and 650 of the Global 2000
companies. It operates across sectors including aerospace and defense,
automotive, banking, capital markets, chemical and process industries, energy
and utilities, healthcare, hi-tech, industrial manufacturing, consumer goods,
insurance, life sciences, manufacturing, media and entertainment, mining and
natural resources, oil and gas, retail, telecom, and travel, transportation, logistics
& hospitality.[

2
HCL Technologies is on the Forbes Global 2000 list.It is among the top 20
largest publicly traded companies in India with a market capitalisation of
$18.7 billion as of May 2017.As of May 2018, the company, along with its
subsidiaries, had a consolidated revenue of $7.8 billion
HCL Enterprise was founded in 1976.
The first three subsidiaries of parent HCL Enterprise were:

 HCL Technologies - originally HCL's R&D division, it emerged as a


subsidiary in 1991
 HCL Info systems and
 HCL Healthcare
The company tried to stay focused on hardware, but, via HCL Technologies,
software and services is a main focus.
Revenues for 2007 were US$4.9 billion.
Revenues for 2017 were US$6.5 billion, and HCL employed over 105,000
professionals in 31 countries.
Revenues for 2018 were US$9 billion, and HCL employed over 110,000
professionals in 31 countries. A unit named HCL Enterprise Solutions (India)
Limited was formed in July 2001.
Currently HCL Technologies is a subsidary of Vamasundari Delhi through a
chain of entities in between. Vamasundari (Delhi) is owned by Shiv Nadar and
it in turns holds majority of shares in most HCL group companies.
On July 1 2019, HCL Technologies acquired select few product of IBM. HCL
Technologies took the full ownership of research and development, sales,
marketing, delivery, and support for AppScan, BigFix, Commerce,
Connections, Digital Experience (Portal and Content Manager), Notes Domino,
and Unica.
Business lines

1. Applications Services and Systems Integrations


2. BPO/Business Services: This division has "delivery centers" in India, the
Philippines, Latin America, USA, HCL BPO Northern Ireland, and
Europe.
3. Engineering and R&D Services (ERS)
4. Infrastructure Management Services (IMS)
5. IoT WoRKS
6. DRYiCE

3
CHAPTER 3

INTRODUCTION TO PYTHON WITH DATA SCIENCE

3.1 INTRODUCTION TO PYTHON-


Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design
philosophy emphasizes code readability with its notable use of significant
whitespace. Its language constructs and object-oriented approach aim to help
programmers write clear, logical code for small and large-scale projects.
Python is dynamically typed and garbage-collected. It supports
multiple programming paradigms, including procedural, object-oriented,
and functional programming. Python is often described as a "batteries included"
language due to its comprehensive standard library.

3.1.1HISTORY-
Python was conceived in the late 1980sby Guido van Rossum at Centrum
Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC
language (itself inspired by SETL), capable of exception handling and
interfacing with the Amoeba operating system.
Python 3.0 was released on 3 December 2008. It was a major revision of the
language that is not completely backward-compatible. Python 3 include the
utility, which automates (at least partially) the translation of Python 2 code to
Python 3.

3.1.2 FEATURES-
Python is a multi-paradigm programming language. Object-oriented
programming and structured programming are fully supported, and many of its
features support functional programming and aspect-oriented
programming (including by metaprogramming and metaobjects (magic
methods)).Many other paradigms are supported via extensions, including design
by contract and logic programming.
Python uses dynamic typing, and a combination of reference counting and a
cycle-detecting garbage collector for memory management. It also features
dynamic name resolution (late binding), which binds method and variable
names during program execution.

4
Python's design offers some support for functional programming in
the Lisp tradition. It has filter , map , and reduce functions; list
comprehensions, dictionaries, sets and generator expressions. The standard
library has two modules (itertools and functools) that implement functional
tools borrowed from Haskell and Standard ML.

 Beautiful is better than ugly.


 Explicit is better than implicit.
 Simple is better than complex.
 Complex is better than complicated.
 Readability counts.
"There is more than one way to do it" is the motto that Python embraces a "there
should be one—and preferably only one—obvious way to do it" design
philosophy.

3.1.3Python 3: “Based on Libraries”


Python's large standard library, commonly cited as one of its greatest
strengths, provides tools suited to many tasks. For Internet-facing applications,
many standard formats and protocols such as MIME and HTTP are supported. It
includes modules for creating graphical user interfaces, connecting to relational
databases, generating pseudorandom numbers, arithmetic with arbitrary-
precision decimals, manipulating regular expressions, and unit testing.
Some parts of the standard library are covered by specifications (for example,
the Web Server Gateway Interface (WSGI) implementation

 Web frameworks
 Multimedia
 Databases
 Networking
 Test frameworks
 Automation
 Web scraping
 Documentation
 System administration
 Scientific computing
 Text processing
 Image processing

5
3.1.4APPLICATION and USES:
Python can serve as a scripting language for web applications, e.g.,
via mod_wsgi for the Apache web server.With Web Server Gateway Interface, a
standard API has evolved to facilitate these
applications. Webframeworks like Django, Pylons, Pyramid, TurboGears, web2
py, Tornado, Flask, Bottle and Zope support developers in the design and
maintenance of complex applications. .
Libraries such as NumPy, SciPy and Matplotlib allow the effective use of
Python in scientific computing, with specialized libraries such
as Biopython and Astropy providing domain-specific functionality.

3.2 INTRODUCTION TO DATA SCIENCE-


Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights from
structured and unstructured data.Data science is the same concept as data
mining and big data: "use the most powerful hardware, the most powerful
programming systems, and the most efficient algorithms to solve problems".
Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and analyze actual
phenomena" with data.[4] It employs techniques and theories drawn from many
fields within the context of mathematics, statistics, computer science,
and information science. Turing award winner Jim Gray imagined data science
as a "fourth paradigm" of science (empirical, theoretical, computational and
now data-driven) and asserted that "everything about science is changing
because of the impact of information technology" and the data deluge.In 2015,
the American Statistical Association identified database management, statistics
and machine learning, and distributed and parallel systems as the three
emerging foundational professional communities.
In 2012, when Harvard Business Review called it "The Trending Job of the 21st
Century", the term "data science" became a buzzword. It is now often used
interchangeably with earlier concepts like business analytics,business
intelligence, predictive modelling, and statistics. Even the suggestion that data
science is sexy was paraphrasing Hans Rosling, featured in a 2011 BBC
documentary with the quote, "Statistics is now the sexiest subject around." Nate
Silver referred to data science as a sexed up term for statistics.In many cases,
earlier approaches and solutions are now simply rebranded as "data science" to
be more attractive, which can cause the term to become "dilute[d] beyond
usefulness." While many university programs now offer a data science degree,
there exists no consensus on a definition or suitable curriculum contents. To its
6
discredit, however, many data-science and big-data projects fail to deliver
useful results, often as a result of poor management and utilization of resources.

3.2.1 DATA ANALYSIS-

Data analysis is the process of evaluating data using analytical and statistical
tools to discover useful information and aid in business decision making. There
are a several data analysis methods including data mining, text analytics,
business intelligence and data visualization.

Data analysis is a part of a larger process of deriving business intelligence. The


process includes one or more of the following steps:
 Defining Objectives: Any study must begin with a set of clearly defined
business objectives. Much of the decisions made in the rest of the process
depends on how clearly the objectives of the study have been stated.
 Posing Questions: An attempt is made to ask a question in the problem
domain. For example, do red sports cars get into accidents more often than
others?
 Data Collection: Data relevant to the question must be collected from the
appropriate sources. In the example above, data might be collected from a
variety of sources including: DMV or police accident reports, insurance
claims and hospitalization details. When data is being collected using
surveys, a questionnaire to be presented to the subjects is needed. The
questions should be appropriately modeled for the statistical method being
used.
 Data Wrangling: Raw data may be collected in several different formats.
The collected data must be cleaned and converted so that data analysis tools
can import it. For our example, we may receive DMV accident reports as
text files, insurance claims from a relational database and hospitalization
details as an API. The data analyst must aggregate these different forms of
data and convert it into a form suitable for the analysis tools.
 Data Analysis: This is the step where the cleaned and aggregated data is
imported into analysis tools. These tools allow you to explore the data, find
patterns in it, and ask and answer what-if questions. This is the process by
which sense is made of data gathered in research by proper application of
statistical methods.
 Drawing Conclusions and Making Predictions: This is the step where,
after sufficient analysis, conclusions can drawn from the data and
appropriate predictions can be made. These conclusions and predications
may then be summarized in a report delivered to end-users.

7
3.2.2DATA MINING-
Data mining is a method of data analysis for discovering patterns in large data
sets using the methods of statistics, artificial intelligence, machine learning and
databases. The goal is to transform raw data into understandable business
information. These might include identifying groups of data records (also
known as cluster analysis), or identifying anomolies and dependencies between
data groups.

Applications of data mining:


 Anomaly detection can process huge amounts of data (“big data”) and
automatically identify outlier cases, possibly for exclusion from decision
making or detection of fraud (e.g. bank fraud).
 Learning customer purchase habits. Machine learning techniques can be
used to model customer purchase habits and determine frequently bought
items.
 Clustering can identify previously unknown groups within the data.
 Classification is used to automatically classify data entries into pre-
specified bins. A common example is classifying email messages as
“spam” or “not-spam” and having the system learn from the user.

3.2.3 BUSINESS INTELLIGENCE-


Business intelligence transforms data into actionable intelligence for business
purposes and may be used in an organization’s strategic and tactical business
decision making. It offers a way for people to examine trends from collected
data and derive insights from it.
 An organization’s operating decisions such as product placement and
pricing.
 Identifying new markets, assessing the demand and suitability of products
for different market segments.
 Budgeting and rolling forecasts.
 Using visual tools such as heat maps, pivot tables and geographical
mapping.

3.2.4 DATA VISUALIZATION-


Data visualization refers very simply to the visual representation of data. In the
context of data analysis, it means using the tools of statistics, probability, pivot
tables and other artifacts to present data visually. It makes complex data more
understandable and usable.

8
Increasing amounts of data are being generated by a number of sensors in the
environment (referred to as “Internet of Things” or “IOT”). This data (referred
to as “big data”) presents challenges in understanding which can be eased by
using the tools of Data visualization. Data visualization is used in the following
applications.
 Extracting summary data from the raw data of IOT.
 Using a bar chart to represent sales performance over several quarters.
 A histogram shows distribution of a variable such as income by dividing
the range into bins.

3.2.5 DATA MODIFICATION-


Data Modification means modifying or editing the data that has been scrapped.
It occurs when a saved (or stored) value in a computer is changed to a different
value.

So if data is manipulated then stored in the same place it is modified.


A simple example is a spreadsheet:

 If the value in column A1 is 100, and you change that value you are
modifying it.
 If you create a formula in A2 that uses the value in A1, you are
manipulating it, storing the result in A2.
 If you write a macro that picks up the value in A1, runs a calculation
then saves the result back in A1, you are both manipulating and
modifying it.

3.2.6 DATA MANIPULATION

Data manipulation is the process of changing data to make it easier to


read or be more organized. For example, a log of data could be organized
in alphabetical order, making individual entries easier to locate. Data
manipulation is often used on web server logs to allow a website owner to
view their most popular pages as well as their traffic sources. Users in the
accounting field or similar fields often manipulate data to figure out
product costs, sales trends, or potential tax obligations. Stock market
analysts are frequently using data manipulation to predict trends in the
stock market and how stocks might perform in the near future.

9
CHAPTER 4

DETAILED PLAN OF STUDY INCLUDING SCHEDULE OF


TRAINING

4.1 Training Schedule in the form of Chart

Project Ref. No.: Date of Preparation of Activity Plan:


Internship Finder Project
Title: Start Team
No. Task Days Status
Date Mates
1. Study and planning Internship 10/06/2019 5 Ameesha Done
Finder
2. Basic Design Internship 15/06/2019 2 Ameesha Done
Finder
3. Database Design Internship 18/06/2019 5 Ameesha Done
Finder
4. Interface Design Internship 24/06/2019 6 Ameesha Done
Finder
5. 1stDocumentation Internship 01/07/2019 3 Ameesha Done
Finder
6. Interface Coding Internship 04/07/2019 7 Ameesha Done
Finder
7. Database Internship 12/07/2019 3 Ameesha Done
Connection Finder
8. Interface Design Internship 16/07/2019 4 Ameesha Done
Finder
9. 2nd Documentation Internship 21/07/2019 2 Ameesha Done
Finder
10. Snapshots Internship 23/07/2019 1 Ameesha Done
Finder
11. Frontend Internship 24/07/2019 3 Ameesha Done
Development Finder
12 Development Internship 27/07/2019 4 Ameesha Done
process of Database Finder
13. Final Coding Internship 01/08/2019 3 Ameesha Done
Finder
14. Final Internship 04/08/2019 1 Ameesha Done
Documentation Finder
15 Final Internship 05/08/2019 5 Ameesha Done
Documentation Finder
16 Testing Internship 10/08/2019 1 Ameesha Done
Finder

10
4.2 STUDY 1:

4.2.1 GUI Used:

Python offers multiple options for developing GUI (Graphical User Interface).
Out of all the GUI methods, tkinter is most commonly used method. It is a
standard Python interface to the Tk GUI toolkit shipped with Python. Python
with tkinter outputs the fastest and easiest way to create the GUI applications.
Creating a GUI using tkinter is an easy task.
To create a tkinter:
1. Importing the module – tkinter
2. Create the main window (container)
3. Add any number of widgets to the main window
4. Apply the event Trigger on the widgets.
Importing tkinter is same as importing any other module in the python code.
Note that the name of the module in Python 2.x is ‘Tkinter’ and in Python 3.x is
‘tkinter’.
importtkinter

1. Tk(screenName=None, baseName=None, className=’Tk’,


useTk=1): To create a main window, tkinter offers a method
‘Tk(screenName=None, baseName=None, className=’Tk’,
useTk=1)’. To change the name of the window, you can change the
className to the desired one. The basic code used to create the main
window of the application is:

m=tkinter.Tk() where m is the name of the main window object

2. mainloop(): There is a method known by the name mainloop() is used


when you are ready for the application to run. mainloop() is an infinite
loop used to run the application, wait for an event to occur and process the
event till the window is not closed.
m.mainloop()
tkinter also offers access to the geometric configuration of the widgets which
can organize the widgets in the parent windows. There are mainly three
geometry manager classes class.
1. pack() method:It organizes the widgets in blocks before placing in the
parent widget.

11
2. grid() method:It organizes the widgets in grid (table-like structure) before
placing in the parent widget.
3. place() method:It organizes the widgets by placing them on specific
positions directed by the programmer.
There are a number of widgets which you can put in your tkinter application.
Some of the major widgets are explained below:
1. Button:To add a button in your application, this widget is used.
The general syntax is:
w=Button(master, option=value)
There are number of options which are used to change the format of the
Buttons. Number of options can be passed as parameters separated by commas.
Some of them are listed below.
 activebackground: to set the background color when button is under the
cursor.
 activeforeground: to set the foreground color when button is under the
cursor.
 bg: to set he normal background color.
 command: to call a function.
 font: to set the font on the button label.
 image: to set the image on the button.
 width: to set the width of the button.
 height: to set the height of the button.
2. Canvas: It is used to draw pictures and other complex layout like graphics,
text and widgets.
The general syntax is:
w = Canvas(master, option=value)
master is the parameter used to represent the parent window.
There are number of options which are used to change the format of the widget.
Number of options can be passed as parameters separated by commas. Some of
them are listed below.
 bd: to set the border width in pixels.
 bg: to set the normal background color.
 cursor: to set the cursor used in the canvas.
 highlightcolor: to set the color shown in the focus highlight.
 width: to set the width of the widget.
 height: to set the height of the widget.
3. CheckButton: To select any number of options by displaying a number of
options to a user as toggle buttons. The general syntax is:
w = CheckButton(master, option=value)

12
There are number of options which are used to change the format of this widget.
Number of options can be passed as parameters separated by commas. Some of
them are listed below.
 Title: To set the title of the widget.
 activebackground: to set the background color when widget is under the
cursor.
 activeforeground: to set the foreground color when widget is under the
cursor.
 bg: to set he normal backgrouSteganography
Break
Secret Code:
Attach a File:nd color.
 command: to call a function.
 font: to set the font on the button label.
 image: to set the image on the widget.
4. Entry:It is used to input the single line text entry from the user.. For multi-
line text input, Text widget is used.
The general syntax is: w=Entry(master, option=value)
master is the parameter used to represent the parent window.
There are number of options which are used to change the format of the widget.
Number of options can be passed as parameters separated by commas. Some of
them are listed below.
 bd: to set the border width in pixels.
 bg: to set the normal background color.
 cursor: to set the cursor used.
 command: to call a function.
 highlightcolor: to set the color shown in the focus highlight.
 width: to set the width of the button.
 height: to set the height of the button.
 Frame: It acts as a container to hold the widgets. It is used for grouping
and organizing the widgets. The general syntax is:
 w = Frame(master, option=value)
 master is the parameter used to represent the parent window.
5. Label: It refers to the display box where you can put any text or image
which can be updated any time as per the code.
The general syntax is:
w=Label(master, option=value)
There are number of options which are used to change the format of the
widget. Number of options can be passed as parameters separated by commas.

13
6. MenuButton: It is a part of top-down menu which stays on the window all
the time. Every menubutton has its own functionality. The general syntax
is:
w = MenuButton(master, option=value)
7. Menu: It is used to create all kinds of menus used by the application.
The general syntax is:
w = Menu(master, option=value)
8. Message: It refers to the multi-line and non-editable text.
The general syntax is:

w = Message(master, option=value)
9. RadioButton: It is used to offer multi-choice option to the user. It offers
several options to the user and the user has to choose one option.
The general syntax is:
w = RadioButton(master, option=value)
10.Scrollbar: It refers to the slide controller which will be used to implement
listed widgets.
The general syntax is:
w = Scrollbar(master, option=value)
master is the parameter used to represent the parent window.
11.Text: To edit a multi-line text and format the way it has to be displayed.
The general syntax is:
w =Text(master, option=value)

4.3 STUDY 2:

4.3.1 WEB SCRAPING

Web scraping is used to collect large information from websites. But why does
someone have to collect such large data from websites? To know about this,
let’s look at the applications of web scraping:

 Price Comparison: Services such as ParseHub use web scraping to


collect data from online shopping websites and use it to compare the
prices of products.
 Email address gathering: Many companies that use email as a medium
for marketing, use web scraping to collect email ID and then send bulk
emails.

14
 Social Media Scraping: Web scraping is used to collect data from Social
Media websites such as Twitter to find out what’s trending.
 Research and Development: Web scraping is used to collect a large set
of data (Statistics, General Information, Temperature, etc.) from websites,
which are analyzed and used to carry out Surveys or for R&D.
 Job listings: Details regarding job openings, interviews are collected
from different websites
 and then listed in one place so that it is easily accessible to the user.

What is Web Scraping?

Web scraping is an automated method used to extract large amounts of data


from websites. The data on the websites are unstructured. Web scraping helps
collect these unstructured data and store it in a structured form. There are
different ways to scrape websites such as online Services, APIs or writing your
own code. In this article, we’ll see how to implement web scraping with
python.

Why Python for Web Scraping?

You’ve probably heard of how awesome Python is. But, so are other languages
too. Then why should we choose Python over other languages for web
scraping?

Here is the list of features of Python which makes it more suitable for web
scraping.

 Ease of Use: Python is simple to code. You do not have to add semi-
colons “;” or curly-braces “{}” anywhere. This makes it less messy and
easy to use.
 Large Collection of Libraries: Python has a huge collection of libraries
such as NumPy, Matplotlib, Pandas etc., which provides methods and
services for various purposes. Hence, it is suitable for web scraping and
for further manipulation of extracted data.
 Dynamically typed: In Python, you don’t have to define datatypes for
variables, you can directly use the variables wherever required. This
saves time and makes your job faster.
 Easily Understandable Syntax: Python syntax is easily understandable
mainly because reading a Python code is very similar to reading a
statement in English. It is expressive and easily readable, and the
indentation used in Python also helps the user to differentiate between
different scope/blocks in the code.
 Small code, large task: Web scraping is used to save time. But what’s
the use if you spend more time writing the code? Well, you don’t have to.
15
In Python, you can write small codes to do large tasks. Hence, you save
time even while writing the code.
 Community: What if you get stuck while writing the code? You don’t
have to worry. Python community has one of the biggest and most active
communities, where you can seek help from.

How to scrape?

When you run the code for web scraping, a request is sent to the URL that you
have mentioned. As a response to the request, the server sends the data and
allows you to read the HTML or XML page. The code then, parses the HTML
or XML page, finds the data and extracts it.

To extract data using web scraping with python, you need to follow these basic
steps:

1. Find the URL that you want to scrape


2. Inspecting the Page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required format

To extract data from the website using Python.

 Selenium: Selenium is a web testing library. It is used to automate


browser activities.
 BeautifulSoup: Beautiful Soup is a Python package for parsing HTML
and XML documents. It creates parse trees that is helpful to extract the
data easily.
 Pandas: Pandas is a library used for data manipulation and analysis. It is
used to extract the data and store it in the desired format.

Step 1: Find the URL that you want to scrape


Step 2: Inspecting the Page

Step 3: Find the data you want to extract

16
Step 4: Write the code
Step 5: Run the code and extract the data
Step 6: Store the data in a required format

4.4 STUDY 3:

4.4.1 Storing the data in the CSV File-

CSV is a simple file format used to store tabular data, such as a


spreadsheet or database. Files in the CSV format can be imported to and
exported from programs that store data in tables, such as Microsoft Excel
or OpenOffice Calc. CSV stands for "comma-separated values"

CSV is a simple file format used to store tabular data, such as


a spreadsheet or database. Files in the CSV format can be imported to
and exported from programs that store data in tables, such as Microsoft
Excel or OpenOffice Calc.

CSV stands for "comma-separated file”.

A CSV is a text file, so it can be created and edited using any text editor. A
CSV file is created by exporting (File menu -> Export) a spreadsheet or
database in the program that created it. .

In python the CSV file is displayed in the form of the dictionary and list
format. This encrypts the data that has been scrapped from the online
website and then the filters are applied or to be more specific are sorted
according to the particular row or column to be displayed in the GUI.

17
4.5 STUDY 4:

4.5.1 Displaying of Data-

After the proper visualization and modification of data that is stored in the
CSV file, it is properly sorted in order to be displayed. The framework used
by tkinter GUI allows the user to properly display the data stored in the form
of the dictionary.

18
CHAPTER 5

DETAILS ABOUT THE CONCEPTS AND FEATURES OF


LEARNING

Project Internship:

5.1 TITLE: INTERNSHIP FINDER:

Internship Finder provides the scenario to find the best internship according
to the given essentials.

The information of all internships and the data is collected by the online
websites.This includes the scrapped data which is further send for the analysis. The
filters are applied to the scrapped data and hence are finally retrieved for the further
analysis.

The general process of the internship finder is to scrap the data from the
online internship websites like the intenshala.com, WorkIndia.com, shine.com,
AllCareerPoint.com and many more. It is basically a web scrapper that is used for
collecting the data for the data analysis purpose. The filters are applied to the data
according to which the information is fetched. These filters are applied based on the
specific choices like the gross value of the companies, the stipend given or not,
duration or the number of days or months, preferred locations, facilities provided,
employee feedback and the overall working environment.

This project is aimed at developing a Web scrapper called as the Internship


Finder that is of importance to users in searching for an quality internship. The
Internship Finder is an application that can be accessed by all the users based on the
certain filters applied. This system can be used to automate the workflow of finding
or searching to the internships that satisfy the following criteria. The features used
consists of the scrapping, analyzing and displaying the final top internships.

Functions of Internship Finder :


There will be registered users in the system including an administrator and
other students or users. An user can apply for an internship after logging in. The
administrator after logging in will modify or delete an account made by the user

19
after the internship is completed. The user can choose the internship based on their
requirements. After the internship is over, the administrator can change the
password and can also delete the account of the previous list of users registered. The
administrator maintains records for users who have registered for the particular
internship on the specific website.

User’s Functions :
 Can update the details in his/her profile.
 Can view the various internship details with the detailed descriptions,
criteria etc.
 Can register for an internship.
 Can login with the certain user ID and password registered.
 Can give the suitable location, duration and qualification for the
internship.
 Can search for the internships according to their needs.
 Can choose the internships scrapped with the applied filters.
 Can perform certain functions of receiving the message for the incorrect
mail ID and password and in order to enter the correct email ID or
password.
 Can edit the details accordingly.

5.2 PROBLEM DEFINITION-

Our user is a college student and the students or the people who are searching
for the best internship all over online through various websites, to gain a good
experience and learn new things.
The problem of each and every user searching for an internship is that either
they are not able to find a suitable location or duration. The company is providing
stipend or not and various other issues.The information to be displayed can be
gathered by answering the following questions :

1. How to know about the best suitable time duration ?

2. How to find preferred location for an internship in a company ?

3. How to scrap the particular information?

4. What are the problems student faces while searching for an internship on
the large websites ?

5. What are the specific criteria for choosing an internship ?

20
CHAPTER 6

DESCRIPTION OF THE PROJECT TO DEVELOPMENT


PHASE

6.1 Customer Requirement Specification

User’s :Ameesha Saxena

Business/Project Objective:

Construct an online web scrapper named Internship Finder to scrap the data
from the different websites like Internshala.com, Shine.com and the many other
such websites. The scrapped data will be displayed in the form of the comma
separated file. The CSV file will contain the five columns on which filters will be
applied and top five internships will be displayed for the user based on their needs.

Input provided by the Users :

 Inputs to the System


 Outputs from the system
 Process Involved in the System
 Expected top five to ten internships.
 List of scrapped data
 Data constraints for the registration table.
 Data constraints for the My Internship Finder table.

Hardware Requirements:

 A laptop with the i5 core processor that will help you access all the tools
in the completion and fulfillment of the project.
 4 Gegabytes of Ram or higher.

Software Requirements:

 Jupyter Notebook
 Python 3.7/ Anaconda 64 bit
 Visual Studio Code
 My SQL database connectivity

21
 SQL Lite
 Web browser
Scope of the Work(in Brief):

Depending on the scrapped data the top list of the internships will be
displayed and the clickable links will open into the web browser. On one click the
user in search for internship can apply directly.

 Verification and validation of registration details for a particular


internship.
 A CSV file is created which will store the title,duration,location,stipend
and links.
 Data analysis is done to display the top internships.

6.2 ARCHITECTURE & DESIGN OF THE PROJECT

User Interface – Tkinter using Python 3


Tier-1
Business logic – Tkinter
Presentation & using Python 3
Business Logic tier

Tier-2 Database Access – sqlite 3/SQL Server

Database tier

Fig: 6.2.1

6.3 ENTITY RELATIONSHIP DIAGRAM


22
Password
Username

Login

Fig: 6.3.1

Username
password
Name

REGISTER

Email ID
Re-Type Password

Fig: 6.3.2

23
1 month

Duration
2 months
Field

more

FIND
Location

Payed
Full time
Stipend Type

Work from home

Unpayed

Fig: 6.3.3

6.4 TABLE DESIGN

24
6.4.1 Table : Staff (for registration)
Field Name Data Type Null Key References Description
Table
Name Varchar(50) No Store the name of the
user
User Varchar(50) No Pk Store the user ID
Passw Varchar(50) No Store the password
e-mail Varchar(50) No Stores the email of the
user

6.4.2 Table in CSV File: My_InternshipFinder


Field Name Data Type Null Description

Title Varchar(50) No Store the name of the


internships.
Location Varchar(50) No Store the location of
the internships.
Duration Varchar(50) No Store the time
duration of the
internship
Stipend Varchar(50) NO Store the stipend
provided
Links Varchar(100) No Store the link of the
websites

6.5 SNAPSHOTS OF THE PAGES

25
6.5.1 Registration Page:

The registration page renders to register the details of the user who is applying
for the internship. This page contains the Name, user name , password and
email ID data fields to enter the user’s data. On clicking the ‘REGISTER ME’
button the user is successfully registered. A dialog box appears with the
following information. One Clicking the OK button the user can go back to the
login window for logging in.

Img: 6.5.1

26
6.5.2 Login Page :
This page allows the user to login with the registered user name and password,
on clicking on the login button.

Img: 6.5.2

6.5.3 Finder Page :


On clicking the login button the “FINDER PAGE” is opened. This page is
collecting the details of the internship, After enterning the certain details about
the type, duration, location of the internship , click on the find button which will
display a window that will display all the internships suitable to that criteria.

27
Img: 6.5.3

6.6 Display Window :


This is the CSV(Comma Seperated Values) table which contains the list of the
internship scrapped from the online websites. Then, filters are applied to display
the top five.

28
6.7 SNAPSHOTS OF DATABASE TABLES
This is the final top five internships scrapped, analysed and displayed in the
form of the CSV File in a table format.

29
6.8 Project Directory Structure :

Internship Finder

Images

users

others

Desktop INF

myDatabase

Online Internship

 The My_InternshipFinder directory contains all the .csv files on the root.
 The others folder contains all the other images used in the project.
 The online internship directory contains all the .py files required by the
project.
 The myDatabase folder is used to store the registration details during the
file upload process.

30
CHAPTER 7
CONCLUSION
This project is aimed at developing a Web scrapper called as the Internship
Finder that is of importance to users in searching for an quality internship. The
Internship Finder is an application that can be accessed by all the users based on the
certain filters applied. This system can be used to automate the workflow of finding
or searching to the internships that satisfy the following criteria.

 Helps to know about the best suitable time duration to choose an


internship.

 Helps to find preferred location for an internship in a company.

 Helps to scrap the particular information from website.

 Tells about the stipend criteria.

 Helps in sorting the specific criteria for choosing an internship.

 Solves all the problems student faces while searching for an internship on
the large websites like internshala.com.

In order to reduce the work load of searching an suitable internship too much by
scrolling the cursor, the best way is the introduction of the web scrapper named
Internship Finder which provides filters based on the location, duration and
stipend offered or not. This scrapper with finally display the top 5 or best
scrapped internship.

31
CHAPTER 8

REFERENCES

 www.geeksforgeeks.org

 www.edureka.com

 www.stackoverflow.com

 www.towardsdatascience.com

 www.wikipedia.com

 www.google.com

 PYTHON FOR EVERYBODY(Exploring data in Python 3) by


Charles Severance.

32
33

You might also like