You are on page 1of 14

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“Jnana Sangama”, Belagavi, Karnataka

A report on
“Introduction to Python and Data Science”
Submitted in partial fulfillment for Internship
in

SUMMER INTERNSHIP - I (21INT36)

INTER / INTRA INSTITUTIONAL INTERNSHIP

of

BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
Submitted by

PRATIKSHA SOORI
(1RF21CS077)

RV INSTITUTE OF TECHNOLOGY AND MANAGEMENT®


(Affiliated to Visvesvaraya Technological University, Belagavi & Approved by AICTE, New Delhi)
Chaitanya Layout, JP Nagar 8th Phase, Kothanur, Bengaluru-560076

2022-23
RV Educational Institutions
RV INSTITUTE OF TECHNOLOGY AND MANAGEMENT®
(Affiliated to Visvesvaraya Technological University, Belagavi & Approved by AICTE, New Delhi)
Chaitanya Layout, JP Nagar 8th Phase, Kothanur, Bengaluru-560076

Department of Computer Science Engineering

CERTIFICATE

Certified that the Inter / Intra institutional internship-I work titled “Introduction to Python
and Data Science” has been carried out by PRATIKSHA SOORI (1RF21CS077) is a
bonafide student of RV Institute of Technology and Management, Bengaluru in partial
fulfillment for the award of summer internship-I in Computer Science and Engineering of
the Visvesvaraya Technological University, Belagavi during the academic year 2022-2023.
It is certified that all corrections/suggestions indicated for the internal assessment have been
incorporated in the report. The internship report has been approved as it satisfies the
academic requirements prescribed by the university.

Dr. Roopashree S Dr. Anitha J Dr. Jayapal R


Assistant Professor, Professor &Head, Principal,
Department of CSE, Department of CSE, RVITM.,
RVITM., Bengaluru–76 RVITM., Bengaluru–76 Bengaluru–76
RV Educational Institutions
RV INSTITUTE OF TECHNOLOGY AND MANAGEMENT®
(Affiliated to Visvesvaraya Technological University, Belagavi & Approved by AICTE, New Delhi)
Chaitanya Layout, JP Nagar 8th Phase, Kothanur, Bengaluru-560076

DECLARATION

I, PRATIKSHA SOORI (1RF21CS077) the student of third semester B.E, Computer

Science and Engineering, RV Institute of Technology and Management, Bengaluru hereby

declare that the Inter / Intra institutional internship-I titled “Introduction to Python

and Data Science” has been carried out by me and submitted in partial fulfillment for the

award of summer internship-I of third semester, Bachelor of Engineering in Computer

Science and Engineering, Visvesvaraya Technological University, Belagavi during the

academic year 2022 -2023. I declare that matter embodied in this report has not been

submitted to any other university or institution for the award of any other degree.

Place: Bengaluru Signature


Date:

Pratiksha Soori
(1RF21CS077)
ACKNOWLEDGEMENT

The successful presentation of the summer internship-I would be incomplete without the
mention of the people who made it possible and whose constant guidance crowned my effort
with success.

I would like to extend my gratitude to the RV Institute of Technology and Management,


Bengaluru, and Dr. Jayapal R, Principal, RV Institute of Technology and Management,
Bengaluru for providing all the facilities to carry out the Internship.

I thank Dr. Anitha J, Professor and Head, Department of Computer Science and
Engineering, RV Institute of Technology and Management, Bengaluru, for her initiative and
encouragement.

I would like to thank my internship resource person, Dr. Roopashree S, Assistant Professor,
Department of Computer Science and Engineering, RV Institute of Technology and
Management, Bengaluru, for his/her constant guidance and inputs.

I would like to thank all the Teaching Staff and Non-Teaching Staff of the college for their
co-operation.

Finally, I extend my heart-felt gratitude to my family for their encouragement and support
without which I would not have come so far. Moreover, I thank all my friends for their
invaluable support and cooperation.

PRATIKSHA SOORI

(1RF21CS077)
ABSTRACT
Python is a high-level object-oriented programming language that is used in a wide
variety of application domains. It has the right combination of performance and features that
demystify program writing. Python follows modular programming approach, which is a
software design technique that emphasizes separating the functionality of a program into
independent, inter-changeable modules, such that each contains everything necessary to
execute only one aspect of the desired functionality. Conceptually, modules represent a
separation of concerns, and improve maintainability by enforcing logical boundaries
between components. 
Data science encompasses a set of principles, problem definitions, algorithms, and
processes for extracting nonobvious and useful patterns from large data sets. Many of the
elements of data science have been developed in related fields such as machine learning and
data mining. The commonality across these disciplines is a focus on improving decision
making through the analysis of data. Machine learning (ML) focuses on the design and
evaluation of algorithms for extracting patterns from data. Data science takes these
considerations into account but also takes up other challenges, such as the capturing,
cleaning, and transforming of unstructured social media and web data; the use of big-data
technologies to store and process big, unstructured data sets; and questions related to data
ethics and regulation.
This internship report reflects the 3 weeks training received. The details of the
practical experience and the academic knowledge that have been gained from the internship
during its tenure are incorporated.
Table of Contents
Chapter No. Contents Page No.

Acknowledgement I
Abstract II
Table of Contents III
List of Figures IV

Chapter 1 INTRODUCTION TO DATA SCIENCE 1


2.1 Python-Pluses 1
2.2 Python Libraries for Data Science 2
2.3 Training Contents 3
Chapter 2 INTRODUCTION TO DATA SCIENCE 4
2.1 Basic Concepts 4
2.2 Training Contents 4
2.3 Diabetes Dataset 5
Chapter 3 CONCLUSION AND FUTURE SCOPE 7
LIST OF FIGURES

Figure No. Figure Name Page No.


2.1 Program code for analyzing diabetes dataset 6
2.2 Output graph 6
Chapter-1
INTRODUCTION TO PYTHON
Python-Pluses
Python programming language was developed by Guido Van Rossum in February
1991. Though it came into existence in early 1990’s, it is competing with ever popular
languages such as C, C++, Java etc in popularity index. Its design philosophy emphasizes
code readability, and its syntax allows programmers to express concepts in fewer lines of
code. The language provides constructs intended to enable clear programs on both a small
and large scale. Python supports multiple programming paradigms, including object-oriented,
imperative and functional programming or procedural styles.
The main advantages of using python are:
1. Python is compact and very easy to use object-oriented language with very simple
syntax rules. It is a very high-level language and thus very programmer friendly.
2. Python is an interpreted language and not a compiled language. This means that the
python installation interprets and executed the code line by line at a time. This makes
it easy to debug and thus suitable for beginners to advanced users.
3. Python is an incredibly efficient language as the programs will do more in fewer lines
of code than many other languages.
4. Python language is freely available without any cost.
5. Python has evolved into a powerful, complete and useful language over these years.
These days, it is being in many diverse fields/applications such as scripting, rapid
prototyping, web applications, GUI Programs, Game development, Database
applications and System Administrations.
6. Python can run equally well on variety of platforms- Windows, Linux/UNIX
Macintosh, supercomputers, smartphones etc. In other words, Python is a portable
language.
When it comes to data science, we need some sort of programming language or tool.
Being high in speed, having large number of packages available and with a syntax that is easy
to understand, Python is used to help in building applications with a readable code base. It
has been used in data science, IoT, AI and other technologies, which has added to its
popularity.
Python Libraries for Data Science

Python has libraries with large collections of mathematical functions and analytical tools.
The following are the libraries that give users the necessary functionality when crunching
data:
1. NumPy:
NumPy stands for Numerical Python. It is a general-purpose array-processing
package. The most powerful feature of NumPy is n-dimensional array. It is the
fundamental package for scientific computing with Python. This library also contains
basic linear algebra functions, Fourier transforms, advanced random number
capabilities and tools for integration with other low-level languages like Fortran, C
and C++. NumPy can also be used as an efficient multi-dimensional container of
generic data.
2. Pandas Data Frame:
Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. It is built on NumPy package due to which, NumPy is
necessary for operating the Pandas. It can perform five significant steps required for
processing and analysis of data irrespective of the origin of the data, i.e., load,
manipulate, prepare, model, and analyse.
Pandas has a fast and efficient Data Frame object with the default and customized
indexing. It can be used for reshaping and pivoting of the data sets. A variety of
datasets can be processed in different formats like matrix data, tabular heterogeneous
and Time Series. It can also integrate with other libraries such as SciPy and Scikit-
learn.
3. MatPlotLib:
Matplotlib is a quintessential Python library for plotting and visualisation. A vast
variety of graphs, starting from histograms to line plots to heat plots can be done. One
of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. This library can provide an object-
oriented API for embedding plots into applications. It is a close resemblance to
MATLAB embedded in Python programming language. Matplotlib also facilitates
labels, grids, legends, and some more formatting entities.
4. Scikit Learn:
Scikit Learn is a robust machine learning library for Python. It features ML
algorithms like SVMs, random forests, k-means clustering, spectral clustering, mean
shift, cross-validation and more. Even NumPy, SciPy and related scientific operations
are supported by Scikit Learn by being a part of the SciPy Stack. It implements a
range of machine learning, pre-processing, cross-validation, and visualization
algorithms using a unified interface.
The scikit-learn library provides many different algorithms which can be imported
into the code and then used to build models just like we would import any other
Python library. This makes it easier to quickly build different models and compare
these models to select the highest scoring one. But to really appreciate its true power,
we need to start using it on different open data sets and build predictive models using
them.
Training Contents

This internship was carried out from 10th of October to 31st of October 2022. In the first
week, I was able to understand the basic python programming skills. Following were the
topics covered that helped me to do so:
1. Introduction to Python: The procedure to install Python, distinguish between
important datatypes and use basic features of the python interpreter and IDLE were
the starting steps. The concept of difference between a module and a script was also
covered.
2. Using variables in Python: I learn about numeric, string, sequence and dictionary data
types along with relevant operations while practicing Python syntax.
3. Basic concepts in Python: The basic idea of using conditional statements, loops and
iterators were studied. This helped me in developing Python programs using the
above statements.
4. Python Datatypes: After understanding the basics, I moved on to exploring the
various datatypes available in Python such as lists, dictionaries, tuples and sets. I also
learned the operations that can be performed on them.
5. Functions and Packages: Optimisation of Python code was studied further by dividing
the program into functions.
By the end of this week, I was successfully able to develop programs in Python. I got to
put the known concepts into practice in my assignments.
Chapter-2
INTRODUCTION TO DATASCIENCE
Basic Concepts
Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions. It uses complex machine learning algorithms to build predictive models.
The data used for analysis can come from many different sources and presented in various
formats.
A data science life cycle is an iterative set of steps taken to deliver a project or
analysis. Since every data science project and team are different, every specific data science
life cycle is different. However, most data science projects tend to flow through the same
general life cycle of data science steps which are as follows:
1. Capture: In this stage, the data science team is trained in researching the issue to
create context and gain understanding. Raw structured and unstructured data are
gathered. The team comes up with an initial hypothesis, which can be later confirmed
with evidence.
2. Maintain: This stage covers methods to investigate the possibilities of pre-processing,
analysing, and preparing data before analysis and modelling. The raw data is taken
and put into a form that can be used.
3. Process: After pre-processing, data scientists take the data and examine its patterns,
ranges, and biases to determine how useful it will be in predictive analysis.
4. Analyze: This stage involves performing the various analyses on the data. It involves
exploratory and predictive analysis.
5. Communicate: In the final stage, analysts prepare the analyses in easily readable
forms such as charts, graphs, and reports. Thus, it is a data reporting or data
visualisation stage.
Training Contents
In the second week of the internship, I was able to gain knowledge about data analysis,
data visualisation, machine learning, and applying the same on real life Data Science projects.
The following topics covered during this week helped me to gain this knowledge:
1. Python libraries: The available libraries in Python such as NumPy, Pandas, SciPy and
Matplotlib were studied, which would be useful for analysing data.
2. Model building: I learnt the basic steps involved in model building using SciKit learn
which are- loading a dataset, splitting the dataset and timing the model. Data
visualisation can be done with Matplotlib.
3. Machine learning: Understanding data processing, operation on NumPy arrays,
reciprocal, power function, modulus function and execution of programs.
4. Data cleaning: After learning the basic tools available for analysing data sets,
overview of data cleaning was covered which involves the unwanted observations,
fixing structural errors, managing unwanted outliers, handling missing data and
inputting the missing values from past observations.
Diabetes Dataset
For the second assignment of the internship, I worked on the diabetes dataset provided by
scikit-learn. Diabetes is a chronic disease, caused due to the increase in level of blood
glucose, with the potential to cause a worldwide health care crisis. However, early
prediction of diabetes is quite challenging task for medical practitioners due to
various factors. Data science methods have the potential to benefit other scientific fields by
shedding new light on common questions. One such task is to help make predictions on
medical data.
The ten attributes present in this dataset are age, sex, body mass index, average blood
pressure, and the six blood serum measurements of 442 diabetes patients. The target is the
quantitative measurement of disease progression one year after the baseline. Pre-processing
on the data has already been done. By using linear regression theory, a graph is plotted which
indicates the predicted value after one year. Then, the result is compared with a scatter plot of
the real values.
In Linear Regression,
y=mx+c
y : It is a variable to be predicted ( aka Dependent variable) . It is of numerical
continuous datatype.
M : Here, the coefficient ‘m’ is the slope of the line.
X : It is the variable which is called the independent variable.
C : We know this as a constant value , aka y-intercept.( The value of ‘y’ when ‘x’ is
zero). Basically, it refers to the point which crosses the vertical axis.
Program Code

Fig.2.1. Program code for analysing diabetes dataset


Obtained Graph
Fig.2.2. Output graph

As can be seen from the graph above, there is no clear tendency for the spots to be distributed
around a line, confirming the fact that linear regression is not a good model.

Chapter-3
CONCLUSION AND FUTURE SCOPE
In this internship work, I have studied the basics of Python programming and Data
Science. I implemented each sub-field of the topic studied with a piece of project work. Aim
of the study in data science is to learn the underlying information of machine learning. Using
data visualisation part, I have learnt how to make data more understandable by using different
python programming language libraries. This internship helped me to learn how to analyse
pre-existing datasets using python. The practical programming exercises enhanced my
interest of solving problems in Data Analytics using python.
Along with Python language, there are many other languages used for Data science and
Machine learning like Java, C++ etc. However, most of the developers use Python than Java,
C++ due to its easy syntax, secure coding, and simplicity. When it comes to robust and
performance, developers choose Python.
With respect to the future work there is still huge space for this language to serve other
upcoming research areas because of its features like simplicity, extensive library, inbuilt and
extensible modules. In future we will propose python as a powerful tool which is used by
many research communities.

You might also like