Professional Documents
Culture Documents
ON
V Ankith Reddy
20STUCHH010202
AT
IIIT(Hyderbad)
An Internship Program-I
1
Faculty of Science & Technology, IFHE Hyderabad
ID number : 20STUCHH010202
2
PREFACE
The B-Tech program is a well-structured and comprehensive technical studies program.
The primary goal of practical training at the B-Tech level is to help students acquire
abilities by supplementing the academic study of technical courses in general. Internship
Program assists in gaining real-world information about the IT Company and provides a
good introduction to the modern IT world. The B-Tech degree offers a broad curriculum
divided into specialties that give practical understanding of engineering principles.
Training is an essential component of every professional course. In the organization's
training, professors provide us with theoretical knowledge of many areas. Only via
training did I learn what a corporation is and how it operates. I can learn about other
departments. activities being carried out in the organization, which will benefit me in
the future when I enter the practical sector Internship Program is a vital element of B-
Tech, and each student is required to train at a firm for two months and then write a
project report on the same while training. Throughout the course, I gained a lot of
experience and learned about real-life management methods and how they differ from
theoretical knowledge. Theoretical knowledge is insufficient in today's globalized
world, where market rivalry is fierce. Aside from that, one must have practical
information that will aid an individual in his or her work activities, and it is true that
“Experience is the best teacher”.
3
ACKNOWLEDGMENT:
I would like to offer my heartfelt gratitude to Dr. Bapi Raju and express my
appreciation for his timely comments during my assignment.I also offer my gratitude to
Mr. Madhukar Dwivedi and Mr. Krishna and they deserve my respect and admiration.
They were always there to assist me during the course of my internship. I would like to
express my heartfelt gratitude to Dr. D.V Nair (faculty mentor) for his encouragement
and valuable assistance in shaping the current form of my work, as well as to those who
extended their kind assistance in completing my internship. I appreciate everyone's
assistance and direction in finishing my assignment. This report's result required a great
deal of direction and cooperation from many people. I consider myself quite fortunate to
have received this during the whole process.. Only with such guidance and support I
have accomplished everything.
4
ABSTRACT
The first project assigned was a moderately difficult data set from a Kaggle competition
where our team had put our skills to test that have been learned prior to the project
during the training phase. The prerequisites were - python course(pandas&numpy),
machine learning introduction, feature engineering, and core algorithms with
mathematical intuition. The problem statement was to create an ML model to predict the
survivability of a passenger in the Titanic Shipwreck, the aforementioned techniques
were put to use to import, clear corrupt data, preprocess, and train the data set to make
an efficient prediction model which yielded good results in the competition leaderboard
with top 5 percentage.
The second assigned project was based on time series data set where data is recorded
over a period of time and is used in order to predict the future by forecasting methods.
The dataset worked upon was ‘Delhi Climate Time Series’ and the goal was to build a
model. The approaches followed were one with simple regression techniques and the
other using SARIMA which is a complex approach. The model is ready with simple
regression techniques and SARIMA model has yet to be finalized.
5
TABLE OF CONTENTS
Title Page
no.
Preface….………..…………………………………………………........3
Acknowledgment...….………………………………………………..….4
Abstract…………...……………………………………………………...5
Introduction ……………………………………………………...............7
Project 1………….……………………………………………………....8
1.1 Methodology.....……………………………………………………...11
1.2 Results………………………………………….................................17
Project 2 …………..……………………………………...............….....10
2.0 Introduction……………………………………………………….....18
2.1 Methodology,……………………………………………..............…19
Conclusion …………………………………………………………........22
References/Bibliography// …………………………………………...….23
6
INTRODUCTION
The internship at IIIT was mainly focused on learning and implementing Machine
learning
Introduction to the Machine Learning Project: Machine learning is a subfield of
computer science that arose from the study of pattern detection in data and predicting
the outcome of future or unknown events. In this process we primarily worked on
datasets in kaggle and implemented various machine learning models to try out and test
our skills in machine learning.
There were a few pre-requisites which had to be learnt for machine learning which were
assigned to us in the first few weeks of our internship. The pre-requisites for creating
efficient machine learning models used for predictions are as follows
1) Python – Basic and most popular programming language used for machine
learning and the libraries created for machine learning algorithms
2) Numpy and Pandas – Numpy is python library which is used to handle
numerical data and pandas is also a python library with many functions used to
work on datasets
3) Data Visualisation – Data visualisation mainly focuses on plotting of the
information in datasets for a better understanding of the given data. Some of the
most commonly used plots in the projects were Bar and line plots , scatter plots ,
histograms and box plots.
4) Introduction to machine learning – In Intro to machine learning , all the basic
topics and algorithms used to build models were covered. Learning of
algorithms and the mathematical intuition behind the algorithms were assigned
to us for better implementation of the algorithms in our models. The various
algorithms learnt during the course as pre-requisites for machine learning are as
follows
- Linear Regression
- Logistic Regression
- Decision trees and Random forests
- K-means clustering algorithms
- K nearest neighbour
- Support vector machines
The above are the group of machine learning algorithms used for classification ,
regression and clustering of given data to predict the outcomes.
5) Feature Engineering – Feature engineering is a data pre-processing course
which is helpful in cleaning the data before training and testing the machine
learning models for efficient and reliable predictions. Most of the data in its raw
form is impure and corrupted which requires a lot of pre-processing to be able to
load it into machine learning models efficiently which is done with the help
various feature engineering techniques like changing the format of data into
accepted format , assigning values based on mean median mode for NaN values,
using various encoding techniques for strings in the datasets etc
7
Project – 1
Titanic machine learning project
In our first project, we all know about the unfortunate sinking of the Titanic
ship when the starboard side of the Titanic struck the iceberg , making a
cavity underneath the waterline. The frame was not punctured by the chunk
of ice, such that the hull's creases buckled and isolated, permitting water to
surge leading to tragic deaths of the individuals who were present on the
ship.
8
OBJECTIVE OF THE STUDY
A. Primary objective:
The report aims to provide information about the data and mathematical
models used for predicting the survivability of titanic passengers and future
sales of store
B.Secondary objective:
SCOPE OF STUDY
9
Main text
Case study - 1
Here training data was used to train our machine learning models and we
constructed models based on features like
a)1st = Upper
b)2nd = Middle
c)3rd = Lower
4.sibsp is about siblings / spouses aboard the Titanic sibsp: The dataset
defines family relations in this way...
Some children travelled only with a nanny, therefore parch=0 for them.
a) C = Cherbourg,
b) Q = Queenstown,
10
c)S = Southampton.
Methodology
First we inspected data set manually based on our understanding about data
after that to reduce the noise from data, few columns were dropped namely
"PassengerId", "Name","Ticket" and "Cabin" from both test and train data
set. "PassengerId", "Name" and "Ticket" were dropped as they didn't
significantly effect the data and had low correlation and in the case of
"Cabin" it had several missing values.
In the next step we plotted bar graphs, pie charts, histogram, heatmaps,line
graphs etc using seaborn an inbuilt library in python. this gave us better
insight of data and help us decide on our next steps based on intuitions
from data visualization tools.
11
1.3 Pre processing
Data preprocessing is the process of preparing raw data for use with a
machine learning model. It is the first and most important stage in
developing a machine learning model.
12
· Sklearn gives a ‘Label encoding’ for encoding the levels of
categorical features into numeric values. LabelEncoder encode labels
with a value between zero and 1 in the column of “Sex” to convert
strings of male and female into numbers on which we train our
model. Thus The categorical parameters in this approach will create
separate columns for both Male and Female labels. So, whenever
there is Male, the value in the Male column will be 1 and 0 in the
Female column, and vice versa
The following is an example of missing data from the Titanic dataset. You
can see that some values are missing in the "Age" and "Cabin"
columns.
In the dataset, white space indicates the missing value. In pandas, missing
values are usually represented by NaN. It stands for Not a Number.
13
the most common way to replace missing values in a numeric column is by
replacing by means . If there are outliers, the average is not
appropriate. In such cases, the outliers need to be dealt with first.
column of embarked had 3 missing values that were replaced by value ’S’
as ‘S’ was the most frequent value in the column. similarly the
missing values in column of “Age” was filled by the mean value of age in
the given data set
The idea behind the StandardScaler is to transform the data so that the
mean of the distribution is 0 and the standard deviation is 1.
Given the distribution of the data, the mean is subtracted from each value
in the dataset and divided by the standard deviation (or characteristic in
the case of multivariate) for the entire dataset.
The input data set features vary substantially over their ranges like data had
binary values in embarked that was considerably smaller compared to some
values in the age column. StandardScaler evaluates the mean and scales
the information to the unit fluctuation in characteristic values.
14
1.5 Training the model
It is important that the correlation between the models is low. Just as low-
correlated assets come together to form a portfolio that is larger than the
sum of its parts, uncorrelated models can generate more accurate aggregate
forecasts than individual forecasts. The reason for this great effect is that
the trees protect each other from individual mistakes (unless they all
15
always make mistakes in the same direction). Some trees may be wrong,
but many others will be right, which allows the trees to move in the right
direction as a group. Therefore, the requirements for good performance in
Random Forest are:
In order for a model built using these functions to be better than a random
guess, the function must have an actual signal. The predictions (and
therefore errors) of the individual trees should be slightly correlated with
each other.
Now we will train our machine learning model on train data set and test it
on our test data set. We are using random forest to train our data as it had
better accuracy in compression with algorithms like decision tree and
logistic regression which were tried and tested before random forest. After
training our model we predicted the outcome of which passenger survived
and stored it in a new CSV file.
n estimators = 1000
depth = 5
16
RESULTS:
Our first model was ready and we have been placed at 9200/14,500 on the
leader. The model was complete and working, but this isn’t the end, In
machine learning, a model can be improved substantially by using different
approaches and following optimization method to imporve accuracy of a
model. Our team didn’t stop with the initial result and were constantly
trying to improve our score by exploring new methods or alternatives. We
have tried different combinations in preprocessing step, scaling step and
modelling step and achieved our best result with Random forests and
standard scaler with mean imputation, which placed us at 850 rank out of
14,500+ submissions.
Project 2
17
Delhi Climate Time Series
2.0 Introduction
In New Delhi, the capital of India, the climate is subtropical, with a very mild
and sunny winter, and a very hot season from mid-March to mid-June.The air
quality in Delhi, the capital territory of India, according to a WHO survey of
1,650 world cities, is the worst of any major city in the world. It also affects the
districts around Delhi. Air pollution in India is estimated to kill about 2 million
people every year; it is the fifth largest killer in India. This dataset provides data
from 1st January 2013 to 24th April 2017 in the city of Delhi, India. The 4
parameters here are meantemp, humidity, wind_speed, meanpressure.
Columns Descriptons:
● date : Date of format YYYY-MM-DD
● meantemp : Mean temperature averaged out from multiple 3-hour intervals in a
day.
● humidity : Humidity value for the day (units are grams of water vapor per cubic
meter volume of air).
● wind_speed : Wind speed measured in kmph.
● meanpressure : Pressure reading of weather (measure in atm)
2.2 Methodology
18
1.11 Data Visualisation
Plotted various graphs and charts using Seaborn and Mat- plotlib to get an overview of
the information in the dataset and to look at the various features in the data set over the
given time period . Here, X axis - time and Y - axis - respective feature
19
Fig 3 : Wind Speed over the time period
Observations Made
● From the above plots, its clearly evident that the meantemp-
which is mean temperature recorded for each day is following
seasonality. Seasonality- is regular,periodic change in mean od
the series. It is a fact that temperature changes from season to
season and the results from graphs are unsurprising
● Similar to meantemp,the graphs of other features- wind_speed
and humidity show seasonality.
- The NaN values have been checked for but none are found in the dataset
- The data was clean
- No corrupt or duplicate data was found
20
2.23 Data Modelling
Linear Regression
In this method,simple linear regression is performed on the dataset. The values will be
predicted as a linear combination of the previous 80 days values.
Stationarity:
before applying any statistical model on a Time Series, the series has to be staionary,
which means that, over different time periods,
a) It should have constant mean.
b) It should have constant variance or standard deviation.
c) Auto-covariance should not depend on time.
ARIMA model:
ARIMA(Auto Regressive Integrated Moving Average) is a combination of 2 models AR(Auto
Regressive) & MA(Moving Average).
Rolling Statistics:
Rolling Statistics - Plot the moving avg or moving standard deviation to see if it varies with
time. Its a visual technique.
ADCF Test:
The above test is used to give us various values that can help in identifying stationarity.
2.3 RESULTS
Forecasting graph
21
2.4 Conclusion
This report explains how I have learnt about machine learning throughout my
internship. I'm confident in saying that my internship at Cognitive Science gave me a lot
of valuable learning experience. I gained knowledge of machine learning technology
and how to use it effectively. In previous projects, I learned how to use data pre-
processing, data visualisation, and machine learning. In order to fully comprehend the
algorithms, we also delved deeply into their mathematics. We also undertook the
complex process of building the algorithm from scratch. In addition, we worked on
several datasets as our main projects throughout the course of two months, all these
methods have enabled us to dive into the beginning phase of Machine Learning world.
22
REFERENCES
1) https://www.kaggle.com/learn/python
2) https://www.kaggle.com/learn/pandas
3) https://www.kaggle.com/learn/data-visualization
4) https://www.kaggle.com/learn/intro-to-machine-learning
5) https://www.kaggle.com/learn/intro-to-machine-learning
6) https://www.kaggle.com/learn/intro-to-machine-learning
7) https://www.kaggle.com/datasets/sumanthvrao/daily-climate-
time-series-data
8) https://www.kaggle.com/c/titanic
9) https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LogisticRegression.html
10) https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LinearRegression.html#sklearn.linear_mod
el.LinearRegression
11) https://scikit-learn.org/stable/modules/generated/
sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.
KNeighborsClassifier
12) https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html
23