You are on page 1of 14

LAB MANUAL

SE E&TC
SUBJECT: DATA
ANALYTICS LAB
Assignment No. 1
TITLE: Data analysis and python fundamentals.
THEORY:

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is


designed to be highly readable.
Top Python Libraries for Data Science are

• TensorFlow
• NumPy
• SciPy
• Pandas
• Matplotlib
• Keras
• SciKit-Learn
• PyTorch
• Scrapy
• BeautifulSoup
Python variables do not need explicit declaration to reserve memory space. The declaration
happens automatically when you assign a value to a variable. The equal sign (=) is used to
assign values to variables.

Python has five standard data types −

• Numbers
• String
• List
• Tuple
• Dictionary

Number
Number data types store numeric values. Number objects are created when you assign
a value to them. For example −
var1 = 1
var2 = 10

Strings
Strings in Python are identified as a contiguous set of characters represented in the quotation
marks.
Example

str = 'Hello World!'

Python Lists
Lists are the most versatile of Python's compound data types. A list contains items
separated by commas and enclosed within square brackets ([]).
The items belonging to a list can be of different data type.
list = [ 'abcd', 786 , 2.23, 'john', 70.2 ]

Python Tuples
A tuple is another sequence data type that is similar to the list. A tuple consists of a
number of values separated by commas. Unlike lists, however, tuples are enclosed
within parentheses
tuple = ( 'abcd', 786 , 2.23, 'john', 70.2 )

Python Dictionary
Python's dictionaries are kind of hash table type. It works like key-value pairs.
Dictionaries are enclosed by curly braces ({ }) and values can be assigned and accessed
using square braces ([]). For example −
dict = {'name': 'john','code':6734, 'dept': 'sales'}
print dict.keys() # Prints all the keys
print dict.values() # Prints all the values

CONCLUSION: Thus we have successfully studied fundaments of python


Assignment No. 2
TITLE: Data visualization in python using matplotlib.
THEORY:
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is designed to
be as usable as MATLAB, with the ability to use Python and the advantage of being free and open-
source. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a
plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
The various plots we can utilize using Pyplot are Line Plot, Histogram, Scatter, 3D Plot, Image,
Contour, and Polar.
Syntax to import pyplot
import matplotlib.pyplot as plt

Different types of Matplotlib Plots


Line Chart
Line chart is one of the basic plots and can be created using the plot() function.

matplotlib.pyplot.plot(\*args, scalex=True, scaley=True, data=None, \*\*kwargs)


Bar Chart
A bar chart describes the comparisons between the discrete categories. It can be created using
the bar() method.

Histogram
A histogram is basically used to represent data provided in a form of some groups. It is a type of
bar plot where the X-axis represents the bin ranges while the Y-axis gives information about
frequency. The hist() function is used to compute and create histogram of x.
Syntax:

matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None,


cumulative=False, bottom=None, histtype=’bar’, align=’mid’,
orientation=’vertical’, rwidth=None, log=False, color=None, label=None,
stacked=False, \*, data=None, \*\*kwargs)
Scatter Plot
Scatter plots are used to observe relationships between variables. The scatter() method in the
matplotlib library is used to draw a scatter plot.

Syntax:
matplotlib.pyplot.scatter(x_axis_data, y_axis_data, s=None, c=None,
marker=None, cmap=None, vmin=None, vmax=None, alpha=None,
linewidths=None, edgecolors=None

CONCLUSION: Thus we have successfully studied and executed visualization of data using
matplotlib
Assignment No. 3
TITLE: Handling missing values in data in python

Theory:

Missing Data can occur when no information is provided for one or more items or for a whole
unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also refer to
as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with
missing data, either because it exists and was not collected or it never existed.

For Example, Suppose different users being surveyed may choose not to share their income,
some users may choose not to share the address in this way many datasets went missing.

There are several useful functions for detecting, removing, and replacing null values in Pandas
DataFrame :

• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
Checking for missing values using isnull() and notnull()
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both
function help in checking whether a value is NaN or not. These function can also be used in Pandas Series
in order to find null values in a series.

Checking for missing values using isnull()

In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe
of Boolean values which are True for NaN values.

Filling missing values using fillna(), replace() and interpolate()


In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function
replace NaN values with some value of their own. All these function help in filling a null values in datasets
of a DataFrame.

Interpolate() function is basically used to fill NA values in the data frame but it uses various interpolation
technique to fill the missing values rather than hard-coding the value.

CONCLUSION: Thus, We have successfully studied and implemented handling missing values in data
using python.
Assignment No. 4
TITLE: converting Categorical variables into quantitative variables in python
THEORY:
Machine learning is good at dealing with numeric values. We could make machine
learning models by using text data. So, to make predictive models we have to
convert categorical data into numeric form.
Method 1: Using get_dummies()
Replacing the values is not the most efficient way to convert them. Pandas provide
a method called get_dummies which will return the dummy variable columns.
Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’,
dummy_na=False, columns=None, sparse=False, drop_first=False,
dtype=None)

Method 2: Scikit-Learn
Method 3: Using replace() method
Replacing is one of the methods to convert categorical terms into numeric. For
example, We will take a dataset of people’s salaries based on their level of
education. This is an ordinal type of categorical variable. We will convert their
education levels into numeric terms.
Syntax:
replace(to_replace=None, value=None, inplace=False, limit=None, regex=False,
method=’pad’)

CONCLUSION:
Thus we have successfully converted categorical variables to quantitative variables
Assignment No. 5
TITLE: Statistical Hypothesis Testing with Python

THEORY:

Hypothesis testing is the analysis of assumptions on a population sample. In other


words, it involves checking whether a hypothesis should be accepted or not.

Hypothesis testing has improved decision-making in different sectors including


business. Today, organizations rely on hypothesis testing because of the enormous
amount of data generated across the globe.

Null Hypothesis and Alternate Hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null
hypothesis. H1 is the symbol for it.

Example:

A sanitizer manufacturer claims that its product kills 95 percent of germs on


average.
To put this company's claim to the test, create a null and alternate hypothesis.
H0 (Null Hypothesis): Average = 95%.
Alternative Hypothesis (H1): The average is less than 95%.
Another straightforward example to understand this concept is determining
whether or not a coin is fair and balanced. The null hypothesis states that the
probability of a show of heads is equal to the likelihood of a show of tails. In
contrast, the alternate theory states that the probability of a show of heads and
tails would be very different.
Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical hypothesis
into two types.
Simple Hypothesis: A simple hypothesis specifies an exact value for the parameter.
Composite Hypothesis: A composite hypothesis specifies a range of values.
Example:
A company is claiming that their average sales for this quarter are 1000 units. This
is an example of a simple hypothesis.
Suppose the company claims that the sales are in the range of 900 to 1000 units.
Then this is a case of a composite hypothesis.
Conclusion: Thus we have studied the Hypothesis testing.
Assignment No. 6
TITLE: Exploratory data analysis:
Group by in python
Theory:
Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric python packages. Pandas is one of those packages and
makes importing and analyzing data much easier.
Pandas groupby is used for grouping the data according to the categories and apply
a function to the categories. It also helps to aggregate data efficiently.
Pandas dataframe.groupby() function is used to split the data into groups based on
some criteria. pandas objects can be split on any of their axes. The abstract
definition of grouping is to provide a mapping of labels to group names.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=False, **kwargs)

Conclusion: Thus we have successfully implemented the group by function in


python
Assignment No. 7
TITLE: Exploratory data analysis:
Analysis of Variance ANOVA
THEORY:
Analysis of variance (ANOVA) is a statistical technique that is used to check if the
means of two or more groups are significantly different from each other. ANOVA
checks the impact of one or more factors by comparing the means of different
samples.
We can use ANOVA to prove/disprove if all the medication treatments were equally
effective or not.
Another measure to compare the samples is called a t-test. When we have only two
samples, t-test and ANOVA give the same results. However, using a t-test would
not be reliable in cases where there are more than 2 samples. If we conduct
multiple t-tests for comparing more than two samples, it will have a compounded
effect on the error rate of the result.
Terminologies used in the technique.
Means (Grand and Sample)
A sample mean is the average value for a group, whereas the grand mean is the
average of sample means from various groups or the mean of all observations
combined.
F-Statistics
F-statistic or F-ratio is a statistical measure that tells us about the extent of
difference between the means of different samples. Lower the F-ratio, closer are
the sample means.
Sum of Squares
The sum of squares is a technique used in regression analysis to determine the
dispersion of data points. It is used in the ANOVA test to compute the value of F.
Mean Squared Error (MSE)
The Mean Squared Error gives us the average error in the data set.
Hypothesis
In ANOVA, we have Null Hypothesis and an Alternative Hypothesis. The Null
hypothesis is valid when all the sample means are equal, or they don’t have any
major difference.
The Alternate Hypothesis is valid when at least one of the sample means is different
from the other.
Group Variability
In ANOVA, a group is a set of samples within the independent variable.
Between-group variability occurs when there is a significant variation in the sample
distributions of individual groups.
Within-group variability occurs when there are variations in the sample distribution
within a single group.

One-way ANOVA to see if there are any significant differences between the
means of your independent variables. When we know how each independent
variable's mean differs from the others, we can figure out which of them is linked
to your dependent variable and start to figure out what's driving that behaviour.
The two-way analysis of variance is a variation of the one-way analysis. There are
two independent variables in this equation (hence the name two-way). Factors are
the two independent variables in a two-way ANOVA. The concept is that the
dependent variable is influenced by two variables, or factors.
CONCLUSION: Thus we have studied ANOVA successfully.
Assignment No. 8
TITLE: Model development using linear and multiple linear regression
THEORY:
Regression analysis is a statistical method that helps us to understand the
relationship between dependent and one or more independent variables,
Dependent Variable
This is the Main Factor that we are trying to predict.
Independent Variable
These are the variables that have a relationship with the dependent variable.
Types of Regression Analysis
There are many types of regression analysis, but in this article, we will deal with,
1. Simple Linear Regression
2. Multiple Linear Regression
Linear Regression:
In Machine Learning lingo, Linear Regression (LR) means simply finding the best
fitting line that explains the variability between the dependent and independent
features very well or we can say it describes the linear relationship between
independent and dependent features, and in linear regression, the algorithm
predicts the continuous features(e.g. Salary, Price ), rather than deal with the
categorical features (e.g. cat, dog).
Simple Linear Regression
Simple Linear Regression uses the slope-intercept (weight-bias) form, where our
model needs to find the optimal value for both slope and intercept. So with the
optimal values, the model can find the variability between the independent and
dependent features and produce accurate results. In simple linear regression, the
model takes a single independent and dependent variable.
There are many equations to represent a straight line, we will stick with the
common equation,
Here, y and x are the dependent variables, and independent variables respectively.
b1(m) and b0(c) are slope and y-intercept respectively.
Multiple Linear Regression
In multiple linear regression, our model will apply the same steps. In multiple linear
regression instead of having a single independent variable, the model has multiple
independent variables to predict the dependent variable.

where bo is the y-intercept, b1,b2,b3,b4…,bn are slopes of the independent


variables x1,x2,x3,x4…,xn and y is the dependent variable.
CONCLUSION:
Thus we have studied and implemented linear regression successfully.

You might also like