Professional Documents
Culture Documents
SE E&TC
SUBJECT: DATA
ANALYTICS LAB
Assignment No. 1
TITLE: Data analysis and python fundamentals.
THEORY:
• TensorFlow
• NumPy
• SciPy
• Pandas
• Matplotlib
• Keras
• SciKit-Learn
• PyTorch
• Scrapy
• BeautifulSoup
Python variables do not need explicit declaration to reserve memory space. The declaration
happens automatically when you assign a value to a variable. The equal sign (=) is used to
assign values to variables.
• Numbers
• String
• List
• Tuple
• Dictionary
Number
Number data types store numeric values. Number objects are created when you assign
a value to them. For example −
var1 = 1
var2 = 10
Strings
Strings in Python are identified as a contiguous set of characters represented in the quotation
marks.
Example
Python Lists
Lists are the most versatile of Python's compound data types. A list contains items
separated by commas and enclosed within square brackets ([]).
The items belonging to a list can be of different data type.
list = [ 'abcd', 786 , 2.23, 'john', 70.2 ]
Python Tuples
A tuple is another sequence data type that is similar to the list. A tuple consists of a
number of values separated by commas. Unlike lists, however, tuples are enclosed
within parentheses
tuple = ( 'abcd', 786 , 2.23, 'john', 70.2 )
Python Dictionary
Python's dictionaries are kind of hash table type. It works like key-value pairs.
Dictionaries are enclosed by curly braces ({ }) and values can be assigned and accessed
using square braces ([]). For example −
dict = {'name': 'john','code':6734, 'dept': 'sales'}
print dict.keys() # Prints all the keys
print dict.values() # Prints all the values
Histogram
A histogram is basically used to represent data provided in a form of some groups. It is a type of
bar plot where the X-axis represents the bin ranges while the Y-axis gives information about
frequency. The hist() function is used to compute and create histogram of x.
Syntax:
Syntax:
matplotlib.pyplot.scatter(x_axis_data, y_axis_data, s=None, c=None,
marker=None, cmap=None, vmin=None, vmax=None, alpha=None,
linewidths=None, edgecolors=None
CONCLUSION: Thus we have successfully studied and executed visualization of data using
matplotlib
Assignment No. 3
TITLE: Handling missing values in data in python
Theory:
Missing Data can occur when no information is provided for one or more items or for a whole
unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also refer to
as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with
missing data, either because it exists and was not collected or it never existed.
For Example, Suppose different users being surveyed may choose not to share their income,
some users may choose not to share the address in this way many datasets went missing.
There are several useful functions for detecting, removing, and replacing null values in Pandas
DataFrame :
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
Checking for missing values using isnull() and notnull()
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both
function help in checking whether a value is NaN or not. These function can also be used in Pandas Series
in order to find null values in a series.
In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe
of Boolean values which are True for NaN values.
Interpolate() function is basically used to fill NA values in the data frame but it uses various interpolation
technique to fill the missing values rather than hard-coding the value.
CONCLUSION: Thus, We have successfully studied and implemented handling missing values in data
using python.
Assignment No. 4
TITLE: converting Categorical variables into quantitative variables in python
THEORY:
Machine learning is good at dealing with numeric values. We could make machine
learning models by using text data. So, to make predictive models we have to
convert categorical data into numeric form.
Method 1: Using get_dummies()
Replacing the values is not the most efficient way to convert them. Pandas provide
a method called get_dummies which will return the dummy variable columns.
Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’,
dummy_na=False, columns=None, sparse=False, drop_first=False,
dtype=None)
Method 2: Scikit-Learn
Method 3: Using replace() method
Replacing is one of the methods to convert categorical terms into numeric. For
example, We will take a dataset of people’s salaries based on their level of
education. This is an ordinal type of categorical variable. We will convert their
education levels into numeric terms.
Syntax:
replace(to_replace=None, value=None, inplace=False, limit=None, regex=False,
method=’pad’)
CONCLUSION:
Thus we have successfully converted categorical variables to quantitative variables
Assignment No. 5
TITLE: Statistical Hypothesis Testing with Python
THEORY:
The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null
hypothesis. H1 is the symbol for it.
Example:
Depending on the population distribution, you can classify the statistical hypothesis
into two types.
Simple Hypothesis: A simple hypothesis specifies an exact value for the parameter.
Composite Hypothesis: A composite hypothesis specifies a range of values.
Example:
A company is claiming that their average sales for this quarter are 1000 units. This
is an example of a simple hypothesis.
Suppose the company claims that the sales are in the range of 900 to 1000 units.
Then this is a case of a composite hypothesis.
Conclusion: Thus we have studied the Hypothesis testing.
Assignment No. 6
TITLE: Exploratory data analysis:
Group by in python
Theory:
Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric python packages. Pandas is one of those packages and
makes importing and analyzing data much easier.
Pandas groupby is used for grouping the data according to the categories and apply
a function to the categories. It also helps to aggregate data efficiently.
Pandas dataframe.groupby() function is used to split the data into groups based on
some criteria. pandas objects can be split on any of their axes. The abstract
definition of grouping is to provide a mapping of labels to group names.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=False, **kwargs)
One-way ANOVA to see if there are any significant differences between the
means of your independent variables. When we know how each independent
variable's mean differs from the others, we can figure out which of them is linked
to your dependent variable and start to figure out what's driving that behaviour.
The two-way analysis of variance is a variation of the one-way analysis. There are
two independent variables in this equation (hence the name two-way). Factors are
the two independent variables in a two-way ANOVA. The concept is that the
dependent variable is influenced by two variables, or factors.
CONCLUSION: Thus we have studied ANOVA successfully.
Assignment No. 8
TITLE: Model development using linear and multiple linear regression
THEORY:
Regression analysis is a statistical method that helps us to understand the
relationship between dependent and one or more independent variables,
Dependent Variable
This is the Main Factor that we are trying to predict.
Independent Variable
These are the variables that have a relationship with the dependent variable.
Types of Regression Analysis
There are many types of regression analysis, but in this article, we will deal with,
1. Simple Linear Regression
2. Multiple Linear Regression
Linear Regression:
In Machine Learning lingo, Linear Regression (LR) means simply finding the best
fitting line that explains the variability between the dependent and independent
features very well or we can say it describes the linear relationship between
independent and dependent features, and in linear regression, the algorithm
predicts the continuous features(e.g. Salary, Price ), rather than deal with the
categorical features (e.g. cat, dog).
Simple Linear Regression
Simple Linear Regression uses the slope-intercept (weight-bias) form, where our
model needs to find the optimal value for both slope and intercept. So with the
optimal values, the model can find the variability between the independent and
dependent features and produce accurate results. In simple linear regression, the
model takes a single independent and dependent variable.
There are many equations to represent a straight line, we will stick with the
common equation,
Here, y and x are the dependent variables, and independent variables respectively.
b1(m) and b0(c) are slope and y-intercept respectively.
Multiple Linear Regression
In multiple linear regression, our model will apply the same steps. In multiple linear
regression instead of having a single independent variable, the model has multiple
independent variables to predict the dependent variable.