Internship

MODULE-1: PYTHON PROGRAMMING
What is Python:
Python is a popular programming language. It was created by Guido van

Rossum, and released in 1991.
It is used for:
 web development (server-side),

 software development,
 mathematics,
 system scripting.
What can Python do:
 Python can be used on a server to create web applications.

 Python can be used alongside software to create workflows.
 Python can connect to database systems. It can also read and modify files.
 Python can be used to handle big data and perform complex mathematics.
 Python can be used for rapid prototyping, or for production-ready
software development.
What is a statement in Python:
A statement is an instruction that a Python interpreter can execute. So, in

simple words, we can say anything written in Python is a statement.
Python statement ends with the token NEWLINE character. It means each line
in a Python script is a statement.
For example, a = 10 is an assignment statement. where a is a variable name and

10 is its value. There are other kinds of statements such
as if statement, for statement, while statement, etc., we will learn them in the
following lessons.
There are mainly four types of statements in Python, print statements,

Assignment statements, Conditional statements, Looping statements.
The print and assignment statements are commonly used. The result of a print
statement is a value. Assignment statements don’t produce a result it just
assigns a value to the operand on its left side.
A Python script usually contains a sequence of statements. If there is more than
one statement, the result appears only one time when all statements execute.
The if-else statement
The if-else statement provides an else block combined with the if statement
which is executed in the false case of the condition.
If the condition is true, then the if-block is executed. Otherwise, the else-block
is executed.
..
for Loop:
In Python, the for loop is used to run a block of code for a certain number of
times. It is used to iterate over any sequences such as list, tuple, string, etc.
The syntax of the for loop is:
For val in sequence :
# statement(s)
while Loop
Python while loop is used to run a specific code until a certain condition is met.
The syntax of while loop is:
while condition :
# boby of while loop
Here,
1. A while loop evaluates the condition

2. If the condition evaluates to True, the code inside the while loop is
executed.
3. condition is evaluated again.
4. This process continues until the condition is False.
5. When condition evaluates to False, the loop stops.
Introduction to Python Functions:
A function is a block of code that performs a specific task.
Suppose, you need to create a program to create a circle and color it. You can
create two functions to solve this problem:
 create a circle function
 create a color function
Dividing a complex problem into smaller chunks makes our program easy to
understand and reuse.
Types of function:
There are two types of function in Python programming:
 Standard library functions - These are built-in functions in Python that

are available to use.
 User-defined functions - We can create our own functions based on our
requirements.
Python Function Declaration:
The syntax to declare a function:
def function_name(arguments):
#function body
return
Here,
 def - keyword used to declare a function

 function_name - any name given to the function
 arguments - any value passed to function
 return (optional) - returns value from a function
Let's see an example,
def greet():
Print(‘Hello world!’)
Here, we have created a function named greet() . It simply prints the

text Hello World! .
This function doesn't have any arguments and doesn't return any values.
We will learn about arguments and return statements later in this tutorial.
MODULE-2: NECESSARY MODULES FOR DATA SCIENCE
NumPy
NumPy (Numerical Python) is the fundamental package for numerical

computation in Python; it contains a powerful N-dimensional array object. It has
around 18,000 comments on GitHub and an active community of 700
contributors. It’s a general-purpose array-processing package that provides
high-performance multidimensional objects called arrays and tools for working
with them. NumPy also addresses the slowness problem partly by providing
these multidimensional arrays as well as providing functions and operators that
operate efficiently on these arrays.
Features:
 Provides fast, precompiled functions for numerical routines
 Array-oriented computing for better efficiency
 Supports an object-oriented approach
 Compact and faster computations with vectorization
Applications:
 Extensively used in data analysis
 Creates powerful N-dimensional array
 Forms the base of other libraries, such as SciPy and scikit-learn
 Replacement of MATLAB when used with SciPy and matplotlib

Pandas
Pandas (Python data analysis) is a must in the data science life cycle. It is the
most popular and widely used Python library for data science, along with
NumPy in matplotlib. With around 17,00 comments on GitHub and an active
community of 1,200 contributors, it is heavily used for data analysis and
cleaning. Pandas provides fast, flexible data structures, such as data frame CDs,
which are designed to work with structured data very easily and intuitively.
Features:
 Eloquent syntax and rich functionalities that gives you the freedom to
deal with missing data
 Enables you to create your own function and run it across a series of
data
 High-level abstraction
 Contains high-level data structures and manipulation tools
Applications:
 General data wrangling and data cleaning
 ETL (extract, transform, load) jobs for data transformation and data
storage, as it has excellent support for loading CSV files into its data
frame format
 Used in a variety of academic and commercial areas, including

statistics, finance and neuroscience
 Time-series-specific functionality, such as date range generation,

moving window, linear regression and date shifting.
Matplotlib
Matplotlib has powerful yet beautiful visualizations. It’s a plotting library for
Python with around 26,000 comments on GitHub and a very vibrant community
of about 700 contributors. Because of the graphs and plots that it produces, it’s
extensively used for data visualization. It also provides an object-oriented API,
which can be used to embed those plots into applications.
Features:
 Usable as a MATLAB replacement, with the advantage of being free

and open source
 Supports dozens of backends and output types, which means you can
use it regardless of which operating system you’re using or which
output format you wish to use
 Pandas itself can be used as wrappers around MATLAB API to drive

MATLAB like a cleaner
 Low memory consumption and better runtime behavior
Applications:
 Correlation analysis of variables
 Visualize 95 percent confidence intervals of the models
 Outlier detection using a scatter plot etc.
 Visualize the distribution of data to gain instant insights.

Scikit-learn
Next in the list of the top python libraries for data science comes Scikit-learn, a
machine learning library that provides almost all the machine learning
algorithms you might need. Scikit-learn is designed to be interpolated into
NumPy and SciPy.
Applications:
 clustering
 classification
 regression
 model selection
 dimensionality reduction
MODULE-3: DATA PREPROCESSING:
Introduction:
Data Cleaning is the process of finding and correcting the inaccurate/incorrect

data that are present in the dataset. One such process needed is to do something
about the values that are missing in the dataset. In real life, many datasets will
have many missing values, so dealing with them is an important step.
Why do you need to fill in the missing data:
Because most of the machine learning models that you want to use will provide
an error if you pass NaN values into it. The easiest way is to just fill them up
with 0, but this can reduce your model accuracy significantly.
For filling missing values, there are many methods available. For choosing the
best method, you need to understand the type of missing value and its
significance, before you start filling/deleting the data to completely understand
how to handle missing values in Python.
First Look at the Dataset
In this article, I will be working with the Titanic Dataset from Kaggle.
 import the required libraries that you will be using – numpy and pandas.
See that the contains many columns like PassengerId, Name, Age, etc.. We
won’t be working with all the columns in the dataset, so I am going to be
deleting the columns I don’t need.
df.drop("Name",axis=1,inplace=True)
df.drop("Ticket",axis=1,inplace=True)
df.drop("PassengerId",axis=1,inplace=True)
df.drop("Cabin",axis=1,inplace=True)
df.drop("Embarked",axis=1,inplace=True)
See that there are also categorical values in the dataset, for this, you need to use
Label Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
newdf=df
#splitting the data into x and y
y = df['Survived']
df.drop("Survived",axis=1,inplace=True)
How to know whether the data has missing values?
Missing Value Treatment in Python – Missing values are usually represented

in the form of Nan or null or None in the dataset.
df.info() the function can be used to give information about the dataset. This
will provide you with the column names along with the number of non – null
values in each column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null int64
2 Age 714 non-null float64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
dtypes: float64(2), int64(4)
memory usage: 41.9 KB
See that there are null values in the column Age.
The second way of finding whether we have null values in the data is by using
the isnull() function.
print(df.isnull().sum())
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
dtype: int64
See that all the null values in the dataset are in the column – Age.
Let’s try fitting the data using logistic regression.
from sklearn.model_selection import train_test_split

X_train, X_test,y_train,y_test = train_test_split(df,y,test_size=0.3)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
---------------------------------------------------------------------------
ValueError: Input contains NaN, infinity or a value too large for

dtype('float64').
See that the logistic regression model does not work as we have NaN values in
the dataset. Only some of the machine learning algorithms can work with
missing data like KNN, which will ignore the values with Nan values.
Now let’s look at the different methods that you can use to deal with the
missing data.
The methods I will be discussing are
1. Deleting the columns with missing data

2. Deleting the rows with missing data
3. Filling the missing data with a value – Imputation
4. Imputation with an additional column
5. Filling with a Regression Model
1. Deleting the column with missing data

In this case, let’s delete the column, Age and then fit the model and check for
accuracy.
But this is an extreme case and should only be used when there are many null
values in the column.
updated_df = df.dropna(axis=1)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null int64
2 SibSp 891 non-null int64
3 Parch 891 non-null int64
4 Fare 891 non-null float64
dtypes: float64(1), int64(4)
memory usage: 34.9 KB
MODULE-4: EDA(Exploratory Data Analysis)
Exploratory Data Analysis (EDA) is an approach to analyze the data using

visual techniques. It is used to discover trends, patterns, or to check
assumptions with the help of statistical summary and graphical
representations.
Dataset Used
For the simplicity of the article, we will use a single dataset. We will use the
employee data for this. It contains 8 columns namely – First Name, Gender,
Start Date, Last Login, Salary, Bonus%, Senior Management, and Team.
Dataset Used: Employees.csv
Let’s read the dataset using the Pandas module and print the 1st five rows. To
print the first five rows we will use the head() function.
Example:
 Python3
import pandas as pd
import numpy as np
df = pd.read_csv('employees.csv')
df.head()
Output:
Handling Missing Values
You all must be wondering why a dataset will contain any missing value. It
can occur when no information is provided for one or more items or for a
whole unit. For Example, Suppose different users being surveyed may choose
not to share their income, some users may choose not to share the address in
this way many datasets went missing. Missing Data is a very big problem in
real-life scenarios. Missing Data can also refer to as NA(Not Available) values
in pandas. There are several useful functions for detecting, removing, and
replacing null values in Pandas DataFrame :
Example:
 Python3
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()
Output:
Data visualization
Data Visualization is the process of analyzing data in the form of graphs or
maps, making it a lot easier to understand the trends or patterns in the data.
There are various types of visualizations –
 Univariate analysis: This type of data consists of only one variable.
The analysis of univariate data is thus the simplest form of analysis
since the information deals with only one quantity that changes. It
does not deal with causes or relationships and the main purpose of
the analysis is to describe the data and find patterns that exist within
it.
 Bi-Variate analysis: This type of data involves two different
variables. The analysis of this type of data deals with causes and
relationships and the analysis is done to find out the relationship
among the two variables.
 Multi-Variate analysis: When the data involves three or more
variables, it is categorized under multivariate.
Let’s see some commonly used graphs –
We will use Matplotlib and Seaborn library for the data visualization. If you
want to know about these modules refer to the articles –
 Matplotlib Tutorial
 Python Seaborn Tutorial
Histogram
It can be used for both uni and bivariate analysis.

Example:
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Output:
What Is Statistical Analysis:

Statistical analysis is the process of collecting and analyzing data in order to
discern patterns and trends. It is a method for removing bias from evaluating
data by employing numerical analysis. This technique is useful for collecting
the interpretations of research, developing statistical models, and planning
surveys and studies.
Statistical analysis is a scientific tool that helps collect and analyze large
amounts of data to identify common patterns and trends to convert them into
meaningful information. In simple words, statistical analysis is a data analysis
tool that helps draw meaningful conclusions from raw and unstructured data.
The conclusions are drawn using statistical analysis facilitating decision-making

and helping businesses make future predictions on the basis of past trends. It
can be defined as a science of collecting and analyzing data to identify trends
and patterns and presenting them. Statistical analysis involves working with
numbers and is used by businesses and other institutions to make use of data to
derive meaningful information.
Types of Statistical Analysis
Given below are the 6 types of statistical analysis:
 Descriptive Analysis
Descriptive statistical analysis involves collecting, interpreting, analyzing, and

summarizing data to present them in the form of charts, graphs, and tables.
Rather than drawing conclusions, it simply makes the complex data easy to read
and understand.
 Inferential Analysis
The inferential statistical analysis focuses on drawing meaningful conclusions

on the basis of the data analyzed. It studies the relationship between different
variables or makes predictions for the whole population.
 Predictive Analysis
Predictive statistical analysis is a type of statistical analysis that analyzes data to

derive past trends and predict future events on the basis of them. It
uses machine learning algorithms, data mining, data modelling, and artificial
intelligence to conduct the statistical analysis of data.
 Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best
course of action based on the results. It is a type of statistical analysis that helps
you make an informed decision.
 Exploratory Data Analysis
Exploratory analysis is similar to inferential analysis, but the difference is that it

involves exploring the unknown data associations. It analyzes the potential
relationships within the data.
 Causal Analysis
The causal statistical analysis focuses on determining the cause and effect
relationship between different variables within the raw data. In simple words, it
determines why something happens and its effect on other variables. This
methodology can be used by businesses to determine the reason for failure.
Benefits of Statistical Analysis
Statistical analysis can be called a boon to mankind and has many benefits for
both individuals and organizations. Given below are some of the reasons why
you should consider investing in statistical analysis:
 It can help you determine the monthly, quarterly, yearly figures of

sales profits, and costs making it easier to make your decisions.
 It can help you make informed and correct decisions.

 It can help you identify the problem or cause of the failure and make
corrections. For example, it can identify the reason for an increase in
total costs and help you cut the wasteful expenses.
 It can help you conduct market analysis and make an effective

marketing and sales strategy.
 It helps improve the efficiency of different processes.

Correlation Analysis
Methods of correlation and regression can be used in order to analyze the extent
and the nature of relationships between different variables. Correlation analysis
is used to understand the nature of relationships between two individual
variables. For example, if we aim to study the impact of foreign direct
investment (FDI) on the level of economic growth in Vietnam, then two
variables can be specified as the amounts of FDI and GDP for the same period.
Correlation coefficient ‘r’ is calculated through the following formula:
Where, x and y are values of variables, and n is size of the sample.
The value of correlation coefficient can be interpreted in the following manner:
If ‘r’ is equal to 1, then there is perfect positive correlation between two values;
If ‘r’ is equal to -1, then there is perfect negative correlation between two
values;
If ‘r’ is equal to zero, then there is no correlation between the two values.
In practical terms, the closer the value of ‘r’ to 1, the higher positive impact of
FDI on GDP growth in Vietnam. Similarly, if the value of ‘r’ is less than 0, the
closer it is to – 1, the greater the negative impact of FDI on GDP growth in
Vietnam. If ‘r’ is equal to zero, then FDI is perceived to have no impact on GDP
change in Vietnama within the given sample.
The most popular forms of correlation analysis used in business studies

include Pearson product-moment correlation, Spearman Rank
correlation and Autocorrelation.
The Pearson product-moment correlation is calculated by taking the ratio of
the sample of the two variables to the product of the two standard deviations
and illustrates the strength of linear relationships. In Pearson product-moment
correlation the correlation coefficient is not robust due to the fact that strong
linear relationships between the variables are not recognized. The correlation
coefficient is sensitive to outlying points therefore the correlation coefficient is
not resistant.
Spearman Rank correlation requires the data to be sorted and the value to be
assigned a specific rank with 1 to be assigned as the lowest value. Moreover, in
case of data value appearing more than once, equal values will be specified their
average rank.
Autocorrelation (serial correlation) implies the correlation among the values of

the same variables but at various times. Autocorrelation coefficient is calculated
by changing lagged data with the formula for the Pearson product-moment
correlation coefficient. Also, because a series of unshifted data will express
perfect correlation, the function begins with the coefficient of 1.
Correlation coefficient ‘r’ illustrated above is just a mathematical formula and

you don’t have to calculate correlation coefficient manually. For a bachelor’s
degree dissertation most supervisors accept correlation tests that have been run
on a simple Excel spreadsheet. For master’s or PhD level studies, on the other
hand, you will have to use more advanced statistical software such as SPSS or
NCSS for your correlation analysis.
Correlation analysis as a research method offers a range of advantages. This

method allows data analysis from many subjects simultaneously. Moreover,
correlation analysis can study a wide range of variables and their interrelations.
On the negative side, findings of correlation does not indicate causations i.e.
cause and effect relationships.
MODULE-5: MODELING (MACHINE LEARNING MODELS):
Logistic Regression:
Definition: Logistic regression is a machine learning algorithm for

classification. In this algorithm, the probabilities describing the possible
outcomes of a single trial are modelled using a logistic function.
Advantages: Logistic regression is designed for this purpose (classification),

and is most useful for understanding the influence of several independent
variables on a single outcome variable.
Disadvantages: Works only when the predicted variable is binary, assumes all
predictors are independent of each other and assumes data is free of missing
values.
Naïve Bayes:
Definition: Naive Bayes algorithm based on Bayes’ theorem with the

assumption of independence between every pair of features. Naive Bayes
classifiers work well in many real-world situations such as document
classification and spam filtering.
Advantages: This algorithm requires a small amount of training data to

estimate the necessary parameters. Naive Bayes classifiers are extremely fast
compared to more sophisticated methods.
Disadvantages: Naive Bayes is is known to be a bad estimator.
K-Nearest Neighbours:
Definition: Neighbours based classification is a type of lazy learning as it does
not attempt to construct a general internal model, but simply stores instances of
the training data. Classification is computed from a simple majority vote of the
k nearest neighbours of each point.
Advantages: This algorithm is simple to implement, robust to noisy training

data, and effective if training data is large.
Disadvantages: Need to determine the value of K and the computation cost is

high as it needs to compute the distance of each instance to all the training
samples.
Decision Tree:
Definition: Given a data of attributes together with its classes, a decision tree
produces a sequence of rules that can be used to classify the data.
Advantages: Decision Tree is simple to understand and visualise, requires little

data preparation, and can handle both numerical and categorical data.
Disadvantages: Decision tree can create complex trees that do not generalise
well, and decision trees can be unstable because small variations in the data
might result in a completely different tree being generated.
Random Forest:
Definition: Random forest classifier is a meta-estimator that fits a number of

decision trees on various sub-samples of datasets and uses average to improve
the predictive accuracy of the model and controls over-fitting. The sub-sample
size is always the same as the original input sample size but the samples are
drawn with replacement.
Advantages: Reduction in over-fitting and random forest classifier is more

accurate than decision trees in most cases.
Disadvantages: Slow real time prediction, difficult to implement, and complex

algorithm.
2.7 Support Vector Machine
Definition: Support vector machine is a representation of the training data as

points in space separated into categories by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall.
Advantages: Effective in high dimensional spaces and uses a subset of training

points in the decision function so it is also memory efficient.
Disadvantages: The algorithm does not directly provide probability estimates,

these are calculated using an expensive five-fold cross-validation.
MODULE-6: EVALUATION:
Model evaluation
You may wonder why we have to spend efforts on model evaluation 🤔? The
problem is if we re-run the ANN, every time the model produces different
accuracy not only on the train set and the test set. So evaluating the model
performance on one single test is not the most relevant way.
A typical method is to use K-fold cross-validation. Figure 1 shows how it works.
Fig.1 K-fold cross-validation diagram (Img created by Author)
We split the training set into K folds (e.g. K=10). Then train the model on 9
folds and test it on the last remaining fold. With 10 folds, we make 10 different
combinations with 9 training sets and 1 test set, and train/test the model 10
times. After that, we take an average of 10 evaluations and compute the standard
deviation. With that, we can determine which category of the model belongs to
as in Figure 2.
Fig.2 Bias-variance trade-off diagram (Img created by Author)
To implement K-fold cross-validation, we use a scikit_learn wrapper

in Keras: KerasClassifier. Specifically, we use Keras to build the model and
use scikit_learn for cross-validation. First thing is to build a function for the
model architecture as the function is a required argument for the Keras wrapper.
As you notice below, this is the same ANN architecture as we built before.
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Densedef build_classifier():
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = ‘uniform’, activation = ‘relu’,
input_dim = 11))
classifier.add(Dense(units = 6, kernel_initializer = ‘uniform’, activation = ‘relu’))
classifier.add(Dense(units = 1, kernel_initializer = ‘uniform’, activation =
‘sigmoid’))
classifier.compile(optimizer = ‘adam’, loss = ‘binary_crossentropy’, metrics =
[‘accuracy’])
return classifier
With the classifier built using the above function, we create

a KerasClassifier object. Below we specify the batch size of 10 and the number
of epochs of 100 which model needs to be trained on.
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs =
100)
Now, let’s apply K-fold cross-validation on the classifier by

using cross_val_score() method. This function returns a list of training accuracy.
Parameter cv is the fold number we use for cross-validation. Here the classifier
will be trained on 10 different training sets, split from the initial training set.
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv =
10)
mean = accuracies.mean()
std = accuracies.std()
Here, cross-validation may take a while. In the end, we get an accuracy of 10

evaluation as shown in Figure 2. The average accuracy is 0.843 and the standard
deviation is 1.60% 🤪.
Fig.2 Accuracy from cross-validation.
Parameter tuning
With reliable model accuracy, let’s try to improve it using two techniques.
Dropout regularization
One technique we have not mentioned is dropout regularization. This is

the solution for over-fitting which is related to high variance. How does
dropout work? At each training iteration, some neurons are randomly
disabled to prevent them from depending on each other. By overwriting
these neurons, neural networks retain a different configuration.
MODULE-7: DASHBOARD SOTRY TELLING MODEL DEPLOYMENT
What is data storytelling ?
Data storytelling is an act of humanizing data by turning it into a narrative or a

story that creates actionable insights, frequently supported by visualizations.
It isn’t just about making visuals and presentations. It’s about communicating
insights that deliver real value.
Why stories are needed ?
It is important to understand and translate data that matters into meaningful

insights and connect that data / information / insight to your audience, which will
motivate them to act on it and have a data-driven business decision that drives
change. This is where data storytelling comes in.
Stories assist in boosting recall ability, increasing user

engagement and empathy and improves comprehension. It helps the user to
make sense to the numbers showcased on the dashboard.
Before starting the design of any visualizations on the dashboard using various
BI tools such as Tableau, PowerBI, etc., it is important to understand the data
and what are the data points that are important to the audience who will be using
the dashboard.
Basic data story design checklist
1. Who : Identify the intended audience that will be using the

dashboard, understand their goals and requirements and their
understanding of data.
2. What : Identify the source, quality and timeliness of data used.
3. Why : Understand the purpose of creating the dashboard, is it for the
management purpose or is it for operational purpose i.e. understand
the business use case. Identify the intended outcome and actions that
might be taken based on the viz. shown.
4. How : How the data story will be presented i.e. what will be
structure, format and tone of the story.
Consider your story points
The story points in your dashboards will be your different visualizations that you
will create using the data that will convey the objective of your dashboard or
storyboard.
Data visualizations support and enhance data stories, helping you communicate
your findings elegantly and effectively.
Different visual analysis — story points
Create graphs that make sense and weave them into compelling, action-inspiring
stories. Some of the visual analysis that can be your story points are :
1. Trends
2. Rank Ordering
3. Comparisons
4. Relationships
Framing your story
This will dictate what the users will be viewing at most of the time and with
which element he/she will be interacting with. Frame or structure your story
within the design consideration. Below are some of the points that can be
considered while framing one’s story :
1. Domain Context
2. Appropriate Tone
3. What to include / exclude
4. Granularity Level
Another important point to consider before diving into creating visualizations is

to design a flow in mind of how you are going to showcase your data to your
audience i.e. we need to design an intended trajectory for the audience that
will be looking at the dashboard, that they will follow which will guide them to
get better insights from the data.
Flow and Layout
A good data storyboard contains both text as well as charts, and we can use pre-
attentive attributes to highlight specific sections, make larger chunks of text
more digestible and bring user’s direct attention.
1.

Internship

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Internship

Uploaded by

Copyright:

Available Formats

MODULE-1: PYTHON PROGRAMMING

Python is a popular programming language. It was created by Guido van

 web development (server-side),

What can Python do:

 Python can be used on a server to create web applications.

What is a statement in Python:

A statement is an instruction that a Python interpreter can execute. So, in

For example, a = 10 is an assignment statement. where a is a variable name and

There are mainly four types of statements in Python, print statements,

The if-else statement

1. A while loop evaluates the condition

Introduction to Python Functions:

A function is a block of code that performs a specific task.

 create a circle function

 create a color function

There are two types of function in Python programming:

 Standard library functions - These are built-in functions in Python that

 def - keyword used to declare a function

Let's see an example,

Here, we have created a function named greet() . It simply prints the

NumPy (Numerical Python) is the fundamental package for numerical

 Provides fast, precompiled functions for numerical routines

 Array-oriented computing for better efficiency

 Supports an object-oriented approach

 Compact and faster computations with vectorization

 Extensively used in data analysis

 Creates powerful N-dimensional array

 Forms the base of other libraries, such as SciPy and scikit-learn

 Replacement of MATLAB when used with SciPy and matplotlib

 Contains high-level data structures and manipulation tools

 General data wrangling and data cleaning

 Used in a variety of academic and commercial areas, including

 Time-series-specific functionality, such as date range generation,

 Usable as a MATLAB replacement, with the advantage of being free

 Pandas itself can be used as wrappers around MATLAB API to drive

 Low memory consumption and better runtime behavior

 Correlation analysis of variables

 Visualize 95 percent confidence intervals of the models

 Outlier detection using a scatter plot etc.

 Visualize the distribution of data to gain instant insights.

Data Cleaning is the process of finding and correcting the inaccurate/incorrect

Why do you need to fill in the missing data:

First Look at the Dataset

from sklearn.preprocessing import LabelEncoder

How to know whether the data has missing values?

Missing Value Treatment in Python – Missing values are usually represented

See that there are null values in the column Age.

from sklearn.model_selection import train_test_split

ValueError: Input contains NaN, infinity or a value too large for

The methods I will be discussing are

1. Deleting the columns with missing data

1. Deleting the column with missing data

Exploratory Data Analysis (EDA) is an approach to analyze the data using

Handling Missing Values

df["Gender"].fillna("No Gender", inplace = True)

It can be used for both uni and bivariate analysis.

What Is Statistical Analysis:

The conclusions are drawn using statistical analysis facilitating decision-making

Types of Statistical Analysis

Given below are the 6 types of statistical analysis:

Descriptive statistical analysis involves collecting, interpreting, analyzing, and

The inferential statistical analysis focuses on drawing meaningful conclusions

Predictive statistical analysis is a type of statistical analysis that analyzes data to