You are on page 1of 19

Name: Vijay Patel

Class: SYBSC-IT
Div: B Roll no.: 4163
Assignment Questions
Data Analysis and Visualization using Power BI and Tableau

1. Data Cleaning and Preprocessing:


Question 1:
Discuss the importance of handling missing data in a dataset.
Provide three methods for handling missing data and explain
each with an example.
Answer:
Handling missing data in a dataset is crucial for ensuring the accuracy and
reliability of data analysis and machine learning models. Missing data can lead to
biased results, reduced statistical power, and inaccurate predictions. Therefore, it's
essential to address missing values appropriately. Here are three common methods
for handling missing data, along with examples:
1.Deletion
Description: In this method, rows or columns containing missing values are
entirely removed from the dataset.
Example: Consider a dataset of student exam scores where some students have
missing scores for a particular subject. If we choose deletion, we would remove the
rows corresponding to those students with missing scores.
2. Imputation:
Description: Imputation involves replacing missing values with estimated or
calculated values based on the available data.
Example: For the same student exam scores dataset, instead of removing the rows
with missing values, we can impute the missing scores by calculating the mean or
median of the scores in that subject and replacing the missing values with the
calculated mean or median.
3. Prediction Models:
Description: More advanced techniques involve using prediction models to
estimate missing values based on other features in the dataset.
Example: In a dataset of customer information, if there are missing values for
income, we can use a regression model with features such as age, education level,
and occupation to predict the missing income values.

Question 2:
Explain the concept of outliers in data. How can outliers
impact statistical analysis? Discuss two techniques for
detecting outliers.
Answer:
Outliers in Data:
In statistics, an outlier is a data point that deviates significantly from the rest of the
data set. Imagine a class photo where one person stands out by wearing a bright
orange costume. That person, in data terms, is an outlier.
These outliers can be caused by:
● Measurement errors: Mistakes during data collection or recording.
● Natural variations: Rare but legitimate occurrences within the data.
● Data entry errors: Typos or accidental misplacements of values.

The Impact of Outliers:


Outliers can have a significant impact on statistical analysis:
● Skewing results: Outliers can pull averages and standard deviations towards
their extreme values, misrepresenting the "typical" data point.
● Distorting relationships: Outliers in a scatter plot can obscure underlying
trends or correlations between variables.
● Biasing models: Machine learning algorithms can prioritize outliers, leading
to models that perform poorly on the majority of the data.
However, outliers can also be valuable:
● Revealing anomalies: They can signal unusual events or errors that require
investigation.
● Uncovering new insights: Outliers might represent rare but important
phenomena that deserve further exploration.

Detecting Outliers: Finding the Odd Ones Out


There are two main techniques for detecting outliers:
1. Statistical methods:
● Z-score: This measures how many standard deviations a data point is away
from the mean. Points beyond a certain threshold (e.g., +/- 3 standard
deviations) are considered outliers.
● Interquartile Range (IQR): This uses the quartiles of the data to define the
range of "typical" values. Points outside the range (e.g., below Q1 - 1.5IQR
or above Q3 + 1.5IQR) are outliers.
2. Visualization:
● Box plots: These display the distribution of data with a box representing the
middle half and whiskers extending to the quartiles. Points outside the
whiskers are potential outliers.
● Scatter plots: Visual inspection can reveal points that fall far away from the
main trend of the data.
● Conclusion: Outliers are complex, but their impact and detection are
crucial for accurate data analysis. Understanding when they are friends
or foes is key to extracting valuable insights from your data.
2. Data Normalization and Standardization:
Question 1:
Define data normalization and standardization. Provide a step
by step explanation of Min Max normalization and Z score
normalization with numerical examples.
Answer:
Data Normalization and Standardization:
Data preprocessing is crucial for accurate analysis, and two key techniques are
normalization and standardization. Both aim to scale different features in your
data to a common range, but they achieve this differently.
Normalization:
● Maps values to a specific range, typically between 0 and 1 (Min-Max) or -1
and 1.
● Useful for algorithms sensitive to absolute values or ranges.

● Not as sensitive to outliers as standardization.


Standardization:
● Transforms data to have a mean of 0 and a standard deviation of 1.

● Assumes the data follows a normal distribution (bell-shaped curve).

● Useful for algorithms sensitive to distance and relative values.

● More sensitive to outliers than normalization.

Min-Max Normalization Step-by-Step:


1. Define the range: Choose your target range, usually 0-1 or -1-1.
2. Find minimum and maximum values: Identify the minimum and
maximum values for each feature in your dataset.
3. Apply the formula: For each data point x, calculate (x - min) / (max - min)
* target range.
Example:
Consider a dataset with ages ranging from 20 to 60. Normalize to 0-1:
● Age = 35

● (35 - 20) / (60 - 20) * 1 = 0.6


Z-Score Standardization Step-by-Step:
1. Calculate the mean: Find the average value of each feature in your dataset.
2. Calculate the standard deviation: Determine the standard deviation for
each feature.
3. Apply the formula: For each data point x, calculate (x - mean) / standard
deviation.
Example:
Consider the same dataset with ages:
● Age = 35, Mean = 40, Standard Deviation = 10

● (35 - 40) / 10 = -0.5


Key Differences:
● Normalization uses the min/max values, while standardization uses
mean/standard deviation.
● Normalization has a fixed range, while standardization's range depends on
the data.
● Normalization is less sensitive to outliers, while standardization is more
affected.

Question 2:
In what scenarios would you prefer normalization over
standardization or vice versa? Justify your answer.
Answer:
The choice between normalization and standardization depends on several factors,
and there's no single "one size fits all" answer. Here are some scenarios where you
might prefer one over the other:
Prefer Normalization:
● Non-Gaussian data: If your data doesn't follow a normal distribution (bell-
shaped curve), standardization can distort important information by
assuming normality. Normalization, being independent of distribution,
avoids this issue.
● Bounded algorithms: Some algorithms are sensitive to the absolute values
of features and perform best when data falls within a specific range.
Normalization ensures this by mapping all features to a pre-defined range
like 0-1 or -1-1.
● Data with outliers: Normalization is less sensitive to outliers compared to
standardization, as it only considers the overall minimum and maximum
values. This can be beneficial if your data has a few extreme points that
might unduly influence the standardization process.
● Interpretability: Normalization results in values within a known range,
which can sometimes be easier to interpret in terms of their relative
magnitude within the data.
Prefer Standardization:
● Euclidean distance-based algorithms: Algorithms like k-Nearest
Neighbors or Support Vector Machines (SVMs) rely on Euclidean distances
between data points for decision making. Standardization ensures features
contribute equally to these distances, as they all have the same standard
deviation.
● Algorithms sensitive to relative differences: Some algorithms, like linear
regression, focus on the relative differences between features rather than
absolute values. Standardization ensures all features have equal influence on
the model by having the same unit variance.
● Data with Gaussian distribution: If you're confident your data follows a
normal distribution, standardization takes advantage of this information by
transforming data into a standard normal distribution, leveraging its
statistical properties for analysis.

Additional Considerations:
● Algorithm requirements: Check the specific documentation of your chosen
algorithm to see if it has any specific requirements regarding feature scaling.
● Experimentation: It's often recommended to try both normalization and
standardization and compare their impact on your model's performance to
make an informed decision.
The best approach depends on your specific data and modeling goals. By
understanding the strengths and weaknesses of both techniques, you can choose
the one that leads to the most accurate and meaningful results.
3. Descriptive Statistics:
Question 1:
Calculate the mean, median, and mode for the following
dataset: [10, 15, 20, 25, 30, 35, 40].
Answer:
Calculation for the dataset [10 ,15 ,20,25 ,30,35,40]:
1. Mean:
Mean = (10+15+20+25+30+35+40) / 7
=175 / 7
=25
2. Median:
Since there are 7 values, the median is the 4th value (when arranged
in ascending order).
Median=25
3. Mode:
There is no value that appears more than once, so there's no mode.

Question 2:
Discuss the use of the interquartile range (IQR) as a measure
of variability. Calculate the IQR for the dataset: [22, 25, 28, 30,
35, 40, 45].
Answer:
Use of interquartile range (IQR) as a measure of variability:
The Interquartile Range (IQR) is a measure of statistical dispersion, or
variability, based on dividing the dataset into quartiles.
It's calculated as the difference between the third quartile (Q3) and the first
quartile (Q1), thus capturing the spread of the middle 50% of the data.

Calculation for the data set [20 ,22 ,25,20 ,28,30,35,40,45]:


1. Arrange the data in ascending order:
22,25,28,30,35,40,45
2. Calculate Quartiles:
● Q1: 25 (since it's the middle value between 22 and 28)

● Q3: 40 (since it's the middle value between 35 and 45)


3. Calculate IQR:
IQR=Q3 - Q1
=40 – 25
=15

4. Data Visualization Basics:


Question 1:
Create a bar chart to represent the sales of three products (A,
B, C) in a given month. Provide appropriate labels and titles.
Answer:
Question 2:
Explain the significance of color theory in data visualization.
How can color be effectively used to convey information in a
chart?
Answer:
Color Theory:
Color is much more than just aesthetics in data visualization. It's a powerful tool
that can:
1. Enhance data understanding:
● Highlight key findings: Use contrasting colors to draw attention to
important data points or trends.
● Differentiate categories: Employ distinct colors to represent different
categories or groups within your data.
● Show relationships: Apply color gradients to visualize changes in values or
correlations between variables.
2. Enhance visual presentation:
● Create visual hierarchy: Use color saturation or brightness to emphasize
important elements and guide the viewer's eye.
● Improve readability: Choose contrasting colors for text and background to
ensure clear legibility.
● Appeal to the audience: Utilize color palettes that are culturally relevant or
evoke specific emotions.
3. Ensure accessibility:
● Consider color blindness: Always check your color choices for
accessibility, using tools like ColorBrewer or APCA checker.
● Provide sufficient contrast: Ensure enough contrast between colors and
background to aid perception for everyone.

Putting Color into Practice:


Here are some tips for using color effectively in your charts:
● Limit the number of colors: Stick to 6-10 distinct colors to avoid
overwhelming the viewer.
● Choose meaningful colors: Use colors that naturally represent the data
(e.g., green for growth, red for decline).
● Follow consistent color palettes: Maintain the same color scheme across
related charts for clarity.
● Use color intentionally: Don't color-code arbitrarily; have a clear purpose
for each color choice.
● Test your color choices: Get feedback from diverse audiences to ensure
accessibility and understanding.
By understanding the power of color theory and applying these techniques, you can
create data visualizations that are informative, engaging, and accessible to all.
color is just one element of effective data visualization. Combine it with clear
labeling, informative titles, and well-chosen chart types to tell the story your data
wants to share.
5. Exploratory Data Analysis:
Question 1:
Given a dataset of customer purchases, filter the data to show
purchases made in the last quarter of the year. Provide a brief
interpretation of the results.
Answer:
Filtering purchases for the last quarter of the year:
import pandas as pd

# Assuming 'purchase_date' is the column containing purchase dates


# and 'last_quarter' is the last quarter of the year (e.g., October to December)

# Load the dataset


data = pd.read_csv('customer_purchases.csv')

# Convert 'purchase_date' to datetime format


data['purchase_date'] = pd.to_datetime(data['purchase_date'])

# Filter purchases for the last quarter of the year


last_quarter_purchases = data[(data['purchase_date'].dt.month >= 10) &
(data['purchase_date'].dt.month <= 12)]

# Display the filtered data


print(last_quarter_purchases)

Interpretation:
The filtered dataset last_quarter_purchases contains only the purchases made in
the last quarter of the year (October to December). Analyzing this subset of data
can provide insights into seasonal trends, end-of-year spending patterns, and the
popularity of certain products or promotions during the holiday season.

Question 2:
Sort a dataset of employee salaries in ascending order.
Discuss the potential insights gained from examining the
sorted data.
Answer:
Sorting employee salaries in ascending order:
import pandas as pd
# Assuming 'salary' is the column containing employee salaries
# Load the dataset
data = pd.read_csv('employee_salaries.csv')
# Sort the salaries in ascending order
sorted_salaries = data['salary'].sort_values()
# Display the sorted data
print(sorted_salaries)

Potential insights:

♣ Identifying Salary Distribution: Sorting employee salaries allows us to see


the range and distribution of salaries within the organization.
♣ Detecting Outliers: Extreme values at either end of the sorted list may
indicate outliers or anomalies in salary data that warrant further
investigation.
♣ Comparing Salaries: By examining the sorted data, we can compare
individual salaries and identify discrepancies or disparities that may require
attention, such as wage gaps or inconsistencies across departments or job
roles.
♣ Understanding Pay Structure: Observing the sorted salaries can provide
insights into the organization's pay structure, including salary tiers,
increments, and potential patterns in compensation based on experience or
seniority.
♣ Benchmarking: Comparing sorted salaries with industry standards or
competitor data can help assess the organization's competitiveness in terms
of employee compensation and benefits.

NUMERICAL AND PRACTICAL BASED QUESTIONS:


1.HANDLING MISSING DATA:

Question 1:
Consider a dataset with missing values. Use pandas in Python
to handle missing data through imputation. Provide code
snippets and explain the rationale behind your choices.
Answer:
Here's how you can handle missing data through imputation in pandas:
1. Import libraries and load data:
Python:
import pandas as pd

# Replace "your_data.csv" with your actual file path


data = pd.read_csv("your_data.csv")
2. Identify missing values:
Python:
# Check for missing values using isnull()
missing_values = data.isnull().sum()
print(missing_values)

3. Choose imputation method:


The best imputation method depends on the data type, distribution, and the reason
for missing values. Here are some common options:
● Mean/Median: Suitable for numerical data with a normal distribution.

● Mode: Good for categorical data.

● Forward Fill/Backward Fill: Fills missing values with the previous/next


non-missing value. Useful for sequential data.
● Interpolation: Uses surrounding data points to estimate missing values.
Applicable to numerical data with a clear trend.
4. Impute missing values:
Python:
# Example: Impute missing numerical values with mean
data["numerical_column"] =
data["numerical_column"].fillna(data["numerical_column"].mean())

# Example: Impute missing categorical values with mode


data["categorical_column"] =
data["categorical_column"].fillna(data["categorical_column"].mode()[0])

5. Evaluate the results:


Compare the imputed data with the original data to ensure the imputation
doesn't introduce biases or artifacts. You can also use statistical tests to
evaluate the effectiveness of imputation.
Rationale:
● Choosing the most appropriate method depends on the specific data
and context.
● Mean/Median imputation assumes a normal distribution, which
might not be ideal for skewed data.
● Mode imputation might not be suitable if there are many unique
categories.
● Forward/Backward Fill can propagate errors if missing values are
clustered.
● Interpolation can create unrealistic values if the trend is not linear.
2. Data Normalization:

Question 1:

Given a dataset of exam scores ranging from 60 to 90,


perform Min Max normalization to scale the values between 0
and 1. Provide the normalized dataset.
Answer:
The formula for Min-Max normalization is:
x_norm = (x - x_min) / (x_max - x_min)
x_min = 60
x_max = 90

Eg:-

∴ x = 75
x_norm = (75 - 60) / (90 - 60)
x_norm = 15 / 30
x_norm = 0.5

Original Score Normalized Score


60 0
65 0.1667
70 0.3333
75 0.5
80 0.6667
85 0.8333
90 1

3. Descriptive Statistics:

Question 1:

Using Python's statistical libraries, calculate the standard


deviation for a dataset representing daily temperatures over a
month.

Answer:
Here's how you can calculate the standard deviation for a dataset representing daily
temperatures over a month using Python's statistical libraries:
Method 1: Using pandas library
Python
import pandas as pd

# Sample daily temperatures


temperatures = [15, 18, 20, 22, 25, 23, 21, 19, 17, 16, 14, 12, 10, 11, 13, 15, 17, 19,
21, 22, 20, 18, 16, 14, 12, 10, 11, 13, 15]

# Create a pandas Series


temps_series = pd.Series(temperatures)

# Calculate the standard deviation


standard_deviation = temps_series.std()
# Print the standard deviation
print("Standard deviation of daily temperatures:", standard_deviation)
Method 2: Using statistics library
Python
import statistics

# Sample daily temperatures


temperatures = [15, 18, 20, 22, 25, 23, 21, 19, 17, 16, 14, 12, 10, 11, 13, 15, 17, 19,
21, 22, 20, 18, 16, 14, 12, 10, 11, 13, 15]

# Calculate the standard deviation


standard_deviation = statistics.stdev(temperatures)

# Print the standard deviation


print("Standard deviation of daily temperatures:", standard_deviation)
Both methods achieve the same result: calculating the standard deviation of the
given temperature data.

Explanation:
● We first import the necessary libraries: pandas for data manipulation and
statistics for statistical calculations.
● We define a list of sample daily temperatures.

● In the pandas method, we create a pandas Series from the temperature list.
The .std() method calculates the standard deviation.
● In the statistics library method, we directly use the statistics.stdev() function
on the temperature list.
● Finally, we print the calculated standard deviation.
Remember to replace the sample data with your actual temperature data to get the
standard deviation for your specific case.
4. Data Visualization:
Question 1:
Using Matplotlib or any other suitable library, create a line chart to
visualize the trend in stock prices over the last six months. Include
appropriate labels and a legend.

Answer:

First, let's assume that we have a dataset of stock prices for the last six months in a
list called stocks prices. We can use the matplotlib library to create a line chart as
follows:

Code:
import matplotlib.pyplot as plt
# Sample data representing stock prices over the last six months
months = ['Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan']
stock_price_A = [100, 110, 105, 120, 125, 130]
stock_price_B = [90, 95, 100, 110, 115, 120]

# Create line chart


plt.plot(months, stock_price_A, label='Stock A', marker='o')
plt.plot(months, stock_price_B, label='Stock B', marker='o')

# Add labels and title


plt.xlabel('Months')
plt.ylabel('Stock Price')
plt.title('Stock Prices Over the Last Six Months')
# Add legend
plt.legend()

# Display the chart


plt.grid(True)
plt.show()

o/p:

Chart Title
125 130
120 115 120
110 105 110
100 95 100
90

August September October November December January

A B

You might also like