Professional Documents
Culture Documents
Class: SYBSC-IT
Div: B Roll no.: 4163
Assignment Questions
Data Analysis and Visualization using Power BI and Tableau
Question 2:
Explain the concept of outliers in data. How can outliers
impact statistical analysis? Discuss two techniques for
detecting outliers.
Answer:
Outliers in Data:
In statistics, an outlier is a data point that deviates significantly from the rest of the
data set. Imagine a class photo where one person stands out by wearing a bright
orange costume. That person, in data terms, is an outlier.
These outliers can be caused by:
● Measurement errors: Mistakes during data collection or recording.
● Natural variations: Rare but legitimate occurrences within the data.
● Data entry errors: Typos or accidental misplacements of values.
Question 2:
In what scenarios would you prefer normalization over
standardization or vice versa? Justify your answer.
Answer:
The choice between normalization and standardization depends on several factors,
and there's no single "one size fits all" answer. Here are some scenarios where you
might prefer one over the other:
Prefer Normalization:
● Non-Gaussian data: If your data doesn't follow a normal distribution (bell-
shaped curve), standardization can distort important information by
assuming normality. Normalization, being independent of distribution,
avoids this issue.
● Bounded algorithms: Some algorithms are sensitive to the absolute values
of features and perform best when data falls within a specific range.
Normalization ensures this by mapping all features to a pre-defined range
like 0-1 or -1-1.
● Data with outliers: Normalization is less sensitive to outliers compared to
standardization, as it only considers the overall minimum and maximum
values. This can be beneficial if your data has a few extreme points that
might unduly influence the standardization process.
● Interpretability: Normalization results in values within a known range,
which can sometimes be easier to interpret in terms of their relative
magnitude within the data.
Prefer Standardization:
● Euclidean distance-based algorithms: Algorithms like k-Nearest
Neighbors or Support Vector Machines (SVMs) rely on Euclidean distances
between data points for decision making. Standardization ensures features
contribute equally to these distances, as they all have the same standard
deviation.
● Algorithms sensitive to relative differences: Some algorithms, like linear
regression, focus on the relative differences between features rather than
absolute values. Standardization ensures all features have equal influence on
the model by having the same unit variance.
● Data with Gaussian distribution: If you're confident your data follows a
normal distribution, standardization takes advantage of this information by
transforming data into a standard normal distribution, leveraging its
statistical properties for analysis.
Additional Considerations:
● Algorithm requirements: Check the specific documentation of your chosen
algorithm to see if it has any specific requirements regarding feature scaling.
● Experimentation: It's often recommended to try both normalization and
standardization and compare their impact on your model's performance to
make an informed decision.
The best approach depends on your specific data and modeling goals. By
understanding the strengths and weaknesses of both techniques, you can choose
the one that leads to the most accurate and meaningful results.
3. Descriptive Statistics:
Question 1:
Calculate the mean, median, and mode for the following
dataset: [10, 15, 20, 25, 30, 35, 40].
Answer:
Calculation for the dataset [10 ,15 ,20,25 ,30,35,40]:
1. Mean:
Mean = (10+15+20+25+30+35+40) / 7
=175 / 7
=25
2. Median:
Since there are 7 values, the median is the 4th value (when arranged
in ascending order).
Median=25
3. Mode:
There is no value that appears more than once, so there's no mode.
Question 2:
Discuss the use of the interquartile range (IQR) as a measure
of variability. Calculate the IQR for the dataset: [22, 25, 28, 30,
35, 40, 45].
Answer:
Use of interquartile range (IQR) as a measure of variability:
The Interquartile Range (IQR) is a measure of statistical dispersion, or
variability, based on dividing the dataset into quartiles.
It's calculated as the difference between the third quartile (Q3) and the first
quartile (Q1), thus capturing the spread of the middle 50% of the data.
Interpretation:
The filtered dataset last_quarter_purchases contains only the purchases made in
the last quarter of the year (October to December). Analyzing this subset of data
can provide insights into seasonal trends, end-of-year spending patterns, and the
popularity of certain products or promotions during the holiday season.
Question 2:
Sort a dataset of employee salaries in ascending order.
Discuss the potential insights gained from examining the
sorted data.
Answer:
Sorting employee salaries in ascending order:
import pandas as pd
# Assuming 'salary' is the column containing employee salaries
# Load the dataset
data = pd.read_csv('employee_salaries.csv')
# Sort the salaries in ascending order
sorted_salaries = data['salary'].sort_values()
# Display the sorted data
print(sorted_salaries)
Potential insights:
Question 1:
Consider a dataset with missing values. Use pandas in Python
to handle missing data through imputation. Provide code
snippets and explain the rationale behind your choices.
Answer:
Here's how you can handle missing data through imputation in pandas:
1. Import libraries and load data:
Python:
import pandas as pd
Question 1:
Eg:-
∴ x = 75
x_norm = (75 - 60) / (90 - 60)
x_norm = 15 / 30
x_norm = 0.5
3. Descriptive Statistics:
Question 1:
Answer:
Here's how you can calculate the standard deviation for a dataset representing daily
temperatures over a month using Python's statistical libraries:
Method 1: Using pandas library
Python
import pandas as pd
Explanation:
● We first import the necessary libraries: pandas for data manipulation and
statistics for statistical calculations.
● We define a list of sample daily temperatures.
● In the pandas method, we create a pandas Series from the temperature list.
The .std() method calculates the standard deviation.
● In the statistics library method, we directly use the statistics.stdev() function
on the temperature list.
● Finally, we print the calculated standard deviation.
Remember to replace the sample data with your actual temperature data to get the
standard deviation for your specific case.
4. Data Visualization:
Question 1:
Using Matplotlib or any other suitable library, create a line chart to
visualize the trend in stock prices over the last six months. Include
appropriate labels and a legend.
Answer:
First, let's assume that we have a dataset of stock prices for the last six months in a
list called stocks prices. We can use the matplotlib library to create a line chart as
follows:
Code:
import matplotlib.pyplot as plt
# Sample data representing stock prices over the last six months
months = ['Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan']
stock_price_A = [100, 110, 105, 120, 125, 130]
stock_price_B = [90, 95, 100, 110, 115, 120]
o/p:
Chart Title
125 130
120 115 120
110 105 110
100 95 100
90
A B