0% found this document useful (0 votes)

46 views8 pages

Data Analysis Exploration

Uploaded by

Humayun Raza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views8 pages

Data Analysis Exploration

Uploaded by

Humayun Raza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Analysis/Exploration

Notes by Mannan Ul Haq (BDS-3C)

Data analysis is the process of examining and understanding data to gain insights,
make informed decisions, and solve problems.

Data Quality
Data quality is a measure of how good or reliable data is. It considers factors like
accuracy, completeness, consistency, reliability, and up-to-date. High-quality data is
essential for digital businesses.

Dimensions of Data Quality

1. Accuracy (Is the information correct?)

Data should be error-free and reflect real-world scenarios.

Errors can lead to significant problems, such as unauthorized access to bank

accounts.

2. Completeness (How comprehensive is the information?)

Completeness measures how exhaustive a dataset is.

It ensures that all required values are available, making the information usable.

3. Consistency/Reliability (Does the information contradict other trusted resources?)

Consistency refers to data uniformity across networks and applications.

Data in different locations should not conflict with each other.

4. Relevance/Timeliness (Is the information needed, and is it up-to-date?)

Relevance checks if data fulfills its intended purpose.

Timeliness ensures data is available when required, preventing wrong

decisions.

5. Interpretability (How easy is it to understand the data?)

Data Analysis/Exploration 1
Interpretability reflects how easily data can be understood.

Measuring these data quality dimensions helps organizations identify and resolve data
errors, ensuring that their data is fit for its intended purpose. High-quality data is the
cornerstone of effective digital businesses.

Exploring the Dataset

Once you have the data, the next step is to explore it. This preliminary investigation
helps you understand the specific characteristics of the data. It can answer questions
like:

Is the data balanced, or are there more instances of one category than another
(class balance)?

Do certain attributes vary together, indicating potential correlations?

How are the values of data attributes spread out (dispersion)?

Skewness

Are there missing values in the dataset?

Are there any extreme values (outliers) that need attention?

Class Balance
Class Balance or Data Balance refers to the distribution of data points among different
categories or classes within a dataset.

Imagine you have a dataset of customer reviews for a product, and you want to classify
these reviews as either "positive" or "negative". If you have 90% positive reviews
and only 10% negative reviews, you have an imbalance because one class (positive)
dominates the dataset, while the other class (negative) is underrepresented.

Here's a simple explanation:

Balanced Data: When you have roughly an equal number of data points for each
class or category, it's called balanced data. For example, if you have 50 positive
reviews and 50 negative reviews in your dataset, it's balanced.

Imbalanced Data: When one class has significantly more data points than the other
class, it's called imbalanced data. For example, if you have 90 positive reviews and

Data Analysis/Exploration 2
only 10 negative reviews, it's imbalanced.

Why is Class Balance Important?

Class balance is essential because it can impact the performance of machine learning
models. In an imbalanced dataset, models may become biased towards the majority
class (the one with more data points) because they see more examples of it. This can
lead to poor predictions for the minority class.

Attributes Correlation
Correlation is a statistical measure that helps us understand the relationship or
association between two or more variables in a dataset. It tells us how these variables
change in relation to each other. Correlation is often used to determine whether there's
a connection between variables and, if so, the strength and direction of that connection.
Here are some key points about correlation:

1. Positive Correlation: When two variables have a positive correlation, it means that
as one variable increases, the other tends to increase as well.

2. Negative Correlation: Conversely, a negative correlation indicates that as one

variable increases, the other tends to decrease. They move in opposite directions.

3. No Correlation: If there's no apparent pattern or relationship between two

variables, they are said to have no correlation. Changes in one variable do not have
a consistent effect on the other.

Correlation Coefficient:

To quantify the strength and direction of the correlation between two variables, we use a
number called the correlation coefficient. This number ranges between -1 and 1.

A correlation coefficient of 1 indicates a perfect positive correlation.

A correlation coefficient of -1 indicates a perfect negative correlation.

A correlation coefficient close to 0 suggests little to no correlation.

Descriptive Statistics
In data analysis, statistics are important for summarizing and understanding the data.
Depending on whether the data attributes are discrete or continuous, different statistics

Data Analysis/Exploration 3
are calculated:

Measures of Central Tendency

In data analysis, measures of central tendency help us understand the central or typical
value within a dataset. They provide insights into where the data tends to cluster. There
are three primary measures of central tendency: the mean, the median, and the mode.

Mean:
The mean is often referred to as the average. It's calculated by adding up all the values
in a dataset and then dividing by the number of data points. Here's the advantage and
limitation of using the mean:

Advantage of the Mean:

The mean can be used for both continuous and discrete numeric data.

Limitations of the Mean:

The mean cannot be calculated for categorical data because the values cannot be
summed.

The mean is not robust against outliers, meaning that a single large value (an
outlier) can significantly skew the average.

Median:
The median is the middle value in a dataset when it's ordered from smallest to largest.
The median is less affected by outliers and skewed data, making it a preferred measure
of central tendency when the data distribution is not symmetrical.

2256789

Mode:
The mode is the value that occurs most frequently in a dataset.
2256789

Advantage of the Mode:

Unlike the mean and median, which are mainly used for numeric data, the mode
can be found for both numerical and categorical (non-numeric) data.

Data Analysis/Exploration 4
Limitation of the Mode:

In some distributions, the mode may not reflect the center of the distribution very
well, especially if the data is multimodal (has multiple modes) or if all values occur
with similar frequencies.

Measures of Dispersion
Measures of dispersion provide insights into how data values are spread out or vary
within a dataset. Common measures of dispersion include the range, variance, standard
deviation.

Range:
The range is the simplest measure of dispersion. It's calculated by subtracting the
minimum value from the maximum value in the dataset. While it's easy to compute, it's
sensitive to extreme values (outliers) and may not provide a complete picture of data
variability.

Variance and Standard Deviation:

Variance is a measure that quantifies how far each data point is from the mean.

Standard deviation is the square root of the variance and provides a measure of
dispersion in the same units as the data.

These measures give us a more detailed understanding of how data points deviate from
the mean. Higher variance and standard deviation values indicate greater data spread.

Data Distribution
The shape of a data distribution can significantly affect the choice of measures of
central tendency:

Normal or Symmetrical Distribution:

A normal distribution is a specific type of distribution where data tends to cluster
around a central value with no bias to the left or right. It is often represented as a bell
curve.

Data Analysis/Exploration 5
In a normal or symmetrical distribution, the mean, median, and mode are all centered at
the same point. They are approximately equal, making any of these measures a
suitable representation of central tendency.

Skewness:
Skewness measures the degree of asymmetry in a distribution. When a distribution is
skewed, it deviates from the symmetrical bell curve of a normal distribution. In skewed
distributions:

Positive Skewness (Right Skewed): The tail on the right side is longer, and the
mean is pulled toward the right tail. In such cases, the median is often preferred
as a measure of central tendency because it's less affected by extreme values.

Data Analysis/Exploration 6
Negative Skewness (Left Skewed): The tail on the left side is longer, and the
mean is pulled toward the left tail. Again, the median is often a better choice in
this situation.

Standard Normal Distribution

The Standard Normal Distribution, also known as the Z-Distribution, is a specific
type of probability distribution. It is a continuous probability distribution that is
symmetrically shaped like a bell curve. This distribution has a mean (average) of 0 and
a standard deviation of 1.

Data Analysis/Exploration 7
Z-Score:
A Z-Score, also known as a standard score, measures how many standard deviations a
particular data point is away from the mean of a distribution. It's a way to standardize
data and compare it to the standard normal distribution.
Z=x*μ/σ
x = data point

μ= mean
σ = standard deviation

The Z-Score tells you how many standard deviations a data point is above or below the
mean. A positive Z-Score indicates that the data point is above the mean, while a
negative Z-Score indicates that it's below the mean.

The Z-Score is useful for comparing data points from different distributions, identifying
outliers, and making statistical inferences. In particular, it helps determine how extreme
or unusual a data point is in the context of its distribution.

Data Analysis/Exploration 8

Chemistry (Annual Reports - Vol.59-1962)
100% (8)
Chemistry (Annual Reports - Vol.59-1962)
576 pages
Live Tracker - Free PakData Sim Database On-Line 2025
No ratings yet
Live Tracker - Free PakData Sim Database On-Line 2025
9 pages
NADANPENKODI - Malayalam Kambi Kathakal
60% (10)
NADANPENKODI - Malayalam Kambi Kathakal
8 pages
Corel Draw X7 Serial Number & Activation Code
60% (42)
Corel Draw X7 Serial Number & Activation Code
1 page
It - Stephen King's PDF
80% (10)
It - Stephen King's PDF
588 pages
Sim Owner Details - Pakistan No #1 Number Information System 2025
56% (16)
Sim Owner Details - Pakistan No #1 Number Information System 2025
3 pages
Dating Format
89% (195)
Dating Format
23 pages
XXX Archita Phukan Viral Video Original XXX VIDEOS
9% (11)
XXX Archita Phukan Viral Video Original XXX VIDEOS
4 pages
Telugu Family Sex Stories Collection
67% (102)
Telugu Family Sex Stories Collection
157 pages
Science 10 3rd Quarter Exam
84% (103)
Science 10 3rd Quarter Exam
2 pages
ACLS Pretest Answers 2024
92% (26)
ACLS Pretest Answers 2024
29 pages
Microsoft Office 2007 Activation Keys
85% (34)
Microsoft Office 2007 Activation Keys
2 pages
Carbon and Its Compound (Prashant Kirad)
91% (267)
Carbon and Its Compound (Prashant Kirad)
21 pages
Telugu Boothu Kathala 5
67% (18)
Telugu Boothu Kathala 5
33 pages
Telugu Boothu Kathala 24
60% (10)
Telugu Boothu Kathala 24
27 pages
Free Fire Headshot Config File
82% (34)
Free Fire Headshot Config File
1 page
Uveit Foster
50% (6)
Uveit Foster
954 pages
Mineral and Energy Resources (Prashant Kirad)
92% (247)
Mineral and Energy Resources (Prashant Kirad)
20 pages
VIVA QUESTIONS FOR PHYSICS PRACTICALS For Class 12 With Answers.
92% (13)
VIVA QUESTIONS FOR PHYSICS PRACTICALS For Class 12 With Answers.
8 pages
Viva Questions Class 12 Chemistry
74% (19)
Viva Questions Class 12 Chemistry
17 pages
Nodia Sample Paper Class 10th Science 01
86% (14)
Nodia Sample Paper Class 10th Science 01
8 pages
BCS302 Previous Year Question Paper With Answer
73% (11)
BCS302 Previous Year Question Paper With Answer
25 pages
R. D. Sharma Class 9th Book PDF - Unlocked
83% (69)
R. D. Sharma Class 9th Book PDF - Unlocked
464 pages
Chemistry Notes Class 11 Chapter 6 Thermodynamics
73% (30)
Chemistry Notes Class 11 Chapter 6 Thermodynamics
16 pages
Science-Reviewer-3rd-Quarter For Grade 10 Biology
93% (120)
Science-Reviewer-3rd-Quarter For Grade 10 Biology
19 pages
Purposive Communication Module 1 Guide
89% (62)
Purposive Communication Module 1 Guide
91 pages
CH13 Hydrocarbons Shobhit Nirwan
76% (25)
CH13 Hydrocarbons Shobhit Nirwan
58 pages
List of Mobile Network Prefixes in The Philippines
50% (2)
List of Mobile Network Prefixes in The Philippines
4 pages
12th Physics Practical Solutions
80% (71)
12th Physics Practical Solutions
80 pages
Light (Prashant Kirad)
93% (280)
Light (Prashant Kirad)
21 pages

Data Analysis Exploration

Uploaded by

Data Analysis Exploration

Uploaded by

Data Analysis/Exploration

Notes by Mannan Ul Haq (BDS-3C)

Dimensions of Data Quality

Data should be error-free and reflect real-world scenarios.

Errors can lead to significant problems, such as unauthorized access to bank

2. Completeness (How comprehensive is the information?)

Completeness measures how exhaustive a dataset is.

3. Consistency/Reliability (Does the information contradict other trusted resources?)

Consistency refers to data uniformity across networks and applications.

Data in different locations should not conflict with each other.

4. Relevance/Timeliness (Is the information needed, and is it up-to-date?)

Relevance checks if data fulfills its intended purpose.

Timeliness ensures data is available when required, preventing wrong

5. Interpretability (How easy is it to understand the data?)

Exploring the Dataset

Do certain attributes vary together, indicating potential correlations?

How are the values of data attributes spread out (dispersion)?

Are there missing values in the dataset?

Are there any extreme values (outliers) that need attention?

Here's a simple explanation:

Why is Class Balance Important?

2. Negative Correlation: Conversely, a negative correlation indicates that as one

3. No Correlation: If there's no apparent pattern or relationship between two

A correlation coefficient of 1 indicates a perfect positive correlation.

A correlation coefficient of -1 indicates a perfect negative correlation.

A correlation coefficient close to 0 suggests little to no correlation.

Measures of Central Tendency

Advantage of the Mean:

Limitations of the Mean:

Advantage of the Mode:

Variance and Standard Deviation:

Normal or Symmetrical Distribution:

Standard Normal Distribution

You might also like