You are on page 1of 7

Day 17: Statistical Analysis with NumPy

Computing descriptive statistics on arrays


Performing statistical tests using NumPy functions

On Day 17, we will focus on Statistical Analysis with NumPy. NumPy provides functions for computing
descriptive statistics on arrays and performing various statistical tests. Let's explore how to compute
descriptive statistics and perform statistical tests using NumPy:

Computing Descriptive Statistics on Arrays:


Example 1: Mean and Median

In [1]:

import numpy as np

# Create a 1D NumPy array


arr = np.array([10, 20, 30, 40, 50])

# Calculate the mean and median of the array


mean_value = np.mean(arr)
median_value = np.median(arr)

print("Mean:", mean_value)
print("Median:", median_value)

Mean: 30.0
Median: 30.0

Example 2: Variance and Standard Deviation

In [2]:

import numpy as np

# Create a 1D NumPy array


arr = np.array([10, 20, 30, 40, 50])

# Calculate the variance and standard deviation of the array


variance_value = np.var(arr)
std_deviation_value = np.std(arr)

print("Variance:", variance_value)
print("Standard Deviation:", std_deviation_value)

Variance: 200.0
Standard Deviation: 14.142135623730951

Performing Statistical Tests using NumPy Functions:


Example 3: Pearson Correlation Coefficient
In [4]:

import numpy as np

# Create two 1D NumPy arrays


x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])

# Calculate the Pearson correlation coefficient


correlation_coefficient = np.corrcoef(x, y)[0, 1]

print("Pearson Correlation Coefficient:", correlation_coefficient)

Pearson Correlation Coefficient: -0.9999999999999999

Example 4: t-test

In [6]:

import numpy as np
from scipy.stats import ttest_ind

# Create two 1D NumPy arrays representing two groups


group1 = np.array([32, 34, 33, 36, 31])
group2 = np.array([28, 30, 29, 27, 32])

# Perform an independent t-test


t_statistic, p_value = ttest_ind(group1, group2)

print("t-statistic:", t_statistic)
print("p-value:", p_value)

t-statistic: 3.2879797461071485
p-value: 0.011055318298291785

In these examples:

We used np.mean() and np.median() to calculate the mean and median of an array.
We used np.var() and np.std() to calculate the variance and standard deviation of an array.
We used np.corrcoef() to compute the Pearson correlation coefficient between two arrays.
We performed an independent t-test using the ttest_ind() function from the scipy.stats module.

Statistical analysis is essential for drawing insights from data and making informed
decisions. NumPy's statistical functions and compatibility with external libraries
like scipy.stats enable powerful analysis capabilities.

When you perform statistical analysis using NumPy, you often get numerical results such
as mean, median, standard deviation, p-values, etc. These results provide valuable insights
into your data. Here's a general guideline on how to conclude and draw insights from
statistical analysis results:
1. Know Your Data: Before drawing any conclusions, it's essential to understand your data and the
context of your analysis. Consider the nature of your dataset, the variables you're analyzing, and the
research questions you're trying to answer.
2. Summary Statistics:

Mean: The mean (average) gives you an idea of the central tendency of your data. If your mean
value is significantly different from what you expected, it might indicate issues with your data or
reveal a meaningful pattern.
Median: The median is the middle value in your data when it's sorted. It's less affected by extreme
values (outliers) and can help you understand the data's distribution.
Standard Deviation: The standard deviation measures the data's spread or dispersion. A high
standard deviation suggests that the data points are scattered widely from the mean, while a low
standard deviation indicates that they are closer to the mean.
3. Statistical Tests:

When you perform statistical tests (e.g., t-tests, ANOVA, chi-square tests), you're comparing
groups or variables to see if there are significant differences or relationships. Look at the p-values
associated with these tests:
A low p-value (typically < 0.05) suggests that there is strong evidence against the null
hypothesis, indicating a significant result.
A high p-value suggests that there is not enough evidence to reject the null hypothesis.
Consider the practical significance in addition to statistical significance. A small p-value might
indicate a significant difference, but it might not be practically meaningful.
4. Visualizations:

Visualizations are powerful tools for interpreting data. Create plots and charts to illustrate your
findings. For example, histograms, box plots, scatter plots, and line charts can reveal patterns and
trends.
Visualizations help you communicate your results effectively to others.
5. Context and Domain Knowledge:

Always interpret your results in the context of your research or the field you're working in. Domain-
specific knowledge can provide valuable insights.
Consider the implications of your findings. How do they impact your research question or the
problem you're trying to solve?
6. Further Analysis:

Statistical analysis is often an iterative process. If your initial analysis raises more questions or
uncertainties, consider conducting additional analyses or collecting more data.
Be open to exploring different statistical techniques and models that might better fit your data or
research objectives.
7. Peer Review:

If your analysis is part of a research project, consider peer review. Getting feedback from
colleagues or experts in your field can help validate your findings and improve the quality of your

In summary, interpreting statistical analysis results requires a combination of


statistical knowledge, domain expertise, and critical thinking. It's important to
consider both the numerical outputs and the broader context to draw meaningful
conclusions and insights from your data.

🌐 Real-World Scenario:-
1. Market Research and Consumer Behavior Analysis:

Use Case: Analyzing customer survey data to understand consumer preferences.


Example: A retail company conducts a survey to determine customer satisfaction with its
products. They use NumPy to compute the mean, median, and mode of customer ratings. This
analysis helps the company identify the most popular products and areas where improvements
are needed.

Scenario: Market Research and Consumer Behavior Analysis

In this scenario, we have survey data from a retail company that collected customer ratings for its products.
We want to analyze this data to understand consumer preferences and identify popular products.

We'll perform the following tasks using NumPy:

1. Calculate the mean, median, and mode of customer ratings.


2. Identify the most popular products based on their ratings.

Here's a Python code example to illustrate this:

In [7]:

import numpy as np

# Sample customer ratings data for different products (on a scale of 1 to 5)


product_ratings = np.array([4, 5, 4, 3, 5, 2, 4, 5, 3, 4, 5, 5, 3, 4, 4, 5, 4, 3, 2, 5]

# Calculate the mean, median, and mode of ratings


mean_rating = np.mean(product_ratings)
median_rating = np.median(product_ratings)
mode_rating = np.argmax(np.bincount(product_ratings))

# Identify the most popular products (those with the highest ratings)
popular_products = np.where(product_ratings == 5)[0] + 1 # Adding 1 to match product n

# Print the results


print("Mean Rating:", mean_rating)
print("Median Rating:", median_rating)
print("Mode Rating:", mode_rating)
print("Popular Products (rated 5):", popular_products)

Mean Rating: 3.95


Median Rating: 4.0
Mode Rating: 4
Popular Products (rated 5): [ 2 5 8 11 12 16 20]

Explanation:

1. We start by importing NumPy and creating a NumPy array called product_ratings containing
customer ratings.
2. We use NumPy functions to calculate the mean, median, and mode of these ratings. The np.mean()
function calculates the mean, np.median() computes the median, and
np.argmax(np.bincount()) calculates the mode.
3. To identify the most popular products (those with a rating of 5), we use NumPy's boolean indexing. The
expression np.where(product_ratings == 5) returns the indices of products with a rating of 5.
We add 1 to these indices to match product numbers (assuming the products are numbered starting
from 1).
4. Finally, we print the results, including the mean, median, mode of ratings, and the list of popular

Let's conclude the results and draw insights from them for the scenario of market research and
consumer behavior analysis:

1. Mean Rating: The mean rating of the products is approximately 3.95 out of 5. This suggests that, on
average, customers have rated the products favorably, indicating overall satisfaction with the product
offerings.
2. Median Rating: The median rating is 4.0, which aligns with the mean rating. The median is a measure
of central tendency, and its proximity to the mean indicates that the ratings are evenly distributed
around the central value. This suggests a relatively consistent level of satisfaction among customers.
3. Mode Rating: The mode rating is 4, which means that the most frequent rating given by customers is
4. This indicates that a significant portion of customers has rated the products as "4," indicating a high
level of satisfaction.
4. Popular Products (rated 5): Products with a rating of 5 are considered the most popular. Based on
the analysis, the products with ratings of 5 are: [2, 5, 8, 11, 12, 16, 20]. This information is valuable for
the retail company as it identifies specific products that have received consistently high ratings from
customers. These products can be promoted more prominently, and the company can consider
expanding its product offerings in similar categories to meet consumer demand.

In summary, the analysis of customer ratings indicates that the majority of customers
are satisfied with the products, with many products receiving high ratings. The
retail company can use this information to focus its marketing efforts on the most
popular products and explore opportunities for product improvement in areas where
ratings are lower. Additionally, this analysis can guide inventory management and
product development decisions.

2. Clinical Trials and Medical Research: In the context of clinical trials and medical research, let's
explore how NumPy can be used to perform a t-test to evaluate the effectiveness of a new drug.

Use Case: Evaluating the Effectiveness of a New Drug

Scenario: A pharmaceutical company has developed a new drug to treat a specific medical condition. To
assess its efficacy, they conduct a clinical trial. Patients participating in the trial are randomly divided into
two groups: one group receives the new drug, and the other group receives a placebo (inactive substance).
After the trial, the company wants to determine if there is a statistically significant difference in the
outcomes between the two groups.

Here's how you can use NumPy to perform a t-test and draw conclusions from the results:
In [8]:

import numpy as np
from scipy import stats

# Simulated data for illustration (replace with actual trial data)


# Group A: Patients receiving the new drug
group_a_scores = np.array([85, 88, 92, 78, 95, 91, 89, 82, 88, 87])

# Group B: Patients receiving the placebo


group_b_scores = np.array([75, 78, 80, 68, 72, 74, 76, 73, 79, 70])

# Perform a two-sample t-test


t_statistic, p_value = stats.ttest_ind(group_a_scores, group_b_scores)

# Define the significance level (alpha)


alpha = 0.05

# Check if the p-value is less than alpha (significance level)


if p_value < alpha:
print("The difference in outcomes between the two groups is statistically significa
else:
print("There is no statistically significant difference in outcomes between the two

The difference in outcomes between the two groups is statistically signif


icant.

In this example:

- We have two arrays, `group_a_scores` representing the outcomes for patients who
received the new drug and `group_b_scores` for patients who received the placebo.
- We use the `ttest_ind` function from SciPy to perform a two-sample t-test, which
compares the means of the two groups.
- The `p_value` represents the probability of observing such extreme results if there
were no difference between the groups.
- We define a significance level `alpha` (commonly set to 0.05), which represents the
threshold for statistical significance.
- If the `p_value` is less than `alpha`, we conclude that there is a statistically
significant difference in outcomes between the two groups, suggesting that the new
drug is effective.

This analysis helps medical researchers make informed decisions about the efficacy of
the new drug and whether it should proceed to further testing or clinical use.

3. Quality Control in Manufacturing:

Use Case: Ensuring product quality in manufacturing processes.

Example: A factory produces electronic components, and each component's resistance value is measured.
NumPy is used to calculate the standard deviation of resistance values. If the standard deviation exceeds a
threshold, it indicates that the manufacturing process is not consistent, and adjustments are needed.
In [9]:

import numpy as np

# Sample resistance values of electronic components


resistance_values = np.array([102.5, 103.2, 101.8, 104.0, 102.7, 103.5, 101.0, 104.2, 1

# Calculate the standard deviation of resistance values


std_deviation = np.std(resistance_values)

# Set a threshold for acceptable standard deviation


threshold = 0.5 # This threshold depends on the manufacturing process and product spec

# Check if the standard deviation exceeds the threshold


if std_deviation > threshold:
print("Manufacturing process needs adjustment. Standard deviation is too high.")
else:
print("Manufacturing process is consistent. Standard deviation is within acceptable

Manufacturing process needs adjustment. Standard deviation is too high.

Explanation:

In this example, we have a list of resistance values measured from electronic components produced in a
manufacturing process. We use NumPy to calculate the standard deviation of these resistance values,
which is a measure of how much the resistance values vary from the mean (average).

If the calculated standard deviation exceeds a predefined threshold (in this case, 0.5), it indicates that
the manufacturing process is not consistent, and the resistance values have too much variability.
Conversely, if the standard deviation is within the acceptable threshold, it suggests that the
manufacturing process is consistent, and the resistance values are within the desired range.

This type of quality control analysis is crucial in manufacturing to ensure that


products meet the specified quality standards. If adjustments are needed, the
manufacturing process can be fine-tuned to reduce variability and improve product
quality.

You might also like