Professional Documents
Culture Documents
Data are actual pieces of information that you collect through your study
Not all data are numbers lets say you also record the gender of each of your friends getting the
following data male, female
Types of data
Numerical data
Discrete data -= data that can only take certain values. [mostly whole numbers or clearly distinct
values.] for example = the number of students in a class you cant have half a student Example – the
results of rolling 2 dice Can only have the values 2,3,4,5,6,7,8,9,10,11 and 12
Continuous data= data that can take any value within a range can include all decimal values within
range ] for example= people heights could be any value within the range of human heights
Categorical data
These are data sets where the data consists of categories of descriptive terms
Normal- a types of categorical data in which objects fall into unordered categories.
Examples = 1]hair color – blonde, brown, red, black, etc. 2] Race- causcasian, African, Asian etc.
3]smoking status – smoker, non-smoker.
Example =1] class- year 7, year 8, year 9 etc .2]degree of illness – none, moderate, severe
3] opinion of students about homework-ticked off, neutral, enjoy it
Binary data= a type of categorical data in which there are only g two catgegories. Binary data can
either be nominal or ordinal
Ordinal scale
The ordinal scale indicates a ranking or order among the data points it allows for the relative
comparison of values but does not provide precise measurements of the differences between them
examples of ordinal scale data include.
Ranks in a competition [e.g. first place, second place, third place]
educational levels [e.g. elementary middle school, high school, college]
ratings or likert scales [e.g. strongly agree, agree neutral, disagree, strongly disagree]
Interval scale
The interval scale represents data with meaningful numerical intervals between values. It has
ordered categories, and the difference between any two adjacent points is constant, but it lacks a
true zero points example of interval scale data include
temperature in Celsius or Fahrenheit [e.g. 20c , 30c]
years on the calendar[e.g. 2000,2010,2020] iq scores [e.g. 100, 120, 140]
Ratio scale
The ratio scale is the highest level of measurement possessing all the characteristics of the previous
scales nominal, ordinal, and interval along with a true zero points it has ordered categories, equal
interval, and meaningful ratios. Example of ratio scale data include.
Weight [e,g 50 kg 75kg,] height [e.g. 150 cm, 180cm] income [e.g.$500,$ 1000]
Numpy is powerful python library for numericl computing that provides support for handling large,
multidimensional arrays and matrices, along with a wide range of mathematical functions to operate
on them efficiently it is widely used in scientific computing data analysis and machine learning here
are some examples of how to use numpy 1] creating numpy arrays = numpy arrays are the
fundamental data structural provided by the library you can create arrays using various methods,
such as from a list or with built-in functions. Here s an examples
import numpy as np
Certainly here are five commonly used numpy array functions for working with n-dimensional arrays.
1] numpy ndarray shape = this function returns the shape of an n-dimensional array it provides
information about the size of each dimension of the array heres an example
shape = arr.shape
print(shape) # Output: (2, 3)
2] numpy ndarry reshape = the reshape function allows you to change the shape of an n dimentional
array without changing its data it returns a new array with the specified shape here’s an example
import numpy as np
3] numpy ndarray transpose = the transpose function returns a new array with the
axes transposed which means the rows become columns and the columns become
rows this function is particularly useful for n-dimensional arrays here’s an example
import numpy as np
To read data from a text file and a spreadsheet file [e.g.CSV] you can use the
pandas library in python pandas provides powerful tools for data manipulation and
analysis here’s an example program that reads data from both file formats. import
pandas as pd
file_path = "data.csv"
file = open(file_path, "r")
Certainly! Here are two common types of charts for data visualization—comparison
charts (including bar charts and line charts) and distribution charts (including
histograms and box plots)—along with their respective applications:
1. Comparison Charts:
Comparison charts are used to compare different categories or variables to identify
patterns, trends, or relationships. They are useful when you want to visualize and
compare values across different groups. Here are two commonly used comparison
charts:
- Bar Chart: A bar chart uses rectangular bars of varying lengths or heights to
represent the values of different categories. It is suitable for comparing discrete
categories or groups. For example, you can use a bar chart to compare sales
performance across different months or compare the popularity of different
programming languages.
- Line Chart: A line chart connects data points with lines to show the trend or
progression of a variable over time or across continuous categories. It is suitable for
displaying trends, patterns, or changes over a continuous scale. For example, you
can use a line chart to visualize the stock market trends over a period or to compare
the temperature variations throughout the year.
2. Distribution Charts:
Distribution charts are used to visualize the distribution and characteristics of a
dataset. They help in understanding the spread, central tendency, and shape of the
data. Here are two commonly used distribution charts:
- Box Plot: A box plot (also known as a box-and-whisker plot) displays the
summary statistics of a dataset, including the median, quartiles, and outliers. It
provides a visual representation of the spread, central tendency, and skewness of
the data. Box plots are particularly useful for comparing distributions and identifying
outliers or anomalies. For example, you can use a box plot to compare the
distribution of exam scores across different classes or to identify any outliers in a
dataset.
Remember, the choice of chart depends on the nature of your data, the type of
analysis you want to perform, and the insights you want to convey. Understanding
the characteristics and applications of different chart types will help you .
Q. To analyze how various parts comprise the whole pie chart ,
donut chart ?
To analyze how various parts comprise the whole using pie charts and donut charts,
you can focus on the proportions and percentages represented by each slice or
segment. Here's how you can perform the analysis:
1. Pie Chart:
A pie chart is a circular chart divided into sectors, with each sector representing a
proportionate part of the whole. To analyze the composition of the whole using a pie
chart, follow these steps:
2. Donut Chart:
A donut chart is similar to a pie chart, but it has a hole or an empty space in the
center. It is useful when you want to represent the composition of the whole while
also providing additional information or categorization. To analyze the composition of
the whole using a donut chart, follow these steps:
- Follow the same steps as mentioned for a pie chart to analyze the individual
slices and their proportions.
- Pay attention to any additional rings or concentric circles within the donut chart.
These can represent further categorization or subgroups within the main categories.
- Analyze the sizes and proportions of the different rings or concentric circles to
understand how the composition is further broken down.
- Compare the sizes of the rings and their percentages to identify any significant
variations or patterns.
In both pie charts and donut charts, it is important to ensure that the chart is properly
labeled and that the values or percentages associated with each slice are clearly
visible. This allows for accurate analysis and interpretation of the composition of the
whole.
ReplyForward
To create and visualize a data frame in Python, you can use the `pandas` library
along with the `matplotlib` or `seaborn` libraries for visualization. Here's an example
program that demonstrates the process:
```python
import pandas as pd
import matplotlib.pyplot as plt
In this program, we first create a dictionary `data` with sample data. Each key in the
dictionary corresponds to a column name, and the associated values form the data
for that column.
Next, we use the `pd.DataFrame()` function from the `pandas` library to create a
data frame `df` from the dictionary.
We then print the data frame using `print(df)` to see the tabular representation of the
data.
Finally, we use `matplotlib` to visualize the data frame. In this example, we create a
bar chart to show the age distribution using `plt.bar()`. We specify the 'Name' column
as the x-axis and the 'Age' column as the y-axis. We also set labels for the axes and
provide a title for the chart. Lastly, we call `plt.show()` to display the chart.
You can modify this example to suit your specific data and visualization
requirements.
Q. Explain scatter chart with suitable example ?
A scatter chart, also known as a scatter plot, is a type of chart that uses Cartesian
coordinates to display the values of two variables as points on a graph. It is useful for
visualizing the relationship or correlation between two numeric variables. Each point
on the chart represents an observation or data point.
Suppose you have a dataset that contains information about the height and weight of
individuals. You want to analyze the relationship between height and weight to
determine if there is any correlation.
```python
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
'Height': [165, 170, 155, 180, 160, 175],
'Weight': [60, 65, 55, 70, 62, 68]
}
In this example, we have a dictionary `data` that contains two lists, 'Height' and
'Weight', representing the respective values for each individual. We create a data
frame `df` from the dictionary using `pd.DataFrame()`.
To create the scatter chart, we use `plt.scatter()` from the `matplotlib.pyplot` library.
We pass the 'Height' column as the x-axis values and the 'Weight' column as the y-
axis values. We also provide labels for the x-axis and y-axis using `plt.xlabel()` and
`plt.ylabel()`, respectively. Finally, we set the chart title using `plt.title()`.
When we run this code, it will generate a scatter chart where each point represents
an individual's height and weight. By analyzing the scatter chart, we can observe the
general relationship between height and weight. If the points are clustered in a
specific pattern, it suggests a correlation between the two variables. For example, if
the points form an upward-sloping trend, it indicates a positive correlation.
Q. Explain pie chart with suitable example ?
A pie chart is a circular statistical chart that is divided into slices to represent the
proportionate contribution of different categories to a whole. Each slice's size is
proportional to the quantity it represents, allowing for easy visual comparison
between categories. Pie charts are commonly used to display data where the
individual categories' values make up parts of a whole.
Suppose you have data on the sales distribution of different products in a store for a
particular month. You want to visualize the proportionate contribution of each product
to the total sales.
```python
import matplotlib.pyplot as plt
# Sample data
products = ['Product A', 'Product B', 'Product C', 'Product D']
sales = [12000, 8000, 5000, 3000]
In this example, we have a list of product names stored in the `products` variable
and a corresponding list of sales values stored in the `sales` variable.
To create the pie chart, we use `plt.pie()` from the `matplotlib.pyplot` library. We pass
the sales values as the first argument and the product names as the `labels`
parameter. The `autopct` parameter is used to display the percentage value for each
slice.
When we run this code, it will generate a pie chart where each slice represents the
proportionate contribution of a product to the total sales. The size of each slice
corresponds to the relative value of the sales for that product. The labels on the
slices show the product names, and the percentage values inside the slices
represent the percentage contribution of each category to the total.
Python offers several powerful visualization libraries that provide a wide range of
tools and capabilities for creating visually appealing and informative charts, graphs,
and plots. Here are some popular visualization libraries in Python:
1. Matplotlib:
Matplotlib is a versatile and widely-used library for creating static, animated, and
interactive visualizations in Python. It provides a wide range of plotting functions and
supports various chart types, including line plots, scatter plots, bar plots, histograms,
pie charts, and more. Matplotlib offers fine-grained control over plot customization,
allowing users to customize colors, labels, axes, titles, and other visual elements.
2. Seaborn:
Seaborn is built on top of Matplotlib and provides a higher-level interface for
creating attractive statistical visualizations. It simplifies the creation of complex
visualizations such as heatmaps, pair plots, violin plots, and time series plots.
Seaborn offers built-in themes and styles to enhance the aesthetics of plots and
provides additional statistical functionalities.
3. Plotly:
Plotly is a library that offers interactive and web-based visualizations. It supports a
wide range of chart types, including line plots, scatter plots, bar plots, heatmaps, 3D
plots, and more. Plotly allows users to create interactive plots that can be embedded
in web applications or viewed in web browsers. It also provides advanced features
like animation, interactivity, and sharing capabilities.
4. Bokeh:
Bokeh is another library for creating interactive visualizations in Python. It
emphasizes interactive and web-based plots, allowing users to create interactive
dashboards, applications, and data-driven stories. Bokeh supports a variety of chart
types, including line plots, scatter plots, bar plots, area plots, and more. It provides
interactivity through tools like zooming, panning, and hovering.
5. Altair:
Altair is a declarative statistical visualization library that allows users to create
interactive and exploratory visualizations using a concise and intuitive syntax. It
leverages the Vega-Lite specification and provides a high-level API for creating
visualizations with minimal code. Altair supports a variety of chart types and can
handle large datasets efficiently.
These are just a few examples of the many visualization libraries available in Python.
Each library has its strengths and unique features, allowing users to choose the one
that best suits their requirements and preferences. Python's rich ecosystem of
visualization libraries provides flexibility, versatility, and creativity in creating stunning
and informative visualizations for data analysis and presentation.
Data visualization offers several benefits across various fields and industries. Here
are some key benefits:
8. Increased Efficiency: Visualizing data reduces the time required to understand and
communicate information. Instead of analyzing raw data or lengthy reports,
visualizations present key findings and insights at a glance. This efficiency allows
users to focus on interpreting the data rather than spending excessive time on data
Q. Explain histogram with suitable example ?
To create a histogram in Python, we can use the popular data visualization library
called Matplotlib. Here's an example program that generates a histogram:
```python
import matplotlib.pyplot as plt
# Sample dataset
data = [1, 2, 3, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 8, 9, 9, 10]
In this program, we start by importing the `pyplot` module from Matplotlib. We then
define a sample dataset called `data`, which consists of some integer values.
Next, we use the `plt.hist()` function to generate the histogram. The `hist()` function
takes two main arguments: the dataset (`data`) and the number of bins (`bins`) to
divide the data into. In this case, we have set the number of bins to 5.
We also provide the `edgecolor` parameter to add black borders around the bars in
the histogram, making them more visible.
After plotting the histogram, we add labels to the x-axis and y-axis using `plt.xlabel()`
and `plt.ylabel()`, respectively. Additionally, we set the title of the histogram using
`plt.title()`.
When you run this program, it will generate a histogram with bars representing the
frequency or count of values falling within each bin. The x-axis represents the values
in the dataset, and the y-axis represents the frequency.
Unit - 3
Data cleaning refers to the process of identifying and correcting or removing errors,
inconsistencies, and inaccuracies in a dataset. It involves handling missing values,
handling duplicate records, correcting data formats, resolving inconsistencies, and
addressing outliers. The goal of data cleaning is to improve the quality and reliability
of the dataset before performing any analysis or using it for further processing.
Data transformation, on the other hand, involves converting or modifying the data in
order to make it suitable for a specific purpose or analysis. It can include various
operations such as scaling, normalization, aggregation, encoding, and feature
engineering.
```python
import pandas as pd
import numpy as np
# Example dataset
scores = [82, 68, 91, 75, 60, 95, 88, 72, 78, 85]
# Create a DataFrame
df = pd.DataFrame({'Score': scores})
In this example, we create a DataFrame `df` with a column 'Score' containing the
students' exam scores. We define the bin edges `[0, 60, 80, 100]` to represent the
ranges for the low, medium, and high categories. We also define the bin labels 'Low',
'Medium', and 'High'.
Example:
Consider a population of 10 individuals numbered from 1 to 10. We want to randomly
select a sample of 4 individuals from this population.
Let's say we perform random sampling with replacement three times. Here's one
possible outcome:
- First draw: 3
- Second draw: 7
- Third draw: 3
In this case, individual 3 was selected twice since it was returned to the population
after the first draw. The selected sample could be [3, 7, 3, 1].
Let's perform random sampling without replacement. Here's one possible outcome:
- First draw: 2
- Second draw: 6
- Third draw: 9
In this case, each selection is unique, and no individual is selected more than once.
The selected sample could be [2, 6, 9, 1].
The key difference between random sampling with and without replacement lies in
the potential for duplication in the sample. With replacement, it is possible to select
the same element multiple times, while without replacement, each selection is
unique.
a) Dropping NaN values: This approach involves removing rows or columns that
contain NaN values. The `dropna()` function in pandas can be used to drop NaN
values. By specifying the `axis` parameter, you can choose to drop rows (`axis=0`) or
columns (`axis=1`) that contain NaN values.
b) Filling NaN values: Instead of removing NaN values, you can choose to fill them
with appropriate values. The `fillna()` function in pandas allows you to fill NaN values
with a specific value or with methods like mean, median, or forward/backward filling.
```python
import numpy as np
# Example data
data = [1, 2, 3, 4, 5, 5, 6, 7, 8, 9]
# Calculate mean
mean = np.mean(data)
# Calculate median
median = np.median(data)
# Calculate mode
mode = np.mode(data)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
```
2. Using pandas:
Pandas is a popular library for data manipulation and analysis in Python. It provides
convenient functions to calculate central values:
```python
import pandas as pd
# Example data
data = [1, 2, 3, 4, 5, 5, 6, 7, 8, 9]
# Create a DataFrame
df = pd.DataFrame(data, columns=['Values'])
# Calculate mean
mean = df['Values'].mean()
# Calculate median
median = df['Values'].median()
# Calculate mode
mode = df['Values'].mode()
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
```
3. Using statistics:
The statistics module in Python provides functions to calculate various statistical
measures, including central values:
```python
import statistics
# Example data
data = [1, 2, 3, 4, 5, 5, 6, 7, 8, 9]
# Calculate mean
mean = statistics.mean(data)
# Calculate median
median = statistics.median(data)
# Calculate mode
mode = statistics.mode(data)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
```
These are just a few examples of how you can calculate central values in Python.
Depending on your specific requirements and the nature of your data, you can
choose the appropriate library and functions to calculate the central values that best
suit your needs.
Q.Calculate specific column median in a dataframe ?
To calculate the median of a specific column in a DataFrame, you can use the
`median()` function provided by the pandas library in Python. Here's an example:
```python
import pandas as pd
# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10],
'C': [11, 12, 13, 14, 15]})
Output:
```
Median of column B: 8.0
```
In this example, we have a DataFrame `df` with three columns: 'A', 'B', and 'C'. To
calculate the median of column 'B', we access the column using `df['B']` and then
apply the `median()` function to it. The resulting median value is stored in the
variable `median_B` and printed.
You can replace `'B'` with the name of the specific column you want to calculate the
median for in your DataFrame.
Q.State and explain percentiles and qutriles with and example ?
Percentiles and quartiles are statistical measures used to divide a dataset into equal
parts or segments based on ranking. They provide valuable insights into the
distribution and spread of data. Let's explain percentiles and quartiles with examples:
1. Percentiles:
Percentiles divide a dataset into 100 equal parts, representing the position of a
particular value in relation to the entire dataset. For instance, the 25th percentile
represents the value below which 25% of the data falls.
Example:
Consider a dataset of exam scores: [65, 70, 72, 75, 78, 80, 82, 85, 88, 90]. To find
the 75th percentile, we calculate the value below which 75% of the data falls. In this
case, 75% of the data is below the value 82. Therefore, the 75th percentile is 82.
2. Quartiles:
Quartiles divide a dataset into four equal parts, representing three points that divide
the dataset into quarters. The three quartiles are denoted as Q1, Q2 (also known as
the median), and Q3.
Example:
Using the same dataset of exam scores: [65, 70, 72, 75, 78, 80, 82, 85, 88, 90]. The
quartiles can be calculated as follows:
- Q1: The first quartile, or the 25th percentile, represents the value below which 25%
of the data falls. In this case, the value is 72.
- Q2: The second quartile, or the 50th percentile, is the median, representing the
value below which 50% of the data falls. In this case, the median is 79.
- Q3: The third quartile, or the 75th percentile, represents the value below which 75%
of the data falls. In this case, the value is 85.
Quartiles provide insights into the spread of data and are commonly used to define
the interquartile range (IQR), which measures the dispersion between the first and
third quartiles.
By using percentiles and quartiles, we can understand how the dataset is distributed,
identify outliers, and compare specific values with respect to the overall dataset.
These measures are particularly useful in exploratory data analysis, statistical
analysis, and visualizations.
1. Pearson Correlation:
Pearson correlation measures the linear relationship between two continuous
variables. It assumes that the variables are normally distributed and have a linear
association. The Pearson correlation coefficient ranges from -1 to 1, where 0
indicates no linear correlation, -1 indicates a perfect negative linear correlation, and
1 indicates a perfect positive linear correlation.
Example:
Consider a dataset that contains the heights and weights of individuals. You can
calculate the Pearson correlation coefficient to measure the relationship between
these two variables. A positive coefficient close to 1 would indicate that as height
increases, weight also tends to increase.
2. Spearman Correlation:
Spearman correlation measures the monotonic relationship between two variables. It
is based on the ranks of the data rather than the actual values. Spearman correlation
is appropriate for variables that may not have a linear relationship but have a
consistent monotonic pattern.
Example:
Suppose you have a dataset containing the rankings of students in two different
exams. By calculating the Spearman correlation coefficient, you can determine
whether there is a consistent pattern in the rankings across the two exams,
regardless of the specific values.
3. Kendall Correlation:
Kendall correlation is another measure of rank correlation, similar to Spearman
correlation. It quantifies the strength and direction of the monotonic relationship
between two variables. Kendall correlation is commonly used when dealing with
ranked or ordinal data.
Example:
Consider a survey where respondents are asked to rank their preferences for
different brands. By calculating the Kendall correlation coefficient, you can assess
the level of agreement or disagreement in the rankings between different individuals.
Grouping and aggregation are common operations in data analysis and involve
grouping data based on one or multiple fields and performing calculations on those
groups. Let's explore grouping and aggregation by single or multiple fields:
Example:
Consider a dataset of sales transactions with fields like "Product Category" and
"Sales Amount". To calculate the total sales amount for each product category, you
can group the data by the "Product Category" field and apply an aggregation function
like sum():
```python
import pandas as pd
# Example DataFrame
df = pd.DataFrame({'Product Category': ['Electronics', 'Clothing', 'Electronics',
'Clothing'],
'Sales Amount': [500, 300, 700, 400]})
print(grouped)
```
Output:
```
Sales Amount
Product Category
Clothing 700
Electronics 1200
```
In this example, the DataFrame is grouped by the "Product Category" field using the
`groupby()` function, and the `sum()` aggregation function is applied to the "Sales
Amount" column for each group. The resulting DataFrame shows the total sales
amount for each product category.
Example:
Consider a dataset of sales transactions with fields like "Product Category",
"Region", and "Sales Amount". To calculate the total sales amount for each
combination of product category and region, you can group the data by the "Product
Category" and "Region" fields and apply an aggregation function like sum():
```python
import pandas as pd
# Example DataFrame
df = pd.DataFrame({'Product Category': ['Electronics', 'Clothing', 'Electronics',
'Clothing'],
'Region': ['North', 'South', 'North', 'South'],
'Sales Amount': [500, 300, 700, 400]})
print(grouped)
```
Output:
```
Sales Amount
Product Category Region
Clothing South 700
Electronics North 1200
```
ReplyForward
ReplyForward
ReplyForward