Professional Documents
Culture Documents
PRESIDENT’S COLLEGE
DEAPRTMENT OF MATHEMATICS
Integrated Mathematics
Module 2 – Statistics
Handout 1: Data Collection & Presentation
Week 3
Organization of Data
The organization of data refers to the structured arrangement and management of information or data elements to
make them accessible, understandable, and useful for various purposes. It involves the design and implementation
of a systematic framework or structure that allows data to be stored, retrieved, processed, and analyzed efficiently.
Organizing data is crucial for decision-making, information retrieval, data analysis, and overall data management.
Several key aspects of data organization include:
1. Data Structure: Data can be organized using various structures such as tables, lists, trees, graphs,
databases, and more, depending on the nature and requirements of the data. These structures help
establish relationships and hierarchies among data elements.
2. Data Classification: Data is categorized into different groups or classes based on its characteristics and
attributes. Classification helps in organizing data into meaningful categories, making it easier to manage
and retrieve.
3. Data Naming and Labeling: Properly naming and labeling data elements, fields, or variables is essential for
clarity and consistency. Clear and meaningful names make it easier to understand and work with the data.
4. Data Hierarchies: Data often has hierarchical relationships, where some data elements are more granular
or detailed, while others are higher-level summaries or aggregations. Organizing data hierarchically can
facilitate navigation and analysis.
5. Data Indexing: Indexing involves creating data structures (e.g., indexes) to quickly locate specific data
records or entries within a large dataset. Indexing improves data retrieval performance, especially in
databases.
6. Data Relationships: Establishing relationships between different data elements or entities is crucial for
relational databases. Relationships define how data elements are related to each other and are key for
maintaining data integrity.
7. Data Storage: Choosing appropriate storage mechanisms and technologies to store data efficiently, whether
it's in traditional databases, data warehouses, data lakes, or cloud-based storage solutions.
8. Data Security: Implementing measures to protect data from unauthorized access or breaches is a
fundamental aspect of data organization. This includes user authentication, encryption, access controls, and
data auditing.
9. Data Documentation: Properly documenting data, including metadata (data about data), is essential for
understanding its context, source, quality, and usage. Documentation is crucial for data governance and
compliance.
10. Data Retrieval and Querying: Creating mechanisms for retrieving and querying data is vital for extracting
meaningful insights from it. This may involve the use of query languages, search algorithms, or data
analysis tools.
11. Data Maintenance: Regularly updating, cleaning, and maintaining data to ensure its accuracy, consistency,
and reliability over time.
12. Data Coding: Data coding involves assigning numerical or categorical codes to represent specific categories,
attributes, or values within a dataset. This coding process simplifies the data, making it more manageable
and facilitating analysis. For example, in a survey, you might code responses like "Male" as 1 and "Female"
as 2. Coding is essential for quantitative analysis and statistical operations.
13. Data Entry: Data entry is the process of manually inputting data into a digital system or database. This can
involve entering data from paper forms, surveys, or other sources into a computerized system. Accurate
and consistent data entry is critical for maintaining data quality and integrity. Errors in data entry can lead
to problems down the line when analyzing or using the data.
14. Data Validation: During the data entry process, data validation checks can be implemented to ensure that
the entered data adheres to predefined rules or standards. This helps identify and correct errors or
inconsistencies in real-time, contributing to data quality.
15. Data Transformation: In some cases, data entry and coding may also involve transforming data from one
format or structure to another. For example, converting dates from different formats into a standardized
date format for consistency.
16. Data Cleaning: Data cleaning often follows data entry, and it involves identifying and correcting errors,
missing values, duplicates, or outliers in the dataset. Cleaning is essential to ensure data accuracy and
reliability.
Prepared by Henry Brandon (23/09/2023)
Instruments Used to collect data
1. Surveys: Surveys involve asking structured questions to a sample of individuals or organizations. They can
be conducted through various means, including paper surveys, online surveys, telephone interviews, or
face-to-face interviews.
2. Questionnaires: Questionnaires are a type of survey instrument that typically consists of a set of
standardized questions. Respondents provide written or verbal responses to these questions.
3. Interviews: Interviews involve one-on-one or group interactions with participants to gather information.
They can be structured (with predetermined questions) or unstructured (more open-ended and
conversational).
4. Observations: Observational data is collected by directly watching and recording events, behaviors, or
phenomena. It can be done in a controlled setting (controlled observations) or in natural environments
(naturalistic observations).
5. Experiments: Experiments involve manipulating one or more variables to observe their effects on other
variables. Experiments are commonly used in scientific research to establish cause-and-effect relationships.
6. Focus Groups: Focus group discussions involve a small group of participants who engage in a facilitated
discussion about a specific topic. They are often used to gather qualitative insights and opinions.
7. Content Analysis: Content analysis involves systematically analyzing written, visual, or audio materials
(e.g., texts, videos, images) to extract and code relevant information.
8. Case Studies: Case studies involve an in-depth examination of a single case or a few cases. Researchers
collect detailed information about the case(s) to gain insights into specific phenomena.
9. Secondary Data: Secondary data is data that has been previously collected by someone else for a different
purpose. Researchers use existing datasets, documents, or records for their analysis.
10. Sensor Data: Sensors and instruments like GPS devices, temperature sensors, accelerometers, and more can
collect data automatically in real-time, often used in scientific research and environmental monitoring.
11. Web Scraping: Web scraping involves extracting data from websites or online sources. It is commonly used
in data collection for web-based research.
12. Diaries and Journals: Participants keep records of their activities, thoughts, or experiences in diaries or
journals. These can provide valuable insights into daily life and attitudes.
13. Photography and Videography: Images and videos can capture visual data, which is especially useful in
fields like anthropology, ecology, and art analysis.
14. Biometric Data: Biometric instruments can collect physiological data like heart rate, EEG
(electroencephalogram), or eye-tracking data to study human behavior and physiological responses.
15. Social Media Data: Data from social media platforms can be collected for various purposes, such as
sentiment analysis, trend tracking, or studying online behavior.
16. Geographic Information Systems (GIS): GIS tools collect and analyze spatial data, including maps,
geographic coordinates, and geographic features.
17. Telemetry: Telemetry involves remotely collecting data from sensors or instruments and transmitting it to
a central location for monitoring or analysis. It's commonly used in fields like environmental science and
engineering.
18. Biological Samples: In fields like biology and medicine, researchers collect biological samples (e.g., blood,
tissue, DNA) for laboratory analysis.
19. Economic Indicators: Economic data, such as GDP, unemployment rates, and inflation, are collected by
government agencies and organizations to monitor economic conditions.
20. Psychometric Tests: Psychometric instruments are used in psychology and education to measure cognitive
abilities, personality traits, and other psychological constructs.
Frequency Table
Presentation of Data
The presentation of data refers to the process of visually and graphically representing information, facts, or
findings in a clear, understandable, and often engaging manner. The primary objective of data presentation is to
communicate data-driven insights effectively to an audience, making it easier for them to grasp the meaning,
patterns, and implications of the data. Effective data presentation enhances data interpretation and aids decision-
making. Common methods of presenting data include charts, graphs, tables, maps, infographics, and narratives,
among others. The choice of presentation format depends on the nature of the data, the audience, and the specific
objectives of conveying the information.
1. Bar Chart:
Definition: A bar chart represents data using rectangular bars of varying lengths. The length of each bar
corresponds to the value it represents.
When to Use: Use a bar chart to compare discrete categories or data points. For example, you can use a
bar chart to compare the sales performance of different products in a store over a month.
2. Pie Chart:
Definition: A pie chart is a circular graph divided into slices, with each slice representing a portion of a
whole. Each slice's size represents the proportion or percentage of a category.
When to Use: Use a pie chart to show the composition of a whole when you want to emphasize relative
proportions. For instance, you can use a pie chart to display the distribution of expenses in a budget.
3. Line Graph:
Prepared by Henry Brandon (23/09/2023)
Definition: A line graph displays data points as a series of connected dots, forming a line. It is suitable
for showing trends and changes over time or a continuous variable.
When to Use: Use a line graph to illustrate the stock price of a company over several years, highlighting
the trend in its value.
4. Histogram:
Definition: A histogram displays the distribution of numerical data using vertical bars or bins. Each bar
represents a range of values, and the height indicates the frequency of data points in that range.
When to Use: Use a histogram to visualize the distribution of exam scores in a classroom, showing how
many students scored within each score range.
5. Frequency Polygon:
Definition: A frequency polygon is a line graph that represents data frequencies using line segments
connected to data points or bins.
When to Use: Use a frequency polygon in conjunction with a histogram to provide a smoother
representation of the data's distribution, making it easier to identify trends.
6. Ogive:
Definition: An ogive, or cumulative frequency curve, displays cumulative frequencies or percentages of
data values.
When to Use: Use an ogive to visualize the cumulative distribution of data, such as the cumulative
number of customers who have purchased a product at different price points.
7. Box-and-Whisker Plot:
Definition: A box-and-whisker plot summarizes the distribution of numerical data by displaying the
median, quartiles, range, and outliers in a graphical format.
When to Use: Use a box-and-whisker plot to compare the salary distributions of employees in different
departments of a company.
8. Stem-and-Leaf Plot:
Definition: A stem-and-leaf plot organizes and displays numerical data by separating each data point
into a "stem" (leading digits) and a "leaf" (trailing digits).
When to Use: Use a stem-and-leaf plot to visualize the distribution of ages of participants in a survey.
9. Scatter Plots:
Definition: A scatter plot shows individual data points as dots on a two-dimensional plane. It is used to
visualize relationships between two continuous variables.
When to Use: Use a scatter plot to explore the correlation between the number of hours spent studying
and exam scores for a group of students.
10. Cross Tabulation of Nominal or Categorical Data:
Definition: Cross tabulation, or a contingency table, summarizes the relationships between two or more
categorical variables by presenting the frequency or counts in a table format.
When to Use: Use cross tabulation to analyze the relationship between gender (male or female) and the
preference for a particular type of smartphone (iPhone or Android) among survey respondents.
A Bar Chart, Pie Chart & Line Graph
Frequency Polygon
Prepared by Henry Brandon (23/09/2023)
Histogram
Ogive
Box-and-Whisker Plot
Prepared by Henry Brandon (23/09/2023)
Stem-and-Leaf Plot
Scatter Plot
Bimodal Distributions
Uniform Distribution
Prepared by Henry Brandon (23/09/2023)
Multimodal Distribution
Examples:
Stem-and-Leaf Plots
Prepared by Henry Brandon (23/09/2023)
0
1 8
2 0 2 2 3 4 7 8 8 9 9
3 0 0 0 0 1 1 1 2 3 4 4 4 5 6 6 6 6 7 7 7 8 9 9 9 9
4 0 0 0 0 1 1 1 1 1 2 2 4 5 6 6 7 7 9
5 0 0 1 1 2 2 3 3 3 5 5 8 9 9
6 0 1 3 5 8 8
7 3 4
8
9 4
Line Graph
Prepared by Henry Brandon (23/09/2023)
Bar Charts – Simple: Vertical & Horizontal, Stacked: Horizontal & Vertical, Comparative: Vertical &
Horizontal
Prepared by Henry Brandon (23/09/2023)
Interpretation:
The highest bar was observed to be the age group of 13-25, while the lowest bar was observed to be the age group
of 45-64.
Histograms
Prepared by Henry Brandon (23/09/2023)
Interpretation:
Increasing and decreasing trends and we can also look at the shape of the distribution. (approximately Normal)
Frequency Polygons
Prepared by Henry Brandon (23/09/2023)
Interpretation:
Same interpretations as a simple line graph.
Pie Charts:
Prepared by Henry Brandon (23/09/2023)
Interpretation:
a. Bicycle = 125
b. Does not walk = 415
c. Bus & Car = 290
Students who come to school by car are the most, while students who walk to school are the least. Majority of the
students use a form of transportation (58%) that doesn’t involve using any physical energy.
Box-and-Whisker Plots:
Interpretation: Highest observation, lowest observation, the median Q 2, the first quartile Q 1and the 3rd quartile Q 3 .
The 1st whisker is bounded by the lowest observation and the 1 st quartile. The rectangular box is bounded by the 1 st
and 3rd quartiles. The 2nd whisker is bounded by the 3 rd quartile and the highest observation. Finally, the median is
represented by a stroke through the rectangular box. This is called the 5 – point set.
Shape of the distribution can be determined from the Box-and-Whisker plot. Now to determine whether there is a
positively or negatively skewed distribution, we follow the following formula.
We need the Median and the Mean.
Measuring Skewness:
If the mean = median, then we have a normal distribution.
If the mean > median, then we have a positively skewed distribution.
If the mean < median, then we have a negatively skewed distribution.
Further, we can look at the Box-and-Whisker plot to determine the amount of data that is above or below a certain
point (percentile).
Range = Highest observation – lowest observation.
IQR & SIQR.
Prepared by Henry Brandon (23/09/2023)
Ogive:
Interpretation:
More or less this is the same interpretation as the Box-and-Whisker plot.