You are on page 1of 16

Course: Educational Statistics (8614) Semester: Spring, 2021

Level: B.Ed

ASSIGNMENT No. 1

Q.1 What do you understand by statistics? What are the characteristics of statistics? Explain in detail.
Statistics
Statistics is a branch of applied mathematics that involves the collection, description,
analysis, and inference of conclusions from quantitative data. The mathematical theories behind statistics rely
heavily on differential and integral calculus, linear algebra, and probability theory. Statisticians, people who do
statistics, are particularly concerned with determining how to draw reliable conclusions about large groups and
general phenomena from the observable characteristics of small samples that represent only a small portion of
the large group or a limited number of instances of a general phenomenon.

The two major areas of statistics are known as descriptive statistics, which describes the
properties of sample and population data, and inferential statistics, which uses those properties to test
hypotheses and draw conclusions.

Understanding Statistics
Statistics are used in virtually all scientific disciplines such as the physical and social
sciences, as well as in business, the humanities, government, and manufacturing. Statistics is fundamentally a
branch of applied mathematics that developed from the application of mathematical tools including calculus and
linear algebra to probability theory.

In practice, statistics is the idea we can learn about the properties of large sets of objects or
events (a population) by studying the characteristics of a smaller number of similar objects or events (a sample).
Because in many cases gathering comprehensive data about an entire population is too costly, difficult, or flat
out impossible, statistics start with a sample that can conveniently or affordably be observed.

Two types of statistical methods are used in analyzing data: descriptive statistics and


inferential statistics. Statisticians measure and gather data about the individuals or elements of a sample, then
analyze this data to generate descriptive statistics. They can then use these observed characteristics of the

1
sample data, which are properly called "statistics," to make inferences or educated guesses about the
unmeasured (or unmeasured) characteristics of the broader population, known as the parameters.

Descriptive Statistics
Descriptive statistics mostly focus on the central tendency, variability, and distribution of
sample data. Central tendency means the estimate of the characteristics, a typical element of a sample or
population, and includes descriptive statistics such as mean, median, and mode. Variability refers to a set of
statistics that show how much difference there is among the elements of a sample or population along the
characteristics measured, and includes metrics such as range, variance, and standard deviation.

The distribution refers to the overall "shape" of the data, which can be depicted on a chart
such as a histogram or dot plot, and includes properties such as the probability distribution function, skewness,
and kurtosis. Descriptive statistics can also describe differences between observed characteristics of the
elements of a data set. Descriptive statistics help us understand the collective properties of the elements of a
data sample and form the basis for testing hypotheses and making predictions using inferential statistics.

Inferential Statistics
Inferential statistics are tools that statisticians use to draw conclusions about the
characteristics of a population from the characteristics of a sample and to decide how certain they can be of the
reliability of those conclusions. Based on the sample size and distribution of the sample data statisticians can
calculate the probability that statistics, which measure the central tendency, variability, distribution, and
relationships between characteristics within a data sample, provide an accurate picture of the corresponding
parameters of the whole population from which the sample is drawn.

Inferential statistics are used to make generalizations about large groups, such as
estimating average demand for a product by surveying a sample of consumers' buying habits, or to attempt to
predict future events, such as projecting the future return of a security or asset class based on returns in a sample
period.

Regression analysis is a common method of statistical inference that attempts to


determine the strength and character of the relationship (or correlation) between one dependent variable (usually
denoted by Y) and a series of other variables (known as independent variables). The output of a regression
model can be analyzed for statistical significance, which refers to the claim that a result from findings generated

2
by testing or experimentation is not likely to have occurred randomly or by chance but are instead likely to be
attributable to a specific cause elucidated by the data. Having statistical significance is important for academic
disciplines or practitioners that rely heavily on analyzing data and research.

Main Characteristics of Statistics

1. It consists of aggregates of facts:


In the plural sense, statistics refers to data, but data to be called statistics must consist of
aggregate of certain facts.

A single and isolated fact or figure like, 60 Kgs. weight of a student or the death of a
particular person on a day does not amount to statistics.

For a data may amount to statistics it must be in the form of a set or aggregate of certain
facts, viz. 50, 65, 70 Kgs. Weight of students in a class or profits of a firm over different times etc. is liable to
be effected by multiplicity of causes.

2. It is effected by many causes:


It is not easy to study the effects of one factor only by ignoring the effects of other factors.
Here we have to go for the effects of all the factors on the phenomenon separately as well as collectively,
because effects of the factors can change with change of place, time or situation.

Here, the overall effect is taken and not of one factor only as in other natural sciences. For
example, we can say that result of class XII in board examination does not depend on any single factor but
collectively on standard of teachers, teaching methods, teaching aids, practical’s performance of students,
standard of question papers and as well as of evaluation.

3. It should be numerically expressed:


A data to be called statistics should be numerically expressed so that counting or
measurement of data can be made possible. It means that the data or the fact to constitute statistics must be
capable of being expressed in some quantitative form as weights of 60, 70, 100 and 90 Kg. or profits of Rs.
10,000, Rs. 20,000 etc. Thus these data must contain numerical figures so that those may be called as numerical
statement of facts.

3
4. It must be enumerated or estimated accurately:
As stated above that the statements should be precise and meaningful. For getting
reasonable standard of accuracy the field of enquiry should not be very large. If it is infinite or very large, even
enumeration of data is impossible and reasonable standard of accuracy may not be achieved. To achieve it we
have to make on estimate according to reasonable standard of accuracy depending upon the nature and purpose
of collection of data. e.g. we may measure the height of buildings in metres but we cannot measure the length of
small things like bricks in the same unit of metre.

5. It should be collected in a systematic manner:


Another characteristic of statistics is that the data should be collected in a systematic
manner. The data collected in a haphazard manner will lead to difficulties in the process of analysis, and wrong
conclusions. A proper plan should be made and trained investigators should be used to collect data so that they
may collect statistics. If it is not done, in such cases reliability of data gets decreased. So to get correct results
the data must be collected in a precise manner.

6. It should be collected for a predetermined purpose:


Before we start the collection of data, we must be clear with the purpose for which we are
collecting the data. If we have no information about its purpose, we may not be collecting data according to the
needs. We may need some more relevant data to achieve the required purpose, which we would miss in the
event of its ignorance.

Suppose we want to get data on imports and exports, we have to know about various
segments such as electronics, consumer articles, grains and such other segregations also. If some person on
govt. duty is counting the vehicles passing through a road in a unit time is statistics, but same work done by any
other person not related to this field, is not statistics because the former is doing it for the Government which
wants to make it four lane road-if needed.

7. It should be capable of being placed in relation to each other:


It is last but not less important of the characteristics of the statistics. The collection of
data is generally done with the motive to compare. If the figures collected are not comparable, in that case, they
lose a large part of their significance.

4
It means, the figures collected should be homogeneous for comparison and not
heterogeneous. For example, Heterogeneous data like sale of Rs. 20,000 result of 80% cases and mileage of 80
kms can never be placed in relation to each other and compared for analysis and interpretation which is the
ulterior motive of the science of statistics. It can be concluded that all statistics are numerical data but all
numerical data are not statistics unless they satisfy all the essential characteristics of statistics, depicted as
above.

Q.2 What do you understand by the term “data”? Write in detail the types of data.
Data is a thorny subject. For a start, we’re not sure how we are supposed to refer to it,
that is - data is the plural of datum. Strictly speaking we should talk about data that ‘are’ not ‘is’ available to
support a theory etc. The Guardian newspaper discussed the debate here and appeared to suggest that (split
infinitives and nuances of idiomatic Latin notwithstanding) our day-to-day usage of the term is allowed to
remained conveniently grammatically incorrect.

So of the many different instances of individual datum (sorry, data) that exist, can we
group them into distinct types, categories, varieties and classifications? In this world of so-called digital
transformation and cloud computing that drives our always-on über-connected lifestyles, surely it would be
useful to understand the what, when, where and why of data on our journey to then starting to appreciate the
how factor.

1 - Big data

A core favorite, big data has arisen to be defined as something like: that amount of data that
will not practically fit into a standard (relational) database for analysis and processing caused by the huge
volumes of information being created by human and machine-generated processes.

Thomas suggests that big data is a big deal because it’s the fuel that drives things like
machine learning, which form the building blocks of artificial intelligence (AI). He says that by digging into
(and analyzing) big data, people are able to discover patterns to better understand why things happened. They
can also then use AI to predict how they may happen in the future and prescribe strategic directions based on
these insights.

2 - Structured, unstructured, semi-structured data

5
All data has structure of some sort. Delineating between structured and unstructured data
comes down to whether the data has a pre-defined data model and whether it’s organized in a pre-defined way.

Mat Keep is senior director of products and solutions at MongoDB. Keep explains that, in
the past, data structures were pretty simple and often known ahead of data model design -- and so data was
typically stored in the tabular row and column format of relational databases.

As a result of all this polymorphism today, many software developers are looking towards
more flexible alternatives to relational databases to accommodate data of any structure.

3 - Time-stamped data

 Time-stamped data is a dataset which has a concept of time ordering defining the sequence
that each data point was either captured (event time) or collected (processed time).

“This type of data is typically used when collecting behavioral data (for example, user
actions on a website) and thus is a true representation of actions over time. Having a dataset such as this is
invaluable to data scientists who are working on systems that are tasked with predicting or estimating next best
action style models, or performing journey analysis as it is possible to replay a user's steps through a system,
learn from changes over time and respond,” said Alex Olivier, product manager at marketing personalization
software platform company Qubit.

4 - Machine data

Simply put, machine data is the digital exhaust created by the systems, technologies and
infrastructure powering modern businesses.

Matt Davies, head of EMEA marketing at Splunk asks us to paint a picture and imagine
your typical day at work, driving to the office in your connected car, logging on to your computer, making
phone calls, responding to emails, accessing applications. Davies explains that all this activity creates a wealth
of machine data in an array of unpredictable formats that is often ignored.

If made accessible and usable, machine data is argued to be able to help organizations
troubleshoot problems, identify threats and use machine learning to help predict future issues.

6
5 - Spatiotemporal data

Spatiotemporal data describes both location and time for the same event -- and it can show
us how phenomena in a physical location change over time.

Temporal data contains date and time information in a time stamp. Valid Time is the time
period covered in the real world. Transaction Time is the time when a fact stored in the database was known.

“Examples of how analysts can visualize and interact with spatiotemporal data include: tracking moving
vehicles, describing the change in populations over time, or identifying anomalies in a telecommunications
network. Decision-makers can also run backend database calculations to find distances between objects or
summary statistics on objects contained within specified locations,” said MapD’s Mostak.

6 - Open data

Open data is data that is freely available to anyone in terms of its use (the chance to
apply analytics to it) and rights to republish without restrictions from copyright, patents or other mechanisms of
control.  The Open Data Institute states that open data is only useful if it’s shared in ways that people can
actually understand. It needs to be shared in a standardized format and easily traced back to where it came from.

Bursell explains that these are still academic techniques at the moment, but over the
next ten years he says that people will be thinking about what we mean by open data in different ways. The
open source world understands some of those questions and can lead the pack. The Red Hat security man says
that it can be difficult for organizations that have built their business around keeping secrets. They now have to
look at how they open that up to create opportunities for wealth creation and innovation.

7 - Dark data

Dark data is digital information that is not being used and lies dormant in some form.

Analyst house Gartner Inc. describes dark data as, "Information assets that an
organization collects, processes and stores in the course of its regular business activity, but generally fails to use
for other purposes."

8 - Real time data 


7
One of the most explosive trends in analytics is the ability to stream and act around
real time data. Some people argue that the term itself is something of a misnomer i.e. data can only travel as fast
as the speed of communications, which isn’t faster than time itself… so, logically, even real time data is slightly
behind the actual passage of time in the real world. However, we can still use the term to refer to instantaneous
computing that happens about as fast as a human can perceive.

Newman says that real time data can help with everything from deploying emergency
resources in a road crash to helping traffic flow more smoothly during a citywide event. He says that real time
data can also provide a better link between consumers and brands allowing the most relevant offers to be
delivered at precise moments based upon location and preferences. “Real time data is a real powerhouse and its
potential will be fully realized in the near term,” added Newman.

9 - Genomics data

Bharath Gowda, vice president for product marketing at Databricks points at


genomics data as another area that needs specialist understanding. Genomics data involves analysing the DNA
of patients to identify new drugs and improve care with personalized treatments.

“It requires significant data processing and needs to be blended with data from
hundreds of thousands of patients to generate insights. Furthermore, you need to look at how you can unify
analytics workflows across all teams - from the bioinformatics professional prepping data to the clinical
specialist treating patients - in order to maximize its value,” said Gowda.

10 - Operational data

Colin Fernandes is product marketing director for EMEA region at Sumo Logic.
Fernandes says that companies have big data, they have application logs and metrics, they have event data, and
they have information from microservices applications and third parties.

The question is: how can they turn this data into business insights that decision makers and non-technical teams
can use, in addition to data scientists and IT specialists?

Fernandes points out that in practice, this means looking at new applications and
business goals together to reverse engineer what your operational data metrics should be. New customer-facing

8
services can be developed on microservices, but how do we make sure we extract the right data from the start?
By putting this ‘operational data” mindset in place, we can arguably look at getting the right information to the
right people as they need it.

11 - High-dimensional data

High-dimensional data is a term being popularized in relation to facial recognition


technologies. Due to the massively complex number of contours on a human face, we need new expressions of
data that are multi-faceted enough to be able to handle computations that are capable of describing all the
nuances and individualities that exist across out facial physiognomies. Related to this is the concept of
eigenfaces, the name given to a set of eigenvectors when they are used in computing to process human face
recognition.

12 - Unverified outdated data

The previously quoted Mike Bursell of Red Hat also points to what he calls unverified
outdated data. This is data that has been collected, but nobody has any idea whether it's relevant, accurate or
even of the right type. We can suggest that in business terms, if you're trusting data that you haven't verified,
then you shouldn't be trusting any decisions that are made on its basis. Bursell says that Garbage In, Garbage
Out still holds… and without verification, data is just that: garbage.

“Arguably even worse that unverified data, which may at least have some validity and
which you should at least know that you shouldn't trust, data which is out-of-date and used to be relevant. But
many of the real-world evidence from which we derive our data changes, and if the data doesn't change to
reflect that, then it is positively dangerous to use it in many cases,” said Bursell.

13 - Translytic Data

An amalgam of ‘transact’ and ‘analyze’, translytic data is argued to enable on-demand


real-time processing and reporting with new metrics not previously available at the point of action. This is the
opinion of Mark Darbyshire, CTO for data and database management at SAP UK.

Darbyshire says that traditionally, analysis has been done on a copy of transactional
data. But today, with the availability of in-memory computing, companies can perform ‘transaction window’

9
analytics. This he says supports tasks that increase business value like intelligent targeting, curated
recommendations, alternative diagnosis and instant fraud detection as well as providing subtle but valuable
business insights.

Q.3 What types of characteristics a pictogram should have to successfully convey the meaning? Write
down the advantages and drawbacks of using pictograms.
type of characteristics a pictogram

A picture that represents a word or an idea by illustration. picture graphs data is


the most important and meaningful character in science. Pictograms are a visual way o displaying statistical
data. They are also known as pictorial unit charts, pictographs and pictorial unit bar charts. For example, I will
draw a pictogram to compare the countries that have the most tanks. First I need a table of statistical data
showing the countries with the most tanks I will give it an appropriate title. I will then use an appropriate
symbol to represent the tanks. Each tank pictured in the graph will represent 1000 tanks. Now let's draw the
pictograph. Russia has 22,950 tanks so I will round up the figures and draw 23 tanks. For the United States,
draw nine tanks. China I will draw seven. North Korea and Pakistan I will draw five each. And here's our
completed pictograph. 

A common deceptive graph involves using a vertical scale that starts at some
value greater than zero to exaggerate differences between groups. Here's an example of two graphs that depict
the same information. The graph on the left doesn't have a zero starting point, it actually starts at ten per cent.
The graph on the right does have a zero starting point. The graphs represent the same information although the
graph on the left would make you believe that there's a bigger difference between those using oxycontin and
their experience with nausea than the placebo.

When in reality there'snot that big of a difference. So always examine a graph


carefully to see whether the vertical axis begins at some point other than zero so that differences are
exaggerated. Pictographs. Drawings of objects called pictographs are often misleading. Data that are one-
dimensional in nature, such as budget amounts, are often depicted with two-dimensional objects such as dollar
bills or three-dimensional objects such as stacks of coins homes or barrels.

10
By using pictographs artists can create false impressions that grossly distort
differences by using these simple principles of basic geometry. When you double each side of a square its area
does it merely double it increases by a factor of four. When you double each side of a cube its volume doesn't
merely double, it increases by a factor of eight. Here's an example of a pictogram representation of the decrease
in smoking from 1970 to 2013. Now, these are three-dimensional objects, cylinder in shape basically. It looks
like the cigarette on the right is much smaller than the cigarette on the left - less than half the size for sure. 

But if notice, the percentages in 1970: 37 per cent of household country adults
smoked, while in 2013 18 per cent of adults smoked. If this was an accurate pictorial representation of the
relationship between the nineteen seventy per cent and the two thousand thirteen per cent of smokers, then the
picture on the right would only be half as big as the picture on the left. But it does appear to be much smaller.
Some concluding thoughts in addition to the graphs we've discussed in this section there are many other useful
graphs, some of which may not yet have been created. The world needs more people who can create original
graphs that enlighten us about the nature of data. In the visual display of quantitative information, Edward Tufte
offers these principles: for small datasets of twenty values use a table instead of a graph.

A graph of data should make us focus on the true nature of the data, not on other
elements such as eye-catching but distracting design features. Do not distort data construct a graph to reveal the
true nature of the data. Almost all of the ink in a graph should be used for the data not for the other design
elements. Graphs can be so much more than this brief introduction may lead you to believe. Check out the
graphs and related information found on this website. We're not going to take a lot of time together looking at
the different graphs but this is a monthly feature in the Country Times that includes a posting of a graph and
then they ask people to write comments about what they think appears in the graphs. There are some really very
interesting graphs that are used here this first one know it's small so you may want to take some time and then
read the interpretation of others as they looked at the graphs.

Q.4 Define normal curve. Write down the properties of normal curve.
A normal distribution, sometimes called the bell curve, is a distribution that
occurs naturally in many situations. For example, the bell curve is seen in tests like the SAT and GRE. The bulk
of students will score the average (C), while smaller numbers of students will score a B or D. An even smaller
percentage of students score an F or an A. This creates a distribution that resembles a bell (hence the nickname).
11
The bell curve is symmetrical. Half of the data will fall to the left of the mean; half will fall to the right.
Many groups follow this type of pattern. That’s why it’s widely used in business, statistics and in government
bodies like the FDA:
 Heights of people.
 Measurement errors.
 Blood pressure.
 Points on a test.
 IQ scores.
 Salaries.
The empirical rule tells you what percentage of your data falls within a certain number of standard
deviations from the mean:
• 68% of the data falls within one standard deviation of the mean.
• 95% of the data falls within two standard deviations of the mean.
• 99.7% of the data falls within three standard deviations of the mean.

The standard deviation controls the spread of the distribution. A smaller standard deviation indicates that the
data is tightly clustered around the mean; the normal distribution will be taller. A larger standard deviation
indicates that the data is spread out around the mean; the normal distribution will be flatter and wider.
Properties of a normal distribution

12
 The mean, mode and median are all equal.
 The curve is symmetric at the center (i.e. around the mean, μ).
 Exactly half of the values are to the left of center and exactly half the values are to the right.
 The total area under the curve is 1.
The Standard Normal Model
A standard normal model is a normal distribution with a mean of 0 and a standard deviation of 1.

Standard Normal Model: Distribution of Data


One way of figuring out how data are distributed is to plot them in a graph. If the
data is evenly distributed, you may come up with a bell curve. A bell curve has a small percentage of the points
on both tails and the bigger percentage on the inner part of the curve. In the standard normal model, about 5
percent of your data would fall into the “tails” (colored darker orange in the image below) and 90 percent will
be in between. For example, for test scores of students, the normal distribution would show 2.5 percent of
students getting very low scores and 2.5 percent getting very high scores. The rest will be in the middle; not too
high or too low. The shape of the standard normal distribution looks like this:
Properties
All forms of (normal) distribution share the following characteristics:

1. It is symmetric
A normal distribution comes with a perfectly symmetrical shape. This means that the
distribution curve can be divided in the middle to produce two equal halves. The symmetric shape occurs when
one-half of the observations fall on each side of the curve.

2. The mean, median, and mode are equal


The middle point of a normal distribution is the point with the maximum frequency, which
means that it possesses the most observations of the variable. The midpoint is also the point where these three
measures fall. The measures are usually equal in a perfectly (normal) distribution.

13
3. Empirical rule
In normally distributed data, there is a constant proportion of distance lying under the curve
between the mean and specific number of standard deviations from the mean. For example, 68.25% of all cases
fall within +/- one standard deviation from the mean. 95% of all cases fall within +/- two standard deviations
from the mean, while 99% of all cases fall within +/- three standard deviations from the mean.

4. Skewness and kurtosis


Skewness and kurtosis are coefficients that measure how different a distribution is from a
normal distribution. Skewness measures the symmetry of a normal distribution while kurtosis measures the
thickness of the tail ends relative to the tails of a normal distribution.

History of Normal Distribution


Most statisticians give credit to French scientist Abraham de Moivre for the discovery of
normal distributions. In the second edition of “The Doctrine of Chances,” Moivre noted that probabilities
associated with discreetly generated random variables could be approximated by measuring the area under the
graph of an exponential function.

Moivre’s theory was expanded by another French scientist, Pierre-Simon Laplace, in


“Analytic Theory of Probability.” Laplace’s work introduced the central limit theorem that proved that
probabilities of independent random variables converge rapidly to the areas under an exponential function.
Q.5 Explain procedure for determining median, with one example each at least, if:
i. The number of scores is even
ii. The number of scores is odd.

Median, in statistics, is the middle value of the given list of data, when arranged in an order. The arrangement of
data or observations can be done either in ascending order or descending order. 
14
Example: The median of 2,3,4 is 3.

In Maths, the median is also a type of average, which is used to find the center
value. Therefore, it is also called measure of central tendency. 

Apart from the median, the other two central tendencies are mean and mode. Mean
is the ratio of sum of all observations and total number of observations. Mode is the value in the given data-set,
repeated most of the time.

Learn more:

 Mean

 Mode

In geometry, a median is also defined as the center point of a polygon. For example,
the median of a triangle is the line segment joining the vertex of triangle to the center of the opposite sides.
Therefore, a median bisects the sides of triangle.

Median in Statistics

The median of a set of data is the middlemost number or center value in the set. The
median is also the number that is halfway into the set.

To find the median, the data should be arranged, first, in order of least to greatest or
greatest to the least value. A median is a number that is separated by the higher half of a data sample, a
population or a probability distribution, from the lower half. The median is different for different types of
distribution. 

Median Formula

The formula to calculate the median of the finite number of data set is given here.
Median formula is different for even and odd numbers of observations. Therefore, it is necessary to recognise
first if we have odd number of values or even number of values in a given data set.

The formula to calculate the median of the data set is given as follow.

15
Even Number of Observations

If the total number of observation is even, then the median formula is:

Median  = [(n/2)th term + {(n/2)+1}th]/2

where n is the number of observations

Odd Number of Observations

If the total number of observation given is odd, then the formula to calculate the median is:

Median = {(n+1)/2}thterm

where n is the number of observations

Example 1:

Find the Median of 14, 63 and 55

solution:

Put them in ascending order: 14, 55, 63

The middle number is 55, so the median is 55.

Example 2:

Find the median of the following:

4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29

Solution:

When we put those numbers in the order we have:

4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,

There are fifteen numbers. Our middle is the eighth number:

The median value of this set of numbers is 24.

16

You might also like