Unit 1

Random Number Generation and Statistical Data
S. No.




















This unit shall introduce general concepts of Statistics to the readers. One will also
come to know about the importance of random numbers, and the procedures to generate
these with the help of algorithms, after going through this unit. This unit describes how to
draw a frequency distribution for a given observations and also defines some measures
for these distributions.


In this section, basic terminology related to this course will be explained. One

classification tells us that Mathematical models are of two types, namely, deterministic
models and probabilistic models. Probabilistic models are also sometimes called as nondeterministic models or stochastic models.
By a deterministic model, we mean a model which stipulates that the conditions under
which an experiment is performed determine the outcome of the experiment. For
example, finding the current flown in a circuit when a battery is inserted into a simple
circuit. This can be done using Ohm’s law. Gravitation law describes accurately the
process involved in the falling of a body from a given height etc. By a probabilistic

4}. alpha-particles emitted by a material in a specified time. then the associated sample space shall be called as finite sample space. Otherwise.. In deterministic model. Sample Space (S): The set of all possible outcomes of a random experiment is called the sample space (S) associated with that experiment. points on X-Y plane. it is supposed that actual outcome of the experiment is determined from the conditions under which the experiment is carried out. The outcome of a random experiment corresponds to an element in the sample space S. In both the cases. It will not lead to consider that sample space is {2. physical considerations are used to predict the outcome while in probabilistic models the same kinds of considerations are used to specify a probability distribution. the conditions determine only the probabilistic behavior of the observable outcome. points in 3dimensions etc. execution time of a computer program. In a probabilistic model. do not determine the outcome and the outcome may change even in similar conditions. 4. We shall be able to represent the elements of a sample space as points in a space of one or more dimensions. For example. and 4 in first 15 throws. Infinite sample spaces are further classified as countably infinite and uncountable spaces. For example. Sample space in this case will be S = {1. the experiments that are performed by us for which the outcome is not certain are called as random experiments. response time of a request by a user etc. we have to perform an experiment. called as random experiment.g. e. 5. the sample space is called infinite sample space. For example. In a deterministic model. 3. while throwing a die. 6}. the process of noting whether a component is functioning or has failed. Finite Sample Space: If the set of all possible outcomes of the experiment is finite.model. Random Experiment (R): In real-life situations. 2 . The result of any such observation is called the outcome of the experiment. It is worth noting that sample space is not completely determined by the experiment and we should consider the largest possible set of outcomes that can be associated with the experiment. we may get the numbers 2. 2. we mean a model in which the conditions under which an experiment is performed. 3. the points on the real line. 3.

random numbers were generated by human beings but nowadays we employ machines to generate random numbers. However. if one were to be given a number. namely. we need to generate random numbers that follow a general principle. The outcome of this process is unpredictable. Previously.3 METHODS FOR GENERATING RANDOM NUMBERS There are two types of methods that are used to generate random numbers.e. Random Number: A random number is a number that is generated by a process. The earliest physical methods for 3 . it is essential to consider sequences of numbers. Random numbers generated by these physical or computational devices are often called as pseudo random numbers. We shall study different distributions in next units of this course. this principle is generally governed by the probability distribution that a given variable follows. cryptography. 1. We should also employ the techniques that establish whether the given generated sequence is really random or not. completely randomized design.Finite sample space and countably infinite sample space together are called countable or discrete sample space. In our case. statistical sampling.. Almost all the computer languages support a random number generator as an inbuilt function with their compilers. it is simply impossible to verify whether it was produced by a random number generator or not. appear random. In a number of situations.2 BASIC CONCEPTS IN RANDOM NUMBER GENERATION Generation of random numbers is an important topic in computer based simulation studies. and this cannot be sub-sequentially reliably reproduced. i. computer simulation. In order to study the randomness of the output of such a generator. Random Number Generator: Above definition of random number is fine if we have a kind of black box that gives us these random numbers. and other areas where producing an unpredictable result is desirable. 1. physical methods and computational methods. A random number generator is a computational or physical device designed to generate a sequence of numbers or symbols that lack any pattern. Random numbers have their applications in gambling. Such a black box is usually called a random number generator.

and we get first random number between 0 and 1 as 0. between 0 and 1. These are really slow methods for most applications in statistics. This gives x1 = 9418.6988. we multiply this by a suitably large number and then find the middle digits form the number obtained as the product. We again multiply the number obtained from the extracted digits. We extract middle 4 digits from its square.8182. We can repeat this process to get r4 = 0. r5 = 0. The sequence of random numbers obtained by this process is: 0. 0. say r1. If we divide this 4-digit number by 9999. and flipping and roulette wheels. We can again extract middle 4-digits from the product of x1 and M as x2 = 4109 and so r2 = 0.generating random numbers include die.9419. … (ii) Mid-Product Method This method is similar to mid-square method except that instead of squaring the number taken as initial seed.9419.8182. Example: Let us take the initial seed x0 as 6553. So. 0.4109. We square this number and find middle 4-digits from this square. we again extract middle 4 digits from the square of 9418. coin. we take a 4-digit number as initial seed.9288. The 4-digit number obtained as square shall act as the seed for generating next random number. Computational Methods (i) Mid-Square Method In this method. These techniques are still being used in games and gambling etc. 0. 0. One more iteration of this process shall give us x3 = 8181 and r3 = 0.9288. The sequence of random numbers obtained by this process is: 4 . by the large number instead of squaring. we will get a random number.2483. This process is repeated until we get desired number of random numbers.6988. r1 = 0. We can extract middle 4-digits from the product of x0 and M as x1 = 1182. The process can be repeated desired number of times. Example: Let us again take an initial seed as x0 = 6553 and a large number M = 9876573.1182. This is x2 = 9418 and thus second random number r2 is 0. Now.2483.

0.1030.4 EFFICIENCY OF RANDOM NUMBER GENERATORS The random numbers generated by a random number generator should possess some inherent characteristics.8498. 0. This method has a full period for all seed values if and only if (a) c and m are relatively prime. m. … (iii) Linear Congruential Method (LCM) This is widely used method and has been implemented with different compilers for generating random numbers. 0. ANSI C uses m = 232. 1. Poor choices of these parameters shall lead to ineffective implementation of the method. The sequence of random numbers obtained by this method and using ANSI C parameters with the seed 6553 is: 0. In this section.2824. 0.3063. 3. This method. You 5 . a (0 < a < m). 2.9715.1182. multiplywith-carry method and Park-Miller method. 0. should not be used for applications where high quality randomness is critical. 0.0. You may like to explore the combinations of these parameters used by other compilers. and c (0 ≤ c < m) are the integer constants that specify the generator. xn+1 = (axn + c) mod m where xn. and (c) a – 1 is a multiple of 4 if m is a multiple of 4.1166. This leads to the fact that the generator should be efficient one.1746.4109. we shall study some of these characteristics of the random numbers.8048. and a. This is also one of the oldest and best known methods for this purpose. (b) a – 1 is divisible by all prime factors of m. 0. The theory behind this is very simple and easy to understand. a = 1103515245 and c = 12345. n = 1.4236. 0. … This method is very fast and requires minimal memory and thus can be used for simulating multiple independent streams. We can use other random generation methods in those situations such as inverse congruential method. … is the sequence of generated random numbers using the seed value as x0 and m (> 0).2838. however. This can also be implemented very easily using a programming language. 0. This method is extremely sensitive to the choices of these parameters c. 0.2870. The period LCM is at most m. This method is defined by the following recurrence relation. 0.

This situation can be avoided by considering the modifications in the seed value using time function. 0. The maximum length of the sequence of random numbers before it begins to repeat is called the periodicity. The readers are advised to browse literature for the tests at serial (i) to (v) and other tests that are not mentioned here. The generator should be able to produce a sequence with fairly large period. Chi-square test Chi-square test is used for establishing the fact that the numbers generated by a random number generator are uniformly distributed over a given range. Let us consider that we are going to generate n random numbers over the interval (0. The other important point is that this period might depend upon the initial seed value. (0. The important point is that we should not use the generators without establishing their efficiency.1]. This is worth mentioning here that this list is not the exhaustive one and there are other tests also that are used to establish the efficiency of a given test. The number generated by the generator should be uniform when a large amount of numbers are generated. One important characteristic of the generator is its periodicity. Let us divide the interval (0. These should not lie in a specific region of the interval under consideration. Some of the tests that can be performed on random number generators in order to establish their efficiency are: (i) Frequency Test (ii) Serial Test (iii) Poker Test (iv) Gap Test (v) Run Test (vi) The chi-square test These tests are important in order to establish the efficiency of a random number generator. 0. 1) and we have to make sure that these numbers are uniformly generated over this range. the recurrence relation will produce the same sequence of random numbers for a given seed value. These sub-intervals shall be (0.will appreciate now that every random number generator needs a seed value to start and once this seed value is fixed.2]. The successive values generated by the generator should not exhibit a correlation between them. 1) into 10 equal sub-intervals. 6 .1. Such a seed value is called a weak seed value. These should be generated as independent values. We shall here understand concepts used in chi-square test.

Here. that these are uniformly distributed.…. If calculated value of chisquare is in rejected region. Here. (1.1 85 100 0. We will thus have 10 expected frequencies and 10 observed frequencies once we divide the given interval (0. The alternative hypothesis shall be that the numbers are not uniformly distributed. This value is called the expected frequency (fe) associated with that interval.2 107 100 0. If calculated value of chi-square is in accepted region. say 1% or 5%. then we should. ideally.0). Example: Suppose that we obtain the following frequencies while generating 1000 random numbers over the interval (0. We can also find the actual number of random numbers that are there in a given sub-interval.1-0. we assume.2-0. we accept the alternative hypothesis and reject the null hypothesis. (0. 1.0-0. 1) into 10 equal sub-intervals. we can calculate the chi-square statistic as: . If we consider that our generator is giving us uniformly distributed random numbers. This is worth noting that these frequencies will change if we take 9 or 11 (for example) sub-intervals. Interval Observed Frequency Expected Frequency (fo) (fe) 0.5 90 100 7 . This value defines two regions on x-axis. accepted region and rejected (or critical) region. This number is called the observed frequency (fo) associated with the interval. 1) using a random number generator. have n/10 numbers in each of these intervals. We shall have to test this hypothesis at certain level of significance. Once we know these expected and observed frequencies. in the form of null hypothesis. we accept the null hypothesis and reject the alternative hypothesis. accepting the null hypothesis means that the generated random numbers are uniformly distributed.3 110 100 0.4 92 100 0.9.3-0.1) We have to decide whether the generated random numbers are uniformly distributed over the given range or not. we can find the tabulated value of chi-square for a given degree of freedom and given level of significance.4-0. namely. From the table of chi-square distribution.

0. This decision is supported by at least one instance using which the observed frequencies have in this example been obtained. We call 8 . We need to go through chi-square table for this. indicating the fact that we will have to reject null hypothesis.3600 falls in the critical region.5-0. in detail. We need to know the level of significance and degree of freedom for deciding tabulated value of chi-square. Thus the region on xaxis.7 86 100 0.6-0. we shall get calculated value of chi-square as 22. we have to find the value of chi-square for 9 degrees of freedom at 5% level of significance.5 METHODS FOR GENERATING RANDOM NUMBERS FOR PROBABILITY DISTRIBUTIONS We shall be studying probability distributions. inverse method and acceptance-rejection method in this section.7-0. 1) random numbers. Suppose we test the null hypothesis (that the generated random numbers are uniformly distributed over the given range) at 5% level of significance. and level of significance will be specified. namely. we will however be giving methods that are used to generate random numbers for a given distribution. This function can be defined for either discrete variable or continuous variable. In this section.6 91 100 0.1). Suppose that we have to generate a random number from the probability function f(x). We shall discuss two methods.9-1.8 89 100 0. in the next units. As such. The degree of freedom will be number of intervals – 1. The calculated value 22.8-0.9190. after going through Unit 2.9190 defines the accepted region and x ≥ 16.0 130 100 Using (1. 1. with x < 16. In the present problem. It is recommended that you go through this section again. Let us assume that we have a good random number generator that generates independent and identically distributed uniform (0. Let us find tabulated value of chi-square. The value obtained from chi-square table is 16.3600.9 120 100 0. the random number generator using which we obtained the above data may not be giving us uniformly distributed random values.9190 defines the rejected region.

And then the random numbers from h(x) are used to generate the random numbers from f(x). . We can equivalently use . We define CDF as F(x) = P(y ≤ x) for all defined values of y. following steps shall give a random number that is sampled for a variable following the probability function f(x). Using this function. Let us define a function g(x) such that g normalize g(x) in order to define h(x) as follows. 1) and taking that F-1 is the inverse of F. Then x is a random number that follows probability function f(x). 1). Taking and then solving for x. (ii) Compute x = F-1(R). then a random number following probability distribution f(x) = 4e-4x shall be x = . we replace the complex probability function f(x) by a more analytically manageable function h(x). If we are given a random number R uniformly distributed over (0. we first find a closed form expression for the cumulative distribution function (CDF) of the probability function f(x) associated with a distribution. Example: Suppose that probability density function for a random variable X is given by f(x) = αe-αx. 9 .this as probability mass function in case of discrete variable and probability density function in case of continuous variable. The CDF is determined as .2147. (i) Generate a random number R uniform over (0. Acceptance-Rejection Method Using acceptance-rejection method. We will learn in further units that 0 ≤ F(x) ≤ 1. This is random number following the probability function f(x). In this method. Let us find a random number from f(x). For example.4236. we get . x > 0. steps of the acceptance-rejection method can be given as: (i) Generate a random number x1 from h(x) using inverse method. Inverse Method In the inverse method. we can deal with complex probability functions. α > 0. . if α = 4 and R is 0.(1/4)ln(0. Now we .4236) = 0.

These can be put in the sequential form as 23. (iii) If R ≤ f(x1) / g(x1). 45. we need to put this data in a definite format. ungrouped frequency distribution and grouped frequency distribution. 54. 56. 78. 76. 43. otherwise discard x1 and return to step (i). 67. The process of putting the data in a definite format is called classification and tabulation. in order to apply analysis tools. we need to organize this data in such a way that it is easy to understand the features of the data and based on that we can take some appropriate decisions. accept x1 as a random number from f(x1). Ungrouped Frequency Distribution Ungrouped frequency distribution of the above data will take the following form. 65. 68. 78. 56. Taking a meaningful decision from this data is not that easy but if we put this in the form of frequency distribution then we can make some inferences. This is very difficult to understand and interpret this complex data when it is in the form of a sequence. 92.(ii) Generate a random number R uniform over (0. 1). 89. 61. 76. As such. 23. 54.6 FREQUENCY DISTRIBUTION The data that we collect in an experiment are generally complex and are in a form of sequence of numbers. Example: Let us consider the marks obtained by 25 students in the course of Statistics and Combinatorics. Following examples shall illustrate the point. 1. Frequency distributions are of two types. 34. 65. 54. Marks Frequency 23 2 34 1 43 2 45 2 54 3 56 2 61 1 10 . Both of these distributions are obtained with the help of the method using tally marks. We can tabulate the data using different methods. 43. 45. 98. Now. 73. One of these methods gives us the classified data in the form of Frequency Distribution.

We will again not be able to take some decision from the frequency distribution containing 90 rows. The distribution may look like as follows when we consider the grouped frequency distribution. it is quite possible that we get ninety rows in the table instead of sixteen rows that are there now. In one extreme situation. We can however argue that this should neither be too small nor too big. if we are dealing with the marks of 100 students. Marks between (and Frequency including) 20 – 29 2 30 – 39 1 40 – 49 4 50 – 59 5 60 – 69 5 70 – 79 5 80 – 89 1 90 – 100 2 One can note that we have a choice here of defining the number of rows in this frequency distribution.65 2 67 1 68 1 73 1 76 2 78 2 89 1 92 1 98 1 Grouped Frequency Distribution The above scheme of putting the data is also not that productive an many real life situations. This difficulty is overcome by putting the data in grouped format and thus reducing the number of rows. We also call the elements in first column as class intervals. In this case itself. There is not any hard and fast rule that gives us the number of class intervals for a given data. 11 .

where n. we will have the ungrouped frequency distribution that will again be complex in many situations.1. This will again lead to poor inferences from the given data. As such. The approximate number of class intervals can be obtained by Struge’s formula . all the data points are represented by a single point and thus resulting into loss of all the information that is there in the form of given data. In other extreme situation.1: Distribution of marks of 25 students 12 90-100 . the histogram corresponding to abovementioned grouped frequency distribution shall look like as depicted in Figure 1. A graph called as histogram is generally used to depict a frequency distribution. 1.this can be as big as the range (largest – smallest) of the data and then we have all elements falling in the same interval and thus the frequency of the interval shall be equal to number of observations. in this extreme situation. we will have n = 8. rounded to the next whole number. class intervals are taken on x-axis and the frequencies corresponding to a class interval is represented by the height of the rectangle whose base is the interval under consideration. In this figure. For example. As such if N is 100. 6 5 4 3 2 1 0 20-29 30-39 40-49 50-59 60-69 70-79 80-89 Figure 1. is the number of class intervals and N is total number of observations. We shall see that the mid-point of the class interval is the representative of the interval. xaxis corresponds to the class intervals and y-axis depicts corresponding frequencies. In a histogram.8 GRAPHICAL REPRESENTATION Frequency distributions are easier to visualize if these are represented in the form of a graph.

2. 6 5 4 3 2 1 0 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-100 Figure 1. namely. median and mode. mean.2: Frequency curve of marks of 25 students 1. This information is the form of measurements of central tendency and measures of dispersion. yet it is good to associate some more precise and useful information with the data. we will understand three central tendencies.8 MEASURES OF CENTRAL TENDENCY The important point behind classifying data and drawing its frequency distribution is to find the nature of distribution that the data is following. the resultant plot shall be called as a frequency curve. Although a histogram gives a good amount of information about the data.Frequency Curve A frequency curve can also be obtained for a frequency distribution. Mean 13 . One of the definitions of central tendency is “A measure of central tendency is a typical value around which other figures congregate”. If we plot the frequencies as a function of class intervals. Again the frequency curve for the abovementioned grouped frequency distribution shall be as given in Figure 1. In this section.

…. the mean of these observations also gets added by this constant. x1. However. …. The median is given by the mean of the sizes of [n/2]th and [n/2 +1]th observations id n is even. it is very much affected by extreme observations and as such. xn as the outcome of an experiment. Then the mean of these observations is defined as . In order to calculate the median. Properties of Mean (i) Sum of deviations of the observations from their mean is always zero. x1. Mean is also called as arithmetic mean associated with these observations. In terms of frequency curve. The median is given by the size of [(n + 1)/2]th observation if n is odd. xn as the outcome of an experiment. Example: Let us take the marks of 25 students considered in section 1.6. (ii) Sum of squares of deviations of observations is minimum when taken from their mean.Let us take that we have n observations.6. x2. The median of these marks shall be 61 that is nothing but the marks at [(25+1)/2]th = 13th position when the marks are put in ascending order. Let us again take n observations. the mean of these observations also gets multiplied by this constant. we first arrange these observations in either ascending or descending order. x2.64. Median Median of a distribution is the value that divides it into two equal parts. this is not a good representation for the data consisting of extreme values. the ordinate drawn at median divides the area under the curve into two equal parts. Mean is very widely used central tendency. (iv) If every observation is multiplied by a constant. 14 . Example: Let us take the marks of 25 students considered in section 1. Here mean of these marks is = 1516 / 25 = 60. (iii) If a constant is added to the observations.

“Dispersion is the measure of variation of the items” and according to Spiegel. A measure of variation should measure the extent to which the observations deviate from 15 . Bowley.6. We have noted that mean is desired measure of location of the set of observations. 7. other measures have been introduced. According to A. 9 and 3. that is. Out of these standard deviation is an important measure and we shall discuss this here in this unit.Mode Mode is the observation that occurs maximum number of times in a given distribution and other observations are densely distributed around this number. mean deviation and standard deviation.9 MEASURES OF DISPERSION Dispersion is the term associated with the variability in the data. And a suitable average of these deviations is called the measure of dispersion. The range of values of second set (which is 8) is twice the range of values of first set (which is 4). However. The simplest measure of the variation of a set of observations is the range of given set. As such. Example: Let us again consider the marks of 25 students considered in section 1. This variability is measured in terms of the deviations from a central tendency. difference between two consecutive values in first set is 1 and this difference for second set is 2. One definition of mode tells us that “mode is the value which has the greatest frequency density in its immediate neighborhood”. 7. 6. 5. Some important measures of dispersion are range. 9. The standard deviation Let us consider two sets of observations: 5. however. 1. is not very useful as this contains some undesirable properties. 11 Mean of each of these sets is 7. The mode of these marks is 43 as this has the maximum frequency under the distribution considered by us.L. the difference between the largest and smallest observations. This measure. 8. inter-quartile range. So we may agree upon a simple fact that the observations in second set have a higher variability in comparison with the observations in first set. “The degree to which numerical data tend to spread about an average value is called variation or dispersion of data”.

x2 . This measure reflects our thought process that the second set of observations is having a higher variability. Write the sequence of steps to help him to generate 20 random query processing times for implementing the newly designed DBMS. 5. A data base administrator (DBA) knows that the duration of processing a query is between 2 seconds to 2 minutes. 9 and 3. 9. Thus. 2. Apply the acceptance-rejection method to obtain a random number from the distribution that follows the probability function f(x) = 6x(1 . 7. Let us be given n observations. 16 . x1. Generate a set of random values (x. The resulting expression is then called as sample variance.. a measurement of variations of these observations about the mean of the set of observations is given by: We shall study in subsequent units that while dealing with sampling distributions we need to modify this expression a bit so that it has some more desirable properties. y) containing 10 elements and satisfying x2 + y2 ≤ 4. xn Some of these variations shall be positive and others may be negative. 6. We thus have a measure of dispersion in the form of sample variance as: Positive square root s of this quantity is called the standard deviation of the observations. Let us again consider the two sets of observations: 5. In order to have the effect of all observations on the variability.5811 and s2 = 3. 3.1623. Simulate the experiments of throwing two coins and throwing two dice. 0 ≤ x ≤ 1..x).…. …. The variations of the observations from this mean will be: x1 . Problems: 1. 8. 11 The standard deviations of these sets using above formula are s1 = 1. x2. 4. xn as the outcome of an experiment and their mean is . This modification is that we need to divide by (n – 1) instead of n. 7. we square these observations and the find the men of these squares.this mean.

104. Show that the sum of deviates of values xi .3 and standard deviations are 8 and 7. 109. 106. …. 112.d. 9. Use this data to construct a frequency distribution and draw the corresponding histogram and cumulative frequency curve. 115. The means of two samples of size 50 and 100. a + d. 109. Find the mean and standard deviation of this data. 128. Construct a data consisting of 50 sample values for a situation in which we have to perform an experiment to get the execution times of 50 C-programs. Also. find the mean and variance of this distribution. 111. 108. 8. a + 2d. 111. …. fi being the frequency of xi.5. ui + α. respectively. Obtain the mean and standard deviation of the sample of size 150 obtained by combining the two samples. then show that x  cu   and sx  | c | su . 10. 110. 107.4 and 50. 2. Show that the sum of squares of deviations of a set of values is minimum when taken about mean. n from their mean is zero. 110. and 108. Find the mean deviation from mean and standard deviation of the values a. 108. 17 . 7. 106. i = 1. 112. 107. 6. a + 2n. If the data are coded so that xi = c. are 54. The IQs of 20 applicants to an undergraduate program are 109. 11. 125.