Professional Documents
Culture Documents
Introduction To Method For Research in Official Statistics PDF
Introduction To Method For Research in Official Statistics PDF
An Overview ........................................................................................................................................... 1
Chapter 1. Research, Researchers and Readers ..................................................................................... 3
1.1. Doing Research ............................................................................................................................ 3
1.2. The Need for a Research Report .................................................................................................. 9
1.3. Connecting with the Reader ....................................................................................................... 11
1.4. The Research Project ................................................................................................................. 12
1.5. Example of a Research Proposal................................................................................................ 16
Chapter 2. Formulating the Research Problem and the Research Plan ............................................... 21
2.1. Research Problem ...................................................................................................................... 21
2.1.1. From Topic to Questions..................................................................................................... 22
2.1.2. Research Objectives ............................................................................................................ 24
2.2. Using Sources ............................................................................................................................ 25
2.3. Stating the Research Hypotheses (and Assumptions) ................................................................ 32
2.4. Putting together the Research Plan ............................................................................................ 34
2.4.1. Formulating the Conceptual and Operational Framework .................................................. 35
2.4.2. Using a model ..................................................................................................................... 40
2.4.3. Selecting methodologies ..................................................................................................... 42
Chapter 3. Case Studies from First Research-based Regional Course................................................ 43
3.1 Summaries of Research Proposals ............................................................................................. 43
3.2. Executive Summaries of Selected Final Reports ....................................................................... 46
3.2.1. Statistics for Local Development Planning (Try Sothearith, Cambodia)............................ 46
3.2.2. Access and Quality of Basic Education in Myanmar (Khin Khin Moe, Myanmar) ........... 50
3.2.3. Determinants of Poverty in Nepal (Shib Nandan Prasad Shah, Nepal) .............................. 54
Chapter 4. Data Processing & Analyses .............................................................................................. 59
4.1. Organizing and Summarizing Data ............................................................................................ 59
4.1.1. Getting Started with STATA .............................................................................................. 63
4.1.2. Obtaining Tables and Summary Measures with STATA.................................................... 74
4.1.3. Managing Data with STATA .............................................................................................. 76
4.1.4. Obtaining Graphs and Charts with STATA ........................................................................ 78
4.1.5. Correlation Analysis with STATA...................................................................................... 84
4.2. Performing Statistical Inference................................................................................................. 87
statistics, and is based on the vast literature available on how to go about conducting research
effectively. This manual is meant as a learning material for the Research-based Training
Program (RbTP) Regional Course of the United Nations Statistical Institute for Asia and the
countries in Asia and the Pacific to enhance their capability in undertaking independent
The RbTP responds to a training gap for most statistical systems in the developing
countries of Asia and the Pacific. Official statisticians especially in National Statistical Offices
(NSOs) are typically not very much involved with in-depth analysis of survey and census
results and administrative based data. The official statistician’s role in analysis for guiding
policy is typically that of a passive data provider rather than that of active involvement in the
interdisciplinary approach to analysis. While this situation may have been acceptable through
the years, data analysis for the purpose of improving data quality and statistical processes
should clearly have to be undertaken by NSOs. Statistical offices also need capability in doing
design, index computations, developing measurement frameworks for new areas in statistics,
assisting policy makers and program formulators in the use of statistical models, say for
The training manual covers basic research principles and a number of statistical
methods for research. The topics included in this manual are: Issues on Doing Research;
Formulating the Research Problem and Research Design; Statistical Analyses; Writing and
Presenting the Research Paper. Some examples of executive summaries, especially from the
First RbTP Regional Course, conducted by the UNSIAP and its local partner the Philippine
Statistical Research and Training Center, are also provided. The training manual also provides
extensive discussions of the use of the statistical software STATA (release 10) for basic data
This manual has been prepared under the general direction of Davaasuren
Chultemjamts, Director of UNSIAP, with the assistance of Jose Ramon G. Albert. We hope
that the course materials presented here are useful for self-learning. Questions, comments and
suggestions are most welcome, and should be directed to UNSIAP Director Davaasuren at
the Asia and Pacific Region, and it is hoped that this manual represents a useful step in this
direction.
Very often, we are in need of solutions to a problem. Research (from the French
réchercer, “to find out,”) involves gathering information through the use of scientific methods
of inquiry for purposes of finding answers to questions, thereby solving a problem. Research is
interpretation of facts, revision of accepted theories or laws in the light of new facts, or
practical application of such new or revised theories or laws. The research process is the
research involves a thorough scientific discovery of answers to ‘what’ questions, such as:
Beyond answering these questions, research must also find explanations to ‘why’
nothing to do with right or wrong. Much of what we understand is transient, i.e. never fully
completed but more evolutionary paradigms (Kuhn, 1980). Scientific endeavors are
consensual; they are the result of shared experiences as part of human endeavors to become
better than who we are. They are done in the context of openness, of possibility to scrutiny.
There are various typologies of research. We may, for instance, classify research into
concerned with organizing data into the most general yet parsimonious laws. The emphasis of
and coherent descriptions and explanations. Applied research, on other hand, is concerned with
the discovery of solutions to practical problems; it places emphasis on factual data which have
more immediate utility or application. In basic research, the rationale defines what we want to
know about the phenomenon, while in applied research, the rationale defines what we want to
do. In carrying out a research inquiry, we can work with qualitative and/or quantitative
this is undertaken when tackling a new problem/issue/topic about which little is known. At the
beginning of an exploratory research undertaking, the research idea cannot be formulated very
well. The problem may come from any part of the discipline; it may be a theoretical research
puzzle or have an empirical basis. The research work will need to examine what theories and
concepts are appropriate, to develop new ones, if necessary, and to assess whether existing
methodologies can be used. It obviously involves pushing out the frontiers of knowledge in the
Does the theory apply in new technology industries? With working-class parents? Before
globalization was introduced? In the wake of a global financial crisis? The amount of testing
out to be done is endless and continuous. This is the way to improve a particular discipline.
Research may also be of the problem-solving type, where we start from a particular
problem ‘in the real world’, and bring together all the intellectual resources that can be brought
to bear on its solution. The problem has to be defined and the method of solution has to be
discovered. The person working in this way may have to create and identify original problem
solutions every step of the way. This will usually involve a variety of theories and methods,
often ranging across more than one discipline since real-world problems are likely to be
‘messy’ and not solvable within the narrow confines of an academic discipline.’(Phillips and
Pugh, 2000)
Although we can observe that research may be variegated in purpose, nature of work,
and general type, we can identify what are the characteristics of a good research:
• Research is based on an open system of thought. That is, one can ask any
question, even challenge the established results of a good research.
• Researchers examine data critically. Are the facts correct? Can we get
better data? Can the results be interpreted differently?
As research goes about in seeking explanations, the researcher needs to identify causal
relationships among the variables being investigated. The logical criteria for causal
relationships involve: (a) an operator existing which links cause to effect; (b) cause always
preceding the effect in time; (c) cause always implying effect. However, sometimes, the
nature of the design may not allow for a direct establishment of cause and effect relationships.
inductive reasoning, or a combination of the two. In deduction, laws are stated in universal
terms, and reasoning goes from the general to the particular. Induction involves a systematic
observation of specific empirical phenomena, with the laws stated in terms of probability.
where we reason from observations, and induction, where we reason toward observations (cf.
Figure 1-1). Thus, research is a process rather than a linear event with a beginning and an end.
The research process is not actually rigid: it involves the identification and formulation of
interpretations of data, as well as drawing conclusions from the data analysis and reporting the
research findings. Research is based on the testing of ideas that may start with a theory about
some phenomenon. That theory might rely on the literature in the discipline. Or, more simply,
the researcher might simply be curious about something particular behavior occurring in a
particular observation of the phenomenon under investigation, and wonders why this is so. The
researcher thinks about the problem, looks at relevant studies and theories in the literature, and
tests the explanation by conducting the research. The researcher focuses the question/s of
interest into specific testable hypotheses. A research design is made to ensure that data is
gathered in a systematic and unbiased manner to answer clearly the research hypotheses that
have been posed. After designing the research, data are collected and subsequently analyzed. If
the findings are proposed, that may lead to further data generation, and analysis.
One research problem can spark further examination into the phenomenon under
investigation, and as more and more data and analysis accumulate, we gain a deeper
about the phenomenon. Of course within a single research project, there is only a rough linear
sequence of events that can happen because of time constraints that lead us to delineate some
clear beginning and end to the project, but the research project may set off further research.
“Research is hard work, but like any challenging job well done, both the
process and the results bring immense personal satisfaction. But research and
its reporting are also social acts that require you to think steadily about how
your work relates to your readers, about the responsibility you have not just
toward your subject and yourself, but toward them as well, especially when you
believe that you have something to say that is important enough to cause
readers to change their lives by changing how and what they think.” – Booth et
al. (1995)
Research in official statistics takes a very special track. Within the context of the
statistical production process, research needs of official statistics can be viewed in terms of
process quality. There may be needs for resolving data quality issues with the aid of
beyond the production side toward concerns from the demand side. For instance, NSOs now
have to address data demands for small area statistics especially on matters of public policy
such as poverty in the light of monitoring the progress in meeting the Millennium Development
Goals (MDGs). A number of official statistics are direct survey estimates that can only be
released at rather large areas yet users may require small area statistics for local development
planning purposes, and particularly, for localizing the monitoring of the MDGs at the district or
sub-district levels. NSOs are trying to address these user needs by experimenting with
statistical methods and models that allow survey data to borrow information over space and
time among other data sources, such as censuses and administrative reporting systems. Toward
improving the timeliness of statistics, NSOs are also investigating the possibility of generating
flash estimates, especially to enable policy planners to have rapid assessments of trends relative
quality management frameworks in the statistical production process is also getting some
attention as a way of enhancing the public trust of official statistics. Traditionally, data analysis
and statistical modeling were not viewed as part of the function of statistical offices, but there
is a growing recognition for the need to use many data analytic tools and methods in the
statistical production process, whether for data validation purposes, for time-series analysis or
statistical models do not directly answer research questions, but may be used as intermediate
tools for research, say, to reduce a dataset into fewer variables (including principal
components, analysis and factor analysis), or to group observations (including cluster analysis
(including canonical correlation analysis), and the results of these methods may then be
subsequently used for other analysis. More and more, statistical research is considered an
essential and integral part of the entire production process of official statistics. The research
issues enumerated are certainly not exhaustive of the issues that ought to be addressed, but they
serve as examples of some challenges in improving the statistical production process, to enable
Of course, the common person may also distrust statistics (given the many misuses and
abuses of statistics). It may also be the case that official statisticians are thought of as those
who diligently collect irrelevant facts and figures and use them to manipulate society. No less
than Mark Twaine states that “There are three kinds of lies: lies, damned lies, and statistics.”
(See Figure 1-2). A book was even written by Darrel Huff (with illustrations by Irving Geis) in
1954 on “How to Lie with Statistics.” Despite the view that statistics may be twisted, there is
also a sense of the overwhelming importance of statistics, both figures and the science itself.
Florence Nightingale is quoted to have pointed out that Statistical Science is “...the most
important science in the whole world: for upon it depends the practical application of every
other science and of every art: the one science essential to all political and social
our experience.”
making for policy and program formulation and implementation, it is crucial to have official
statistics and other information to base decisions upon. Without timely, accurate, reliable,
credible official statistics, public policy and the entire development process is blind. That is,
without official statistics, policy makers cannot learn from their mistakes and the public cannot
Many of us may be conducting research but few of us write up our research, in part,
because our research results are usually solely for our purposes. Documenting a research
or may misremember our results unless we write up our research. This is why many
researchers find it useful to even start writing the research report from the beginning of the
If a researcher writes his research work, it enables him/her also to see more clearly
relationships, connections and contrasts, among various topics and ideas. Writing thus enables
a researcher to organize thoughts more coherently, and to reflect on the ideas to be presented in
the report. Also, the better we write, the better we can also read into what others have written.
formal paper especially when this task may be quite demanding? Such concerns are reasonable.
However, writing research reports into formal research papers enables the research findings to
be part of the body of knowledge that people can have access to, especially as a public good.
This, in turn, enables others to learn from our research results including the mistakes we have
made in our research findings. Formal research reports may be published, and consequently
disseminated to a wider readership. They may also be disseminated in other formats, e.g. as
Writing research into a formal report also helps a researcher arrange the research
findings in ways that readers will explicitly understand the information gathered, especially
since it takes several drafts to come up with a final report. Researchers will need to
communicate research results with arguments, including claims, evidence, qualifications and
All statements in a research will have to be structured and linked into a form that will
anticipate what are readers’ views, positions and interests, as well as what readers will
question: from evidence to argument. In short, thinking in a written form (especially a formal
paper) allows researchers to be more cautious, more organized, more familiar and more attuned
In his “Devotions Upon Emergent Occasions,” John Donne wrote that “no man is an
island, entire of itself; every man is a piece of the continent.” While it may seem as though the
research process, especially the writing part, is done in solitude, yet research is a social activity.
Research involves interconnections between the researcher and himself, with other researchers,
with readers of his research report, and with the entire world. Every time a research is started,
there is an attempt to look through what others have done in order not to re-invent the wheel
and to see what further steps we can do based on what others have done. Thus, research
When the research report is being written, a researcher must be thinking of his/her
potential readers. The research findings have to be communicated effectively to readers of the
research report. Toward this end, it is important for a researcher to make a judgment about
his/her readers’ knowledge, understanding and interest. A researcher’s judgments about his
readers will undoubtedly influence his/her writing. Roles will be created for both the researcher
All readers bring to a research report their own predispositions, experiences, and
concerns. So before a report is written, it is important for a researcher to think about the
standpoint of his/her readers, and where the researcher stands as regards the question being
answered. There is undoubtedly going to be variability among readers of the research report.
Some readers may have no interest in the research problem so they may not be concerned with
the research finding. Some may be open to the problem because the finding may help them
understand their own problems. Others may have long-held beliefs that may interfere with the
research findings. Some readers may expect the research paper to help them solve their own
problems, or understand better the field of inquiry. Some readers may recognize the research
problem being investigated, and some may not be aware of this concern. Some may take the
research problem seriously, and some may need to be persuaded that the problem matters.
Despite this variability, readers share one interest: they all want to read reports that can be
understood.
of thinking and doing things. Are they expected to accept new information, change certain
beliefs or take some action as a result of reading the research report? Will the research finding
contradict what readers believe, and how? Will readers have some standard arguments against
the proposed solution espoused in the research report? Will the solution stand alone or will
readers want to see the details in the solution? And how is the research report to be
disseminated? Will the report be published in a reputable journal? Will readers expect the
seriously of the process of writing and communicating with a reader. The research report is
pertaining to the uncertainties, confusions, and complexities in the conduct of the research and
in coming up with the final draft of the research report. As a researcher engages in the writing
process, it is important for a researcher to be aware of all these struggles, and to confront them
While there is no single formula to guide all researchers, a research project typically
involves much planning and writing: from the choice of a problem, to the construction of
analyzing the data. Most of the writing in the research may involve simple note-taking that
a. Evaluation of the Researchability of the Problem and the Value of the Research
a. Sampling Design
9. Data Collection
Certainly, the steps above are not intended to be prescriptive nor are they to be undertaken in a
linear, sequential fashion. But it helps to see these steps, as well as to look through a few
common faults when conducting research. Asis (2002) lists the latter:
of it afterward;
• Taking a batch of data that already exists and attempting to fit a meaningful
would tie together the divergent masses of research into a systematic and
• Failure to make explicit and clear the underlying assumptions within your
place restrictions on the conclusion and how they apply to other situations;
• Failure to anticipate alternative rival hypotheses that would also account for a
given set of findings and which challenge the interpretations and conclusions
Much of the issues above could probably be ironed out if before the project
commences, a research proposal was drafted to enable the researcher to plan out the course of
investigation, i.e., what is to be done; what and how to measure. The research implementation
ought to be guided by the research proposal. Researchers typically also write research
proposals to enable them to apply for funding, or to receive institutional review and ethics
approval. When writing up a research proposal, it has been suggested (see, e.g., Seminar in
Research Methods at the University of Southern California as quoted by Asis, 2002) that the
1. Basic difficulty: What is it that has caught your interest or raised the problem in
your mind?
2. Rationale and theoretical base: Can this be fitted into a conceptual framework
that gives a structured point of view? In other words, can you begin from a position
this area? Can you build a conceptual framework into which your ideas can be
3. Statement of the purpose or problem: Define the problem. What is it that you
plan to investigate? What is the context of the proposed research? What are the
hypotheses you will test or the specific objectives at which the research is aimed.
Be concrete and clear, making sure that each hypothesis or objective is stated in
6. Design and procedure: State who your subjects will be, how they will be selected,
the conditions under which the data will be collected, treatment variables to be
7. Assumptions: What assumptions have you made about the nature of the behavior
you are investigating, about the conditions under which the behavior occurs, about
your methods and measurements, or about the relationship of this study to other
8. Limitations: What are the limitations surrounding your study and within which
9. Delimitations: How have you arbitrarily narrowed the scope of the study? Did you
focus on selected aspects of the problem, certain areas of interest, a limited range of
10. Definition of terms: Limit and define the principal terms you will use, particularly
where terms have different meanings to different people. Emphasis should be placed
After the research proposal is crafted, and the research commences, then a research
report is drafted, revised, and finalized. Thus, doing research involves a number of skills:
thoughts into a coherent claims and arguments, making sound analyses and
enterprise
In addition, it is helpful to know how the research report will be assessed by readers.
Ultimately, the findings of a research and contents of a research report are the responsibility of
the researcher.
Introduction
The Thai government policy promotes voluntary family planning since 1970. Economic and
longer expectancy of Thai population. Thus, the ratio of elderly people to the total population is
increasing.
Objectives
3. To provide basic information on the Thai elderly that might be used in planning elderly
help project in the future.
Framework 1
Independent variables Covariate variable Dependent variables
Individual characteristics
• Sex
• Educational attainment
• Marital status Employment
• Religion
Status
• Total living children
• Living condition - Working
• Health status - Not working
Household characteristics
• Household headship status Age
• Owner of dwelling
• Housing quality index
• Family support
Geographic characteristics
• Region
• Area
Framework 2
Independent variables Covariate variable Dependent variables
Individual characteristics
Sex
Educational attainment
Marital status
Religion
Total living children
Living condition Working elderly group
Health status
Household characteristics - Industry
Household headship status - Agriculture
Owner of dwelling - Services
Housing quality index Age
Family support Area
Geographic characteristics
Region
Marital status
Religion
Household headship status Categorical
Living condition measurement/
Owner of dwelling Nominal scale
Financial support
Region and Area
Independent variables
1=0-3 persons,
0-16 persons 2=4-5 persons,
3=6-16 persons
1=working
Employment status 0=not working
Methodology
• Descriptive statistics and cross-tabulation used to examine demographic and
socio-economic characteristics, and living conditions of the elderly
• GLM Univariate Analysis and Chi-square test used to analyze factors influencing
employment status of the elderly, and industry
Source of Data
• The 2002 Survey of the Elderly in Thailand
• Selected only population age 60 years and over
• Framework 1
1. H 0 : µ r1 = µ r2 = µ r3 = µ r4 = µ r5
: Means of elderly employment status in different regions are equal
H : µ ≠ µ ;r ≠ r
1 ri rj i j
: At least one of the means is unequal to the rest
2. H 0 : µ f1 = µ f 2
: Means of elderly employment status in different financial support categories are equal
H 1 : µ f1 ≠ µ f 2
j
: Means are unequal
3. H 0 : µ h1 = µ h2 = µ h3 = µ h4 = µ h5
:Means of elderly employment status across different health status are equal
H : µ ≠ µ ;h ≠ h
1 hi hj i j
: At least one of the means is unequal to the rest
4. H 0 : µ e1 = µ e2 = µ e3 = µ e4
: Means of elderly employment status in different educational attainment
H 1 : µ ei ≠ µ e j ; ei ≠ e j
: At least one of the means is unequal to the rest
5. H 0 : µ h1 = µ h2
: Means of elderly employment status between different household headship status are equal
H :µ ≠ µ
1 h1 h2
: Means of elderly employment status between different household headship status are unequal
Hypothesis of study
• Framework 2
1. H 0 : µ r1 = µ r2 = µ r3 = µ r4 = µ r5
: Means of industry in different regions are equal
H 1 : µ ri ≠ µ rj ; ri ≠ r j
: At least one of the means is unequal to the rest
2. H 0 : µ m1 = µ m2 = µ m3
: Means of industry across different marital statuses are equal
H 1 : µ mi ≠ µ m j ; mi ≠ m j
: At least one of the means is unequal to the rest
3. H 0 : µ o1 = µ o2
: Means of industry between different types of owners of dwelling are equal
H 1 : µ o1 ≠ µ o2
: Means of industry between different types of owners of dwelling are unequal
4. H 0 : µ hi1 = µ hi 2 = µ hi 3 = µ hi 4
:Means of industry among different housing quality index is equal
H 1 : µ hi ≠ µ h j ; hi ≠ h j
: At least one of the means is unequal to the rest
The pursuit of research begins with a problem. A research problem may be viewed as a
general statement of why the research is to be undertaken. The research problem is not
necessarily a social or practical problem but practical or social problems can motivate a
research problem. Both practical and research problems consist of two essential elements: a
particular situation or condition, and the consequences/costs of the condition. The condition of
under investigation. For instance, “what role does family size play in monetary poverty?” The
cost of the condition is some ignorance and misunderstanding that is more consequential than
the ignorance or misunderstanding that defined the condition. For the example earlier, not
understanding the relationship between poverty and family size may lead a government to
ignore setting up a public policy on population within the framework of poverty reduction.
A research problem thus starts from insufficiency of the established body of knowledge
(and possibly contradicting results from different researchers). A research problem may result
from some perceived difficulty, i.e. a feeling of discomfort with the way things are and a
discrepancy between what we believe should be and what is. A good research problem is one
that is worthy spending time investigating (and this includes issues on “solvability” of the
problem), and one whose report is worth spending time reading. A research problem ought to
meet the research standards of the institution represented by the researcher or of the funding
agency of the research project. A well-defined research problem leads naturally to the
statement of the research objectives, to research hypotheses, to the key variables, and to a
selection of methodology for measuring the variables. A poorly defined research problem leads
to confusion. If a researcher is uncertain about the research problem, we can be sure that others
The research problem might firstly be described in terms of as a broad topic, a general
area of inquiry, which ideally ought to be of interest to a researcher. The topic may have been
suggested by another researcher, especially an expert in a field, or it may have been the result
interest, which came across as a reaction to some readings or writings. A topic may be quite
broad if it can be stated in fewer than five words. The topic is typically written as the title of
the work. It may be important to narrow down the general topic by adding modifying words
and phrases. The advantage of having a specific topic is that one can easily turn the topic into
questions that need to be answered. It may be helpful to look at the topic as part of a larger
system, or to analyze the topic into component parts and evaluate the relationships among these
components. The topic’s range of variation, categories, characteristics and value may lead us to
Most research topics generate several questions. Questions are of generally of two
types: ‘what’ or ‘where’ questions and ‘why’ questions. The grammatical constraints of the
direct question force a research to be precise and particular. In the short term, a researcher will
find out that the question gives the researcher (and readers of the report) a fairly clear idea of
the type of research needed to answer it: quantitative or qualitative; experimental or empirical.
Does the research require a survey? Would desk study be appropriate, or is there a need for
A researcher must progress not only from topic to questions, but from questions to its
rationale or significance. Once questions are formulated, a researcher has to ask and try to
answer one further question: So what? That is, a researcher must be able to state the value of
the research, not only to the researcher, but also to others. In doing so, Booth et al (1995)
b. imply a question, i.e. state what he/she does not know about the topic:
because I want to find out who/how/why ___________
Such steps may seem, at first glance, to be quite academic, and not of any use in the
real world, particularly in the world of official statistics. However, research problems in official
statistics are actually structured exactly as they are in the academic world. No skill is more
important than the ability to recognize a problem important to clients and stakeholders of
official statistics, and the public at large, as well as to pose that problem to readers of the
researcher report. Readers must be convinced that the research result is important to them and
A researcher ought to define the research problem very carefully, and to have some idea
of why he/she chose it (rather than another problem). Simply identifying a topic which interests
a researcher will not help him/her to choose a suitable method of researching it, nor to schedule
the research work and report writing. If we make the mistake of not distinguishing between the
topic to be investigated and the research problem to be solved, we run the risk of lacking the
A researcher may read rather indiscriminately if he/she does not know quite what
he/she is looking for, and will probably also make far more notes than is required. The
researcher may keep gathering more and more data, and not know when to stop; if he/she does
stop, he/she may have difficulties identifying what to include in a research report, and what not
to, and worse, decide to put everything into the report and lead readers of the research report to
“To raise new questions and new possibilities, to regard old problems from a new angle
requires creative imagination and marks real advance in science” (Einstein, A. and L. Infeld,
The Evolution of Physics, Simon and Schuster, 1938). Thus, in identifying and conceptualizing
a research problem, it is important to consider that the formulation of a problem may be far
more often essential than its solution, which may be merely a name of a mathematical or
experimental skill. As was pointed out by Van Dalen, D.B. in “Understanding Educational
Research” (as quoted by Asis, 2002), a researcher may want to ask himself/herself the
following:
others?
• Am I genuinely interested in this research problem but free from strong biases?
• Does the research problem meet the scope, significance, and topical requirements
• What can I learn from what others have already done on this research problem?
Regardless of the specific topic and the approach chosen for investigating the problem,
open-ended, because finances and time for undertaking a research project are limited. At some
point, the research has to end. A researcher needs to have some idea of what he/she is looking
for, in order to figure out how to go about looking for it and roughly how long it will take to
find it. And, of course, so that the researcher will know when he/she has found it!
It is not enough to have a research topic but to have statements of what is expected as
the output of a research. Each of the research objectives identified ought to be at least partially
met by the end of the research undertaking. There is usually a main objective, which is not
quite operational, which is broken down into specific objectives. The specific objectives
It is suggested that the general and specific objectives should be smart, measurable,
action oriented, realistic and time related (SMART), and where appropriate, a hypothesis must
be stated for each objective. The research objectives ought to detail the scope of research work
to a rather manageable task. Poorly formulated objectives lead to unnecessary data collection
and undoubtedly make the research project difficult to handle. Final outputs and results should
be itemized and linked to the objectives. The research objectives may be stated in the form of
questions in order to help formulate research hypotheses and methods for meeting the
objectives.
While identifying a problem is the first essential step in the conduct of research, it must
be followed by a search for sources, including a review of relevant literature, and a consultation
of persons/experts concerned with the problem. A literature review is a hunt for sources of
information on related topics and/or problems either in the library and/or the virtual library
(e.g., the World Wide Web and the internet), or even other computer sources (e.g., specialized
databases, CDs). Internet research can be quite rewarding. Just going through one search
engine, e.g., Google (www.google.com), will takes us through the available hyperlinks in the
internet. However, internet researching can be time consuming and there are certainly a lot of
resources on the internet that are not valuable for the research at hand. A consultation with
experts may take the form of an interview whether via face-to-face, email, phone conservation
or short message service. It helps to have a research supervisor to direct research work
especially on reviewing literature, but a supervisor cannot be expected to perform routine work
for the researcher. For instance, if a researcher is not comfortable with the rigors of performing
a regression model, the researcher can be directed to read through some lecture notes or text on
regression modeling, but the researcher must review the modeling process, run the regression
with the appropriate statistical software, and perform model diagnostics to assess the adequacy
of the model employed. The researcher cannot expect the supervisor to tell him or her what to
ends or issues on inappropriate use of tools, suggest references and the like but the researcher is
responsible for the research undertaking and the writing of the proposal and report. During a
presentation of the findings, a researcher should never say something like “I did it this way
because my research supervisor told me to do so”. Instead, a more proper statement is: “I did it
this way because my research supervisor suggested it; I then compared it with other methods
There are various functions of a review of related literature. A major purpose of using
sources is to establish the originality of the research; that is, that the research proposed has not
been undertaken. Almost always something related has been done so a review of the literature
organizes various sources, discusses them, and points out their limitations. A literature review
helps avoid unintentional duplication of previous researches. Of course, there are also
researches that repeat what others have done (or what was done earlier by the researcher) with
some slight change in the methodology, as in how constructs were operationalized, or how data
were analyzed. In the absence of a research problem, a literature search may also help guide in
selecting a topic. It also provides the source of the theoretical and conceptual frameworks of
the planned research. A literature search provides ideas on research methodology and on how
to interpret research findings. It may help in indicating how the research will refine, extend or
transcend what is now known in the body of knowledge about the subject matter. For instance,
Mercado (2002) illustrates that when performing an epidemiological research, a careful review
• geographic areas affected- are there particular areas where the problem occurs?
• changes across population groups- are there special population groups affected
by the problem?
further research?
The epidemiological research problem identified must now be defined in terms of its
occurrence, intensity, distribution and other measures for which the data are already available.
The aim is to determine all that is currently known all about the problem and why it exists.
Sources of information may include people and publications; the latter include general
reference works (such as general and specialized encyclopedias, statistical abstracts, facts on
file), chapters of books, books, journal articles, technical reports, conference papers, and even
electronic sources (both online – from the internet – and offline – from CD ROMs).
Encyclopedias, books and textbooks are typical sources of the conceptual literature; abstracts
of journals of various scientific disciplines, actual articles, research reports, theses and
dissertations serve as sources of research literature. Sources of both conceptual and research
literature include general indices, disciplinal indices, and bibliographies (Meliton, 2002).
• primary sources : “raw” materials of the research, i.e. first-hand documents and/or
• secondary sources: books, articles, published reports in which other researchers report
In official statistics, the primary sources may be actual data from a survey, and secondary
sources may be published reports, and compilations of data from primary sources. Booth et al.
(1995) provide sound advice for using secondary and tertiary sources:
“Here are the first two principles for using sources: One good source is worth
more than a score of mediocre ones, and one accurate summary of a good
source is sometimes worth more than the source itself.”
source (perhaps inadvertently, and even occasionally, deliberately). This is especially true
when we look through information in the internet, which has become a source of a lot of
information, both good and not helpful ones. Thus, it is important to also evaluate sources.
order: (a) Peer–reviewed articles in international journals; (b) Book chapters in edited
collections; textbooks by famous authors (i.e. with a track record of publications in peer–
reviewed journals); Edited proceedings from international conferences; (c) Technical reports
and electronic documents with no printed equivalent from well–regarded institutions; (d) Peer–
reviewed articles in national or regional journals; Text-books by lesser authors; (e) Unedited
conference proceedings; Edited proceedings from local conferences; (f) Technical reports and
Taking full notes can provide some guidance toward evaluation of the information and
sources. It may be helpful to firstly have bibliographical notes of the reference materials, and
H61
Such notes should include the author, editor (if any), title (including subtitle), edition, volume,
publisher, date and place published of the publication, and the page numbers for an article in a
journal. It may also be helpful to record the library call number of the reference material
(although this is no longer cited in the final research report). The latter is meant to help trace
back steps in case there is a need to recheck the source. For internet sources, information on the
and to indicate the exact page reference so that quotations can be properly
• Paraphrase – The researcher restates in his or her won words the author’s
thoughts.
• Summary – The researcher states in condensed form the contents of the article.
These notes state the author (and title), and a page number on the upper left hand portion of the
card (see Figure 2-2 and Figure 2-3). At the upper right hand are keywords that enable a
researcher to sort the cards into different categories. The body of the card provides the notes.
As the research evolves, it may be possible to have the rate of gathering information go
faster than the rate at which the information can be handled, even if the researcher is doing
speed reading. And a researcher may experience information overload when the related
literature is overwhelming (see Figure 2-4). A researcher need not include every available
Introduction to Methods for Research in Official Statistics 29 of 179
publication related to the research topic or hypothesis. However, it may be important to keep
everything one finds in the literature. Even publications that look quite remote from a research
project are likely to be useful in the review section of the research report, where readers may
get a sense of the extent to which a researcher knows the topic. If a publication will not be cited
eventually in the research report, it may even be helpful in a future research undertaking. For
It may be important to seek help in planning out the research, including the review of
literature, especially from those who can provide both understanding and constructive
criticisms. A research supervisor could provide guidance on the length of the literature review
section of the research proposal and the research paper. A researcher could use such advice to
select sources working backwards from the publications most closely related to the research
involves judgments of importance and value, and so includes a large element of subjectivity.
The order and prominence of individual sources vary, depending on the researcher’s
perspective; i.e. according to the question or hypothesis he/she is going to work on. There is
also some general preference to use recently published sources, and to use sources that have
undergone refereeing.
While there may be some apprehension about the extent of subjectivity employed in
selecting related literature, this need not worry a novice in research. Since a literature review is
identify research problems), the selection of related literature may be guided by asking the
following broad questions: (a) What exactly has this source contributed? (b) How does that
the source valid and are there a variety of other sources?. Meliton (2002) even lists a few more
• Who were the respondents? How many? How were they selected? What were they
made to do?
• What was the research design? What is the experimental treatment used?
• What data gathering instruments were used? Were they reliable and valid tools?
• What were the major findings? Were they logical? How were the data interpreted?
Was the interpretation appropriate and accurate?
The answers to the broad and specific questions above may be subsequently used when writing
up the final form of the research proposal as well as in the drafts of the research report.
Ordinarily, a research report will contain a section on the literature review after the
statement of the problem and its introductory components but before the theoretical framework.
The review must be systematically organized. Similarities and differences should be given
among the literature cited. Most or some parts of the research may also be included in the
write-up of the research literature, depending on their relevance of the cited literature to the
research. For conceptual literature, hypotheses, theories and opinions are usually discussed.
(Meliton, 2002) A researcher will have to wonder whether after all the readings and interviews
done, is there a better idea as to what the result of the research will be?
assertion or proposition about each research question whose validity shall be tested in the
course of the research. For each research question, the researcher must have a hypothesis, i.e.,
Usually the literature review has given background material that justifies the particular
hypotheses to be tested. The research will confirm, contradict or cause a modification of the
research hypothesis. A hypothesis will not merely be rejected simply because it cannot explain
scientific thought, Thomas Kuhn (1970) explains that for a new theory to replace an old theory,
two conditions must be satisfied: (1) there must be serious faults in the old theory, i.e. the old
theory does not explain important facts; (2) a new theory must be available. Why should a
theory, even one that has evident flaws, be rejected unless something better is available with
which to replace it? Even Albert Einstein is reported to have uttered irritation with the then
current scientific paradigms when he remarked “The only time science progresses is when old
professors die.”
previous researches or first principles, and the research is to be designed to challenge the
fact
• be as simple as possible
Research Question Are households with large family sizes more likely to be poor than
Research Hypothesis (Yes,) Households with large family sizes are more likely to be poor
Notice that this research hypothesis specifies a direction in that it predicts that a large
household will be poorer than a small household. This is not necessarily the form of a research
hypothesis, which can also specify a difference without saying which group will be better than
the other. However, in general, it is considered a better hypothesis if the direction is specified.
Also, note the deductive reasoning principle behind testing the research hypotheses.
If a theory is true, then controlled experiments could be devised and evidence found to
support the theory. If instead data were gathered first and then, an attempt is made to work out
what happened through inductive reasoning, then a large number of competing theories could
explain the result. This is called post-hoc theorizing, but I this case, there is no way of knowing
for certain which theory is correct, and there may be no way of ruling out the competing
explanations with the result that we end up choosing the theory that fits best our existing
biases. Inductive reasoning has a role in exploratory research in order to develop initial ideas
and hypotheses, but in the end the hypotheses have to be tested before they can have scientific
credibility. It is important for the researcher not merely to come up with research hypotheses,
but also to be able to make sure that findings will manage to either reject or not reject these
Some schools of thought would advocate hypothesis-free research, since just by stating
a hypothesis we are actually constructing a context for the research and limiting its outcomes.
This is often advocated especially in the social sciences where researchers may immerse
allow the theory to follow the observations. However, there is some concern about this
philosophy of research as no person can escape their life experiences, which form an implicit
the research to test these hypotheses. In addition, the total ignorance philosophy can be rather
Research hypotheses, unlike assumptions, can be tested. Assumptions are often taken
for granted, and not even mentioned, especially established laws. However, there may be
assumptions in a research that may be done for convenience or those that can not be tested
within time, budget, or design constraints of the research. These assumptions need to be
explicitly stated, and they certainly put a scope to the research and its results.
The research procedures used should be described in sufficient detail to permit another
researcher to replicate the research. The procedures must be well-documented and the
methodology/design must be transparent.
After careful, logical and critical evaluation of the research problem, it is important to
formulate the research plan that will guide the execution of the research to explore or test
specific ideas. This involves identifying the conceptual foundation of the research, as well as its
operationalization. The research design also features a planning of issues pertaining to the
types of data to be collected and analyzed, their scales of measurement, and the methods of
data collection. What is important is that in principle, a research design, when executed will be
able to answer the research questions. The plan for a research investigation in official statistics
involves:
• Outlining summary tables, graphs, charts, pictures, and figures designed for the
investigation
• Planning for the generation of a complete, accurate, and reliable report of the
investigation.
A fundamental question that must be dealt with in a research is: How do I measure
committed to; it involves taking constructs or concepts to be tackled within the research and
Conceptual definitions are definitions that describe concepts using other concepts.
Ordinary dictionary definitions will not be enough; it is important to come up with clear and
often are common definitions of concepts used in the area of research. If the researcher opts to
adopt a conceptual definition that is novel, he/she must have intimate familiarity with the
research topic. Aside from conceptual definitions, the conceptual foundation of a research also
includes established laws (e.g., law of diminishing returns, law of gravity, weak law of large
numbers), which can be viewed as theories whose falsification, within its background, is
almost unthinkable.
Concepts allow scientists to classify and generalize; they are components of theories
and define a theory’s content as well as attribute; e.g., theory of governance is “power” and
when using certain established concepts. Some of these concepts may be variables, i.e.,
concepts or constructs that can vary or have more than one value. Some variables can be very
concrete such as gender and height but others can be quite abstract, such as poverty and well
being, or even IQ. In contrast, concepts that do not vary are constants.
Example 2.4.1
empirical data;
observations. They are not established deductively, however. They may be shown in
theoretical system are interrelated in a way that permits some to be derived from the
others.
elements in the system being investigated and to explain phenomena. It is important to use
clear terms e.g., point-form, diagram and present the theoretical/conceptual framework to
explain the state of the system being investigated. For example, Figure 2-4 illustrates the
conceptual framework of political systems. Also, a researcher needs to specify how the
program of research is relevant to examining and extending that theory/framework, and show
clearly in the research proposal how each part of the proposal addresses a specific feature of
that theory/framework. There ought to be some indication about how the theory/framework
Introduction to Methods for Research in Official Statistics 36 of 179
will change as a result of the research. Alternatively, what question/s about the
Demands
Inputs Decisions
The Political Output
Support
System and Actions
It is all well and fine to hypothesize relationships and state them in the research’s
conceptual framework, but providing evidence for the research hypotheses will remain at a
standstill until the constructs and concepts involved are transformed into concrete indicators or
operational definitions that are specific, concrete, and observable. Such operational definitions
usually involve numbers in them that reflect empirical or observable reality. For example, one
poor if its per capita income is less than some threshold (say $1 a day in purchasing power
parity terms). Most often, operational definitions are also borrowed from the work of others.
The most important thing to remember is that there is a unit of analysis: household, individual,
societal, regional, provincial/state, to name a few. Thus, aside from formulating a conceptual
be followed in order to establish the existence of the phenomenon described by the concepts.
The operational framework follows the conceptual framework except that the concepts in the
concepts. There are no hard and fast rules about how to operationalize constructs. In some
cases, constructing the operational framework is rather simple and consequently, there is no
to a researcher or those which can be readily accessed. The operational definitions allow for
concepts in the conceptual framework to be specified into the variables or for operational
measurements of these concepts. Some operational definitions can easily produce data. It may,
for instance, be fairly easy to get someone to fill out a questionnaire if there is no ambiguity in
the questions. It may also be important to use operational definitions that are convincing to
readers, especially those who will be evaluating the research. It may be important to see help,
to have someone criticize the operationalization of the framework and definitions. It is thus
often a good idea to use definitions that were used by other researchers (especially noted ones)
in order to invoke precedents in case someone criticizes the research. (Of course, the past work
measurement, i.e., that there is varying precision by which a variable is measured. Following
the classic typology of Stevens (1951), anything that can be measured falls into one of the four
types; the higher the type, the more precision in measurement; and every level of measurement
of the data being collected fall into mutually exclusive and exhaustive
such as sex, marital status, religion, language spoken at home and race.
• ordinal: describes variables that are categorical, and also can be ordered or
ranked in some order of importance. Most opinion and attitudinal scales in the
ask individuals if they were first, second, or third generation immigrants, the
assumption here is that the number of years between each generation is the
• ratio: describes variables that have equal intervals and a fixed zero (or
reference) point. Weight and age have a non-arbitrary zero point and do not
have zero attitudes on things, although qualifications "not at all", "often", and
The kind of information a measurement gives will show the kind of measurement scale
that will be used. That is, the kind of analysis that one can carry out on available data at hand
outside the theoretical framework. In the natural sciences such as chemistry, models are
inputs and outputs. Models help to understand the research undertaking. They guide the entire
research process and help explain the research idea to others. A good model captures those
aspects of a phenomenon that are relevant for the research investigation, but clearly, a model
oversimplify the problem and may not be well suited to application. The reward for simplifying
the understanding of the phenomenon by ignoring what is irrelevant for present purposes is that
Models are beyond mathematical models. Medical researchers typically use some
aspect of the physiology of a mouse as a model for physiology of humans. Of course, medical
models of persons based on animals can be misleading. These models provide clues that must
be tested out in direct investigation with human subjects. In the social sciences, models may
consist of symbols rather than physical representations: i.e., the characteristics of some
empirical phenomenon, including its components and the relationship between the components,
are represented as logical arrangements among concepts. The elements for a model can be
drawn from related literature, personal experiences, consulting with experts, existing data sets,
and pilot studies. Figure 2-5 illustrates a model for the policy implementation process. This
illu
stra
tes
the
dyn
ami
cs
environmental factors thru specific transactions and reactions to the transactions. A model,
system that it represents, enough features to be useful for the research investigation.
There are two strategies that can be adopted: either the theory comes before the research (with
a deductive approach employed), or the research comes before the theory (with an inductive
• Select a proposition derived from the theory or model for empirical investigation.
• If the proposition derived from the theory is rejected by the empirical data, make
• If the proposition is not rejected, select other propositions for testing or attempt to
research before theory. As shown in Figure 1-2, the research process is cyclical. This iteration
between induction and deduction continues until the researcher is satisfied (and can satisfy
readers of the research report) that the theory is complete within its assumptions.
Methods and procedures are chosen in order to answer the research questions identified.
This is why research questions need to be specific. For each research method selected, a
discussion must be given in the final research report regarding: (a) the name and description of
the method chosen, possibly with reference to some literature that describes it, if is not
popularly known; (b) why this method was chosen: applicability of the methodology in the
research, and preference over the method to other methods; (c) what are the assumptions for
applying this method or procedure, and how are they met in the research.
creativity. That is, the research must create something really new, or at least a new synthesis; it
must result in a design that is better than the currently existing alternatives, and the research
report must both define and demonstrate this advantage. And to this end, criteria have to be
identified for assessing the design. An example of a “design” research is Sothearith’s proposal
for a framework for local development in Cambodia during the first RbTP Research-based
Regional Course. (cf. Chapter 3). The research report had to establish that there is a demand for
local development planning; review existing designs in other countries, such as the Philippines
and identify their shortcomings; show the proposed development framework. Ideally, it should
have showed an actual test of the framework at the field level. In Chapter 4 of this manual, we
look into some mechanical processes and tools of reporting and summarizing data that enable
1. To analyze the present capacity of the elected commune council or local government in
systemic statistical management for local development planning;
2. To identify what can the National Institute of Statistics play a more active role in
supporting the commune council; and
3. To draw a policy recommendation and framework to establish or strengthen the local
database system.
One of critical changes of development in Cambodia is to improve the capacity of
commune council in data collection and management. There are three concerns that need to be
addressed by the framework 1) identify changes in the use data collection instruments in order
to address emerging issues in development planning; 2) way of presenting data and analytical
techniques for policy decision making; 3) provide an opportunity for the local government to
discuss and exchange information on commune issues related to the collection, processing and
utilization of data.
Risk Coping and Starvation in Rural China (Yu Xinhua, China)
Even though it is widely recognized that giving farmers more secure land rights that
may increase agricultural investment, such a policy might undermine the function of land as
social safety net and, as a consequence, not be sustainable broad support. This research
explores the role of land as a safety net that helps rural households smooth consumption in case
of shocks. Likewise, it will explore the impact of different land tenure arrangement of
investment and productive efficiency. Therefore, the main objective of this research is to
examine how rural household are able to cope with idiosyncratic shocks to their income by
suing panel data from Southern China.
Combined panel data from annual household survey and cross-section data from field
survey will be used to construct an econometric model. This model can be use to evaluate the
effect of different land policies.
Small Area Estimation for Kakheti Region of Georgia (Mamuka Nadareishvili,
Georgia )
In Georgia, various microeconomic indicators lower than a region have not yet been
explored. The research proposed to obtain estimates of various indicators at the district level by
using the small area estimation technique. The primary data for the research would be the data
The Mortality Situation of Districts in Central Province of Papua New Guinea (Roko
Koloma, Papua New Guinea)
All the mortality indices in Papua New Guinea (PNG) have been estimated indirectly
using population census and demographic health survey data. This is done at the national,
regional and provincial level. Attempts to derive sub-provincial (districts) has not yet been
done so due to smaller sample size, which only catered for national and regional estimates, and
the design of questionnaire did not allow for estimates to be made.
With the current law in PNG on provincial and local administrative setup, the direction
is now more at the district level. The medium term development plan also clearly states the
need at district planning as the basis for development planning and meeting these policy goals.
Therefore, the research proposal intend to 1) indirectly estimate selected demographic indices
(IMR, CMR, Life Expectancy, CDR, ASDR) for the four districts of Central Province; 2) to
analyze these indices, compare the trends and patterns with current and past regional/provincial
level demographic indices for Central Province; 3) to identify the need for district level
demographic indicators in planning and decision making based on this study; 4) to support
through this study the establishment of provincial dat systems; to cater for data collection and
analysis at district level either through censuses, surveys or administrative records.
Basic Needs Poverty Line Analysis (Benjamin Sebastian Sila, Samoa)
With insufficient information/data on poverty in Samoa, assessment on the quality of
information provided in the Household Income and Expenditure Survey for 1997 and 2000
shall be undertaken to analyze poverty indicators.
Integrated Fisheries Survey Based on Master Sample Methods: An Alternative to
Monitor Food Security in the Philippines (Reynaldo Q. Valllesteros, Jr., Philippines)
One of the emerging social concerns under the United Nations Millennium
development goals is food security which the Philippine government has been prioritized to
address. As such, the Bureau of Agricultural Statistics of the Department of Agriculture gathers
information through agricultural and fisheries surveys that were conducted regularly to monitor
the level of food sufficiency among the Filipino farmers, fishermen, and the general public.
However, appropriated annual budget of the Bureau cannot afford to conduct a regular
updating and maintenance of the sampling frame especially for Aquaculture, which requires a
list of Aquaculture farm operators by type of aquafarm. Thus, the need to design and construct
a master sample frame (MSF) to reduce costs of updating and maintenance for all fisheries
survey is of utmost concern.
The general objective of the study is to conduct a desk research towards the
development of an integrated fisheries survey design based on master sample using the 2002
Census of Fisheries Evaluation Survey (CFES) databases. Specific objectives are; 1) Design
and construct a prototype master sample frame (PMSF) for fisheries surveys; 2) Conduct
correlation and regression analysis to determine the association of the different characteristics
of the operators of municipal/commercial fishing and aquaculture households; and 3) Identify
indicators/auxiliary variables needed in the development of a prototype Integrated Fisheries
Survey (IFS) design with different modules (aquaculture, commercial fishing, marine
municipal and inland municipal fishing,); and 4) Develop a prototype IFS design and
recommend for pilot testing in the forthcoming special projects of the Bureau.
Developing a Poverty Index Using the APIS, MBN and MDG (Aurora T. Reolalas,
Philippines)
The current measurement of poverty indexes such as the poverty incidence, poverty and
income gaps, and severity of poverty in the country makes use of the Family Income and
Introduction to Methods for Research in Official Statistics 45 of 179
Expenditure Survey, which is conducted every three years. In between FIES years, these
measures are not available, instead the annual food and poverty thresholds is estimated using
the raising factor derived from the latest FIES. Since, the Annual Poverty Indicators Survey is
conducted in between FIES years, the indicators in the APIS would be of use to estimate the
poverty indexes in the country. Thus, the main objective of this study is to develop a poverty
index using the APIS, MBN and MDG.
Since the 1990s, the Royal Government of Cambodia has actively pushed for a number
development, social justice and poverty reduction. One key area identified to help expedite the
reforms is the decentralization and deconcentration of power from the central government to
the local government. It is an instrument meant to bring the government closer to the people
and help further democratize the country and improve service delivery.
The commune, as the lowest level in Cambodia’s four-tier government structure, was
formed after the February 2002 local election to assume the key role of implementor and
executor of the national government development program at the local level. It has since
played a very important role in local development planning. As such, the commune represents
Good commune development planning, however, requires having a set of clear and
accurate data that could present the problems, priority needs, interests and potentials at the
governments like districts, provinces and municipalities as well as by the national government.
It also requires having a well-organized and well-trained commune council that would
be able to build a good data management system at the local level, especially in terms of data
development framework meant to build and strengthen the present capacity of the relatively
“young and inexperienced” commune council system in Cambodia to enable it to handle and
The framework being proposed is largely modeled after the Philippines’ experience in
processing, analysing, formulating and monitoring data and statistical indicators for planning at
the community and local levels. Unfortunately, however, in view of time and funding
limitations, this study simply presents the framework as developed and does not conduct an
Rationale. In order to put the need or call for the proposed development framework for
the strengthening of the commune council system in the proper perspective, it is best to present
a brief review/overview of Cambodia’s present statistical system and the manner of collecting
The National Institute of Statistics (NIS) under the Ministry of Planning (MOP) of
conducts and produces database information at the national level. At the same time, it is tasked
to compile and consolidate statistics on various activities collected by the concerned statistics
and planning units of decentralized offices and other ministries as well as the data sets
differences in definition and coverage, lack of coordination, and at times, poor inter-ministerial
With the current decentralization thrust in Cambodia where the commune takes
centerstage, commune-based data that are accurate, timely and relevant become very critical.
And while there are actually a lot of information already being collected at present at the
commune level such as data sets collected and monitored by the various international
organizations and NGOs earlier mentioned, said databases are being used only by and made
available mostly to these same organizations for their own needs and purposes. Hardly is there
any linkage between them and other relevant end users, in particular, the commune councils
and the communes themselves as well as other government agencies involved in local planning.
program in the decentralization reform experiment – by its very nature and mandate, also
collects commune-related data and information. However, because its main concerns relate to
the provision of public services and investments, funded through the local development fund
(one of SEILA’s core financial transfer mechanisms to the communes), SEILA’s key
information needs focus more on data related to performance evaluation and monitoring,
financial status, and project progress and capacity information rather than on data about the
Moreover, the data and information collected by the SEILA program are sent to the
provincial level for entry into one of the databases of the SEILA program and are not kept,
managed and maintained at the commune level. Whatever data are collected are provided only
in report forms to the communes, without turning over the databases to them.
In addition, because the SEILA program is a national program that encompasses all
levels of government administration, not all of the information and indicators that it maintains
are meaningful and useful at each level. For instance, reduction in the level of poverty is a
methodology. Given the above, this study proposes an alternative system of commune-based
data and information called Commune Database Information System (CDIS) whose use will
Introduction to Methods for Research in Official Statistics 48 of 179
not be limited only to a few and specific end users, whose information are not narrowed to
monitoring and evaluation activities at the province or district or even commune level, and
whose information and indicators are specific to the needs of the commune for local planning.
The proposed CDIS is a bottom-top system with data generated to depict the
conditions. It is meant to complement available local information system data with primary
off from the Micro Impacts of Macroeconomic Adjustment Policies project in the Philippines),
the CDIS will provide the commune with data that can serve as basis in identifying local needs
and problems, prioritizing them, and developing programs and projects that can improve the
socioeconomic conditions of the community and alleviate the situation of its neediest members.
The CDIS consists of three major phases, namely: (a) pre-implementation which
involves the setting up of the CDIS system, determination of agencies involved, identification
of the indicators needed, orientation of the commune council and community, organization of
the CDIS working group, and preparation of the questionnaire form for the survey; (b)
List of indicators. The data requirements identified for the CDIS framework are
basically those needed for local situation analysis/development planning broadly classified into
nine categories: (a) demography, (2) housing, (3) education, (4) health/nutrition, (5) peace and
order/public safety, (6) income and livelihood, (7) land use, (8) agriculture, and (9)
transportation. From these, 49 indicators meant to meet the minimum basic needs at the
commune level have been identified. The indicators serve as basic information about the
commune and the families therein, thus enabling action to be taken to address improvements in
Implications and conclusion. With the key role that the commune councils play in the
current Cambodian decentralization thrust, the development of a CDIS all the more becomes
important. It is, after all, proposed to be a sangkat 1-based information system for gathering,
1
Cambodian term for commune
Introduction to Methods for Research in Official Statistics 49 of 179
analysing and utilizing data regarding the basic needs of local residents. Moreover, it is a
system where certain members of the commune council and the community themselves are
proposed to be members of the CDIS team which will gather and handle the raw data and take
charge of data updating on a regular basis. The data are also to be processed, analysed and
maintained at the commune level with copies submitted to the district and provincial planning
and statistics levels for integration into their development plans. It thus provides a functional
organization at the commune level that becomes part of the overall information system and at
the same time generates information that enable the community and other agencies involved to
To institutionalize the CDIS would require the participation of all levels of the
government. The bottom-up approach has greater chances to succeed, for instance, if agencies
like the NIS and the MOP are to take part at the very outset of the CDIS’ development because
these are the two agencies that will eventually take over the role of the SEILA program once
These two agencies can thus develop the CDIS to become the national standard
database system for all agencies involved in development planning for communes, districts,
3.2.2. Access and Quality of Basic Education in Myanmar (Khin Khin Moe,
Myanmar)
Good quality basic education is fundamental to acquiring relevant knowledge, life skills
and understanding of one’s environment. The good personal qualities acquired would have a
such, basic education may be considered as the foundation for national development and
economic growth.
Recognizing this, the Myanmar government has set basic education as the center and
focus of its Education For All (EFA) program. Its EFA National Action Plan is aimed toward
the improvement, especially in terms of access, quality, relevance and management, of its
primary and lower secondary levels which are the heart of basic education.
a system that can generate a learning society capable of facing the challenges of the knowledge
age. In particular, the EFA program aims to: (a) ensure that significant progress is achieved so
that all school-aged children shall have access to complete, free and compulsory basic
education of good quality by 2015; (2) improve all aspects relating to the quality of basic
education such as teachers, personnel and curriculum; and (3) achieve significant improvement
in the levels of functional literacy and continuing education for all by 2015.
Are these goals being achieved? To answer this, it is important to see if access to and
the quality of basic education in Myanmar are improving, and to determine the trends of certain
Objectives. This paper therefore aims to respond to this concern by assessing the status
of basic education in Myanmar and by examining the trends of 19 education indicators from
1991 to 2001. It likewise endeavors to show the education indicators that would experience
increases with an increase in the government expenditures on education and thereupon presents
projections on them.
composed of the primary and secondary school levels. In Myanmar, the basic education
system is referred to as 5-4-2, meaning that it consists of 5 years of primary school, 4 years of
middle school (lower secondary level) and 2 years of high school (upper secondary level). The
entry age for the school system is 5 years old. All schools and universities in Myanmar are run
by the government but for primary and middle education, monasteries also offer them with the
same curriculum.
Nineteen (19) education indicators are used in the assessment, namely: (1) public
expenditure by social sector, (2) public expenditure on education as a percentage of GDP, (3)
number of villages and percentage of villages with schools, (5) number of schools in basic
education, (6) pupil-teacher ratio, (7) ratio of students to schools, (8) gross enrolment ratio, (9)
net enrolment ratio, (10) percentage of female students, (11) number of students in monastery
education, (12) transition rate, (13) retention rate, (14) repetition rate, (15) promotion rate, (16)
Data are taken from the Myanmar Education Research Bureau, Department of
Administration Department, and the Department for the Promotion and Propagation of the
Sasana.
Descriptive analysis is used to study trends and patterns over time while correlation
analysis and regression modeling are used to look into the relationship between the education
indicators and government expenditure and to determine how government will be spending on
Results/findings. Based on the trends established for certain education indicators from
1991 to2001, the study asserts that Myanmar has achieved the goals stated in its EFA action
plan.
More specifically, for goal one where access and complete free and compulsory basic
education of good quality is being targeted for all school-aged children by 2015, the trends for
all indicators, except for the repetition and dropout rates, are shown to be increasing,
suggesting an improvement in the access and quality of basic education for all school-aged
percentage of both the GDP and total government expenditure is especially noteworthy since it
shows the priority given by the government to education. Ditto with the overall increases in
the trends in the number of schools in basic education as well as in the percentage of villages
with schools. The upward trend indicates the provision of more and better social services for
the sector. The increases in the retention, promotion and internal efficiency rates, accompanied
by decreases in the trends of the repetition and dropout rates, meanwhile, support the
For goal two in terms of improving all aspects of the quality of basic education, the
pupil-teacher ratio trend line as shown in the analysis indicates an improvement in the access of
pupils to teachers, thereby supporting the observation that teachers – who play a key role in
their attitudes and outlooks. Unfortunately, however, there are no data to show the quality of
the curriculum. Nonetheless, it should be noted that administrators and managers of the
Department of Basic Education are said to regularly conduct curriculum evaluation and revise
And for goal three, the continuing increase in the trend line for adult literacy rate is one
indicator of how Myanmar tries to effect improvements in the levels of functional literacy and
continuing education for all its citizens. Myanmar’s efforts in this aspect have, in fact, been
recognized by the UNESCO when Myanmar was awarded international literacy prizes twice
for it.
expenditures shows the indicators that are expected to increase given an increase in the
government expenditure on education. As such, the results help the government identify which
among the education indicators it needs to closely monitor in order to see whether or not it is
For instance, the analysis shows that among others, the indicators of gross enrolment
ratio and net enrolment ratio have high positive relationship with expenditures on education.
This means that for every increase in the expenditure for education, these two indicators are
expected to register increases. And because access to education is a situation wherein there is
ease of enrolment in the first grade of every school level, then the extent of access may be
measured by these indicators. Monitoring these indicators therefore enables the authorities to
assess how it is faring in its attainment of the EFA goal of providing all school-aged children
Finally, the results of the study’s analysis indicate that on the whole, Myanmar’s basic
education has been managed well. Moreover, the results of the projections suggest that indeed,
it will do Myanmar well if it will continue to spend or invest on education as this will bring
further improvements in the education indicators that will help the country attain its EFA goals
by 2015.
Nepal is considered to be one of the poorest countries in the world. With a per capita
income estimated at US$200 in 1995, it is ranked, based on a 1999 World Bank report, as the
ninth poorest country in the world, the poorest in fact outside of Africa. In terms of the human
development index computed by the United Nations, it is also almost at the bottom at number
143 out of 175 countries as per the 2003 global Human Development Report.
The overall living standard of the Nepalese people has thus remained very poor, with
majority of the population residing in the rural areas where a big segment is poor. To help
improve the living standards, most of the government’s economic development plans and
In order to determine, however, whether or not such plans and programs are succeeding
and meeting the United Nations’ Millennium Development Goal (MDG) of reducing extreme
poverty and hunger by 2015, it is important to have solid and reliable information on the causes
of poverty, where the poor are located and what their livelihood means are. Said information
are helpful in monitoring the progress in poverty reduction efforts and in designing better
Objectives and rationale. In view of the above, this study aims to identify the
Nepal, examine the relationship of household and household head characteristics on poverty in
Nepal, identify the correlates of poverty and assess the extent of their effect on the probability
Poverty alleviation has always been the overriding thrust in Nepal’s development
efforts. Yet, despite noticeable progress, with poverty rate declining from 42 to 38 percent
over the past decade, widespread poverty still remains. In this regard, the government has
made a renewed commitment to bring down the poverty level in Nepal from the baseline figure
of 38 percent at the beginning of the renewed commitment plan period (2002) to 30 percent by
Such goal is a daunting task by itself because the problem of poverty has persisted for
decades. Being a deeply rooted and complex phenomenon, poverty cannot easily be
are associated with being poor or the so-called determinants of poverty in Nepal.
Description, data and methodology. Nepal is nestled between two populous countries.
To its east, south and west is India while to its north is China. It is a landlocked country that
Geographically, Nepal is divided into 3 regions, namely, (a) Mountain, (b) Hill, and (c)
Terai. Seven (7%) percent of the population live in the Mountain area while 44 and 49 percent,
in the Hill and Terai areas, respectively. There are 5 development regions and 75
administrative districts, the latter being further divided into smaller units called village
Poverty varies between urban and rural areas, and across geographical regions, with
rural poverty at 44 percent, higher by almost two times than urban poverty (23 percent) and
with poverty more pronounced in the Mountain areas (56 percent) than in either the Hill (41
Basic data for this study are taken from the Nepal Living Standard Survey (NLSS) of
1995-96 which consists of a sample of 3,388 households. The sample is distributed among the
Mountain area (424 households), Urban Hills (604 households), Rural Hills (1,136 households)
For its framework, the study draws from the various household characteristics as well
determine the correlates of poverty and assess how such influence the probability of a
household’s being poor. At the same time, the study examines how the probability of being
The study develops a poverty profile of Nepal with the use of the three main indices of
poverty, namely, the headcount index, poverty gap index and poverty severity index. Based on
the profile, some household and household head characteristics are initially hypothesized as
factors affecting a household’s per capita consumption. With the help of a multiple regression
model, the correlates of poverty are then identified. Going beyond, the study then carries out a
Certain caveats, though, have to be taken into consideration. For one, there appears to
be some degree of measurement error in some variables in the data from the NLSS which
thereupon compromises the scope and quality of analysis of this study. Also, some other
potential determinants of poverty are not included in the data collected and information on
some variables like ethnicity, religion and household head are not complete for all households.
Results/findings. The profile drawn by the study on Nepal’s poor yields, among others,
Geographically, the poor are shown to be residing mostly in the rural areas, with the
poverty rate there being double (44 percent) than that in the urban areas. Poverty is also more
severe in rural than urban areas. Across regions, the mid-western and far western regions
appear to have higher poverty incidence as well as more intense and severe poverty than other
regions.
poverty gap index and poverty severity index) are shown to be higher as household size
increases. Thus, it may be surmised that households with a large household size are more
likely to be poor than those with a smaller size. As to ethnicity and religion, the analysis
indicates that the occupational caste groups such as Kami, Magar and Damai are poorer than
other castes, and that Muslims are more likely to be poor than those of other religions.
employment status are shown to have significant effects on poverty status, with literate and
female heads, for instance, being able to manage their incomes better and therefore have less
chances of being poor. For social amenities/facilities, on the other hand, households with no
Meanwhile, the results of the modeling and simulation exercises seem to strengthen the
picture drawn from the profile. At the same time, they indicate the variables with the highest
impact on the program of poverty alleviation. The major conclusions indicate that the
of one Nepalese rupee in support of their investment, lead to large improvements in the living
2) Increase in some education variables such as adult female literacy, maximum level
schooling of a parent;
banks, health posts, roads and the like as well as access to safe drinking water.
Government efforts must therefore be geared toward lowering dependency within households,
reducing household size, improving the literacy of adult females, providing more opportunities
for employment to the working age population, reducing the mean time to access social and
general infrastructure in equal manner in all regions and the rural sector, and avoiding gender
discrimination in the case of education. These efforts all have large impacts in improving the
In sum, one of the most important determinants of poverty, as shown from the results,
is education or its lack thereof. More education, particularly in the higher or tertiary level of
education, for the people enlarges the likelihood of their finding work or employment
opportunities and thereby earning more money. As such, government must invest more on
education programs. And while investments in education are inherently of long gestation, the
simulations nonetheless show that they can be a powerful instrument in the long-term fight
against poverty.
In terms of reducing household size and the number of dependent household members,
meanwhile, the government must promote national awareness programs for reducing fertility
Finally, because the model used in the study is of a static nature, the simulations
provide no indication of the time frame for changes and improvements to take place and they
might only be felt after a long gestation period. Nonetheless, the analysis provides Nepalese
policy planners with objective measures on the potential poverty reduction impacts which
Planners must therefore view the results of this study as a possible guide in allocating
As we go about our research, our conceptual framework will lead us to process our data
into interpretable form, and translate it back into meaningful information that is more orderly
and understandable than prior to the data analysis. This allows us to either confirm or revise
Various univariate and multivariate statistical methods (cf. Tables 4-1 and 4-2) can be
Can the patterns observed in a sample data be generalized to the entire population from
independent variables?
Some statistical methods of analysis do not necessarily answer questions directly, but
may be used to reduce a dataset into fewer variables (including principal components, analysis
correlation analysis). Other statistical tools are used for specific purposes or types of data,
including survival analysis models (for analysis of failure or lifetime data), artificial neural
networks and time series models (for forecasting) and geostatistical models (for analyzing
spatial data). All the statistical methods of analyses to be discussed in this chapter help in
answering some research questions and in transforming raw data into meaningful information.
Various types of data will require various types of analyses, even when data are
summarized. For instance, merely averaging data which take a categorical nature may not be
meaningful. Cross-section regressions, time series models and panel data analyses need to
undergo validation to ensure that the models will provide useful insights into and forecasts of a
phenomenon.
In this day and age, statistical analysis can readily be performed with the aid of
computer packages. Some basic statistical methods, including correlation analysis and simple
linear regression can even be performed on electronic spreadsheets, such as Microsoft Excel,
which were developed as interactive calculators for organizing information defined on columns
and rows. (See Figure 4-2). Be aware though that Excel’s statistical features have limitations.
capabilities. There are also a number of statistical software that can be used for special
purposes. The US Bureau of the Census’s two data processing software IMPS and CS-PRO,
US Center for Disease Control. Other special purpose software include the seasonal adjustment
the software WINBUGS for Bayesian analysis with the Gibbs Sampler, (http://www.mrc-
(http://www.eviews.com/eviews4/eviews4/eviews4.html).
Commercial software typically requires a license that needs to be paid for either yearly or at a
one-time cost. These costs currently range from about 500 to 5000 US dollars. Freeware, on the
other hand, can be readily downloaded over the web without cost, but usually offer less
Purpose
General Special
Cost Commercial SAS, SPSS, STATA Eviews (Econometric
Modeling)
Freeware R WINBUGS (Bayesian
Modeling); X12 (Seasonal
Adjustment); IMPS and
CENVAR (Survey Data
Processing and Variance
Estimation), Epi Info
The popular commercial statistical software include SAS, SPSS, STATA, and Eviews,
while Epi Info, R, IMPS, CENVAR, TRAMO-SEATS, X12 and WINBUGS are free statistical
software. Commercial software may be advantageous to use over freeware because of support
and training.
use as a training manual. Typically, documentation is only a reference manual, and the
software user must learn how to use the package from a formal training course or through
someone familiar with the software. In this training manual, we focus on the use of the
commercial statistical software STATA (pronounced stay-tah) version 10 since among the
designed for research, especially on data generated from complex survey designs (as is done in
many NSOs).
A STATA session is initiated with the appearance of four main windows (cf. Figure 4-
3) consisting of the Variables window (on the lower left hand corner) which displays the
variable list of the data set in the active memory, the Review window (on the upper left hand
corner) which provides a list of past commands, the Results window (in the upper middle area),
and the Command window (in the lower middle area) where commands are typed.
Figure 4-3. Four main STATA windows: Review window (upper left hand corner),
Variables window (lower left hand corner); Results window (upper middle corner);
Command window (lower middle corner)
A variable name in the variables window can be pasted to the Command line window
or an active dialog field just by merely clicking on the variable name. Also, if you click a
command in the Review window, it gets pasted to the Command window where you can edit
and execute the edited command. You may also scroll through past commands in the Review
saved into a “do-file” by clicking the upper left Review window button and selecting Save
Review Contents.
The size, shape and locations of these windows may be moved about on the screen by
left-clicking and dragging the window with a mouse. However, note that it is useful to have the
Results window be the largest in order to see a lot of information about your STATA
commands and output on the screen. If you are creating a log file (see below for more details),
the contents can also be displayed on the screen; this is sometimes useful if one needs to back
up to see earlier results from the current session. The fonts or font size may be changed in each
window by clicking the upper left window button (or right clicking on the window) and then
choosing Font on the resulting pop-up window. Finally select the desired font for that type of
window or select a fixed width font, e.g. Courier New 12 pt. When finished, make your choices
If the settings were lost, you can easily reconstruct them by clicking on:
By default the Results window displays a lot of colors, but you can readily choose a set of
predefined colors, or customize the colors. Henceforth, we use white background color scheme
In addition to these four major windows, the STATA computing environment has a
menu and a toolbar at the top (to perform STATA operations) and a directory status bar at the
bottom (that shows the current directory). The menu and toolbar can be used to issue different
STATA commands (like opening and saving data files), although most of the time it is more
convenient to use the STATA Command window to perform those tasks. Unlike previous
versions, STATA version 8 has a graphical user interface (GUI) that allows users to directly
perform data management (with the Data Menu), obtain some data analysis (with the Statistics
Menu) or generate all sorts of graphs on a data set (with the Graphics menu) with point-and-
click features. In addition, note that STATA yields a viewer window when online help is
formats. The syntax of the commands can be readily learned, perhaps more easily than other
statistical packages’ syntax. STATA has both a command driven interface and a graphical user
interface (GUI), i.e., point and click features. With its command driven interface, actions, e.g.,
summary tables and graphs, are set off by commands which can be issued either in interactive
or batch mode. With the interactive mode, users can issue commands line by line and yield
results after each command line is issued. Commands are also retrieved with the Page-Up and
Page-Down keys or retrieved from the list of previous commands. With the batch mode, users
can run a set of commands and have the results all saved into a file (or shown on screen).
Documentation of the data management and analysis is easy to do in contrast to working with
spreadsheets (where we can do many things, but where the documentation of what we did –
methods and tools provided in the software allow us to perform descriptive and/or exploratory
The statistical methods that can be run in STAT include classical and logistic
regression, simultaneous equation methods, nonparametric curve fitting, statistical models for
analyzing ordinal, count, binary and categorical outcomes, a number of multivariate routines
such as cluster analysis and factor analysis, various econometric methods for analyzing time
series data, panel data and multivariate time series, including ARIMA, ARCH/GARCH, and
VAR models. STATA also allows users to add their own procedures. New commands and
procedures are regularly distributed through the net and/or the Stata Journal. STATA will thus
perform virtually all of the usual statistical procedures found in other comprehensive statistical
software, and its strength lies on the commands on survival analysis, panel data analysis, and
STATA readily accommodates survey data with designs more complex than SRS such
as stratified designs and cluster sampling designs. Unlike the survey software SUDAAN
commands (called svy commands) for various analyses of survey data, including including
svy:mean, svy:total, svy:ratio, and svy:prop for estimating means, totals, ratios, and
proportions. Point estimates, associated standard errors, confidence intervals, and design effects
for the full population or subpopulations can be generated with these commands. Statistical
methods for estimating population parameters and their associated variances are based on
methods in most general-purpose statistical software assume that the data meet certain
assumptions especially that the observations are generated through a simple random sample
(SRS) design. Data collected through sample surveys, however, often have sampling schemes
that deviate from these assumptions. Not accounting for the impact of the complex sample
design can seriously lead to an underestimate of the sampling variance associated with an
estimate of a parameter. The primary method used for variance estimation for survey estimates
in STATA is the Taylor-series linearization method. There are, however, also STATA
commands for jackknife and bootstrap variance estimation, although these are not specifically
oriented to survey data. Note that several software packages for analyzing survey data such as
statistical packages, viz., SAS and SPSS, have recently developed the capacity to analyze
There is an ongoing debate as to whether the sample design must be considered when
deriving statistical models, e.g., linear regression, logistic regression, and probit regression,
tobit, interval, censored, instrumental variables, multinomial logit, ordered logit and probit, and
poisson regression, based on sample survey data. Analysts interested in using statistical
techniques such as linear regression, logistic regression, or survival analysis on survey data are
position adopted in STATA is to incorporate into the model the variables that were used to
svy:probit are available for the regression, logistic regression, and probit analysis procedures.
The command svydes allows the user to describe the specific sample design and should
be used prior to any of the above commands. Note that although STATA can calculate
standard errors based on bootstrap replication with its bstrap command, its bootstrapping
procedure assumes the original sample was selected using simple random sampling. Therefore
panel data estimation techniques, (in the family of xt commands such as xtdes, xtreg) there are
now a growing set of sophisticated methods for analyzing time–series data, including,
(ARCH) models; and vector autoregressions (VARs) and structural VARs, impulse response
functions (IRFs), and a suite of graphics capabilities for the presentation of IRFs. Both user–
specified nonlinear least squares and maximum likelihood estimation capabilities are provided
in STATA, although there is currently no support for nonlinear systems estimation (e.g.,
FIML).
Whether for time–series or panel data analysis, STATA allows time–series operators,
including L. for lag, F. for lead, and D. for difference. They may be combined, and may take a
numeric argument or range: for example, L2D3.y would refer to the second lag of the third
difference of the variable y. Aside from ARIMA and ARCH estimation, the prais command is
available for estimating regressions with AR(1) errors via the Prais–Winsten or Cochrane–
Orcutt methods, while the newey command estimates regression models with Newey–West
(HAC) standard errors. The STATA arima command is actually capable of estimating a variety
of models beyond the standard Box–Jenkins approach, such as a regression model with ARMA
disturbances. You may specify a particular optimization methods. The arch command likewise
goes beyond ARCH, and actually includes GARCH, ARCH–in–mean, Nelson’s EGARCH,
threshold ARCH, and several forms of nonlinear ARCH. Diagnostic tools available for
univariate time series include a number of unit root tests, several frequency domain measures
smoothers (both seasonal and nonseasonal). Nonlinear filters are also provided. Multivariate
estimation, structural VAR estimation, and the generation of diagnostic tests (such as Granger
STATA allows users to write their own add-ons and code, reuse and share these code
and extensions with other users. STATA thus provides for a vivid exchange of ideas and
experiences among its users, who are largely from academic, research and government
institutions, as well as international organizations while other statistical software such as SPSS
increasingly targets the business world. One big advantage of STATA over other statistical
software is that it can also incorporate a sample survey design into the estimation process.
Also, it provides a very powerful set of facilities for handling panel / longitudinal data,
STATA is offered in four versions: Small, Intercooled Stata, Special Edition (SE). and
Multi-processing (MP). Small is a student version, limited in the number of variables (99) and
observations (1000), but otherwise complete in functionality. The Intercooled Version is the
standard version. It supports up to 2,047 variables in a data set, with the number of
observations limited by available RAM (technically, as large as 2.147 billion), as the entire
data set is held in memory. Intercooled allows matrices of up to 800 rows or columns. The SE
version arose during the life of release 7.0 in response to users’ needs for analyzing much
larger data sets. Thus, SE allows significantly more variables in a data set (32,767) than
Intercooled or Small, and supports larger matrices (up to 11,000 rows or columns). Advanced
People often wonder about how Stata compares with other statistical packages. Such a
comparison actually depends on the functions and version of a software. For a comparison of
Stata release 9 with SAS 9.13 and SPSS 14, see, e.g. the following link:
1_0207.pdf .
In Windows 95, STATA can be run on any 386 or higher PC with at least 8 MB of
RAM and a math co-processor. STATA actually has cross platform compatibility: the software
can run efficiently on Windows (all current versions), Power Macintosh (OS 8.6, 9.X, or OS
X), Alpha AXP running Digital Unix, HP-9000 with HP-UX, Intel Pentium with Linux,
RS/6000 running AIX, SGI running Irix 6.5, and SPARC running Solaris. A dataset, graph, or
add-on program created using STATA on one operating system, e.g., Windows, may be read
without translation by STATA running on a different platform, such as MacOS. If you change
your platform, all your STATA data, commands and graphs will work on your new platform.
There is no need for a data set translation. Stata Corporation also resells a third party program
interchange between several versions of STATA and the binary formats of many other
summarize results of the analyses and to assist in diagnostic checking in the use of standard
statistical models. Results, including graphics, can be readily copied across to other programs,
such as word processors, spreadsheets and presentation software. STATA also provides a
customizable, and plots of various sorts may be superimposed and juxtaposed. Graphics files
are produced in a native format (with extension name .gph) and may be translated within the
Metafile, Windows Enhanced Metafile, and Portable Document File depending on the
platform. At present, STATA graphics commands support only two–dimensional plots and do
not include contour plots but the new graphics language enables the development of three–
dimensional plots, surface graphs, and the like, and STATA users might expect those
addition, Stata 10 now has the added feature of enabling graphs to be edited (similar to
capabilities found in other competitors, such as SPSS for Windows Interactive Graphics).
capabilities, makes the STATA software an excellent tool for research in official statistics.
Although some modules, e.g., multivariate analysis, still need considerable improvement while
others, e.g., time series analysis, are still under some development, STATA meets a very
graphics makes it possible to depend on STATA rather than having to export data to other
graphics production packages, and the introduction of a GUI, i.e., a menu–driven interface,
which started in version 8, makes STATA accessible to novice researchers with no prior
computing experience. Although no single statistical package can serve all needs, STATA’s
developers clearly are responding to many research needs. Coupled with STATA’s cost
effectiveness, this package will undoubtedly become more and more valuable and useful to
Data can be comprehensively managed with STATA‘s versatile commands. There are
various data files that can be read into STATA, including worksheets, ASCIII files and STATA
datasets. The latter have extension names “.dta”, i.e., a STATA datafile called “hh” has a
By default, files are to be read by STATA from the folder c:\data (see the directory
status bar in Figure 4-3). If you wish to open a pre-existing STATA dataset in a different
folder, you have to tell STATA to change directory. If you wish to read a file called hh found
in the directory c:\intropov\data, you merely have to enter the following commands in the
cd c:\intropov\data
use hh
The cd command tells STATA to change the working directory, which by default is
“c:\data”, to “c:\intropov\data” while the use command tells STATA to read (and employ) the
specified file. These two commands yield the Results window shown in Figure 4-4. Notice that
the first and third lines repeat the command you entered; the second line reports a successful
change of working directory, while the fourth line implies that the use command has been
Figure 4-4. Results Window after changing working directory and reading
preexisting STATA datafile smallfies.
Instead of first issuing a cd command followed by a use command, you could also read
the file hh in folder c:\intropov\data\ by simply entering the full path in the use command, as
in:
use “c:\intropov\data\hh”
File ► Open
in the Menu Tab, or by clicking on the leftmost icon, an opening file folder , in the Tool bar.
Note that the file hh.dta and other Stata data files are available from
exercises for poverty analysis. The data are a subsample of the microdata from the Household
Survey 1998–99 that was conducted jointly by the Bangladesh Institute of Development
If a file to be read in were too large for the default memory capacity of STATA, then an
error message (as in the Results Window of Figure 4-5) would be seen.
To cure such a problem, you would have to change the memory allocated by issuing the
command:
then you can reissue the use command. If the file opens successfully, then the allocated
memory is sufficient. If you continue to obtain an error message, you can try 40m or 60m. But
be careful not to specify too much memory for STATA. Otherwise, the computer will use
All STATA Error messages are short messages, which include a code. . The code is a
link, and by clicking on it, you get more clues to cure the error. Some error messages, however,
Note also that the memory allocation command works only if no data set is open. To
clear
In the use command, if your computer is connected to the internet you can also specify
use http://courses.washington.edu/b517/data/ozone.dta
dataset is open; if you open the hh dataset, you will be replacing the ozone file in memory with
clear
use hh
which, first clears the active memory of data, and then reads the new data set.
You can use the count command to tally the number of observations in the data set.
count
The count command can be used with if-conditions. For the data set hh.dta you can issue the
following command to give the number of households whose head is older than 50.
If you want to see a brief description of the dataset in active memory, you can enter
the describe command as follows:
describe
from the command window, or alternatively, you can select
in the Menu Bar and then click OK after receiving the pop-up window displayed in Figure
4-6.
Piles of raw data, by themselves, may neither be interesting nor informative. However,
when the data are presented in summary form, they may be much more meaningful to us. One
location. Among the most useful measures of location are measures of central tendency, such
as the mean, which represent typical values or values where the middle of the data tends to lie.
Measures of location other than centrality are also calculated to describe the data distribution,
i.e. how the data are “distributed” on the real line. For instance, percentiles are values that
separate the sorted data into 100 equal groups. The fiftieth percentile (also called the median)
is another measure of central location. Several lists of data may have the same mean, but the
spread of the lists may be different. Hence, calculating other features of the data such as a
measure of spread or variation, such as the standard deviation may be also important. A
Russian mathematician named Pafnuty Chebychev explained why the standard deviation and
the mean are considered very important summary measures: Chebychev’s Inequality indicates
that for all data sets, (a) at least three fourths of the data are within two standard deviations
from the mean; (b) at least eight ninths of the data are within three standard deviations from the
mean. Thus, the default procedure for obtaining summary statistics in most statistical software
If you wish STATA to list some summary statistics, , you can issue the command
summarize
from the command window, or alternatively, you can select
Introduction to Methods for Research in Official Statistics 74 of 179
Data ► Describe Data ► Summary statistics
and then click OK after receiving a pop-up window. If you wish to obtain a few more summary
statistics, such as skewness, kurtosis, and the four smallest and four largest values, along with
summarize, details
If you would like to generate a table of frequencies by some subpopulation, you will
have to use the tab(ulate) command. For instance, for the STATA data set hh.dta, if we wish
to get the frequencies of sampled households, across the regions we enter the following
tab region
and use the drop down to select region (or type the word) in the space for categorical variable.
If, in addition, you wish to obtain summary statistics for one variable by region, such as
the variable distance, then you use the summarize option. That is, you respectively issue the
Options are specific to a command. A comma precedes the option list. Note that omitting or
Another convenient command is the table command, which combines features of sum
and tab commands and the by option. In addition, it displays the results in a more presentable
form. If you want to obtain the mean distance of nearest paved road and mean distance of bank
to dwelling by region (and across the database), then you issue the command:
changes only the display, not the internal representation of the variable in the memory. Note
also that the table command can display up to five statistics and not just the mean.
display two-way, three-way or even higher dimensional tables in STATA. The tab command
may be used to generate contingency tables. For instance, if you want to obtain a cross
tabulation of households with a family size greater than 3, by region and by sex, you
respectively issue the command and generate the result given below:
If you would like to drop some variables (or a list of variables) in a dataset, you merely
drop agehead
will delete the variable age from the database. We can also delete some observations. For
instance,
drop in 2/3
list in 2-4
sample 20
drops everything in memory except for a 20% random sample of the database. The command
sort vill
You can also establish new variables in STATA. If you want to compute means (or
other statistics) by groups that we construct, for example, the mean household total assets
across four age groups of household head ages -- under 20, 21-40, 41-60, and over 60, then we
first need to construct these age groups with the generate and replace commands, sort by the
To employ the by command, it is assumed that a sorting was first initiated. The label command
can be used not only to label variables but also to label the values of a variable:
After examining and making changes to a dataset, you may want to save those changes.
You can do that by using the STATA save command. For example, the following command
saves the changes on the hh.dta file:
You can optionally omit the filename above (that is, save, replace is sufficient). If you
do not use the replace option STATA does not save the data but issues the following error
message:
The replace option tells STATA to overwrite the pre-existing original version with the
new version. If you do NOT want to lose the original version, you have to specify a different
filename in the save command, say, save the file with the name hhnew :
save hhsnew
file hhnew.dta saved
To deal with this problem, you will have to firstly save the data file and then inform
STATA that you want to exit. If you want to exit STATA without actually saving the data file,
you will have to instead clear the memory (using the clear command or drop _all command)
before informing STATA that you really want to exit. Alternatively, you can enter these two
commands together:
exit, clear
As a researcher, it is important to document all your work for subsequent work either
by you or by another researcher. You can readily keep track of work in STATA by creating a
log file, which lists all the commands entered as well as the outputs of the commands. Note,
graphical outputs, however, are not saved in a log file. The graphs will have to be saved
independently. More about graphs will be discussed in subsequent sections of this manual.
You can use the Open Log button (the fourth button from the left on the toolbar) to
establish a log file. This opens a dialogue box that requests for a name of a file. The default
extension name is “.smcl” to stand for a formatted log file, although you may opt to use an
ordinary log file (which can be read and edited by any word processor, such as Notepad, Word
Pad, or Microsoft Word). Formatted logs, on the other hand, can only be read within the
STATA software. You can give the logfile a name such as log1, change the default extension
name to an ordinary “.log” file by clicking SAVE AS TYPE, and also change the default folder
to an appropriate folder, such as c:\intropov.
Alternatively, you may also open a log by entering in the command window
Once a log file is created, all the commands and subsequent output (except for graphical
output) are logged into the log file until you ask STATA to stop writing into the log file. After
running some STATA commands, you may decide to close/suspend the log by pressing the
button for opening a log (which also closes or suspends the log and), this will result in the
dialogue box asking you whether you wish to view a snapshot of the log file, close it, or
suspend it. In future STATA sessions, you can decide to overwrite or append to an existing log
file.
but also allow for a close scrutiny of the data. You can readily generate a number of graphs in
STATA that will illustrate the distribution of a variable or suggest relationships among
variables. A very simple but useful pictorial representation of the data distribution is a box-and-
whiskers plot, which displays a box that extends from the lower quartile to the upper quartile
(with the median shown in the middle of the box). The lower quartile, median and upper
quartiles are the 25th, 50th and 75th percentiles of the distribution. If the values in a distribution
are sorted from lowest to highest, the quartiles divide the distribution into four parts, with the
lower quartile being the value for which 25% of the data are below it, the median being the
value for which half of the data are below it (and half are above it), and the upper quartile being
the value for which 75% of the data are below it. Whiskers are drawn to represent the smallest
and largest data within 1.5 IQR from the lower and upper quartiles, respectively, where IQR
(the so-called “inter-quartile range”) is the difference between the upper and lower quartiles.
Points beyond the 1.5 IQR limits from the quartiles are considered outliers, i.e., extreme points
in the distribution.
A box-and-whiskers plot can be obtained in STATA with the graph box command. To
generate a box (and whiskers) plot of the family size variable in the hh dataset, either enter in
and specify the family size variable members in the resulting dialogue box. Either way, this
will generate the box-and-whiskers plot in Figure 4-8. This figure shows that the lower quartile
for total members in the family is 4, i.e., 25% of all the families sampled have family sizes 4 of
below. Also, 50% of all sampled families have family sizes 5 or below, and 75% of all sampled
families have family sizes 6 or below. A family size distribution typically has extremes on the
upper tail of the distribution as there are a few families (typically, poor families) with
extremely large sizes. For this database from Bangladesh, we readily observe that the usual
these latter values are called outliers and are shown as circles in Figure 4-7.
20
15
household size
105
0
If you wish to incorporate weights in the box and whiskers plot, you have to enter in the
command window:
Note that we can also put data restrictions, i.e. select cases satisfying “if” conditions or specify
the data is grouped into intervals (called bins), and each interval is represented by rectangles
whose base is the interval and whose area is the relative frequency counts within the intervals,
we yield a histogram. Note that in constructing the histogram, attention must be given to
selecting the number of bins. In practice, we do not want to have too few nor too little bins.
There, however, are no exact rules for determining the number of bins. One suggestion in the
literature is to use Sturges’ rule: where the number k of bins is the smallest integer less than or
equal to 1+log(n)/log(2) 1+3.332 log(n) and n represents the number of data. The STATA
So, for the famsize variable in the hh data set, the default histogram generated by
STATA (shown in Figure 4-9) has k=36 bins, which can be obtained by entering
.25
.2.15
Density
.1
.05
0
Alternatively, you can generate the histogram in Figure 4-9 by selecting from the Menu bar:
Graphics ► Histogram
and specify the famsize variable in the resulting dialogue box, putting a tick on discrete option
rather than default (continuous) option. Here, we had a unit width for the intervals. Note that
you can also specify the width of the intervals to be some value, say 2:
If the discrete option is not specified, you could suggest how many bins should be generated
You could also generate histograms disaggregated by some subpopulations, and use
yields a histogram of the age of the household heads by region (and across the country). The
The shape of the histogram is typically influenced by the number of bins and the choice
of where to start the bins. Thus, you may want to get a nonparametric density estimate of a
estimate is the Kernel density estimate. In STATA, you can generate Figure 4-10 with:
.03
.02
Density
.01
0
20 40 60 80 100
age in years
kernel = epanechnikov, bandwidth = 3.6385
In some cases, you may want to determine whether the data distribution can be approximated
by a normal curve, so you may want to add the option.
If you are still unconvinced that you can fit a normal curve thru the age distribution, you may
qnorm agehead
with the resulting plot interpreted accordingly to mean that the normal distribution can be fit
provided that the points (representing the observed quantiles and the expected normal
produces the bar graph in Figure 4-11 of average distance to nearest paved road and average
4
3.33363
2.92832
3
2.84255
2
1.79199
1.06412 1.06787
.958716
1
0
A scatterplot of food and non-food expenditures of the hosueholds, on the other hand,
or simply entering:
in the command window. Alternatively, we can use the Menu bar to obtain this scatterplot by
selecting
And then choosing Create Plot, Basic Plot (Scatter) in the resulting dialogue boxes. and
identifying the x variable as d_bank, and the y variable as distance in the resulting pop-up
0 5 10 15
distance of nearest bank from h
in the command window or choosing in the Graphical User Interface to create a second plot, a
Fit plot (Linear prediction).
A scatterplot will help us assess whether or not variables are correlated. Two variables
are said to be correlated if knowing the value of one of them tells us something about the value
of the other. The correlation measure however can be readily obtained with the STATA
command:
variables around a line; the correlation of any two variables is always between –1 and +1. The
closer the value of the correlation to either extreme, the stronger is the evidence of a linear
relationship between the variables. The closer it is to 0, the weaker the linear relationship. If the
correlation is positive, this means that there is a tendency for the two variables to increase
together. If the correlation is negative, one variable tends to decrease as the other increases.
Note that the correlation between two variables is not sensitive to the order of the
variables. That is, interchanging the variables would still yield the same value for the
correlation between the variables. Adding a constant to all the values of one variable will also
not change the value of the correlation coefficient. That is, adding 0.3 to all values of distance
will not change the correlation coefficient between distance and d_bank. Neither does
multiplying one variable by a positive constant affect the correlation. In particular, if the
distance to paved roads were half of their current values (which is equivalent to multiplying the
current values by 0.5), then the correlation coefficient we calculated would still remain
unchanged.
least one of the variables. For instance, the correlation of the log of distance and log of d_bank
Although there are actually no hard rules in determining the strength of the linear
relationship based on the correlation coefficient, we may want to use the following guide:
in order to interpret the correlation. For instance, the correlation coefficient of +0.36 between
correlation of 70% does not mean that 70% of the points are clustered around a line. Nor
should we claim here that we have twice as much linear association with a set of points, which
has a correlation of 35%. Furthermore, a correlation analysis does not imply that the variable
X causes the variable Y. That is, association is not necessarily causation (although it may be
indicative of cause and effect relationships). Even if polio incidence correlates strongly with
soda consumption, this need not mean that soda consumption causes polio. If the population of
ants increases (in time) with the population of persons, (and thus these numbers strongly
correlate), we cannot adopt a population control program for people based on controlling the
number of ants! Also, while the direction of causation is sometimes obvious, e.g., it is rain that
causes the rice to grow and not the growth of rice that causes the rain, the direction of causation
may not always be clear: what is the relationship between macro economic growth and job
creation? Does economic growth come first, yielding more sectors to create more jobs? Or does
job creation come first? Often, both variables are driven by a third variable. The weight and
height of a person are certainly strongly correlated, but does it make sense to claim that one
Finally, note that the presence of outliers easily affects the correlation of a set of data so
it is important to take the correlation figure with a grain of salt if we detect one or more outliers
in the data. In some situations, we ought to remove these outliers from the data set and re-do
the correlation analysis. In other instances, these outliers ought not to be removed. In any
scatterplot, there will be more or less some points detached from the main bulk of the data, and
A less sensitive measure to outliers is the “rank correlation,” the correlation of the ranks
of the x-values with the ranks of the y-values. This may be computed instead of the typical
(Pearson product moment) correlation coefficient especially if there are outliers in the data.
computes for the Spearman rank correlation between distance to nearest paved road and
Sir Francis Galton was the first to consider an investigation of correlations within the context
of studying family resemblances, particularly the degree to which children resemble their
parents. Galton’s disciple Karl Pearson further worked on this topic through an extensive study
on family resemblances. Part of this study was generating heights of 1,078 fathers and those of
their respective first-born sons at maturity. A plot of these data is shown in Figure 4-13 with
the pairs of dots representing the father’s height and the son’s height.
It may be important to mention that there exist spurious correlations involving time. In
such cases, it is important to remove the time trends from such data before correlating them.
individuals can have a dramatic effect on the correlation. Also, it may be misleading to obtain
the correlations for averages, and for cases when the sample comprises different subgroups.
A lot more graphs can be generated with STATA. For instance, we can obtain a
“pairwise scatterplot” (also called a “matrix plot”) that can be helpful in correlation analysis
and in determining whether it makes sense to include certain variables in a multiple regression
model. You can generate a pairwise scatterplot with the graph matrix command.
As was earlier pointed out, graphs generated in STATA can be saved into a STATA
readable graphics file with an extension name .gph. These graphs can also be saved into
various graphical formats, including Windows Metafile (.wmf), Windows Enhanced Metafile
(.emf), Portable Network Graphics (.png), PostScript (.ps), or Encapsulated PostScript (.eps).
Either right click on the graph and choose the desired graphical format to be saved, or click on
Often, it would be important to generalize findings regarding the units about which
information is available. Provided that the units form a probability sample, we may able to
which is the set of objects about which we want to make this statement. Most of these objects
have not been observed, yet we would like to make some statement about them, and we can if
those that we did observe were selected with a chance process, and we know this sampling
design. If some objects were purposely not sampled, it is difficult to argue that they be included
in the population about which statements can be made. For instance, an opinion poll in a
In going about statistical inference, the crucial point is to recognize that statistics, such
as a sample mean, vary from sample to sample if the sampling process were to be repeated. We
can obtain values of the standard error of an estimate that will provide us a measure of the
precision of the estimate, and consequently allow us to construct confidence intervals and/or
perform hypothesis tests about the parameter of interest. By virtue of the Central Limit
Theorem, we are able to say that we are 95% confident that the population mean is within two
standard errors from the sample mean. Thus, if our estimate of the average monthly income
based on a probability sample of households is 18,576 pesos with a standard error of 250 pesos,
then we are 95% confident that the true average income is between 18076 pesos and 19076
pesos. If we would like to determine whether it is plausible to assume some hypothesized value
for the average, such as 20,000 pesos (and merely attribute the difference between the
hypothesized value and the sample average to mere chance), then the confidence interval
(which does not contain the hypothesized value) suggests that we have no evidence to claim
Statistical hypothesis testing is about making decisions in the face of uncertainty in the
interest. This involves stating what we expect, i.e., our null hypothesis, and determining
whether what we observed from the sample is consistent with our hypothesis. The alternative
statement that we consider correct if the null hypothesis is considered false is called the
alternative hypothesis. As information becomes available, the null and alternative hypotheses
Agreement between the null hypothesis and the data strengthens our belief that the null
hypothesis is true, but it does not actually prove it. The whole process of hypothesis testing is
actually subject to errors, the chances of which we would like to be small. Even if our decisions
are evidence-based, these decisions themselves may not be perfect as there is inherent
uncertainty regarding the evidence. Although, we would like to have perfect decisions, we
cannot do so. We have to nonetheless make decisions in the face of uncertainty. Errors in
guided to reject the null hypothesis if the chance of obtaining what we observed or something
To illustrate how to perform a hypothesis test, specifically a regular t-test with STATA,
cd c:\intropov
use hh
sum famsize
with the last command generating the (unweighted) average family size 5.23.
You may hypothesize that the average family size is 5, and assess whether the
difference between the observed value (of 5.23) and what we expect under the null hypothesis,
i.e., 5, may be explained by chance. Intuitively, the smaller the observed difference (here, 0.23)
is, the more likely that we can account for this difference as being merely due to chance. So,
we desire to find the chance of getting an observed difference of 0.23 or a larger difference. If
this chance is small, then this means that we have evidence to suggest that the mean is not 5,
To run a t-test of the hypothesis that the average family size for the Bangladesh household
data file hh.dta is 5 in STATA, you have to enter the following and obtain:
ttest famsize=5
One-sample t test
Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
famsize 519 5.22736 .092115 2.098525 5.046395 5.408325
Notice that right-sided p-value is small, practically zero, (and thus much smaller than the
traditional 5% level of significance used). This suggests that the difference between 5.17 (what
we observe) and 5 (what we expect) is too large to attribute to chance. Thus, we have to
conclude that the average family size for the Bangladesh data is larger than 5.
hypothesis tests, such as the one sample t-test or its nonparametric counterparts:
Two-sided test:
Ho: median of members - 5 = 0 vs.
Ha: median of members - 5 != 0
Pr(#positive >= 217 or #negative >= 217) = min(1,
2*Binomial(n = 419, x >= 217, p = 0.5)) = 0.4941
The results suggest that there is no reason to believe that the median family size is not
five since the p-value for the two sided test is rather large, viz., 0.4941 (compared to the typical
0.05 level of significance). Another nonparameteric alternative to the one sample t-test is the
Wilcoxon signed rank test (that gives a test on the median). Here, we use the signrank
command:
signrank famsize=5
Wilcoxon signed-rank test
Sign obs sum ranks Expected
Positive 202 68225.5 64945
Negative 217 61664.5
Zero 100 5050 5050
Positive 202 68225.5 64945
Ho: members = 5
z = 0.972
Prob > z = 0.3312
which still suggests that there is no reason to believe that the median family size is not five
since the p-value for the two sided test is rather large, viz., 0.3312 (compared to, say, 0.05 level
of significance).
Beyond the one sample t-test, the ttest command in STATA can also be used for two-
sample and paired t tests on the equality of means. For instance, you may discover by entering
that average distance to paved roads seems to differ between male and female headed
in the command window to perform a two sample test of the difference between mean distance
to a paved by sex of the household head. This will test the null hypothesis that the difference
between the means of is zero. Here, we do not have evidence to suggest a difference between
male and female heads. If you are convinced that there is also an inherent difference in
variability of incomes between the two subpopulations of families, you can make the
and we also see that indeed, there is no statistically significant difference on distance to a a a
STATA has a number of commands for performing various statistical hypothesis tests,
especially for comparisons of two or more groups. The classical one way analysis of variance
which tests that the mean is the same across groups against the alternative that at least one of
the groups has a different mean from the rest, can be implemented in STATA:
You can also add the scheffe option in the above command to generate a post-hoc comparison,
the result of which suggests that distance to a paved road in Dhaka is significantly different
The nonparametric analysis of variance can also be implemented with the kwallis
command:
which also suggests that the difference in distance to a paved road across regions are
statistically significant.
You can also obtain a boxplot of the distribution of distance to a paved road across
regions
but due to outliers and the scaling of the values of distance, we may not see the difference very
clearly in the resulting plot. A number of other comparison tests can be performed in STATA,
including the binomial test (through the bitest command), the Wilcoxon Mann Whitney test
(with the ranksum command) and even a chi-square goodness of fit. The latter can be
With the anova command, we can also implement an N-way analysis of variance
(including a factorial ANOVA with or without the interactions), an analysis of covariance and
even a one-way repeated measures analysis of variance. In the latter case, we have one
categorical independent variable and a normally distributed interval dependent variable that
Perhaps you have noticed in some cross tabulation of two variables X and Y that as one
variable increased or decreased, the other variable in the cross tab decreased or increased.
While the naked eye can be good at noticing these relationships, it is important to test
statistically whether the variables in the cross-tab are independent or not. By independent, we
mean whether as X moves one way or another, the movements of Y movements are completely
implements a chi-square test of independence for two variables toilet and hhelec that indicate
respectively, whether (1) or not (0) the household has access to sanitary toilet facilities, and
whether (1) or not (0) the household has access to electricity. The small p-value (Pr=0.000)
tells us that there is evidence to suggest that the two variables are related.
The hypothesis tests discussed thus far do not incorporate the proper analytic/probability
weights from the survey data into the results. When handling survey data with complex survey
designs, you need to incorporate the survey design into the analysis. STATA provides a scheme for
incorporating the survey design variables by way of the family of commands called the svy
Notice that all these commands begin with “svy”. For you to be able to use these
commands, you have to firstly identify the weight, strata and PSU identifier variables. For
hh.dta, enter
and if you wish to obtain estimates of average distance to paved road, average distance to bank
which provides the results displayed in Figure 4-14. Here, we see not only the point estimates
for the variables but also the standard error of the mean, and the 95% confidence interval for
the mean. A confidence interval is essentially a range of values where we are confident that
the true value of the parameter lies. In particular, while our best estimate of family size is 5.19,
we are 95% confident that, the average family size is as low as 4.86 and as high as 5.52.
may be important to actually develop a model that accurately describes the data, and to help in
understanding what exactly does the data says. In certain instances, the modeling may even be
One very widely used model is the regression model which provides us a way of
explaining how a change in some exogenous (or input) variable X, say, results in a change in
the value of some endogenous (or response) variable Y. If the correlation coefficient between
Y. If X and Y correlate and we wish to explain Y with X, we can do so with the simple linear
Introduction to Methods for Research in Official Statistics 94 of 179
regression model. Regression models have two components: a deterministic law-like behavior
called a “signal” in engineering parlance, and statistical variation or “noise”. The simple linear
y i =β 0 + β 1 x i + ε i
which shows that the signal β 0 +β 1 X is being contaminated by the variable ε. The magnitude
of the output variable Y is dependent on the magnitude of the input variable X. A household’s
assets, for instance, functionally depends on the years of schooling of the household head. This
does not, however, suggest that educational attainment of the household head is the only factor
that is responsible for the level of assets of the household, but that it is one possible
determinant. All the other variables that may possibly influence the output variable (but which
we do not account for) are thought of as lumping into the noise term. To make the model
tractable, we assume that the noise is a random variable with zero mean.
For each value of the input variable X, say, x i , and correspondingly, each value of Y,
y i =β 0 +β 1 x i +ε i i =1, 2, … ,n
where the noise terms ε 1 , ε 2 ,… ,ε n form a random sample from a normal distribution with zero
mean and constant variance. In consequence, this will mean that the points of X and Y will be
more-or-less evenly scattered around a line. The parameters β 0 and β 1 in the regression model,
respectively referred to as the intercept and slope, have to be estimated; the classical estimates,
also called the least squares estimates, of these parameters can be readily obtained by STATA
through the regress command. The estimated regression line ought to be viewed as a “sample”
regression line since we are only working with sample data. This line is the “best fitting” line
for predicting Y for any value of X, in the sense of minimizing the distance between the data
and the fitted line. By distance here, we mean the sum of the squares of the vertical distances
of the points to the line. Thus, the resulting coefficients, slope and intercept, in the sample
regression line are also called the least squares estimates (of the corresponding parameters of
household assets against the years of schooling of the household head. You would then have
Notice that in the regress command, the first variable identified is the y variable
followed by the explanatory variable. Also, we can use weights in the regress command
(whether analytic, importance, frequency or probability weights). The result of the last
command (including the command) is shown in Figure 4-15 which lists an analysis of variance
(ANOVA) table, information on the model fit statistics, and a table of the estimated regression
coefficients.
The ANOVA table decomposes total variability into the variability explained by the regression
model, and everything else we can explain. The ratio of the mean squared variation due to the
model to the residual mean square forms the F-statistic (here 36.14) as well as the p-value
associated with this F-statistic can be used for testing overall model adequacy. An overall
model fit involves testing the null hypothesis that X does not help in explaining Y against the
alternative that the regression model is adequate; here, the small p value (which is practically
zero) suggests that we ought to reject the null hypothesis. That is, there is strong evidence to
suggest that the overall fit is appropriate. However, note that, in practically most instances, we
get such results. This is merely the first of many tests that need to be done to assess the
which suggests that as the number of years of schooling of the head increases, we expect the
The utility of the estimated regression line is not merely for explaining relationship
between two variables X and Y (in the illustration above, explaining the relationship between
total household members and log per capita income, respectively) but also in making
predictions on the variable Y given the value of X. Suppose, that you wish to pick one of the
households at random, and you wish to guess its assets. In the absence of any information, the
best guess would naturally be the average of household assets in the database, viz., 188,391.5
taka. If, in addition, you are provided information about the educational attainment (i.e., years
of schooling) of the household head, say, 3, then according to our estimated regression line, we
display 113716+29974.14*3
obtain predicted y-values for regressions on x, then after you have issued a regress command,
you can issue a predict command. In the case of the regression of the log per capita income on
predict assethat
label variable assethat "predicted mean asset score"
and these commands respectively generate a new variable from the earlier regress command
that regresses hassetg on educhead, and label the resulting variable called assethat as the
To obtain a scatterplot of the variables hassetg and educhead along with the estimated
In theory, the intercept in the regression line represents the value of Y when X is zero,
but in practice, we may not necessarily have this interpretation. If the explanatory variable
were family size, say, then zero members in the household would mean no household! The
intercept here merely represents the value of Y for the estimated regression line if the line were
test of the null hypothesis that the slope is zero, which is, in turn, also equivalent to the test of
the null hypothesis that the population correlation coefficient is zero. The only difference is
that the statistic used for overall model fit (an F statistic) is one sided while the t-statistic for
the slope (and the correlation) is two-sided. The STATA output in Figure 4-16 for the t-statistic
lists a one-sided p-value. The p-value associated with the F-statistic is twice the p-value for the
t-statistic for the coefficient of the educhead variable in Figure 4-16. Of course, both values
here are practically zero and both suggest that the regression fit is a good one.
You can also choose to suppress the constant in the regression model, i.e. perform
regression through the origin, and for this, you enter the nocons option:
to inform STATA that you wish the intercept to take a value of zero.
You can also choose to incorporate the survey design into the regression by way of the
svyreg command:
Here, the resulting estimated coefficients are no different, but the standard errors for the
estimated coefficients are adjusted upward from the earlier standard errors calculated from the
based on the value of one independent (explanatory) variable X. (Later, we will extend the
of the multiple regression model.) The relationship between the variables is described by a
linear function; the change in one variable causes the change in the other variable.
Moving from correlation (and regression) to causation is often problematic as there are
several possible explanations for a correlation between X and Y (excluding the possibility of
chance): it may be that X influences (or causes) Y; Y influences X; or both X and Y are
influenced by some other variable. When performing correlation analysis of variables where
there is no background knowledge or theory, inferring a causal link may not be justifiable
alcohol consumption and liver cirrhosis deaths, it may be difficult to make anything of the high
correlation between pork consumption and cirrhosis mortality. Arm length and leg length are
correlated but certainly not functionally dependent in the sense that increasing arm length
would not have an effect on leg length. In such instances, correlation can be calculated but
In many cases, obtaining a regression fit gives a sensible way of estimating the y-value.
If, however, there are nonlinearities in the relationship between the variables, one may have to
transform the variables, say, generate firstly the square root or logarithms of the X and/or Y
variables, and then perform a regression model on the transformed variables. When using
transformed variables, however, one will eventually have to re-express the generated analyses
in terms of the original units rather than the transformed data. In practice, we may use a power
transformations indexed by θ :
It is desired to choose an optimal value of θ . This can be readily worked out in STATA with
which suggests that the variable hassetg should be raised to +0.03 to improve the model fit, or
even to logarithm terms (i.e, when parameter is estimated to be zero). You can also choose to
stabilizing, i.e., big values are made into small numbers and small numbers remain as small
numbers so that the range of values gets bunched up. If the dependent variable Y is
transformed to the log scale, the coefficient of the independent variable can be approximately
percentage change in the average of Y. If the independent variable X is logged then the
change in X leads to an estimated unit change in the average of Y. If both the dependent and
independent variables are logged, the coefficient b1 of the independent variable X can be
change in X.
and
The coefficient b 1 is called the standardized coefficient. It indicates the estimated number of
average standard deviations Y will change when X changes by one standard deviation (of X).
include (a) the value of the Y variable is composed of a linear function of X and a noise
variable; (b) the noise terms form a random sample with constant variance; (c) the noise
themselves normally distributed for each value of X, as shown in Figure 4-18. These
assumptions all pertain to the behavior of the noise values, which are unknown but can be
estimated by the residuals, the difference between the observed Y- values and their predicted
values. An analysis of the residuals will help us ascertain whether the assumptions of the
After regressing the variable hassetg on the variable educhead and generating a predicted value
assethat, you could also obtain the residual values (with the predict command):
Introduction to Methods
predict r, resid for Research in Official Statistics 101 of 179
label variable r "residual"
Residuals serve as estimates of the noise, and so, the residuals contain information on whether
the regression model assumptions are valid for the data being analyzed, i.e., whether the model
fit is adequate.
To validate the assumption of a normal distribution for the noise, you could look into
the distribution of the residuals and see whether it is sensible to fit a normal curve through the
distribution. You could either look through the kernel density estimate of the distribution and
kdensity r, normal
qnorm r
Here, we ought to see the values of the observed quantiles of the data being more or less
similar to the expected quantiles of the normal distribution, or equivalently having the points
formed from the observed and expected quintiles falling along the 45 degree line passing
You could also further validate the results of the quantile-normal plot with a test of
normality, such as the Shapiro Wilks test, which has for its null hypothesis that the data
swilk r
The results of all these STATA commands on the residuals of the resulting regression model
Density
2000000
residual
0
(c)
Figure 4-19. Testing residual normality through (a) a quantile normal plot;
(b) a kernel density estimate with a fitted curve; (c) the Shapiro Wilks test.
Plots of residuals against predicted values, residuals against explanatory variables, and
residuals against time are also useful diagnostic plots for assessing regression model adequacy.
The command
rvfplot
rvpplot educhead
yields a plot of the residuals versus the values of the educhead variable. You can also enter
A good guide for assessing residual plots is shown in Figure 4-20, which suggests that these
plots ought not to display any patterns for the regression model to be considered appropriate.
(a) (b)
Figure 4-20. Guide for analyzing residual plots to detect
(a) linearity from nonlinearity; (b) homoscedascticity from heteroscedasticity.
If the residuals from a regression involving time series data are not independent, they
are said to be autocorrelated. The well known test for autocorrelation is the Durbin-Watson
test, which makes sense only if the data have some inherent order, as in a time series. Let us
system variable called _n, which we will need to copy into a new variable, say, obsnum with
the generate command. We will also need to let Stata know that obsnum is the time variable
with the tsset command. The post estimation command dwatson performs a Durbin-Watson
test.
gen obsnum=_n
tsset obsnum
reg hassetg educhead
estat dwatson
underlying least-squares regression will lead to ready acceptance of any regression model (even
those which do not have a good fit to the data). This inherent lack of awareness, may involve
not knowing how to evaluate the assumptions and not knowing that there are alternatives to
There are some suggestions on how to properly perform a regression analysis. Firstly,
relationship between the variables. After running the regression, perform a residual analysis to
normal probability plot of the residuals to uncover possible non-normality. If there is violation
of any assumption, use alternative methods, e.g., robust regression, or transform the X-variable,
the Y-variable, or both. You can use weights (analytic or probability) to combat
heteroscedasticity. If there is no evidence of assumption violation, then you can test for the
significance of the regression coefficient, and then consequently, construct confidence intervals
In the simple linear regression model, we only related one variable Y to some
explanatory variable X. We may want to instead relate the response variable Y to a number of
where ε is the noise, serves this purpose. Least squares estimates b 0 , b 1 , …, b p of the
=b 0 +b 1 X 1 +b 2 X 2 +…+b p X p
and just as in the simple linear regression model, we can also represent the value of Y as the
Y= +e
and also perform a residual analysis to assess the adequacy of the multiple regression model.
The least squares estimates are quite difficult to calculate by hand, but they take a closed form
solution and can be obtained very readily with the use of a software, such as STATA (cf.
Figure 4-21).
With STATA, we merely list the explanatory variables after the response variable. For
instance, in the hh data set, if you wish to regress the asset variable hassetg on years of
schooling of the head, family size and the binary indicator variable representing whether (1) or
not (0) the household head is male, you enter the following commands and obtain the result:
The output suggests that our estimated regression model for predicting assets from the years of
schooling of the head, family size and the sex of the head is:
Thus, assets are expected to increase by an estimated 30,000 for each additional year of
schooling of the head everything else held constant. Also asssets are 26,000 higher for each
additional increase in family size ceteris paribus. The regression model can be used to predict
value of assets given the family size, years on schooling and information on whether or not the
The proportion of variation in the assets variable explained by the estimated regression
model is 8.26%. Adjusted for the number of explanatory variables used in the model, the
proportion of variation in the assets variable explained by the regression model is 7.73%.
reflects both the number of explanatory variables and sample size in its calculation and is
smaller than Multiple R squared. It penalizes the excessive use of independent variables.
Since the unadjusted figure automatically increases with more independent variables
dependent variable, the adjusted Multiple R squared is the more useful criterion for comparing
The overall test of significance of the regression model, involving the null hypothesis
that:
H0 : β1 = β2 = β3 = 0
H 1 : β 1 ≠0 or β 2 ≠0 or β 3 ≠0
suggests that the model is adequate since the p-value is rather small (in comparison with either
a 0.05 or even a 0.01 level of significance). That is, there is evidence that at least one
explanatory variable affects the response variable. The associated statistic for this test is an F-
statistic with numerator degrees of freedom equal to the number of parameters minus one (here
4-1=3), and denominator degrees of freedom equal to the number of data minus the number of
parameters (here, 519-4=515). Actually, we will often obtain a rejection of this null hypothesis
since the null hypothesis (that all the regression coefficients associated with the explanatory
variables is zero) is such a strong statement (especially when we are considering a number of
explanatory variables).
Note that individual t-tests on the significance of the variables used in the regression
suggest that if the both variables are in the regression model, we cannot remove one of them as
both are conditionally important (given the presence of the other). In particular, testing the null
hypothesis β 3 =0 associated with the sexhead variable versus the alternative β 3 ≠0 with the use
of the t-statistic, here taking the value -1.01, suggests we cannot reject the null hypothesis since
the associated p-value (31.5%_ is rather large (compared to 5%). This means that there is no
linear relationship between assets and sex of head, all other things being equal. For such
individual tests of significance, the t-statistic is the ratio of the estimate of the regression
coefficient to the standard error. The larger the magnitude of the t-statistic, the more
convincing that the true value of the parameter is not zero (as this means the numerator, i.e., the
Note that aside from the overall F test and individual t tests, you could also perform
other hypothesis tests. You could, for instance, perform F tests for user- specified hypotheses.
Suppose that from the region variable representing the major island where the household
resides (1 for dhaka, 2 for chittagon, 3 for khulan and 4 for ragfhahi), we create four binary
indicator variables, and we decide to add the regn1 variable (representing whether or not the
household resides in dhaka) in the earlier regression model and jointly test the importance of
variable region. The second command performs the regression without showing the output,
since this regression model is merely an intermediate step toward the final result, i.e., testing
Stata’s estimates command makes it easier to store and present different set of
estimation results:
Stata’s estimates command makes it easier to store and present different set of estimation
ln Yi = β 0 + β1 X 1i + β 2 X 2i + ε
When we may not have an idea about what transformation we may need, it may be helpful to
use the Box-Cox transformation as illustrated earlier. Log transformations are useful, as
changes in the log of a variable are roughly the equivalent of one percentage point change in
the variable.
Sometimes, our explanatory variables may have very large correlations. This is known
as multicollinearity. This means that little or no new information is provided by some variables
and this leads to unstable coefficients as some variables are becoming proxy indicators of other
variables. When we add a new variable into the regression that is strongly related to the
explanatory variables already in the model, multicollinearity is suggested when (a) we obtain
magnitudes or signs; (c) nonsignificant individual t-tests despite a significant overall model F
test.
Following http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm ,
we can measure multicollinearity by way of calculating the variance inflation factors (VIFs):
All of these variables measure the education of the parents and the very high VIF values
indicate that these variables are possibly redundant. For example, after you know grad_sch and
Now, let us try to re-do the analysis but with some variables removed:
Notice that the VIF values are now better and the standard errors have been reduced. There are
other ways of addressing multicollinearity apart from dropping variables in the regression. The
researcher may even have the option of not addressing it at all if the regression model is to be
used for prediction purposes only. The interested reader may refer to Draper and Smith
compare will also grow considerably. For 10 explanatory variables, we have to compare 1023
models. Instead of choosing the best model among all possible regressions, we may adopt
• Forward Inclusion
• Backward Elimination
• Forward Stepwise
• Backward Stepwise
with:
Note: both pr and pe are specified as options for probability to remove and probability to enter,
respectively; also “forward” option is used else backward stepwise implemented. The above
choice of pr and pe makes it more difficult for a regressor to enter the model than to be
removed from the model. Forward stepwise is a modification of forward inclusion, which starts
with no regressors in the model, and proceeds as in forward inclusion entering first the
Introduction to Methods for Research in Official Statistics 110 of 179
explanatory variable with the largest simple correlation with the dependent variable. It is
entered if the F statistic for the estimated regression model with it in exceeds the F value
corresponding to pe=0.05. Then, it enters a second regressor, which has the largest correlation
with the dependent variable after adjusting for the effect on the dependent variable of the first
procedure then reassesses included regressors based on their partial F-statistics for exclusion. It
drops the least significant variable from the model if its partial F-statistic is less than the pre-
selected F corresponding to the pr value. Then, it proceeds as in forward inclusion and enters
The computational algorithms (except for all possible regressions) do not necessarily
produce the best regression model. The various algorithms need not yield exactly the same
result. Other software, e.g. R and SAS, use other criteria, e.g. adjusted R2, Cp Mallow’s,
Akaike’s Information Criterion (AIC), or the Schwarz Bayesian Information Criterion (BIC()
rather than probabilities (or F values) to remove or enter variables in the model.
Note that Stata allows a hierarchical option for a hierarchical selection of variables, and
a lock option that forces the first (set of) regressor variable/s listed to be included in the model,
Aside from multicollinearity, the need to use transformations on the variables, and the
use of stepwise techniques, there are other issues on the use of regression models, e.g. the need
for outlier detection and influence diagnostics, and testing for misspecification. We provide in
the next sub-sections a discussion of logistic regression, and common multivariate tools that
Logistic regression, also called logit analysis, is essentially regression with a binary {0,1}
variable) that is a categorical dichotomic and with explanatory variables that can be either
continuous or categorical. In other words, the interest is in predicting which of two possible
To illustrate the use of these commands, consider again the hh.dta database, and
suppose that we define households as poor or non-poor through the variable povind depending
on whether their household assets per capita is less than 15,000. Now let us generate a per
capita asset variable, which we will call assetpc, and the povind variable; let us also run a
The results indicate that years of schooling of the head (educhead) is a statistically significant
predictor of povind (i.e., a household being poor), with the Wald Statistic taking a value z = -
5.47, with a very small p-value 0. Likewise, the test of the overall model is statistically
significant, with a Likelihood Ratio chi-squared statistic 32.47 and a corresponding negligible
p-value 0.00
(natural) log of the likelihood function. Maximizing this nonlinear function involves numerical
methods. At iteration 0 of the algorithm, the log likelihood describes the fit of a model
including only the constant. The final log likelihood value describes the fit of the final model
where L represents the predicted logit (or log odds of a household being poor), that is:
An overall likelihood ratio χ2 test (with degrees of freedom equal to one less than no. of
parameters = 2-1) evaluates the null hypothesis that all coefficients in the model, except the
x2= - 2( L i – L f )
= -2 [-357.0325- (-340.79956) ]
= 32.47
where L i is the initial log likelihood and L f is the final iteration’s log likelihood. A convenient
but less accurate test is given by the asymptotic Z statistic associated with a logistic regression
coefficient. Note that the logit Z and the likelihood ratio χ 2 tests sometimes disagree, though
pseudo R2 = 1- (L f / L i )
= 1- (-340.79956)/(-357.0325)
=.04546628
which provides a quick way to describe or compare the fit of different models for the same
dependent variable. Note that unlike the R2 in classical regression, the pseudo R2 statistics lack
A number of add-on programs in Stata, such as the package spostado, have been
developed for ease in the running and interpretation of logistic regression models. To
download the complete spostado package, merely enter the following in the Stata command
window:
listcoef
listcoef, percent
or request the listing of some model fit statistics with the fitstat command:
fitstat
Another model fit statistic is the Hosmer & Lemeshow goodness-of-fit indicator given by the
lfit command:
lfit
After logit, (as in regress), we can issue the predict command. Here, we generate predicted
probabilities.
predict phat
label variable phat "predicted p(poor) "
graph
Introduction twoway for
to Methods connected
Researchphat educhead,
in Official sort
Statistics 114 of 179
This generates the logistic curve shown in Figure 4-22.
.6 .5
predicted p(poor)
.4 .3
.2
The goal in logistic regression is to try and estimate the probability that the observation
exp( β 0 + β1t x)
p ( y = 1 | x) =
1 + exp( β 0 + β1t x)
The logit, or log odds, is defined as the log of the odds ratio, i.e. the ratio of the probability
p ( y = 1 | x)
g (x) = log
1 − p ( y = 1 | x)
g (x) = β 0 + β1t x
Thus, as the explanatory variable increases by one unit, the log of the odds changes by ‘β’
units. Or equivalently, expβ is an indicator of the change in the odds because of a unit change in
the explanatory variable. In particular, the coefficient on educhead (given by logit command)
describes the effect of the number of years of schooling of the head on the logit (or log odds of
a household being poor). Each additional increase in years of schooling of the head decreases
the predicted log odds of being poor by 0.1489314. Equivalently, each additional increase in
years of schooling of the head multiplies the predicted odds of being poor by exp(-0.1489314)=
0.86162822; an increase of two years of schooling of the head multiplies the predicted odds by
odds by exp(-0.1489314*N).
Interestingly, STATA allows us to obtain the predicted probability of being poor by the
variable educhead
prtab educhead
After fitting a logistic regression model, we can obtain a classification table with
estat classification
The output, shown in Figure 4-23, suggests that the use of the logistic regression model results
in an overall correct classification of 62%. Of the 286 poor households, we have a sensitivity
rate of 78%, (i.e., 224 were correctly classified as poor), but of the 233 nonpoor households, we
have a specificity rate of 41% (i.e., only 96 of the nonpoor were correctly classified), so that
observation is in sub-population 1, using some probability cut-off, say 0.50. The plot of
sensitivity and specificity versus values of the probability cut-off can be obtained with:
lsens
Sensitivity Specificity
Note that when using survey data, we should incorporate the probability weights
[pw=weight] within the logit or logistic command, or use the svy:logit command. Also, you
could add some more variables to improve the model fit, and/or perform model diagnostics
Various prediction and diagnostic statistics can be obtained with the predict command
options:
To perform some logistic regression model diagnostics with a model that has more
explanatory variables (and incorporates the use of the xi command for automatically generating
the following:
-2 -1 0 1 2 3
Pearson residual
while
0 .2 .4 .6 .8 1
Pr(povind)
STATA's mlogit command can be used when the dependent variable takes on more than two
outcomes and the outcomes have no natural ordering. The svy:logit command is used for
survey data, especially when the design is a complex survey design. The clogit command
logistic analysis differs from regular logistic regression in that the data are stratified and the
likelihoods are computed relative to each stratum. The form of the likelihood function is
similar, but not identical, to that of multinomial logistic regression. In econometrics, the
conditional logistic regression is called McFadden's choice model. STATA's ologit command
variable that is categorical and in which the categories can be ordered from low to high, such as
command probit. If we observe covariates X 1 , X 2 ,…, X p of some latent variable Z for which:
Z = b 0 + b 1 X 1 + b 2 X 2 + …+ b p X p
and with Z being somewhat observed only in that each observation is known to be in one
category (high Z values) or another (low Z values). If we assume Z has a normal distn, then
the model is called a probit model, if Z has a logit distn, then the model is a logit model.
technique (first introduced by Sir Ronald Fisher) that identifies variables important for
distinguishing among mutually exclusive groups and predicting group membership for new
cases among a list of proposed (explanatory) variables. The concept behind discriminant
are used for classifying cases into the groups. Like logistic regression, discriminant analysis
predicts group membership and in practice, requires an estimation (training) and validation
explanatory2 variables. If the latter were assumed to follow a multivariate normal distribution,
we can use the F test. A discriminant analysis (implicitly) assumes that all relationships are
linear. In addition, linear discriminant analysis3 assumes unknown, but equal dispersion or
covariance matrices for the groups. This is necessary for the maximization of the ratio of
variance between groups to the variance within groups. Equality can be assessed by Box’s M
test for homogeneity of dispersion matrices. Remedies for violations include increasing sample
size and using nonlinear classification techniques. Multicollinearity among the explanatory
variables may impede inclusion of variables in a stepwise algorithm. Also, there ought not to
be any outliers; the presence of outliers adversely impacts the classification accuracy of
discriminant analysis 3. Diagnostics for influential observations and outliers should be done.
research, theoretical model, or intuition. It is suggested that for every explanatory variable
2
Independence among explanatory variables is not assumed in Fisher’s DA. The variables in
Fisher’s classic example, the iris data set, are highly correlated.
3
Linear discriminant analysis assumes equal covariance matrices for the groups. Quadratic discriminant
analysis allows for unequal covariance matrices.
the number of explanatory variables with each group having at least 20 observations.
To understand the idea behind discriminant analysis, consider Figure 4-27. Suppose
establishment) can be classified into one of two groups on the basis of a measurement of one
characteristic, say Z, then we need to find some optimal “cut-off” value C, which divides the
Figure 4-27. Discriminant analysis for two groups (given one variable Z) is to obtain
an optimal rule which divides the data into two and allows us to classify the data to
belong to one group or the other.
entire dataset in such a way that high values of Z are assumed to indicate that the observation
comes from the first group, and low values indicates that the observation comes from the
The classification rule will lead to misclassifications if an observation from the second
group actually has a high value of Z, but it is classified by the rule as coming from the first
group. Also, misclassification results when cases from the first group with low values of Z are
classified by the rule as coming from the second group. In practice, though, we don’t usually
classify on the basis of just one variable. We may have a number of variables at our disposal.
Suppose though, that all the variables can be put together into one linear function Z. This is
illustrated in Figure 4-28 for two variables for the two-group case.
The characteristic Z used for the classification process is called a linear discriminant
Z = b 0 + b 1 X 1 + b 2 X 2 +… + b p X p
where the bj’s are chosen so that the values of the discriminant function differ as much as
possible between the groups. A discriminant analysis can be readily implemented in STATA
with the discrim function We illustrate the use of this command once again on hh.dta:
(a) (b)
Figure 4-29. Results of discriminant analysis: (a) classification matrix;
(b) estimated discriminant function.
Given a large number of variables in a database, you may want to examine the
dimensions of these variables. In this case, the statistical tool needed is called factor analysis
and the few underlying dimensions that are developed are called factors. The factors are
Factor analysis is thus a data reduction tool that removes redundancy or duplication
from a set of correlated variables. It involves representing correlated variables with a smaller
set of “derived” variables. Factors may be formed so that they are uncorrelated with one
another. Consider, for instance, Figure 4-30 representing nine variables v 1 , v 2 , … ,v 9 that are
clustered into three groupings. These groups of variables are the factors.
In practice, the process for obtaining the derived factors is not so clear cut. There is
usually some amount of overlap between the factors since each of the observed variables
defining a factor has some degree of correlation with the other variables in the other factors.
Unlike in regression methods, when performing a factor analysis, it would be important for
multicollinearity to be present. In fact, the rule of thumb is that there should be a sufficient
number of correlations greater than 0.30. The factor analysis model indicates that the variables
Z j =a j1 F 1 +a j2 F 2 + …+ a jm +F m U j j= 1,2, …,k
where the variables are written in standardized form Z, F are the factors (with the coefficients
called “factor loadings”) and U represent a factor unique to a certain variable. Note that we
would prefer to have the number m of factors much less than the number k of variables. While
Introduction to Methods for Research in Official Statistics 123 of 179
the factor analysis model looks similar to a regression model, we have no “dependent variable”
The variables being used in a factor analysis ought to be metric (i.e., numerical). The
variables need not be normally distributed but if the variables are multivariate normal, we
variables to represent each proposed factor (i.e., five or more). Note that sample size is an
should be 100 or larger. Between 50 and 100 may still be analyzed with factor analysis but with
caution. Note that it is suggested that the ratio of observations to variables should be at least 5
to 1 in a factor analysis.
To perform a factor analysis, you ought to perform the following six steps:
STEP 1: Collect data and perform a correlation analysis;
STEP 2: Extract initial factors (using some extraction method) ;
STEP 3: Choose number of factors to retain;
STEP 4: Rotate and interpret;
STEP 5: Decide if changes need to be made (e.g. drop item(s), include item(s)); and if
changes are made, repeat STEPS 2-4;
STEP 6: Construct scales and use them in a further analysis.
After organizing the data into the array shown in Figure 4-31(a), you need to obtain the
correlation matrix of the variables (as shown in Figure 4-31b) from which we can obtain
(a) (b)
Figure 4-31. (a) Data Matrix; (b) Correlation Matrix.
∑r j ≠i
2
ij
MSAi =
∑r + ∑a
j ≠i
2
ij
j ≠i
2
ij
∑∑ r i≠ j
2
ij
KMO =
∑∑ r + ∑∑ a
i≠ j
2
ij
i≠ j
2
ij
The KMO can be helpful in determining whether or not it will be helpful to perform a factor
analysis on the variables. In particular, you can interpret this index in relation to performing a
To illustrate how to perform a factor analysis with STATA, consider the nations dataset and let
us obtain the correlation matrix and the KMO index with the STATA commands corr and
factortest respectively:
These commands yield the results shown in Figure 4-32 and Figure 4-33. It can be observed
that a number of variables are highly correlated; the KMO index indicates that the data is
Figure 4-32. The correlation matrix for variables in the nations dataset.
• pf (principal factor)
• ipf (iterated principal factor)
• ml (maximum likelihood)
• pc (principal components)
factor pop-school3, pf
unique component. Thus, for instance, according to the STATA output and the factor
while Social Scientists prefer principal factor methods. For the latter, factors are inferred.
Principal components factor (PCF) analysis is a factor analysis extraction method and is
not equivalent to principal components analysis (PCA). The method called singular value
decomposition (SVD) that is used to extract principal components in PCA is used to extract
factors in principal components for factor analysis. SVD is applied to the correlation matrix in
both PCA and PCF. But the two procedures differ in succeeding steps. In PCA, all the
eigenvectors extracted by SVD are used in estimating the loadings of all p variables on all p
principal components. In PCF, only eigenvectors corresponding to m << p factors are used in
estimating the loadings. The procedure is such that the choice of m affects the estimates.
Principal factor (PF) analysis extracts the eigenvectors from the reduced correlation
matrix.The reduced correlation matrix is like the correlation matrix, but instead of 1’s on the
main diagonal, has estimates of the communalities on the main diagonal. Communalities refer
to the variances that the X variables have in common. If the X variables have large
communalities, a factor analysis model is appropriate. The estimate commonly used for the
communality of the ith variable is the R2 obtained by regressing it on the other X variables.
factors. Factor Loadings, which vary from –1.00 to +1.00, represent the degree to which each
of the variables correlates with each of the factors. In fact, these factor loadings are the
correlation coefficients of the original variables with the newly derived factors (which are
themselves variables). Inspection reveals the extent to which each of the variables contributes
to the meaning of each of the factors. You can use the following guide to interpret factor
loadings:
• ± 0.40 important
Introduction to Methods for Research in Official Statistics 127 of 179
• ± 0.50 practically significant
Note that: (a) an increase in the number of variables decreases the level of significance; (b) an
increase in sample size decreases the level necessary to consider a loading significant; (c) an
increase in the number of factors extracted increases the level necessary to consider a loading
significant.
To determine the number of factors that we can use, you can be guided by:
• Apriori criterion (that is, you can choose the number of factors prior to the
In the output of the command earlier, we see that the latent root criterion suggests the use of
two factors. If we would be satisfied with a ninety percent cut off for the proportion of
variation explained, then likewise, we would select two factors. (If we are satisfied with 80%
The scree plot (or elbow rule) is a plot of the number of factors against the cumulative
proportion of variation explained. It is suggested that we find the value where the smooth
decrease of eigenvalues appears to level off. This is the number of factors to be used. You can
greigen, yline(1)
which yields the plot shown in Figure 4-34, and this plot suggests the use of two or three
factors.
If the resulting factors are difficult to interpret and name, you may try rotating the
factor solution. A rotational method distributes the variance from earlier factors to later factors
by turning the axes about the origin until a new position is reached. The purpose of a rotated
factor solution is to achieve a simple structure. In STATA, a rotated factor solution can be
obtained (after generating the factor solution) if you enter the command
rotate
The default rotation is a varimax rotation, which involves an orthogonal rotation of the factors;
Once you are comfortable with the factors, you ought to generate the factor scores
(linear composites of the variables). Several methods can be employed for obtaining the
resulting factor score coefficients. The default method in STATA, as in most software, is the
The following commands generate the factor scores for three factors formed by
standardizing all variables, and then weighting with factor score coefficients and summing for
each factor; list the nations (and their factor scores), as well as generate a scatterplot of the first
two factors.
economic development status. Developing countries are on the lower portion of the scatterplot,
3
U_S_A
Norway
Canada
2 Sweden
Denmark
Australi
Netherla
Finland
W_German
Kuwait
Belgium
U_KFrance
development
AustriaJapan
NewZeala
1
SauArabi Italy
Ireland
TrinToba
Israel
Hungary Spain
Singapor
SierraLe Greece
Yugoslav
ArgentinHongKong
0
GuineaSomalia Venezuel
S_Korea
Uruguay
YemenAR
Niger Senegal Egypt Jordan
BurkFaso Bolivia
Liberia Peru
Algeria MexicoPortugal
Chile
CenAfrRe Zaire
Nigeria Syria
Ecuador Panama
Benin
Nepal
Burundi Madagasc
IvoryCoa
Pakistan Brazil
TurkeyPhilippi Jamaica
HaitiZambia
Banglade Ghana Morocco
Nicaragu
Tunisia Colombia
Malaysia
CostaRic
Togo India
Indonesi
Cameroon ElSalvad
DomRep
Honduras
Kenya Botswana
GuatemalThailand
Zimbabwe ParaguayMauritiu
SriLanka
PapuaNG
-1
China
Burma
-3 -2 -1 0 1
Scores for factor 1
In the previous sections, we looked into the analysis of cross section data. In the current
section, we consider time series analysis. Time series modeling involves the analysis of a
sequence of numerical data obtained at regular time intervals. Unlike the analyses of a random
sample of data that are discussed in the context of most other statistics, the analysis of a time
series is based on the assumption that successive values in the data file represent consecutive
measurements taken at equally spaced time intervals. The following yearly production data
is a time series. The daily market prices of a certain stock at the market closing over a 6-month
period constitute a time series. Time series that are of interest primarily for economic analysis
include the Gross Domestic Product, the Gross National Product, the Balance of Payments, the
Unemployment levels and rates, the Consumer Price Index and the Exchange Rate Data.
Time series can either be continuous or discrete. Frequently, continuous time series are
endogenous dependent variable, but time it is an ordered variable. Typically, we denote the
time periods by successive integers. We may also consider a continuous time index but here,
we anchor ourselves at some point t=0 in time, and we have a sense of past, present and future.
The future cannot influence the past but the past can, and often does, influence the future.
Time series analysis involves discovering patterns, identifying cycles (and their turning
points), forecasting and control. Analysts are sometimes interested in long term cycles, which
are indicators of growth measures, and so it is helpful if shorter length cycles are removed.
Out-of-sample forecasts may be important to set policy targets (e.g. next year’s inflation rate
Time series models involve identifying patterns in the time series, formally describing
these patterns and utilizing the behavior of past (and present) trends and fluctuations in the time
series in order to extrapolate future values of the series. In model building, it is helpful to
remember the words of time series guru George E.P. Box: “All models are wrong, but some are
useful.” Several models can be used to describe various features of a time series.
on the same object over time). Estimated autocorrelations can be exploited to obtain a first
impression of possible useful models to describe and forecast the time series.
To use the time series commands in STATA, it is important to let STATA know that a
time index is found in the data set. Suppose we want to read the Excel file dole.xls (partially
benefits in Australia from January 1956 to July 1992) is available from the
remove the second column information, and save the file in csv format, say into the
c:\intropov\data folder
After converting the file into CSV format, you can then read the file, generate a time
variable called n (based on the “internal counter” _n ), employ the tsset command (to let
STATA know that n is the time index), and get a time plot of the data with graph twoway line
as shown :
As was earlier pointed out, a time series may be thought of as a set of data obtained at
regular time intervals. Since a time series is a description of the past, forecasting makes use of
these historical data. If the past provides patterns, then it can provide indications of what we
can expect in the future. The key is to model the process that generated the time series, and
this model can then be used for generating forecasts. Thus, while a complete knowledge of the
exact form of the model that generates the model is hardly possible, we choose an approximate
We may investigate the behavior of a times series either in the time domain or in the
frequency domain. An analysis of the time domain involves modeling a time series as a process
generated by a sequence of random errors. In the frequency domain analysis of a time series,
cyclical features of a time series are studied through a regression on explanatory variables
involving sines and cosines that isolate the frequencies of the cyclic behavior. Here, we will
consider only an analysis in the time domain. Thus, we assume that a time series to consists of
a systematic pattern contaminated by noise (random error) that makes the pattern (also called a
signal) difficult to identify and/or extract. Noise represents the components of a time series
representing erratic, nonsystematic, irregular random fluctuations that are short in duration and
non-repeating. Most time series analysis tools to be explored here involve some form of
filtering out of the noise in order to make the signal more salient and thus amenable to
extraction.
Decomposing a time series highlights important features of the data. This can help with
the monitoring of a series over time, especially with respect to the making of policy decisions.
Decomposition of a time series into a model is not unique. The problem is to define and
estimate components that fit a reasonable theoretical framework as well as being useful for
interpretation and policy formulation. When choosing a decomposition model, the aim is to
choose a model that yields the most stable seasonal component, especially at the end of the
series.
patterns, and cycles. Trends are overall long-term tendencies of a time series to move upward
(or downward) fairly steadily. The trend can be steep or not. It can be linear, exponential or
even less smooth displaying changing tendencies which once in a while change directions.
Notice that there is no real definition of a trend, but merely a general description of what it
means, i.e., a long term movement of the time series. The trend is a reflection of the underlying
level of the series. In economic time series, this is typically due to influences such as
patterns, i.e., regular upward or downward swings observed within a year. This is also not
quite a definition but it conveys the idea that seasonality is observed when data in certain
“seasons” display striking differences to those in other seasons, and similarities to those in the
same seasons. Every February, for instance, we expect sales of roses and chocolates to go
upward (as a result of Valentine’s day). Toy sales rise on December. Quarterly unemployment
rates regularly go up at the end of the first quarter due to the effect of graduation. In Figure 4-
38, we see 216 observations of a monthly series (representing a monthly Index from January
1976 – December 1993 of French Total Industry Production excluding Construction) that
When a calendar year is considered the benchmark, the number of seasons is four for
Cycles in a time series, like seasonal components, are also upward or downward swings
but, unlike seasonal patterns, cycles are observed over a period of time beyond one year, such
as 2-10 years, and their swings vary in length. For instance, cyclic oscillations in an economic
time series may be due to changes in the overall economic environment not caused by seasonal
effects, such as recession-and-expansion. In Figure 4-39, we see a monthly time series of 110
Construction Products from January 1985 to February 1994). Notice a downward trend,
yield the time series. Aside from trends, seasonal effects, cycles and random components, a
time series may have extreme values/outliers, i.e., values which are unusually higher or lower
than the preceding and succeeding values. Time series encountered in business, finance and
economics may also have Easter / Moving Holiday effects, and Trading Day Variation. A
description of these components can be helpful to account for the variability in the time series,
and for predicting more precisely future values of the time series.
whether the data fluctuations exhibit a constant size (i.e., nonvolatile, constant variance,
homoskedastic pattern) or are the sizes of the fluctuations not constant (i.e., volatile,
heteroskedastic)?
It may be also helpful to see if there are any changes in the trend or shifts in the
seasonal patterns. Such changes in patterns may be brought about by some triggering event
For such descriptive analysis of a time series, it is useful to employ the following tools:
• Historical plot – a plot with time (months or quarters or years) on the x axis and the
• Multiple time chart (for series observed monthly or quarterly or semestral or for time
periods within the year) – superimposed plots of the series values against time with
that, in practice, the series to be analyzed might have undergone some transformation of an
original time series say, a log transformation in order to stabilize the variance.
A key feature of a time series data (in contrast with other data) is that the data are
observed in sequence, i.e., at time t, observations also of previous times 1, 2, …, t-1 are known
but not those of the “future”. Also, because of autocorrelation, we can exploit information in
the current and past to explain or predict the value of the future.
Time series analysis often involves lagged variables, i.e., values of the same variable
but from previous times, and leads, i.e, forward values of the series. That is, for a given a time
LYt = Yt −1
To get a first order lag on the time series ts in STATA, either enter
gen tsl1=ts[_n-1]
or alternatively, enter
gen tsl1=L.ts
in the command window. To get the lag 2 (also called the second order lag) variable, enter
gen tsl2=L2.ts
Now to see the original time series, the first order and second order lags:
FYt = Yt +1
of a time series Y t representing the original series but one period ahead.
To get a first order lead on the time series ts in STATA, either enter
gen tsf1=ts[_n+1]
or alternatively, enter
gen tsf1=F.ts
gen tsd1=D.ts
is obtained with
gen tsd2=D2.ts
gen tss12=S12.ts
this is not a 12th-difference but rather a first difference at lag 12. This would yield
differences, say between Dec 2003 and Dec 2002 values, Nov 2003 and Nov 2002 values, and
so forth.
The autocorrelation coefficients are the correlation coefficients between a variable and itself at
particular lags. For example, the first-order autocorrelation is the correlation of Yt . and Yt −1 .;
A correlogram is a graph of the autocorrelation (at various lags) against the lags. With
STATA, we can generate the correlogram with the corrgram command. For instance, for the ts
for testing the hypothesis that all autocorrelations up to and including each lag are zero. Partial
elements (those within the lag) is removed. For the Box-Pierce Q statistics, small p values
ac ts, lags(80)
Introduction to Methods for Research in Official Statistics 137 of 179
which is shown in Figure 4-40.
can be fit into a time series. ARMA(p,q) processes should be stationary, i.e., there should be no
trends, and there should be a long-run mean and a constant-variance observed in the time
series. To help guide us in identifying a tentative ARMA model to be fit on a stationary series,
• AR(1) : ACF has an exponential decay; PACF has a spike at lag 1, and no correlation for
other lags.
• AR(2): ACF shows a sine-wave shape pattern or a set of exponential decays; PACF has
spikes at lags 1 and 2, no correlation for other lags.
• MA(1): ACF has spike at lag 1, no correlation for other lags; PACF damps out
exponentially.
• MA(2): ACF has spikes at lags 1 and 2, no correlation for other lags; PACF - a sine-wave
shape pattern or a set of exponential decays.
• ARMA(1,1): ACF exponentially decay starting at lag 1; PACF also has an exponential
decay starting at lag 1.
In other words, we could summarize the signatures of AR, MA, and ARMA models on the
Decay Cutoff
Decay
Cutoff
This summary table can be helpful in identifying the candidate models. If we find three spikes
on the partial autocorrelation plot (which may also look like a decay), then we could use try to
For instance, the differenced series of ts is showing no trend, but the variability is
smaller for the first half of the series than for the second half of the series:
so we instead look into analyzing only the second half of the series. There are decays of the
autocorrelation and partial autocorrelation, so that, we may want to use an ARMA(1,1) on the
differenced series:
The underlying idea in Box-Jenkins models is that only past observations on the variable being
investigated is used to attempt to describe the behavior of the time series. We suppose that the
concept of correlation is used to measure the relationships between observations within the
series.
Often we would like to smooth a time series to break down the data into a smooth and
a rough component:
Then, we may want to remove the “obscurities” and subject the smooth series to trend analysis.
In some situations, rough parts are largely due to seasonal effects and removal of the seasonal
effects (also called deseasonalization) may help in obtaining trends. For instance, time series
such as those shown in Figure 4-42 representing deseasonalized quarterly Gross Domestic
Product (GDP) and Gross National Product (GNP) in the Philippines from the first quarter of
1990 to the first quarter of 2000 shows a steady increase in the economy except for the slump
due to the Asian financial crisis. The effect though on GNP is not as severe as the effect due to
(i) (ii)
Figure 4-42. Gross Domestic Product (i) and Gross National Product (ii) during the first
quarter 1990 up to first quarter 2000.
noise. Smoothing methods can also provide a scheme for short term forecasting. The
smoothing done above is rather elaborate. We discuss here more simple smoothing techniques
The most common smoothing technique is a moving average smoother, also called
running averages. This smoother involves obtaining a new time series derived from the simple
average of the original time series within some "window" or band. The result is dependent
upon the choice of the length of the period (or window) for computing means. For instance, a
three-year moving average is obtained by firstly obtaining the average of the first three years,
then the average of the second to the fourth years, then the average of the third to the fifth
Consider the ts time series from the dole.xls data set. To obtain a 7-month moving
average, you have to use the tssmooth command, specify that it is a moving average with all
the seven points having the same weight and then graph it with:
This results in Figure 4-43 which shows the rough parts of the original time series being
and consequently enter in the resulting popup window the information shown in Figure 4-44
regarding the new name of the variable (ts7) based on the original variable (or expression to
smooth) ts with equal weights for the current (middle) portion of the series, the lagged and lead
An alternative to moving averages (which provides equal weights to all the data in a
window) is exponential smoothing, which provides most weight to the most recent observation.
Simple exponential smoothing uses weights that decline exponentially; the smoothing
S=
t α X t + (1 − α ) St −1
You can ask STATA to obtain an exponentially smoothed series from the ts time series by
This will generate a pop-up window. Replies to the window are shown in Figure 4-45.
The default option here is to choose the best value for the smoothing parameter α based on the
Exponential smoothing was actually developed by Brown and Holt for forecasting the
demand for spare parts. The simple exponential smoother assumes no trend and no seasonality.
Sometimes our goal in time series analysis is to look for patterns in smoothed plots. In
The last command above generates the time plot shown in Figure 4-46.
Typically, the goal in time series analysis involves yielding forecasts. Smoothing
techniques can help yield short-term forecasts. A rather very simple forecasting method entails
Consider the second half of the time series (since the residuals show a change in the variability
for the first and second half of the series, and thus, it may be best to only consider the second
half of the time series). Month-on-month growth rates on the ts variable can be readily obtained
Similarly, you can get year-on-year growth rates with the seasonal differences and the rates a
A forecast based on the growth rates can be calculated from the following equation:
so that you can then obtain forecasts both from the month-on-month and year-on-year growth
reg ts n if n>220
predict forecast3 if n>220
To account for nonlinearity in the trend, you can try a quadratic trend model by entering
If you believe that there is a strong seasonal effect that can be accounted for by monthly
additive effects in addition to the quadratic effects, you can run the following:
gen mnth=mod(n,12)
tab mnth, gen(month)
reg ts n nsq month1-month12 if n>220
predict forecast5 if n>220
Assessment of the forecasts may be done by inspecting the within-sample mean absolute
gen pe1=100*abs(ts-forecast1)/ ts
gen pe2=100*abs(ts -forecast2)/ ts
gen pe3=100*abs(ts -forecast3)/ ts
gen pe4=100*abs(ts -forecast4)/ ts
gen pe5=100*abs(ts -forecast5)/ ts
mean pe1 pe2 pe3 pe4 pe5
which suggests that the forecasts using month-to-month growth rates giving the best
forecasting performance .
In addition to past values of a time series and past errors, we can also model the time
series using the current and past values of other time series, called input series. Several
different names are used to describe ARIMA models with input series. Transfer function
model, intervention model, interrupted time series model, regression model with ARMA errors,
and ARIMAX model are all different names for ARIMA models with input series. Here, we
consider a general structural model relating a “vector” of dependent variables y with covariates
yt X t β + µt
=
You can still use the arima command for this model.
How can you tell if it might be helpful to add a regressor to an ARIMA model? After
fitting an ARIMA model, you should save the residuals of the ARIMA model and then look at
their cross-correlations with other potential explanatory variables. If we have two time series,
we can also explore relationships between the series with the cross-correlogram command
xcorr.
When modeling, you must be guided by the principle of parsimony: that is, you ought
to prefer a simple over a complex model (all things being equal). We must realize also that the
goodness of fit often depends on the specification, i.e., what variables are used as explanatory
variables, what functional form is used for the variable being predicted, etc. Thus, what is
crucial is to work on an acceptable framework for analysis, before one attempts to use any
It has been stressed throughout this training manual that it is important to plan the
research undertaking by coming up with the research proposal, even though there may actually
be a need to change some elements of the research plan as the research gets going. Drafting a
research report is certainly no different as it will entail planning, and revising of plans as a
researcher realizes something that causes him/her to rethink the research or the research results.
Therefore, it requires good skills in both structuring and phrasing the discoveries and thoughts.
These skills are acquired through experience, but can also be taught. Unlike a newspaper
article, a novel or an essay, a research report is a technically rigid text with a structure, but it
also carries some subjective intellectual style that reflects some personal opinions and beliefs
of the researcher. In coming up with the report, a draft (and perhaps several drafts) will have to
be written. The draft report is likely to be revised, perhaps even several times, before a final
version of the report ensues. Thus, the plan for the draft need not be extremely detailed.
The research report ought to explain concisely and clearly what was done, why it was
done, how it was done, what was discovered, and what conclusions can be drawn from the
research results. It should introduce the research topic and emphasize why the research is
important. The report should describe the data, as well as explain the research methods and
tools used; it ought to give enough detail for other researchers to repeat the research. Simple
examples ought to be given to explain complex methodologies that may have been used. The
clarity of the report should be on account of composition, reasoning, grammar and vocabulary.
Booth, et al. (1995), point out that “writing is a way not to report what is in that pile of notes,
but to discover what you can make of it;” they suggest that regardless of the complexity of the
task of coming up with the research report, the plan for a draft report should consist of (a) a
picture of who your readers are, including what they expect, what they know, what opinions
they have; (b) a sense of the character you would like to project to your readers; (c) the
research objectives (written in such a way as to also suggest gaps in the body of knowledge
points and sub-points; (e) the parts of the research paper. When should the report be started?
Some researchers prefer to come up with some major results before commencing the writing
process; others start it even immediately after the research proposal is finalized.
Starting the writing process can be quite frightening, even to most researchers. Laws of
physics suggest that when a body is at rest, it takes force to move the body. Similarly, it takes
some energy to begin writing up a research report, but once it starts there is also some inertia
that keeps the writing moving unless we get distracted. There are different ways of getting
started in writing and in continuing to write. Booth et al. (1995) suggest the importance of
developing a routine for writing: everyone has a certain time of day or days of the week when
they are at their most creative and productive states, thus you ought to schedule the creative
phase of your writing for these times. Other times may be more suitable for detailed work such
as checking spelling and details of the argument. Microsoft Word can also be used for spell
checking but beware that a correctly spelled word is not necessarily the right word to use.
Booth et al. (1995) also stress the need to avoid plagiarism, which they define as:
“You plagiarize, when intentionally or not, you use someone else’s words or
ideas but fail to credit that person. You plagiarize even when you do credit the
author but use his (or her) exact words without so indicating with quotation
marks or block indentation. You also plagiarize when you use words so close to
those in your source, that if you placed your work next to the source, you would
see that you could not have written what you did without the source at your
elbow. When accused of plagiarism, some writers claim I must have somehow
memorized the passage. When I wrote it, I certainly thought it was my own.
That excuse convinces very few.”
Some researchers make great progress in writing up the report with the aid of a
structured technique. In this case, an outline is developed that identifies an overall structure of
the draft research report in a hierarchical manner to arrive at the specifics of the report. Some
people, however, find outlining a stifle to their creativity. An advantage to outlining is that the
pieces of the report are conceived of before the report is written, and this may help give focus
to drafting it.
The currently most popular word processing software package Microsoft Word
supports outlining. Merely click on ‘View’, then select ‘Outline’. The Outline View displays
the Word document in outline form. It uses heading styles to organize the outline; headings can
outline view, you can look at the structure of the document as well as move, copy, and
reorganize texts in the document by dragging headings. You can also collapse a document to
see only the main headings, or expand it to see all headings and even the body text.
Typically, a research report would carry the following general topic outline:
1. Introduction
2. Review of Related Literature
3. Data & Methodology
4. Results & Discussion
5. Summary, Conclusions & Recommendations
Such a skeleton may help in the early stage of a researcher’s thinking process. These
topics could be further subdivided into sub-topics, and sub-sub-topics, and so forth. Elements
of the research proposal, such as the objective, hypothesis, conceptual and operational
framework, relevance of the research, may comprise the introduction section of the draft report,
together with some background on the study. That is, the first section on ‘Introduction’ could
contain:
1.1. Background
1.2. Research Objectives
1.3. Research Hypotheses
1.4. Conceptual and Operational Framework
1.5. Significance of Study
The third section on ‘Data & Methodology’ might include a description of the data and a
discussion of the data analysis method, including its appropriateness for arriving at a solution
3.1. Introduction
3.2. Data
3.2.1. Sampling design
3.2.2. Data processing
3.2.3. Data limitations
3.3. Data Analysis Methods
Data can challenge the theory that guided their collection, but any analysis on poorly generated
data will not be valid, thus the manner in which data was collected and processed has to be
described. Statistical models help in distinguishing genuine patterns from random variation, but
various models and methods can be used. It is important to discuss the analytical methods
not necessarily have a clear sense of the results. The problem may not even be too clear, and in
this case the act of drafting will itself help in the analysis, interpretation and evaluation. There
may be a lot of moments of uncertainty and confusion. A review of the research questions and
identification of key points may help put a structure to the report. It may also be important to
combine a topic outline with a point-based outline that identifies points within each topic. Note
that there may be more than one way to arrange elements and points of the report, but
preferably go from short and simple to long and more complex, from more familiar to less
familiar. Consider the following example of an outline of a research report that analyses results
VI. Results
a. Contraceptive Practices of Women:
i. trends in contraceptive use;
ii. differentials in contraceptive prevalence
iii. Determinants of Contraceptive Use
iv. Contraceptive Discontinuation
v. Contraceptive Switching Behavior
among claims. In addition, here interrelations among the parts of the research report are
conceptualized, especially in coming up with points, claims and organization of the arguments.
Also, it is easy to observe whether issues need connectives that will bring together parts of the
After coming up with the outline, it is now time to go to the level of the paragraph. The
boundaries between a paragraph and a sub-section are not sharp. Sub-sections allow you to go
deeper into a topic, and require several related paragraphs. Within each sub-section, you may
that introduces the paragraph, and leave the details of that paragraph for later. This expanded
form of the outline may constitute the skeleton of the initial draft of the report.
Outlining need not only be part of the pre-drafting stage. Sometimes, one comes up
with a long draft report based on a general outline, and then an outline above may help further
improve the draft by cutting out much of what was written (and even throwing away the entire
draft). There may even be several iterations of this drafting and outlining before the research
report is finalized. The point is to recognize that there will be changes in the outline, but it may
be much more satisfactory and productive to modify a well-considered outline than to start
After planning, outlining and organizing arguments into paragraphs, now comes the
writing of the draft report, which ought to be readable and coherent. The technical report is the
story of the research undertaking; it ought to be readable and coherent so that readers (and
other researchers) can determine what was done and how well it was done. Clear, concise,
direct, unequivocal and persuasive writing is important in scientific technical reports. A good
communicator must always understand that communication involves three components: the
communicator, the message and the audience. Thus a good criterion for selecting a writing
style is for the writer to think of the intended reader. The writer is challenged to “keep it short
and simple (KISS).” Conciseness should, however, not mean that clarity is sacrificed. Booth et
with visual aids, such as tables, graphs, charts and diagrams. Visuals facilitate understanding of
the research and the results; they can convince readers of your point, as well as help you
discover patterns and relationships that you may have missed . Knowing what type of visual to
use in a report is crucial. You must keep the end result – the research report and the key point
appropriate than others in communicating a point. Be aware that a visual is not always the best
explanation to the readers. Visuals must also be labeled appropriately. The title of a figure is
put below the figure, while the title of a table is put above the table. These visuals must be
located as close as possible to the textual description, which must make proper references to the
visual. Readers ought to be told what you want them to see explicitly from a table or figure.
Whether the draft (and the final report) will contain visual or textual evidence (or both) will
depend on how readers can understand information and how you want your readers to respond
to the information you present. Beware that most software packages, especially spreadsheets,
may generate visuals that are aesthetically good but do not necessarily communicate good
information. For instance, pie charts, though favorites in non-technical print media, are
generally not recommended for use in research reports. Some psychological research into
perceptions of graphs (see, e.g., Cleveland and McGill, 1984) suggests that pie charts present
the weakest display among various graphs; they are quite crude in presenting information
especially for more than four segments. Thus, it is recommended that pie charts be avoided in
research reports although they may be used in presentations if you wish to have your audience
see a few rough comparisons where differences are rather distinctive. Booth et al. (1995)
suggests that a researcher ask the following questions when using visual aids:
• How precise will your readers want the data to be? Tables are more precise
than charts and graphs (but may not carry the same impact)
• What kind of rhetorical and visual impact do you want readers to feel? Charts
and graphs are more visually striking (than tables); they communicate ideas
with more clarity and efficiency. Charts suggest comparisons; graphs indicate
a story.
• Do you want readers to see a point in the data? Tables encourage data
i.e., the “quick and dirty” way; others write carefully without leaving any problems, i.e., the
“slow and clean” way; and, others, or perhaps most people, may be somewhere in between
these extremes. In the latter case, it may be important to have notes of things that can be
checked on later. Booth et al. (1995) also provide five suggestions that a researcher may use in
• Determine where to locate your point: express main claim especially at the last
state the problem and even an idea of the solution. Most readers will not bother
readers must know, understand or believe; spell out the problem in more detail
and delimitations, locating the problem in a larger context. But this summary
your points more organized: review what is familiar to readers, then move to
the unfamiliar; start with short, simple material before getting into long, more
complex material; start with uncontested issues to more contested issues. Note
• Select and Shape Your Material: research is like gold mining: a lot of raw
material may be dug; a little is picked out and the rest is discarded. “You know
you have constructed a convincing argument when you find yourself discarding
material that looks good—but not as good as what you keep.” Ensure that your
your data.
Introduction to Methods for Research in Official Statistics 154 of 179
The overall plan is to have a research paper with a flow in its structure involving its
various sections, viz., the introduction, data and methods section, results and discussions,
conclusions and recommendations. The section on results and discussions is the main portion
of the research report. It ought to contain answers to the research questions, as well as the
requisite support and defense of these answers with arguments and points. Conflicting results,
unexpected findings and discrepancies with other literature ought to be explained. The
importance of the results should be stated, as well as the limitations of the research
undertaking. Directions for further research may also be identified. A good research report
transfers original knowledge on a research topic; it is clear, coherent, focused, well argued and
uses language without ambiguities. The research report must have a well-defined structure and
In achieving this final form of the research report, you will first have to come up with
the draft. The first draft of the technical report need not necessarily use correct English or a
consistent style. These matters, including correct spelling, grammar and style can be worked
out in succeeding drafts, especially the final draft. One must merely start working on the draft,
and keep moving until the draft is completed. It may even be possible that what is written in a
draft may not end up in the final form of the report. Thus, it is important to start drafting as
soon as possible, and it is generally helpful not to be too much of a perfectionist on the first
draft. Booth et al. (1995) suggests that some details on the draft be left for revisions:
“if you stumble over a sentence, mark it but keep moving. If paragraphs sound
disconnected, add a transition if one comes quickly or mark it for later. If points
don’t seem to follow, note where you became aware of the problem and move on.
Unless you are a compulsive editor, do not bother getting every sentence perfect,
every word right. You may be making so many changes down the road that at this
stage, there is no point wasting time on small matters of style, unless, perhaps you are
revising as a way to help you think more clearly Once you have a clean copy with the
problems flagged, you have a revisable draft. ”
The draft of the report will typically have to undergo several iterations of diagnoses and
revisions before the final form of the report is produced. After coming up with a first draft, it
may even be better to put it aside for a while, say a day, and then re-read the draft, which may
word processor, it may be difficult to sometimes catch mistakes; printing a hard copy of the
draft may be useful in revising the draft. Revisions involve some level of planning and
diagnosing. The process involves identifying the frame of the report: introduction and
conclusion (each of should carry a sentence that states the main claim and solution to the
problem); as well as the main sections of the body of the paper, including beginning and final
parts of each of these sections. In addition, it is important to analyze the continuity of the
themes in the paper, as well as the overall shape and structure of the paper. Repetitions of
avoided. Verbosity, i.e. wordiness, and use of redundant words ought also to be avoided.
causality. Variables investigated may be driven by other variables that confound the effect of
the correlation between the investigated variables. Remove excessive detailed technical
information and the details of computer output and put them instead in an appendix.
It may be helpful to read and re-read your draft as if it were written by someone else.
The key is to put oneself into the shoes of a reader, i.e. to see the draft from and through the
eyes of a reader “imagining how they will understand it, what they will object to, what they
need to know early so they can understand something later.” (Booth et al., 1995). This will
revising the text to make it more readable to a reader. It may help to ask yourself a few
• Are the text structure, arguments, grammar and vocabulary used in the draft
clear?
• Does the text communicate to its readers what you want it to? That is, can the
reader find what you wanted to say in the draft? Are the visuals, e.g., graphs,
charts, and diagrams communicating the story of the research effectively?
• Does the draft read smoothly? Are there “connectors” among the various ideas
and paragraphs? Is there a flow to the thoughts? Are they coherent?
• Is the draft as concise as possible? Are there redundant thoughts? Can the text
be shortened?
Booth et al. (1995, p. 232-233) provide some concrete and quick suggestions for
coming up with revisions to the writing style in the draft. They are presented below in toto:
“If you don’t have time to scrutinize every sentence, start with passages where you
remember having the hardest time explaining your ideas. Whenever you struggle
with content, you are likely to tangle up your prose as well. With mature writers that
tangle usually reflects itself in a too complex, ‘nominalized’ style.
FOR CLARITY:
Diagnose
1. Quickly underline the first five or six words in every sentence. Ignore short
introductory phrases such as At first, For the most part, etc.
2. Now run your eye down the page, looking only at the sequence of underlines to see
whether they pick out a consistent set of related words. The words that begin a
series of sentences need not be identical, but they should name people or concepts
that your readers will see are clearly related. If not you must revise.
Revise
1. Identify your main characters, real or conceptual. They will be the set of named
concepts that appear most often in a passage. Make them the subject of verbs.
2. Look for words ending in –tion, -ment, -ence, etc. If they appear at the beginning
of sentences, turn them into verbs.
FOR EMPHASIS:
Diagnosis
2. In each sentence, identify the words that communicate the newest, the most
complex, the most rhetorically emphatic information; technical-sounding words
that you are using for the first time; or concepts that the next several sentences will
develop.
Revise
especially if there is one in your word processor. This may help in remedying repetition of
words in a paragraph. Spell-checks also ought to be used, but be careful that these may not
always yield the correct words you want. When introducing technical and key terms in the
research report, it is preferred to structure the sentence so that the term appears in the last
words. The same may be true of a complex set of ideas, which ought to be made as readable as
possible.
Non-native writers of English have the tendency to write in their native language and
then translate their thoughts into English. This tends to be too much work unless only notes are
made rather than full sentences and texts. Even native speakers and writers of English are not
English. Grammar essentially involves linking, listing and nesting words together to
communicate ideas. Sentences consist of coordinated words, and clauses embedded and glued
together. Booth et al. (1995) suggests the importance of grammar in a research report: “Readers
will judge your sentences to be clear and readable to the degree that you can make the subjects
of your verbs name the main characters in your story.” To assist in the revision of the structure
and style of the draft research report, we provide in the Appendix of this manual a review of a
After a draft of the report has been written, a researcher is challenged to come up with
an introduction, conclusion and even an executive summary either while a revised draft is
being written or after the revised draft has been generated. The executive summary,
introduction, and conclusion ought to be clear, logical, coherent, and focused. The executive
summary should be coupled with good arguments and well-structured writing to enable the
paper to get widely read, and possibly published. The introduction and the conclusion likewise
ought to be well written as well as coherent in content. The key points stressed in the
introduction should not conflict with those in the conclusion. The introduction may promise
that a solution will be presented in the concluding section. Because of the importance of the
these last. Even in this case, a working introduction and working conclusion will still have to
be initially drafted.
After the title 4, the introduction will be read. Thus, the introduction ought to give
readers a sense of where they are being led to; otherwise, a reader may not proceed to reading
the entire report. The introduction ought to have a discussion from the broad topic towards
more specific and narrowly defined research problems and questions. It ought to state and
discuss the research objectives in clear language. The introduction should place the research
question in its scientific context, and state clearly what is new in this research and how it
relates to other researches, that is, the introduction should provide a sense of the study’s
significance, and even an overview of related literature. The introduction uses the simple
present tense for referring to established knowledge or past tense for literature review.
Common structure (of introductions) include at least these two elements, in this
predictable order:
□ a statement of the research problem, including something we do not
know or fully understand the consequences we experience if we leave
that gap in knowledge or understanding unresolved:
□ a statement of the response to that problem, either as the gist of its
solution or as a sentence or two that promises one to come.
And depending on how familiar readers are with the problem, they may also expect
to see before these two elements one more:
4
It is suggested that title contain only seven to ten words and avoid use of complex grammar. Preferably,
the title should be intrigue readers and attract interest and attention. Booth et al (1995) suggest that the
title be the “last thing you should write … a title can be more useful if it creates the right expectations,
deadly if it doesn’t … Your title should introduce your key concepts... If your point sentence is vague,
you are likely to end up with a vague title. If so, you will have failed twice: You will have offered
readers both a useful title and useless point sentences. But you will also have discovered something more
important: your paper needs more work”
Introduction to Methods for Research in Official Statistics 159 of 179
They also provide an example of an introduction that states the context, the research problem
First born middle-class native Caucasian males are said to earn more, stay employed
longer and report more job satisfaction. 5
But no studies have looked at recent immigrants to find out whether they repeat that
pattern. If it doesn’t hold, we have to understand whether another does, why it is
different, and what its effects are, because only then can we understand patterns of
success and failure in ethnic communities.6
The predicted connection between success and birth order seems to cut across ethnic
groups, particularly those from South-east Asia. But there are complications in
regard to different ethnic groups, how long a family has been here, and their
economic level before they came. 7
The following example is taken from an introductory section of a draft working paper
by Nimfa Ogena and Aurora Reolalas on “Poverty, Fertility and Family Planning in the
“With the ‘politicization’ of fertility in many countries of the world, the impact of
population research has expanded beyond mere demographics towards the wider
socio-cultural and political realm. … The raging debates during the past year in
various parts of the country on the population and development nexus, at the macro
level, and why fertility continues to be high and its attendant consequences, at the
individual and household levels, created higher visibility for this ever important issue
and legitimized certain influential groups. Nevertheless, with the dearth of
demographic research in the country over the past decade, debates have very limited
findings and the empirical evidence to draw from to substantiate basic arguments …
Correlations… are not sufficient basis for arguing that fertility induces poverty or
poverty creates conditions for higher fertility. Better specified models are needed to
examine causal linkages between fertility and poverty. This study aims to … :
(1) analyze regional trends, patterns and differentials in fertility and poverty;
(2) identify factors influencing Filipino women’s poverty status through their recent
fertility and contraceptive protection and;
(3) illustrate changes in the expected fertility and poverty status of women when
policy-related variables are modified based on fitted structural equations models
(SEM).
Expected to be clarified in this study are many enduring questions such as:
5
Shows Context.
6
States the Research Problem.
7
Indicates a sense of the outcome.
Introduction to Methods for Research in Official Statistics 160 of 179
As fertility declines, how much reduction in the poverty incidence is
expected?
If unmet need for family planning is fully addressed, by how much would
fertility fall?
Methodologically, this is the first time that the fertility-poverty status linkage is
examined at the individual level with a carefully specified recursive structural
equations model (SEM) that accounts for the required elements to infer causality in
the observed relationships. This paper also hopes to contribute to current policy
debates by providing different scenarios to illustrate possible shifts in women’s
fertility and poverty status as selected policy-related variables are modified.”
Notice that the authors make available the background of the research problem, the research
The concluding section may offer a summary of the research findings. It ought to show
that the key objectives and research questions were solved. All conclusions stated ought to be
based on research findings; some suggestions for further research may be stated, recognizing
what is still not known. A closing quote or fact may be presented. Regardless of whether we
Booth et al. (1995, p. 250-254) offers a couple of quick tips of do’s and don’t’s in the
choice and use of first and last words:
Your FIRST FEW WORDS:
Open with a Striking Fact or Quotation (but) only if its language leads
naturally into the language of the rest of your introduction.
□ Close with your main point especially if you ended your introduction not
with your main point but with a launching point. … If you end your
introduction with your main point, restate it more fully in your
conclusion.
The executive summary of the report may be viewed as a miniature version of the
report. Typically the executive summary will be up to 500 words. (See examples in Chapter 3
of Executive Summaries from the First Research based Regional Training course). Many
people may have time only to look through an executive summary, and this part of the research
paper may even be the only part of the report that may be available to the general public. The
executive summary will help readers decide whether to examine any further into the research
paper as a whole. Thus, everything that is important in the research report ought to be reflected
and summed up in the executive summary. It ought to contain an introductory statement of the
rationale and research objectives, perhaps even the research hypotheses, a discussion of the
importance of the study, the data and methodology, the key themes and research results,
abstract, which typically is not more than 100 to 150 words, containing the context, the
problem, and a statement of the main point, including the gist of the procedures and methods
Often there is a need to report the research results other than in the form of a research
report. Research reports have to be widely distributed for them to be read by a big readership.
Orally presenting the research results in a public forum, e.g., a conference or a seminar, is
probably the most common and most rapid way to disseminate new findings. Current practices
for public speaking in many scientific conferences, however, are relatively low, perhaps
because many scientists, official statisticians, included, are not trained in the art of marketing
their products and services. A good presentation can undoubtedly be memorable to the
techniques that can effectively transmit the information about the research and the research
results. A number of fairly new literature, e.g. Alley (2003), provide some discussions of
effective presentation techniques especially in scientific talks. What should be stressed is that
you must do justice to your topic and your work. If the research was worth doing, then its
Certainly, it can be noted that some people are very effective public speakers (and can
probably sell any idea). However, effective presentation involves talents and competencies that
can be developed. Some hard work and practice can help us in yielding good, if not very good,
presentation skills.
The reality is that not all people who would attend a dissemination forum of your
research will read your report. Thus, your work and your research report will likely be judged
provide information to the audience. If your presentation is poorly expressed, then your
research results will be poorly understood. In this case, your research will be in danger of being
ignored, and your effort working on your research may be put to waste. Too many details will
likely not be remembered by the audience, who might just go to sleep, and worse, snore during
your presentation!
In writing the research report, it is important to take the stand of your reader, and to
plan what you write. Similarly, during a presentation, you ought to be considerate of your
audience, and realistically plan your talk. You ought to know your audience, identify areas of
common interest with them, and anticipate potential questions that they may raise during the
presentation. Neither underestimate nor overestimate your audience. However you need not
plan your presentation in the same order as you carried out your research paper. While you
may have used a logical plan for your research report, your presentation will involve not only
an overall plan but a didactic plan, that involves understanding the specific type of audience
you have, the type of talk you are to give and the duration of your talk.
Unlike the written research report, a presentation is a one-shot attempt to make a point
or a few points. Thus, it is vital that your presentation be well-constructed and organized, with
your points submitted in a logical, clear sequence. Often, your presentation may have to be
good only for ten to fifteen minutes. The shorter the talk, the more difficult it will generally be
to cover all the matters you wish to talk about. You can’t discuss everything, only some
highlights, thus you need to be rather strict about including only essential points for the
presentation, and removing all non-essential ones. Try to come up with a simple presentation.
Of course, planning the content of a talk in relation to the actual length of the talk can be rather
difficult. It may be helpful to note that if you use about 100 words per minute, and each
sentence covers about 15 words, and each point will carry about 4 sentences then a 15 minute
talk will roughly entail 25 concepts (and 25 slides, at 1 concept per slide). If you are given
more than 30 minutes for your presentation, while you have more time to cover material, you
have to find ways of enlivening your presentation as peoples’ attentions are typically short-
lived. Also, in this case, there is danger that your audience may be more attentive to your
assumptions and question them. Whether your talk will be short or long, as a rule of thumb,
people can only remember about five or fewer new things from a talk.
plan how to say what you want to say. It should be clear to you whether you are expected to
present new concepts to your audience, or build upon their knowledge. That is, have people not
read your paper ( in which case, how can you persuade them to do so) or have they read it ( in
which case, what specific point do you wish them to appreciate). Either way, the basics of your
talk – the topic, concepts and key results – ought to be delivered clearly, and early on during
the talk, to avoid losing the attention of the audience. You ought to identify the key concepts
and points you plan to make. Determine which of these concepts and points will require
visuals, e.g., graphs, charts, tables, diagrams and photos that will convey consistency with your
message. Preferably, these visuals should be big, simple and clear. The intention of using such
visuals is to provide insights and promote discussions. Although presenting data in tables can
be effective, these tables should not be big tables. Only small tables with data pertinent to the
point you want to deliver should be used. Figures may often be more effective during
presentations. All preparations for such visuals ought to be done way in advance.
If you will be presenting your report with the aid of a computer-based presentation,
such as Powerpoint, make sure to invest some time learning how to use this software.
Powerpoint is an excellent tool for organizing your presentation with slides and transparencies.
Inspect the various templates, slide transitions and custom animation available. While colorful
templates may have their use, you ought to be prudent in your choice of templates,
backgrounds and colors. Artwork, animation and slide transitions may improve the
presentation, but don’t go overboard. Too much animation and transitions may reduce the
visual’s effectiveness and distract your audience from your message. Using cartoons
excessively may make your presentation appear rather superficial. Thus, make sure to be
“artistic” only in moderation. Try not to avoid losing sight that you have an overall message to
deliver to your audience – the key point of your research. You are selling your research!
Artwork will not substitute for content. The earlier you start on your slides, the better they will
be, especially as you fine tune them. However, avoid also excessive fine tuning.
Be conscious that your slides are supports and guides for your talk. You also should
cooperate with the content of your slides. Your first or first two slides should give an overview
Introduction to Methods for Research in Official Statistics 166 of 179
(or outline) of the talk. Your slides ought to be simple and informative; with only one basic
point in each slide. If you intend to give a series of points, it may be helpful to organize them
from the most to the least important. This way, your audience is more likely to remember the
important points later. Establish logical transitions from one point to another that link these
issues. You may, for instance, pose a question to introduce the next point.
Make sure to put the title, location, and date of the talk on each slide; run a spelling
check on the words in your slides. It can be embarrassing if people pay more attention to
mistakes in spelling in the slides than to the issues you are attempting to communicate to your
audience.
• Don’t’ use small fonts. Make sure that your slides are readable. As a rule of
thumb, if it looks just about right on the computer screen, then it is likely
to be too small during the presentation. If it looks big, it may still be small
during the slide view. You should intend to see outrageously large fonts (and
that goes for figures, charts, etc). To do a simulation, put your slides on slide
show, and step back six feet away from your computer. You ought to be
able to read all the text very easily in your presentation; if not you ought to
resize your fonts. You shouldn’t write too much on each slide as the fonts get
automatically smaller if there is too much written. Try having only four or
five lines per slide with not more than six words per line. Preferably have the
• Don’t use pale colors, e.g. yellow (about one in ten people are said to be color
blind, these people cannot see yellow). Use highly saturated colors instead of
pastels, and complementary colors for your text and background to increase
the visibility of the text in the slide. Color increases visual impact.
• Don’t use only upper case (except for the presentation title). A mixture of
lower and upper cases tends to be easier to read than purely upper or purely
lower case.
If you use sentences, use only short sentences with simple constructions.
• Don’t overcrowd slides, e.g., do not use very large tables copied directly from
an MS word document; don’t put too many figures or charts on one slide.
• Don’t change formats. Once you have made your choice on colors, fonts, font
size, etc., you ought to stick to them. Consistency in format will allow your
• Don’t display slides too quickly. But don’t spend too much time on one slide
either. The audience can scan the content of a slide (including visuals) within
the first three seconds after it appears. If you don't say anything during this
period, you allow your audience to absorb the information, then, when you
have their attention, you can expand upon what the slide has to say.
• Don’t ever read off everything directly from your slides. The audience can
read! If you were to read off directly everything from your slides, then
people could just be given copies of the slides without you needing to give
the presentation. You need to say some words not stated in your slides, e.g.,
elaborations of your points, anecdotes, and the like. You ought to maintain
Finally, stage your presentation – run through your talk at least once. A practice talk is
likely to be about 25% faster than the actual presentation. Re-think the sequencing of issues
and reorganize it to make the talk run more evenly. Rephrase your statements as needed. Delete
words, phrases, statements and issues that may be considered non-essential bearing in mind
time constraints as well as flow of ideas in your talk. Try to target having some time left over
for questions at the end of the talk. You may want to prepare a script or notes for your
presentation to put your slides at sync with what you want to say. The script or notes can be
quite useful, especially if you go astray during the presentation. It puts order to what you will
Practice your presentation, perhaps first in private. Listen to the words you use as well
as how you say them, and not to what you may think you are saying. Rehearsing your talk in
Introduction to Methods for Research in Official Statistics 168 of 179
front of a mirror (making eye contact with an imaginary audience) can be a painful and
humbling experience, but also rather helpful as you observe your idiosyncrasies and
mannerisms, some of which may be distracting to your audience. After your private rehearsal,
you can try your presentation out in front of some colleagues, preferably some of whom should
not know too much about your topic and research. These colleagues can provide constructive
feedback, both on the content and style of presentation. Make the changes. Rehearse some
more (as the saying goes “practice makes perfect”) perhaps around five times, then let it rest
(and even sleep on it). The night before the talk, sleep well.
Just right before the talk, try to relax and take it easy: dress comfortably and take a deep
breath before you start. Arrive at the venue at least ten minutes before your talk. Scan the
environment for possible problems in line of sight by the audience. Try to be familiar with the
equipment. Don't wait until the very few minutes before your presentation before you run to the
toilet; check your appearance - including zippers, buttons and the like - before you reappear. It
can be rather embarrassing if people’s attentions are directed not at your talk but at open
zippers, buttons, etc. If you prepared Powerpoint slides, make sure to have several soft copies
and test-run a copy before your presentation. Check if the colors, fonts, bullets, formulas, etc.
During the presentation, straighten up and face the audience. Be prepared for your
presentation. Express confidence: smile, speak slowly, clearly and loudly (and use gestures).
Be balanced in your tone of voice. If a microphone is available, do not say “does everybody
hear me?”; do not tap on the microphone; do not say “hello, hello, mike test” or “one-two-
three” on the microphone. Just speak up and begin your talk. If the microphone does not work,
it may be off. If there is a problem beyond the switch, just start your talk. Be as articulate as
possible. You may want to mention that “more details are available in the research report.”
Introduce your ending by making the title of your last slide or set of few slides
“Conclusion.” Here, you may want to offer a summary of the background (and thus the
significance of your research), as well as a recapitulation of the findings. Discuss directions for
slide may be helpful either at the very end or at the front end of the talk.
If you are interrupted during a conference or seminar, you can answer without delay but
don’t lose control over the flow of your presentation. You ought to avoid being sidetracked or
even worse, being taken over by the distraction. You can opt to defer your reaction to the
interruption by mentioning that this will be discussed later or at the end of the talk.
If there are questions raised during your presentation, it is good practice to repeat the
question not only for your benefit and that of the person raising the question, but also for the
sake of others. Take some time to reflect on the question. If there are questions about the
assumptions in your research, you will have to answer in detail the questions. Thus be
prepared and anticipate the audience reactions. If there are questions about a point you made,
you will have to discuss this point and re-express it clearly. If the audience is largely composed
of non-specialists, you may want to delay your discussion until the end of your presentation or
have a private discussion. Technical questions will have to be provided technical answers. If
you don’t understand the question raised or if the question is rather challenging, you may
honestly say so (but no need to apologize). Here, you may want to offer to do further research
on the issue, ask suggestions from the floor or have further discussions after your presentation.
If you sense that your answer and discussion is getting prolonged, make efforts to get out of the
heat tactfully.
If you run out of time in your presentation, you may finish your current point, refer the
interested reader to further details in the research paper and jump to the concluding section of
your presentation.
I. Tense of a Verb: is the time to which the verb refers. This ought to present few difficulties if
you place yourself in the position of the reader and consider the time to which the statement
refers at the time it is written. Within a paragraph, changes in time frame are indicated by
changing tense relative to that primary tense, which is usually either simple past or simple
present. As a general guideline, however, writers are advised not to shift from one tense to
• Future: used for events in the future when the text was written. It is often used
some 14,000 female respondents aged 15-49 and some 5,000 male
research report.
• Past: used for events already in the past when the text was written.
from the 2003 women data file using the Dynpak software package
of programs
used for a result that is specific to the research undertaking. (If the statement is
• Present: used for statements that are always true, according to the researcher, for
some continuing time period; the statement may have been false before, but it
It is also used for a result that is widely-applicable, not just to the research.
• Past Perfect: used for events already in the past when another event in the
past occurred.
• Future Perfect: used for future events that will have been completed when
II. Voice of a Verb: refers to whether the subject of the sentence performs the action expressed
subject is acted upon (the so-called “passive voice”), i.e., the agent
In the passive voice, sometimes there is no explicit subject (no “by the” phrase), only an
implied one in the complement. Note that in the last example, it is not actually necessary to
have the phrase “by the researcher” to make the sentence a complete sentence. Writers are
advised to avoid dangling modifiers caused by the use of passive voice. A dangling modifier is
a word or phrase that modifies a word not clearly stated in the sentence. Instead of writing “To
cut down on costs, the survey consisted of 1000 respondents” one should write “To cut down
on costs, the team sampled 1000 respondents” In the first case, the construction of the sentence
indicated that the survey was the one that decided to cut down costs.
There is wide acceptance for the use of the active voice in non-scientific writing over the
passive voice as the active voice yields clearer and direct thoughts. The passive voice ought to
be used, however, when the object is much more important than the subject, especially when
may also help avoid calling attention to oneself, i.e., rather than write:
Another alternative way of writing the sentence is to use a third-person style, i.e. use
readers that another set of thoughts is to be introduced. These punctuation marks make
it is easier for the reader to understand the flow of thoughts of a writer, to emphasize
and clarify what a writer means. The rules for the use of a number of punctuation
marks are fairly standard although they are not static, i.e., they have changed through
the years. These conventions and rules are created and maintained by writers to help
The period (“.”) completes a sentence; while the semi-colon (“,”) joins two complete
sentences where the second sentence is rather related to the first. An exclamation
point (!) also ends a sentence, but, unlike the period, it indicates surprise or emphasis.
A question mark (?) also ends a sentence, but it proposes a question which could be
answered, either by the reader or the researcher/writer. (Note that rhetorical questions
need no answer.)
The colon (“:”) is placed at the end of a complete sentence to introduces a word, a
The most common way of using a colon, however, is to introduce a list of items.
Note that the colon should not be placed after the verb in a sentence, even when you
are introducing something, because the verb itself introduces the list and thus, the
colon would be redundant. If you are not sure whether you need a colon in a particular
sentence, try reading the sentence, and when you reach the colon, substitute the word
“namely”. If the sentence reads well, then you may need the colon. (Of course, there
are no guarantees!).
The comma (“,”) can be used more variedly: it can separate elements of a list, or it can
join an introductory clause to the main part of a sentence. Some writers can tell where
a comma is needed by merely reading their text aloud and inserting a comma where
there is a need for a clear pause in the sentence. When a reader encounters a comma,
the comma tells the reader to pause (as in this sentence). In the preceding sentence, the
comma is used to join the introductory clause of the sentence (“When a reader
encounters a comma) to the major part of the sentence (the comma tells the reader to
pause). Another reason for the pause could be that the words form a list, and the reader
must understand that the items in the list are separate. What is often unclear is whether
to include the comma between the last and second-to-last items in a list. In the past, it
was not considered proper to omit the final comma in a series, modern writers believe
that conjunctions such as “and”, “but”, and “or” do the same thing as a comma and
these writers argue that a sentence is more economical without the comma. Thus, you
actually have the option to choose whether or not to include the final comma. Many
writers, however, still follow the old rule and expect to see the final comma. Note also
that while we can use a semicolon to connect two sentences, more often we glue two
were implemented.
If your sentence is rather short (perhaps between five and ten words), you may opt to
The comma may also be used to attaching one or more words to the front or back of the
core sentence or when you insert a group of words into the middle of a sentence. If the
group of words can be viewed as non-essential, the comma will have to be put on both
The poverty data, sourced from the 2003 Family Income and Expenditure
For more grammar bits and tips, you may want to look over various sources from the internet,
such as http://owl.english.purdue.edu/handouts/grammar/
Asis, Maruja M.B. (2002). Formulating the Research Problem, Reference Material in the University of
the Philippines Summer Workshops on Research Proposal Writing.
Babbie, Earl (1992). The Practice of Social Research ( 6th ed.), California: Wadswoth Publishing Co.
Booth, Wayne C., Gregory G. Colomb, and Joseph M. William. (1995) The Craft of Research. Illinois:
University of Chicago.
Draper, Norman and Harry Smith (1998) “Applied Regression Analysis,” New York: John Wiley.
Gall, Merdith, D., Walter R. Borg, Joyce P. Gall (2003) Educational Research: An Introduction, 7th
edition. New York: Pearson.
http://owl.english.purdue.edu/handouts/grammar/
Haughton, Jonathan, Shahidur R. Khandker (2009). Handbook on Poverty and Inequality. World Bank:
Washington
Hoff, Darrel (1954). How to Lie with Statistics. New York: W.W. Norton
Hosmer, David, W., Jr., and Stanley Lemeshow. (2000). Applied Logistic Regression, Second Edition.
New York: John Wiley & Sons.
Kuhn, Thomas S. (1970). The Structure of Scientific Revolutions, Second Edition. Chicago: University
of Chicago Press.
Meliton, Juanico B.(2002). Reviewing Related Literature, Reference Material, University of the
Philippines Summer Workshops on Research Proposal Writing.
Mercado, Cesar M.B. (2002). Overview of the Research Process, Reference Material, University of the
Philippines Summer Workshops on Research Proposal Writing.
Mercado, Cesar M.B. (2002). Proposal and Report Writing, Reference Material, University of the
Philippines Summer Workshops on Research Proposal Writing.
Nachmias, Chava F. and David Nachmias (1996). Research Methods in the Social Sciences (5th ed.),
New York: St. Martin’s Press.
Phillips, Estelle M. and Derek S. Pugh (2000). How to Get a Ph.D.: A Handbook for Students and Their
Supervisors (3rd ed), Philadelphia: Open University Press.
Statistical Research and Training Center (SRTC) Training Manual on Research Methods.
University of the Philippines Statistical Center Research Foundation Training Manual on Statistical
Methods for the Social Sciences.
World Bank (2000) Integrating Quantitative and Qualitative Research in Development Projects. Michael
Bamberger (ed).