You are on page 1of 43

CHAPTER- 2

Introduction to Statistical Package


•A statistical package is a suite of computer
programs that are specialised for statistical analysis.
•It enables people to obtain the results of standard
statistical procedures and statistical significance tests,
without requiring low-level numerical programming.
•Most statistical packages also provide facilities for
data management.
Econometric software 
• is statistical software that is specialised for econometric analysis.
List of statistical packages used mainly for econometric analysis
•This is an incomplete list of software that is designed mainly for the purpose of
performing econometric analyses.
EViews - Econometric Views
GAUSS (software)
gretl - Gnu Regression, Econometrics and Time-series Library
OxMetrics - Ox Based Econometrics
R (programming language)
RATS - Regression Analysis of Time Series
SHAZAM
SPSS (STATISICAL PACKAGE FOR SOCIAL SCIENCE)
STATA- all purpose package
SAS System - all purpose tool
TSP - Time Series Processor, econometric package and programming language and
others.
STATA software
•Stata is a statistical package for managing, analyzing, and graphing
Econometric data.

•Stata is a powerful statistical software that enables users to analyze,


manage, and produce graphical visualizations of data.

•It is primarily used by researchers in the fields of economics, biomedicine,


and political science to examine data patterns.

•It has both a command-line and graphical user interface making the use of
the software more intuitive.

•Stata is available for a variety of platforms. It may be used either as a


point-and-click application or as a command-driven package.
•Stata provides an easy interface for those new to Stata and for experienced
Stata users who wish to execute a command that they seldom use.

•The command language provides a fast way to communicate with Stata and to
communicate more complex ideas.

•It is the only way to harm the permanent copy of your data on disk is if you
explicitly save over it.

•Having the data in memory means that the dataset size is limited by the
amount of computer memory.

•Stata stores the data in memory in an effificient format— you will be


surprised how much data can fifit.

•Nevertheless, if you work with extremely large datasets, you may run into
memory constraints.
Basic Feutures of STATA
•STATA statistical software is a complete, integrated statistical software
package that provides everything you need for data analysis, data
management, and graphics. STATA is not sold in modules, which means
you get everything you need in one package.  Following are some
features.
It Fast, accurate, and easy to use- That speed is due partly to careful
programming, and partly because Stata keeps the data in memory.
Broad suite of statistical features
Complete data-management facilities
Publication-quality graphics
Responsive and extensible
Matrix programming—Mata
Cont’d…..
Cross-platform compatible
Complete documentation and other publications
Reproducibility- Ideally, anyone with your programs and data should
be able to do so without your assistance.
Transportability-Stata binary files may be easily transformed into SPSS or
SAS files with the third-party application Stat/Transfer.
Stat/Transfer can also transfer SAS, SPSS and many other file
formats into Stata format, without loss of variable labels, value
labels, and the like
Programmability of tasks- Stata may be used in an
interactive mode, and those learning the package may
wish to make use of the menu system.
Stata’s fifile model is that of a word processor: a dataset
may exist on disk, but the dataset in memory is a copy.
•Stata’s fifile model is that of a word processor: a dataset may exist on disk, but
the dataset in memory is a copy.
•Datasets are loaded into memory, where they are worked on, analyzed, changed,
and then perhaps stored back on disk.
•Working on a copy of the data in memory makes Stata safe for interactive use.
•Stata is very well supported by telephone and email technical support,
as well as the more informal support provided by other users on
StataList, the Stata listserv
•Also makes it very easy to perform an alternate analysis of a particular
model.
But when you execute a command from a pull-down
menu, it records the command that you could have typed
in the Review window, and thus you may learn that with
experience you could type that command (or modify it and
resubmit it) more quickly than by use of the menus.
Plus points of the stata
• It has the best range of survey software that is available in a standard
commercial package. It has a very wide range of high quality and up-to-
date statistical software.
• Its output is clear and easy to read. Its statistical procedures are very
well documented in manuals.
• It can produce high quality graphics

Returned results-Stata commands are either r-class commands like


summarize, that return results, or e-class commands, that return
estimates. You may examine the set of results from a r-class
command with the command return list

Minus pointsIts of stata


•main interface, with a single command line, seems a bit clumsy, but running
from Do files works well.
•Stata is available in three editions, although perhaps sizes would
be a better word.
•The editions are, from largest to smallest, Stata/MP, Stata/SE,
and Stata/BE. (Prior to Stata 17, the various editions of Stata were
called flflavors, and Stata/BE was called Stata/IC.)
•Stata/MP is the multiprocessor version of Stata. It runs on
multiple CPUs or on multiple cores, from 2 to 64.
•Stata/MP uses however many cores you tell it to use (even one),
up to the number of cores for which you are licensed.
•Stata/MP is the fastest version of Stata. Even so, all the details of
parallelization are handled internally and you use Stata/MP just
like you use any other editions of Stata.
•In addition to being the fastest version of Stata, Stata/MP is also
the largest. Stata/MP allows up to 1,099,511,627,775
observations in theory, but you can undoubtedly run out of
memory fifirst.
•You may have up to 120,000 variables with Stata/MP.
Statistical models may have up to 11,000 variables.
•Stata/SE is like Stata/MP, but for single CPUs.
Stata/SE will run on multiple CPUs or multiple-core
computers, but it will use only one CPU or core. Stata
SE stands for standard edition.
•Stata/SE allows up to 2,147,583,647 observations, assuming
you have enough memory.
•You may have up to 32,767 variables, and, like Stata/MP,
statistical models may have up to 11,000 variables.
•Stata/BE is the basic version of Stata. Up to 2,147,583,647
observations and 2,048 variables are allowed, depending on
memory. Statistical models may have up to 800 variables.
Size Comparison of Stata/MP, SE, and BE
•Stata/MP allows more variables and observations, longer
macros, and a longer command line than Stata/SE.
•Stata/SE allows more variables, larger models, longer
macros, and a longer command line than Stata/BE.
•The longer command line and macro length are required
because of the greater number of variables allowed.
•The larger model means that Stata/MP and Stata/SE can fifit
statistical models with more independent variables.
Speed comparison of Stata/MP, SE, and BE
•We have written a white paper comparing the performance
of Stata/MP with Stata/SE
•The white paper includes command-by-command
performance measurements.
• In summary, on a dual-core computer, Stata/MP will run
commands in 71% of the time required by Stata/SE.
• There is variation; some commands run in half the time and
others are not sped up at all. Statistical estimation
commands run in 59% of the time. Numbers quoted are
medians.
• Average performance gains are higher because commands
that take longer to execute are generally sped up more.
• Stata/MP running on four cores runs in 50% (all commands)
and 35% (estimation commands) of the time required by
Stata/SE. Both numbers are median measures.
• Stata/MP supports up to 64 cores.
• Stata/BE is slower than Stata/SE, but those
differences emerge only when processing datasets
that are pushing the limits of Stata/BE.
• Stata/SE has a larger memory footprint and uses
that extra memory for larger look-aside tables to
more effificiently process large datasets. The real
benefifits of the larger tables become apparent only
after exceeding the limits of Stata/BE. Stata/SE was
designed for processing large datasets.
• The differences are all technical and internal. From
the user’s point of view, Stata/MP, Stata/SE, and
Stata/BE work the same way.
Feature comparison of Stata/MP, SE, and
BE
• The features of all editions of Stata on
all platforms are the same.

• The differences among the three


Editions of Stata are in speed and in
limits as discussed above.
SPSS PACKAGES
•SPSS stands for “Statistical Package for the Social
Sciences”. It is an IBM tool. This tool first launched in
1968. This is one software package.This package is
mainly used for statistical analysis of the data.

•SPSS is mainly used in the following areas like


healthcare, marketing, and educational research,
market researchers, health researchers, survey
companies, education researchers, government,
marketing organizations, data miners, and many
others.
Cont’d
•It provides data analysis for descriptive statistics,
numeral outcome predictions, and identifying groups.
This software also gives data transformation, graphing
and direct marketing features to manage data
smoothly.

•SPSS’s statistics program gives a large amount of


basic statistical functionality; some include
frequencies, cross-tabulation, bivariate statistics, etc.
Features of SPSS
• The data from any survey collected via Survey
Gizmo gets easily exported to SPSS for detailed and
good analysis.
• In SPSS, data gets stored in.SAV format. These data
mostly comes from surveys. This makes the process
of manipulating, analyzing and pulling data very
simple.
• SPSS have easy access to data with 
different variable types. These variable data is easy
to understand. SPSS helps researchers to set up
model easily because most of the process is
automated.
CONT’D
• After getting data in the magic of SPSS starts. There
is no end to what we can do with this data.
• SPSS has a unique way to get data from critical data
also. Trend analysis, assumptions, and predictive
models are some of the characteristics of SPSS.
• SPSS is easy for you to learn, use and apply.
• It helps in to get data management system and
editing tools handy.
• SPSS offers you in-depth statistical capabilities for
analyzing the exact outcome.
• SPSS helps us to design, plotting, reporting and
presentation features for more clarity.
Statistical Methods of SPSS
Many statistical methods can be used in SPSS:
• Prediction for a variety of data for identifying groups and
including methodologies such as cluster analysis, factor
analysis, etc.
• Descriptive statistics, including the methodologies of SPSS,
are frequencies, cross-tabulation, and descriptive ratio
statistics, which are very useful.
• Also, Bivariate statistics, including methodologies like 
analysis of variance (ANOVA), means, correlation, and
nonparametric tests, etc.
CONT’D…
• Numeral outcome prediction such as linear regression. It is
a kind of self-descriptive tool which automatically
considers that you want to open an existing file, and with
that opens a dialog box to ask which file you would like to
open.
• This approach of SPSS makes it very easy to navigate the
interface and windows in SPSS if we open a file.

• Besides the statistical analysis of data, the SPSS software


also provides data management features; this allows the
user to do a selection, create derived data, perform file
reshaping, etc. Another feature is data documentation.
This feature stores a metadata dictionary along with the
data file.
Views on the SPSS
• It has two types of views those are Variable View and Data
View:
Variable View
• Name: This is a column field, which accepts the unique ID.
This helps in sorting the data. For example, the different
demographic parameters such as name, gender, age,
educational qualification are the parameters for sorting
data.
• The only restriction is special characters which are not
allowed in this type.
Label: The name itself suggests it gives the label. Which also
gives the ability to add special characters.
Type: This is very useful when different kind of data’s are
getting inserted.
Width: We can measure the length of characters.
Decimal: While entering the percentage value, this type helps
us to decide how much one needs to define the digits
required after the decimal.
Value: This helps the user to enter the value.
Missing: This helps the user to skip unnecessary data which is
not required during analysis.
Align: Alignment, as the name suggests, helps to align left or
right. But in this case, for ex. Left align.
Measure: This helps to measure the data being entered in the
tools like ordinal, cardinal, nominal.
variable view
• It allows us to customize the data type as required for
analyzing it.
To analyze the data, one needs to populate the different
column headings like Name, Label, Type, Width, Decimals,
Values, Missing, Columns, Align, and Measures.
These headings are the different attributes which, help to
characterize the data accordingly.
Data View
The data view is structured as rows and columns. By
importing a file or adding data manually, we can work with
SPSS.
Following are the Steps for importing Excel file into SPSS.
The first step is to click on File
=> Open
=> Select Data
=> Dialog Box
=> Files of type
=> .xls file.
COMPARISON OF STATA Vs SPSS
Complexity
• SPSS Can be used for modeling very complex data
& Stata can be used for complex analysis
UTILITY
• SPSS is mainly used for multivariant analysis
produces for large amount of data & STATA
Provides normal analysis procedures
APPLICATIONS
• SPSS Can Be Used For Medical and social science
areas & STATA is mostly used in econometrics.
BENEFITS
• SPSS Can Directly change the ouputs in to reports
& Stata has command lines and documentation
features which is highly usefull.
ANALYSIS
• SPSS Can be used for complex data managements
like excel pread sheet multivariant analysis
produces for large amount of data & STATA is
usefull in cuting edge research and ideal for
developers.
Statistical analysis
• SPSS bit stronger in this areas & STATA relatively
weak in this area.
CHAPTER THREE
Data and Data Management in the Packages
• Data-Is the Facts that can be analyzed or used in an effort to gain knowledge or
make decisions; information.
• It is Statistics or other information represented in a form suitable for processing
by computer.
• In scientific writing, data is usually treated as a singular in much the same way
as the word information is.
• Data is a powerful tool and is the fuel behind the most efficient and effective
decisions. In general:
• Data is Representation of facts, concepts, or instructions in a formalized
manner suitable for communication, interpretation, orprocessing by humans or
by automatic means.
• It is a Factual information (such as measurements or statistics) used as a basis
for reasoning, discussion, or calculation
• It is an information in digital form that can be transmitted or processed
• It is an information output by a sensing device or organ that includes both
useful and irrelevant or redundant information and must be processed to be
meaningful output.
Types of Statistical data
Cross-sectional Data
• These are the observations which come from
different groups or individuals at a single point of
time.
• Here, The underlying population should have
members with similar characteristics.
• Cross sectional is a data which is collected from all
the participants at the same time.
• Time is not considered as a study variable during
cross sectional research.
• Though, this is also a fact that, during a cross
sectional study, all the participants don’t give the
information at the same moment.
• Cross sectional data is collected from the
participants within a shorter time frame. This time
frame is also known as field period. Time only
produces a variance in the results, but it’s not
biased.
• With cross-sectional data the ordering of the data
does not matter. In other words, we can order the
data by ascending, descending or even randomized
order and this will not affect out modeling results.
Merits and De-merits of Crossectional data
Time-series data
• These are observations which are collected at
equally spaced time intervals. For example, you can
consider the daily closing price of a particular stock
recorded over the past four weeks.
• One thing is to be noted, and that is, too short or
too long time can lead towards time bias.
• When the data is collected for the same variable
over time, like months, years, then this type of data
is called as time-series data. The data might be
collected over months, years, but virtually, any
time interval can be seen.
Cont’d…
• Data collected at a number of specific
points in time is called time series data.
Such examples include stock prices, interest
rates, exchange rates as well as product
prices, GDP, etc.
• Time series data can be observed at many
different frequencies (hourly, daily, weekly, 
monthly, quarterly, anually, etc.).
• Time-series data can be used to predict the
future values of a given financial vehicle. 
Cont’d…
• Time series data is a dataset consist of observations
of one individual at multiple time intervals.
• Unlike cross-sectional data, the ordering of the
data is important in time-series data.
•  Each point represents the values at specific
points in time.
• As such, time series data are typically presented
in chronological order.
• Changing the order of the data ignores the time-
dimensionality of the data.
Panel DATA
• Panel data is a dataset consist of observations of
multiple individuals obtained at multiple time
intervals.
• A panel data is a combination of a time-series data set (such
as stock price with respect to date) and a cross-sectional data
set (such as the population of the city of the particular year).
•When cross-sectional data for multiple year or timestamp gets
repeated in the data set is known as panel data.
•Panel data is also known as cross-sectional data with time series
or longitudinal time series data in some cases, the data that is
extracted or acquire from a dataset generally by performing
observations overtime on a large number of cross-sectional data
units for eg.
Comparison of Time serious and Panel
• Time series data focuses on single individual while panel data
focus on multiple individuals.

• Looking at the application of both types of data, profit of an


individual over a period of ten years is an example of time
series data while profit of set of individuals over a period of
ten years is an example for panel data.

• The key difference between time series and panel data is


that time series focuses on a single individual at multiple
time intervals while panel data (or longitudinal data)
focuses on multiple individuals at multiple time intervals. 
• Consider the following two examples to
understand the difference between time
series and panel data clearly: profit of an
individual over a period of ten years is an
example of time series data while profit of set
of individuals over a period of ten years is an
example for panel data.
• Time series data focuses on observations of a
single individual at different times usually at
uniform intervals. One example is the income
of an organization calculated at the end of
each year for a period of 5 years’ time.
• Panel data is also called longitudinal data.

• This type of data focuses on multiple


individuals at multiple time periods. The
panel data has the form of Xit. The i denotes
the individual while t denotes the time period.
One example is the 
Gross Domestic Product (GDP) of five countries
over a period of ten years such as 2001 to
2010.
• In this scenario, there is a total of 50
observations.
TIME SERIES Vs PANEL
Panel data is generally divided into two categories:
1. Balanced Panel Data
When cross-sectional data with time series repeats a patter of itself on
a fixed period time interval it is known as balanced panel data.
• Here we have the same set of data for every fixed period of the time
interval.
Example:
We are having time-series data set of the 5 cities for the year 2001 and
the same data set of the same 5 cities for the year 2002.
2. Unbalanced Panel Data
When cross-sectional data with time series does not repeat a patter of
itself on a fixed period time interval it is known as balanced panel data.
• In unbalanced panel data, some of the cross-sectional data is
missing for a time interval, it does not have the same set of cross-
sections, it contains different sets of cross-sections for the different
data sets.
Advantages of Panel Data Analysis
Given below are the advantages mentioned:
Panel data contains generalized, common, and individual behaviors of data groups.
Panel data contains additional info, additional variability, and additional properties than
statistical knowledge or cross-sectional knowledge.
Panel data can be found and live applied with math effects that pure statistic or cross-
sectional knowledge cannot.
Panel data will minimize estimation biases which will arise from aggregating groups into
one statistic.
After extracting the data from the different resources The first step that researchers
follow is cleaning data and check the quality of panel data.
Because it is considered as the panel data is already implicitly well arranged by both
cross-sectional and time-series variables and get the presence of fixed and/or random
effects of data. Otherwise, the data are simply (or physically) arranged in the panel data
format but will not be considered as the panel data in an economic analytical sense.
The most important aspect is consistency in the unit of data analysis or measurement of
the data, which says that each observation in a data set is being treated and weighted
equally.
Some requirement seems self-driven but it is often interval by analytical research. If each
observation is not equivalent in many senses, any analysis based on such data may not
be adequate and reliable.
Introducing data in to the package
In to STATA
Importing an Excel or Text Data File into Stata: To import an Excel file (e.g.
“Example_Dataset.xlsx”) click on File, then on Import, then on Excel spreadsheet. A
new window will open. Click Browse and navigate to the folder where the data file you
want to use is stored, and then click on Open. You will see a preview of the data file in
the “Import Excel” window. If the first row of your data file contains the variable
names, as it does for the “Example_Dataset” data file, check the box next to “Import
first row as variable names”:
Saving a Dataset in Stata Format: If you make modifications to an original dataset (say
by recoding variables, or creating new ones), it is best practice to save the modified
dataset as a new data file, instead of overwriting the original file. That way if there
turn out to be errors in the modified file, you can always start afresh with the original
dataset.
Recoding and Labeling Variables: Recoding categorical or
quantitative variables can be useful in a number of
circumstances. For example, you might want to use fewer, more
aggregated categories than those used in collecting the data,
change the ordering of a variable’s categories for some reason, or
recode a quantitative variable as a categorical variable.
Creating a “Do” File in Stata: A do file lists and executes Stata
commands. It is a convenient and efficient alternative to typing
commands in the Stata command box. By storing commands for a
particular analysis in a do file, you can easily replicate your
results, re-run your analysis with modifications and elaborations,
or repeat it after correcting errors. A do file is a separate file that
has a “.do” extension

https://sociology.fas.harvard.edu/need-help-basic-stata
STATA YOUTUE: t https://www.youtube.com/user/statacorp
F-Test: The Basics
An F-test is used to test whether two population variances are equal. 
T-Test: The Basics
A two sample t-test is used to test whether or not the means of two
populations are equal.
F-Test vs. T-Test: When to Use Each
We typically use an F-test to answer:
Do two samples come from populations with equal variances?
Does a new treatment or process reduce the variability of some
current treatment or process?
And we typically use a T-test to answer :
Are two population means equal? (We use a two sample t-test to
answer this)
Is one population mean equal to a certain value? (We use a 
one sample t-testto answer this)
Key Differences Between T-test and F-test
• A univariate hypothesis test that is applied when
the standard deviation is not known and the sample
size is small is t-test. On the other hand, a statistical
test, which determines the equality of the variances
of the two normal datasets, is known as f-test.
• The t-test is based on T-statistic follows Student t-
distribution, under the null hypothesis. Conversely,
the basis of the f-test is F-statistic follows Snedecor
f-distribution, under the null hypothesis.
• The t-test is used to compare the means of two
populations. In contrast, f-test is used to compare
two population variances.

You might also like