You are on page 1of 89

FOCUS for IBM Mainframe

Statistical Analysis Users Manual

DN1000019.1091

Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 External Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 FOCUS Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Preparing Data for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Direct Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Selective Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Specifying Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Selecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3Environmental Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The Examples in this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preparing for ANALYSE Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Entering the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Statistical Operations Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Preparing the Environment: STATSET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Limiting the Sample Size: FILESIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Reviewing Online Documentation: EXPLAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1-2 1-2 1-2 1-2 1-3 1-3 1-3 1-3 1-5 1-7 2-1 2-1 2-2 2-2 2-6 2-7

2.

3.

3 The Statistical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.1 Analysis of Variance: ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.2 Correlation Analysis: CORRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.3 Exponential Smoothing: EXSMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4 3.4 Factor Analysis: FACTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 3.5 Discriminant Analysis: MDISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.6 Multiple Linear Regression: MULTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15 3.7 Polynomial Regression: POLRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 3.8 Descriptive Statistics: STATS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 3.8.1 Control Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.9 Stepwise Multiple Regression: STEPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31 3.10 Time-Series Analysis: TIMESER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35 3.10.1 The Time Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 3.10.2 Commands that Create New Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 3.10.3 Other TIMESER Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42 3.10.4 Saving Forecast Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43 3.10.5 Ease-of-Use Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43

3.11 Crosstabulations: XTABS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Specifying Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.2 Specifying Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.4 Cell Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.5 Ease-of-Use Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.6 Control Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.7 Specifying Columns and Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.8 General Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. B.

3-46 3-47 3-48 3-50 3-54 3-55 3-56 3-56 3-57

Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 Master File Descriptions and Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1

1 Introduction
FOCUS provides a full range of statistical tools. These facilities have been designed for ease of use in a conversational environment. The spectrum of facilities covers two types of statistical operations: Simple functions supplied by the FOCUS report request language as part of the regular report writer including: Minimums. Maximums. Averages. Average sums of squares. Percentage counts. User-supplied functions.

Complex statistical operations with interactive prompts to determine what to perform next (based on results obtained so far). These include: Descriptive Statistics. Correlations. Multiple Linear Regressions. Stepwise Regressions. Polynomial Regressions. Analyses of Variance. Discriminant Analyses. Factor Analyses. Exponential Smoothing. Time Series Analyses and Forecasting.

Statistical operations are performed on selected sets of variables and their values (observations) collected from one or more FOCUS or external files by the FOCUS report writer. Up to 64 variables are permitted. The number of observations is only limited by the amount of storage available to the user (e.g., 20,000 observations with 20 numeric variables would need 1.6 million bytes of virtual storage). The steps for performing statistical analysis of data in FOCUS databases are: 1. 2. 3. 1. or Create a HOLD file of selected and/or redefined data using the report writer. Issue the ANALYSE command. Respond to the prompts. Create a Master File Description describing the data to be analyzed (attributes of the variables: names, formats and lengths, etc. -- described in the FOCUS Users Manual).

The steps for performing statistical analysis of data in external files are:

Statistical Analysis Users Manual

1-1

1. 2. 3.

Create a HOLD file of selected and redefined data using the report generation language (only necessary if the external file is not a fixed- format file). Issue the ANALYSE command. Respond to the prompts.

1.1 Sources of Data


Data for most statistical analyses may be drawn from two sources: external files and FOCUS files.

1.1.1 External Files


Any user-supplied fixed-format file can be used. Each such file must be accompanied by a Master File Description in which the fieldnames and formats of data are stated. The facilities for describing them are contained in the FOCUS Users Manual. When using external files: Create and maintain them with your normal system facilities. Fixed-format data files can be analyzed directly (you do not need to create an extract file).

1.1.2 FOCUS Files


Any FOCUS file or set of interconnected files may supply data for statistical analysis. FOCUS files are recommended for complex data structures. Advantages of using FOCUS files: More efficient storage due to elimination of redundant data. Faster data retrieval for analysis than external files when data selection is involved. The FOCUS data validation, maintenance, and correction facilities are available. Facilitates the creation of control variables in ANOVA and MDISC. Data from any combination of these sources may be collected for analysis through the facilities of the FOCUS request language. Documentation of the procedures for describing and building FOCUS files appears in the FOCUS Users Manual. The numeric data fields in either FOCUS or external files may be treated as any one of the following four numeric representations: binary integer, floating point, double-precision floating point, or packed decimal. It is acceptable to have mixed formats in a file. Alphanumeric fields are supported (used to order groups as in STATS, XTABS).

Usage:

1.2 Preparing Data for Analysis


The steps taken to prepare data for analysis depend on the location and condition of the data to be analyzed. The subsections that follow briefly outline the preparations.

1-2

Information Builders

1.2.1 Direct Analysis


When using all of the data in a fixed-format external file for analysis (with no selection or modifications), simply issue the ANALYSE command:
ANALYSE FILE filename

1.2.2 Selective Data Analysis


Subsets of the data, actual or modified, may be statistically analyzed by first extracting the data with a TABLE command. Before the analysis process begins, issue a normal FOCUS report request naming the file to be accessed (FOCUS or external) and selecting the fields for analysis:
TABLE FILE filename PRINT field1 field2...fieldn ON TABLE HOLD END ANALYSE FILE HOLD...

1.2.3 Specifying Variables


Variables to be analyzed are specified as FOCUS verb objects:
PRINT POPULATION MILLIONAIRES

Variables can be created or transformed directly with the DEFINE or COMPUTE facilities, which are described in the FOCUS Users Manual.

1.2.4 Selecting Data


Use FOCUS screening statements to specify the data you wish to analyze:
IF STATE EQ TEXAS

The prepared data is placed in a HOLD file. All analysis is then performed on the data in the HOLD file. Such procedures may be run reiteratively since FOCUS treats HOLD files as external files (with their own descriptions), which can be overwritten by subsequent requests. An alternative method is to create subsets of data using the SELECT operand in a STATSET operation (see Section 2.3 Preparing the Environment: STATSET on page 2-2).

Statistical Analysis Users Manual

1-3

Figure 1-1. Extracting Data for Analysis Several sets of extracted data may be held for subsequent analysis with the HOLD AS feature. Example 1: Selected Variables
TABLE FILE SMSA PRINT POPULATION ON TABLE HOLD AS POP END ANALYSE FILE POP . . .

Example 2: Transformed Variables


DEFINE FILE SMSA BIGSMSA/I2=IF POPULATION GT 500 THEN 1 ELSE 0; END TABLEF FILE SMSA PRINT POPULATION BIGSMSA AND COMPUTE LOGPOP=LOG(POPULATION); ON TABLE HOLD AS NEWPOP END ANALYSE FILE NEWPOP . . .

Example 3: Selected Data

1-4

Information Builders

TABLE FILE PROPERTY PRINT REGION CTYPE AND PROPERTY ON TABLE HOLD AS PROPHOLD IF REGION EQ "NORTHEAST" END ANALYSE FILE PROPHOLD . . .

1.3 Environmental Considerations


ANALYSE operations may be performed live, in batches, one response at a time, or from FOCEXECs in the same manner as other FOCUS commands (TABLE, GRAPH, etc.). For example, an ANALYSE session that performs a correlation analysis followed by a multiple linear regression analysis may be stored in a FOCEXEC as shown (the contents of the FOCEXEC are shown in the column on the right): Enter ANALYSE. Select correlation analysis. Select multiple linear regression. Specify the dependent variable. Specify two independent variables. Specify the independent variables. Request a regression equation. Omit table of residuals. End the ANALYSE session. Figure 1-2. Sample ANALYSE Session This sample 10-line FOCEXEC could be stored in a file and executed by entering EX focexecname in FOCUS (focexecname is a name of up to eight characters that you assign when you create the file). It should be noted that ANALYSE automatically inhibits all prompt messages when used from Dialogue Manager and that all message facilities of the Dialogue Manager are available (see the FOCUS Users Manual). To facilitate operation by experienced users, typing ahead is supported. You can use either comma- or blankdelimited type ahead reponses. For example, the FOCEXEC in Figure 1-2 could also be written as follows
ANALYSE FILE HOLD CORRE,MULTR,&1,2,DIST,DEPTH,YES,NO,QUIT ANALYSE FILE HOLD CORRE MULTR TEMP 2 DIST DEPTH YES NO QUIT

in which case you would supply a name for variable &1 on the command line when you executed the FOCEXEC (see the FOCUS Users Manual). You can also type ahead in live sessions:

Statistical Analysis Users Manual

1-5

ANALYSE FILE HOLD CORRE,MULTR,TEMP,2,DIST,DEPTH,YES,NO,QUIT

To direct ANALYSE output to an offline device (usually a line printer), set PRINT=OFFLINE in FOCUS before entering the ANALYSE environment, or select STATSET option ONLINE=OFF. The FOCUS LET facility is also supported in ANALYSE, and offers an easy way to store row and column labels in crosstabs before starting your ANALYSE session (see Section 3.11.7 Specifying Columns and Rows on page 356). The following example illustrates the combined use of ANALYSE, GRAPH, and TABLE facilities. Enter ANALYSE. Set up EQFILE TEMP1. Specify a regression. Set up EQFILE TEMP2. Specify a second regression. End the ANALYSE session. Execute TEMP1 FOCEXEC (DEFINEs). Execute TEMP2 FOCEXEC (DEFINEs). Graph TEMP1, TEMP2 and the difference across DIST.
ANALYSE FILE HOLD STATSET EQVAR=TEMP1,EQFILE=TEMP1 MULTR,TEMP,2,DIST,DEPTH,YES,NO STATSET EQVAR=TEMP2,EQFILE=TEMP2 MULTR,TEMP,3,DIST,DEPTH,BACTERIA,YES,NO QUIT EX TEMP1 HOLD EX TEMP2 HOLD ADD GRAPH FILE HOLD WRITE TEMP1 AND TEMP2 AND COMPUTE DIFF=TEMP1-TEMP2 ACROSS DIST END RETYPE

Create a TABLE report of the graph.

Figure 1-3. Using TABLE, ANALYSE, and GRAPH in a Session Note the use of the ANALYSE option STATSET at the start of the FOCEXEC in Figure 1-3. STATSET provides facilities for altering the default parameters (called STATSET flags) that control the ANALYSE environment during your session. Use these facilities to: Set up special HOLD files for ANALYSE output, variables, and equation files. Specify handling of records with missing data. Specify the print destination. Set the print width for your terminal.

If you are a new user, you may wish to review online descriptions of each of the operations. You can summon these by selecting EXPLAIN in response to the ANALYSE prompt for a statistical operation.

1-6

Information Builders

1.4 The Examples in this Manual


Each of the ANALYSE operation summaries in this manual concludes with a sample terminal session, illustrating the actual function(s) with live data. The examples used to illustrate the statistical operations are drawn on the AIRLINE, EMPLOYEE, PROPERTY and Standard Metropolitan Statistical Area (SMSA) data files. EMPLOYEE is a FOCUS file and before you can use EMPLOYEE data in statistical operations you must first extract it into a flat file using a TABLE request. The extract step is required for all data residing in non-sequential files. (ANOVA and MDISC usually require that the data originate in a hierarchical structure with the control variables as keys.) In FOCUS, descriptions of the files holding the data are stored in separate files called Master File Descriptions. The Master File Descriptions and the data used in the examples for ANALYSE operations appear in Appendix B of this manual. The FOCUS Users Manual contains instructions for creating Master File Descriptions.

Statistical Analysis Users Manual

1-7

2 Preparing for ANALYSE Sessions


2.1 Entering the Environment
You enter the Statistical Analysis (or ANALYSE) environment by issuing the command ANALYSE:
ANALYSE FILE HOLD

FOCUS responds by displaying the ANALYSE options:


STATISTICAL OPERATIONS... ANOVA POLRG CORRE STATS EXPLAIN STATSET EXSMO STEPR FACTO TIMESER MDISC XTABS MULTR FILESIZE

You select an operation by typing in a selection. The analysis chosen then proceeds, issuing prompts for any required information, and concludes with a request for another operation:
ENTER STATISTICAL OPERATION DESIRED -

Thus, after reading a file just once you can perform unlimited analyses on the data. Sections 2.3 and 2.4 of this chapter discuss ANALYSE facilities for preparing the environment and limiting the size of the sample file. Section 2.5 describes the online ANALYSE documentation facilities that explain the various statistical facilities.

Statistical Analysis Users Manual

2-1

2.2 Statistical Operations Summary


The statistical subcommands and operations are summarized in Figure 2-1. Subcommand ANOVA CORRE EXPLAIN EXSMO FACTO MDISC MULTR POLRG STATS STATSET STEPR TIMESER XTABS Statistical Operation Analysis of variance (Section 3.1). Correlation coefficient, standard deviation, mean, and number of observations (Section 3.2). Documentation of ANALYSE functions (Section 2.5). Triple exponential smoothing and forecasting (Section 3.3). Factor analysis (Section 3.4). Discriminant analysis (Section 3.5). Multiple linear regression (Section 3.6). Polynomial regression (Section 3.7). Descriptive statistics (modes, variances, deciles, quartiles, etc.) (Section 3.8). Select processing of missing data, create an equation file, and prepare the environment (Section 2.3). Stepwise linear regression (Section 3.9). Time Series Analysis (Section 3.10). Cross tabulation and summary statistics (Section 3.11).

Figure 2-1. Summary of Statistical Operations NOTE: Each operation is initiated by a mnemonic subcommand and may be repeated in any order, as often as needed. Whenever possible, analyses are integrated and use the same input as the previous steps. For example, all regression analyses use the results of a single set of CORRE calculations instead of reproducing the input data for each analysis.

2.3 Preparing the Environment: STATSET


The ANALYSE option STATSET is used to set the parameters (or flags) that govern the FOCUS ANALYSE environment. Each of the flags (e.g., missing data, equation file processing, data selection, etc.) can be set individually or they can be set in groups. You can enter STATSET in three ways: Type STATSET and press ENTER. This initiates automatic prompting for all STATSET flags. Each prompt begins with the name of the flag (e.g., HOLD: ENTER THE NAME--).

2-2

Information Builders

Type ? STATSET and press ENTER. This initiates a display of the current STATSET flag settings. Type STATSET and make your own assignments for STATSET flags before pressing ENTER (e.g., STATSET MISSING=ON, MISSVAL=-999, HOLD=STYDY1).

To issue a status request for the current settings of the STATSET flags, issue the following command:
? STATSET

This is illustrated in the STATSET terminal session at the end of this section. The STATSET flags are summarized on the following page. Flag
EQFILE

Value
name EQFILE

Description Names the FOCEXEC that will hold the regression equation as a defined variable. Set to NONE or not issued if no FOCEXEC is required. The dependent variable name used in the regression equation (stored in the EQFILE FOCEXEC). The name of the HOLD file being created, similar to the usage HOLD AS HOLD1 in the FOCUS TABLE command. An ON or OFF setting specifies whether records with missing data will be processed. The value used to identify missing data fields. All variables are compared to this value when MISSING=ON. An ON or OFF setting specifies where to route output (ONLINE/OFFLINE) from within ANALYSE. An ON or OFF setting routes the output from analyses to print (useful when HOLD files are created and analyzed in subsequent iterations). The selection criteria used to select records for analysis. Specifies the FOCUS terminal print width setting. Note that this affects report formatting by XTABS.

EQVAR

name

HOLD

name STATHOLD

MISSING

ON OFF

MISSVAL

value -999

ONLINE

ON OFF

PRINT

ON OFF

SELECT WIDTH

criteria n

Figure 2-2. STATSET Flags that Control the ANALYSE Environment

EQFILE
The EQFILE flag names the equation FOCEXEC (maximum eight characters) being created. Equation file processing creates a FOCEXEC containing the regression equation, with EQVAR as the left-hand side of the equation. The command syntax is as follows:
STATSET EQFILE=FIT1

Statistical Analysis Users Manual

2-3

EQVAR
The EQVAR flag sets the variable name (maximum 12 characters) that appears on the left-hand side of the EQFILE. Using STATSET EQFILE=TFIT1, EQVAR=TEMP1, the regression fitting TEMP by DIST and DEPTH would produce an EQFILE that looks like the following:
-DEFAULT &2=""; DEFINE FILE &1 &2 TEMP1 = ; END >

91.04427 +( -.0027106*DIST +( -1.008607*DEPTH

) )

In this form &1 may be used as needed (e.g., any HOLD file, etc.); &2 can be ADD or a blank (see the FOCUS Users Manual). Note that you may create as many equation files as needed in addition to the original data for subsequent use in comparative graphing and analysis reporting. The command syntax is as follows:
STATSET EQVAR=FIT1

HOLD
The HOLD flag sets the name of the HOLD file to be created. A number of statistical operations (e.g., STATS, TIMESER) provide the means for holding varying types of data. This data is written into a HOLD file. A Master File Description is also created. Such files may then be used directly by FOCUS. The default name for the HOLD flag is STATHOLD. The command for setting this flag is
STATSET HOLD=mychoice

where:
mychoice

Is up to any eight-character name. The HOLD flag may be set as often as needed. In this way, many different HOLD files can be created in one ANALYSE session. A given file may be replaced (overwritten) as necessary. When the file created has the same name as the ANALYSE subject, the following message is displayed
(FOC133) WARNING: ANALYSE FILE LOST...HAD SAME NAME AS HOLD FILE. (FOC134) RE-READING DATA UNDER NEW SELECTION CRITERIA IS INHIBITED.

followed by the normal output:


HOLDING...

The file being analyzed remains available for further ANALYSE processing throughout the current ANALYSE session. Processing with new selection criteria (STATSET flag SELECT) is inhibited (because the disk file is now different) and the following error message is displayed:
(FOC134) RE-READING DATA UNDER NEW SELECTION CRITERIA IS INHIBITED.

2-4

Information Builders

MISSING
The MISSING flag specifies whether missing data should be included in the analysis (ON) or omitted (OFF). The default value is OFF. When ON, individual variables within records are ignored if they are numerically equal to the value in the STATSET flag MISSVAL (see below). This special processing of records with missing data is supported in the following operations: CORRE, MULTR, POLRG, STEPR, STATS and XTABS. To include missing data issue the following command:
STATSET MISSING=ON

MISSVAL
The MISSVAL flag sets the value that identifies missing data. The default value is -999. It is recommended that you use only whole numbers. The command syntax is as follows:
STATSET MISSVAL=-999999

ONLINE
The ONLINE flag determines the destination of statistical output. When set ON (the default value), statistical output is sent to the terminal (SYSPRINT). When set OFF, statistical output is sent to the offline device (usually a line printer). Error messages and prompts are directed to the terminal. Use the following command to turn off this flag:
STATSET ONLINE=OFF

PRINT
The PRINT flag controls the statistical output of the regression analyses (MULTR, POLRG, STEPR). When ON (the default value), statistical output is sent to the terminal or printer, depending on the setting for ONLINE. When OFF, statistical output is inhibited. This is useful when the purpose of a regression, or series of regressions, is to produce an equation file for future use. Error messages and prompts are directed to the terminal. To inhibit the output, use the following STATSET command:
STATSET PRINT=OFF

SELECT
The SELECT flag sets the record selection criteria for subsequent statistical analyses. The default value (OFF) specifies no screening. SELECT may be set to include a series of valid FOCUS screening conditions of the following form:
FIELDNAME TEST RELATION LITERAL (or LITERAL or...)

Any number of screening tests may be used. Separate multiple tests with commas, and terminate the last with a comma/dollar sign (,$), as shown below:
STATSET SELECT=CITY EQ STAMFORD, PROD_CODE EQ B10, UNIT SOLD GT 25,$

Note that multiple fields may be tested on one or more lines; however, a test for a single field may not exceed one line.

Statistical Analysis Users Manual

2-5

WIDTH
The WIDTH flag sets the number of characters in the print line (the default is 80 characters). In XTABS (the ANALYSE cross tabulation facility) WIDTH is used to control automatic paneling. The command syntax is as follows:
STATSET WIDTH=130

Sample STATSET Terminal Session


Data extraction FOCEXEC
ANALYSE FILE AIRLINE STATSET MISSING=ON, MISSVAL=0, HOLD=LAGGED TIMESER, TDATE, 1 MONTH LAG LOGRIDER AS PLAG1 1 LAG LOGRIDER AS PLAG12 12 HOLD END

Analysis preparation FOCEXEC


ANALYSE STATSET STATSET MULTR YES STAT END FILE LAGGED MISSING=ON, HOLD=REGRESS, EQFILE=REGRESS SELECT=PLAG1 NE 0, PLAG12 NE 0,$ LOGRIDER, 3, T.INDEX, PLAG1, PLAG12

Sample STATSET dialogue


ENTER STATISTICAL OPERATION DESIRED - statset HOLD : ENTER THE NAME OF THE HOLD FILE FOR STATISTICAL OUTPUT - sessout ONLINE : ENTER ON"" FOR ONLINE PRINTING OR OFF"" FOR OFFLINE PRINTING - on PRINT : ENTER (ON OR OFF) THE STATUS OF THE PRINT CONTROL FLAG - off WIDTH : ENTER NUMBER OF CHARACTERS TO PRINT ON A LINE - 80 EQFILE : ENTER NONE"" OR THE REGRESSION EQUATION FILE NAME - none MISSING : ENTER (ON OR OFF) THE STATUS OF MISSING DATA PROCESSING - off SELECT : ANY SELECTION TESTS (YES OR NO) - no ENTER STATISTICAL OPERATION DESIRED - stats

2.4 Limiting the Sample Size: FILESIZE


At the point of selecting a statistical operation, you can also specify the number of observations to be used for your sample. To do so, enter FILESIZE in response to the ANALYSE prompt for a statistical operation.
STATISTICAL OPERATIONS... ANOVA CORRE EXPLAIN EXSMO FACTO POLRG STATS STATSET STEPR TIMESER ENTER STATISTICAL OPERATION DESIRED filesize ENTER NUMBER OF RECORDS TO PROCESS 10 MDISC XTABS MULTR FILESIZE

2-6

Information Builders

2.5 Reviewing Online Documentation: EXPLAIN


There is online documentation available for each ANALYSE statistical operation, which you can summon either by selecting the EXPLAIN option as your STATISTICAL OPERATION, or, if you know the name of the particular analysis, enter the name of the operation followed by a space and a question mark ( ?). Thus
ANOVA ?

summons a display of the online documentation for ANOVA. FOCUS command level help is also available with:
>>HELP ANALYSE

EXPLAIN
Enter EXPLAIN in response to the ANALYSE prompt for a statistical operation to invoke the following display
ENTER STATISTICAL OPERATION DESIREDexplain ENTER COMMAND (ALL, LIST, FOR SUMMARY, OR NAME OF SPECIFIC OPERATION)

where:
ALL

Prints the entire online file of documentation for ANALYSE operations.


LIST

Lists and briefly describes the statistical operations.


operation

Prints the documentation for a specified operation.

Statistical Analysis Users Manual

2-7

Sample EXPLAIN Terminal Session

2-8

Information Builders

3 The Statistical Operations


This chapter describes each of the statistical operations that you can run within ANALYSE. The analyses are presented in alphabetical sequence.

3.1 Analysis of Variance: ANOVA


The ANOVA statistical operation performs an analysis of variance for an equal cell factorial design for up to 5 factors. (Note that the number of factors at each level must be the same.) To initiate it, respond ANOVA to the ANALYSE prompt for a statistical operation:
ENTER STATISTICAL OPERATION DESIRED - anova

After selecting ANOVA, you are prompted for the name of each factor desired. All other required information is taken from the input file, including the number of factors and the number of groups or levels in each factor (in a factorial manner). An example follows: TABLE FILE PROPERTY
WRITE CNT.REGION WRITE CNT.CTYPE BY REGION WRITE CNT.PROPERTY BY REGION BY CTYPE PRINT PROPERTY BY REGION NOPRINT BY CTYPE NOPRINT ON TABLE HOLD END ANALYSE FILE HOLD ANOVA A,B,C END >

Initiate the request. Retrieve the number of REGIONS (factor 1). Retrieve number of City types (factor 2) within factor 1. Property by region (factor 3). The actual variable PROPERTY. Specify the sort sequence. The file for analysis. The ANALYSE request. The statistical function. The factors.

Figure 3-1. Sample ANOVA Session

References
The ANOVA analysis is done as a factorial design using a three-operator method. A. Ralston (1967), Mathematical Methods for Digital Computers, Analysis of Variance (Chapter 20), New York: John Wiley and Sons.

Statistical Analysis Users Manual

3-1

Sample ANOVA Terminal Session

3.2 Correlation Analysis: CORRE


The CORRE analysis calculates and displays means, standard deviations, and correlation matrices for all numeric variables in the file named. There are no prompts for additional information. To initiate it, enter CORRE, in response to the ANALYSE prompt for a statistical operation:
ENTER STATISTICAL OPERATION DESIRED - corre

3-2

Information Builders

The results of these calculations and the sums of cross-product deviations (not displayed) are saved for use in subsequent regression analyses. The method used calculates product-moment correlation coefficients. If you choose to process records with missing data (SET MISSING=ON, MISSVAL=some value), pair-wise deletion is performed. (In pair-wise deletion, cases are omitted from the computation when either of the two variables under consideration is missing.) With the input data X ij , where i=1, 2,..., n (observations) and j=1, 2,..., m (variables), the following equations are used: Means
n

x ij
Xj =
i=1 -------------

where:
j=1, 2,..., m

Correlation Coefficients
S jk T jk = ------------------------S jj S kk

where:
j=1, 2,..., m k=1, 2,..., m

Standard Deviation

S jj S j = --------------- = n1

-------------------------------n1

i=1

( X ij X j )

Nj

where:
j=1, 2,..., m

Sums of Cross-Product Deviation


=1 ( X ij Tj ) ( X ik T k ) i---------------------------------------------------------------n n

( X ij Tj ) ( X ij Tk )

S jk =

i=1

where
j=1, 2,..., m k=1, 2,..., m

and

Statistical Analysis Users Manual

3-3

i=1 T j = ------------n

x ij

are used for computational accuracy. NOTE: Any alphanumeric data is detected and treated as numeric zeroes. When processing records with missing data (MISSING=ON), the CORRE output includes the number of observations and pairs present (i.e., not missing). CORRE requires a greater number of observations than variables (by at least 1).

Sample CORRE Terminal Session

3.3 Exponential Smoothing: EXSMO


The EXSMO analysis performs triple exponential smoothing for a time series of numeric variables in the file. Many smoothed series for the same or different variables can be displayed, retained in core, recalculated, forecasted, and/or written to a HOLD file.

3-4

Information Builders

To initiate this analysis, enter EXSMO, in response to the ANALYSE prompt for a statistical operation.
ENTER STATISTICAL OPERATION DESIRED - exsmo

EXSMO then prompts for the name of the field to be smoothed.


ENTER NAME OF VARIABLE TO BE SMOOTHED - sales

The variable (in the above example sales) may be specified by its fieldname, alias, position number in the file, or any unique truncation of the alias or fieldname. EXSMO names smoothed variables (those created for examination, holding, or further smoothing) by prefixing the first 10 characters of the names or aliases (aliases are assigned if HOLD files are created) with an S. For example, RIDER is the second variable in the file, AIRLINE, EXSMO creates a smoothed variable with the following attributes:
FIELDNAME=S.RIDER,ALIAS=S.E02, FORMAT=D15.3,$

If you prefer to name the smoothed variable yourself, use the AS phrase. For example:
EXSMO RIDER AS SMOOTH_RIDER

These alternate names may contain up to 12 characters and must be enclosed in single quotation marks if they contain any embedded blanks. The values of variables that exist when you enter EXSMO are protected and cannot be overwritten by EXSMO. For example, if NRIDER is the name of a field in the file being analyzed, the expression
EXSMO RIDER AS NRIDER

results in an error message (and a repeat prompt). However, you can overwrite fields created by EXSMO during the current session. EXSMO prompts the user for a smoothing constant (alpha) (where 0.0 < < 1.0 ).
ENTER SMOOTHING CONSTANT (E.G. 0.1) -

The larger the value of the constant, the larger the influence of previous data on the next smoothed point. EXSMO then prompts the user for the source of the three additional constants (A, B, and C) required for the smoothing:
DO YOU WISH TO SPECIFY INITIAL COEFFICIENTS (YES/NO)

If you specify YES, the following three prompts appear:


ENTER COEFFICIENT A ENTER COEFFICIENT B ENTER COEFFICIENT C -

If you specify NO, EXSMO uses the default method to calculate starting values for A, B, and C. The initial default values of A, B, C are determined as follows:
C = X 1 2X 2 + X 3 B = X 2 X 1 1.5C A = X 1 B 0.05C

where X 1 , X 2 , and X 3 are the first three input time series data points, and the calculation is done in the order shown (C, B, and then A).

Statistical Analysis Users Manual

3-5

Starting with initial values of A, B, and C, EXSMO calculates the first smoothed series data point S 1 , updates coefficients A, B, and C, and then proceeds step by step through the time series values generating the smoothed series. The formulas used at each step are defined below
S i = A + B + 0.5C

where:
Si

Is the smoothed data value for the next time period (the ith period).
A,B,and C

Are the values that exist at the (i-1)th time period. After S 1 is calculated for one time period ahead, the A, B, and C coefficents are updated with the following formula
A = X1 + ( 1 ) ( Si Xi ) B = B + C 1.5 ( ) ( 2 ) ( S i X i ) C = C ( ) ( Si Xi )
3 2 3

where:
Xi

Is the input time series data point for the one time period ahead.

Alpha is the smoothing constant. The calculation is done in the order shown (A, B, and then C) with the B and C on the right-hand side of the equations taking on their previous values. EXSMO then proceeds, one time period at a time, until an entire smoothed time series (Si ) is calculated covering the same extent as the input time series. At the end, EXSMO provides final values for the coefficients A, B, and C in the expression
Fi = A + B ( T ) + C ( T ) 2
2

where:
Fi

Is the ith forecasted value of the smooth series.


T=1,2,3...

Are used for the first, second, third, etc. forecasted time period(s). This expression is used to find estimates (or forecasts) for the specified number of time periods ahead (T).

3-6

Information Builders

When the smoothing calculations are complete, EXSMO prints the initial and final values of the coefficients used (see the example below). The user then receives a series of prompts calling for dispositions for the input and smoothed data for the actual and forecasted time periods (display, hold, etc.). EXSMO requests the data with the following prompt:
ENTER COMMAND (E.G. PRINT,LAST,FORECAST,SHOW,EXSMO,KEEP,HOLD,QUIT)

Note that at any point before a new smoothed series is computed, a response of EXSMO restarts the analysis at the first prompt.

EXSMO Control Commands


COMMAND
PRINT p q

DESCRIPTION A table of data values and smoothed-series values is displayed for data points numbered from p to q, where p and q are positive integers. For values greater than N (the number of observations) only predicted smoothed values can be displayed. Displays the last p data values, along with their corresponding smoothed series values. Displays the predicted smoothed series values for the first p data points starting with N + 1. Displays information (for point p only) on one line. Allows the user to specify another variable for smoothing, or resmooth the present variable using different parameters. The data for all variables currently defined is written to a HOLD file. Exit from EXSMO, while retaining the data for all created variables in core. Exit from EXSMO, discarding all variables created in the session.

LAST p

FORECAST p

SHOW p EXSMO

HOLD KEEP

QUIT

Figure 3-2. EXSMO Control Commands Each command may be specified by name or with a unique truncation. The PRINT, LAST, and FORECAST commands each displays three columns of output (the data point number, the input, and the smoothed data values). Data points beyond the extent of the input file are represented as blanks. The EXSMO control commands support reiterative smoothing and display of variables. The smoothed variables created with EXSMO are saved and are available thereafter as normal variables for use in any subsequent ANALYSE statistical operation (e.g., CORRE, MULTR, etc.). Thus, you can perform multiple analyses using both the original and the smoothed variables.

PRINT
The PRINT command initiates two prompts for the time period to be printed. For example:

Statistical Analysis Users Manual

3-7

print ENTER FIRST DATA POINT TO OUTPUT ENTER LAST DATA POINT TO OUTPUT -

1 3

Since typing ahead is supported, this is equivalent to entering:


print 1 3

LAST
The LAST command initiates the prompt:
ENTER NUMBER OF DATA POINTS -

When smoothing a large file this is a useful technique for examining just the end of the file.

FORECAST
The FORECAST command initiates the following prompt:
ENTER NUMBER OF DATA POINTS -

This command is useful when only the forecasted data points are required, particularly when used in conjunction with the LAST command. For example, LAST 4 FORECAST 5 produces a display of the last four actual time periods together with the first five forecasted time periods. NOTE: PRINT, LAST, and FORECAST may be entered together to produce an integrated display.

SHOW
The SHOW command initiates the following prompt:
ENTER POINT TO DISPLAY -

This command is used to examine the time period.

HOLD
The HOLD command initiates the following prompts:
ENTER FIRST DATA POINT TO OUTPUT ENTER LAST DATA POINT TO OUTPUT -

HOLD creates a raw data file (and Master File Description) that can be used immediately in graphs, reports, etc. The file created, using the name specified by the STATSET HOLD flag, will contain all of the new variables created by the smoothing process in addition to the original fields from the target file. All pre-existing alpha and integer fields (for maintaining identity and date facilities) are preserved in their original formats. All other numeric fields are held in decimal (D) format with the number of decimal places that they originally held. All smoothed fields created by EXSMO are held in D15.3 format. Numeric variables not forecasted are assigned values as set in the STATSET MISSVAL flag (alpha fields are set to blank) for the forecasted time periods.

KEEP
The KEEP command causes an exit from EXSMO, while retaining all of the newly created variables in core. You can then perform other ANALYSE functions and later return to EXSMO for further smoothing.

3-8

Information Builders

QUIT
The QUIT command exits EXSMO and returns to ANALYSE, deleting all of the smoothed data created during the EXSMO session.

Sample EXSMO Terminal Session

Statistical Analysis Users Manual

3-9

3.4 Factor Analysis: FACTO


The FACTO option performs factor analysis. The user is prompted for a minimum eigenvalue to limit the number of factors retained. Enter FACTO in response to the ANALYSE prompt for an operation to initiate this analysis:
ENTER STATISTICAL OPERATION DESIRED - facto

The analysis performs a principal components solution and a varimax rotation of the factor matrix. The results of the principal component analysis determine the minimum number of dimensions needed to account for most of the original variable set variance. The varimax rotation is used to simplify columns (factors), rather than rows (variables), in the factor matrix. In the extraction, eigenvalues equal to or greater than the supplied eigenvalue are retained. This is done to minimize the number of factors. FACTO produces: Eigenvalues. Cumulative percentage eigenvalues. Eigenvectors. Factor matrix. Variances for each iteration cycle. Rotated factor matrix. Communalities for initial extraction and final rotation.

Reference
For information about the principal component, varimax rotation method, see: W.J. Dixon (1973), Biomedical Computer Programs manual, Los Angeles: University of California Press.

3-10

Information Builders

Sample FACTO Terminal Session

Statistical Analysis Users Manual

3-11

3.5 Discriminant Analysis: MDISC


To perform discriminant analysis enter MDISC in response to the ANALYSE prompt for an operation:
ENTER STATISTICAL OPERATION DESIRED- mdisc

3-12

Information Builders

The required information is taken directly from the input file without further prompting. The input file created by the TABLE request must contain the number of groups, the number of observations in each group, and the name of each group, in addition to the variables. For example:
TABLE FILE SMSA WRITE CNT.STATE WRITE CNT.SMSA BY STATE PRINT POPULATION MILLIONARES REMARRIAGE BY STATE ON TABLE HOLD END ANALYSE FILE HOLD MDISC

Initiates the request. The number of groups. The number of observations (in each group). Names three variables.

Figure 3-3. Creating an Extract File to Analyze with MDISC NOTE: The counts must contain the number of unique items, not records. The calculation checks for and requires that the number of variables be equal to or greater than the number of groups. MDISC produces: The mean of each variable by group. The pooled dispersion matrix. The common means for each variable. The generalized Mahalanobis D-square. Numbered discriminant functions. a. b. a. b. c. Constant. Coefficients. Observation. The probability associated with largest discriminant function. The largest function number.

Evaluation of classification functions for each observation.

References
T.W. Anderson (2nd Edition - 1984), Introduction to Multivariate Statistical Analysis, (Section 6.6 through 6.8), New York: John Wiley and Sons. W.J. Dixon (1973), Biomedical Computer Programs, Los Angeles: University of California Press.

Statistical Analysis Users Manual

3-13

Sample MDISC Terminal Session

3-14

Information Builders

3.6 Multiple Linear Regression: MULTR


The MULTR option performs a multiple linear regression for any choice of dependent and independent variables in a specified data file. Initiate MULTR as follows:
ENTER STATISTICAL OPERATION DESIRED- multr

MULTR supports processing of both equation files and records with missing data. A linear relationship of the following form is sought
n

Y =

Ci xi + C0
i=1

Statistical Analysis Users Manual

3-15

where:
Y

Is the selected dependent variable.


xi

Are the N selected independent variables.


Ci

Are the N regression coefficients relating the change in Y caused by a change in each Xi and the intercept C0. Since MULTR searches for a relationship between variables, it automatically calls CORRE to produce the data from which it selects the Y and Xi variables for the regression (if not called previously, in which case the data would already be available). After examining the means, standard deviations, and correlation coefficients for all variables, the system prompts the user to determine: The dependent variable (name, truncation or number). Each independent variable (name, truncation or number). Whether a table of residuals is desired. Whether the regression equation is desired.

To simplify variable selection, especially when many variables are included (a maximum of 64 is permitted), variable names are also assigned position numbers which you can use when selecting them. After the analysis is performed and the report appears, you may select as many alternate analyses (other variables) as desired, perform other types of analysis and return to evaluate other models displayed by MULTR. MULTR displays: The Variable. Mean. Standard deviation. Correlation vs. the dependent variable. Regression coefficient. Standard error of regression coefficient. Computed T value.

The Intercept. The Multiple Correlation Coefficient. The Standard Error of Estimate. An Analysis of Variance for the Regression. Source of variation. Degrees of freedom. Sum of squares. Mean squares. F values.

The Regression Equation listed.

3-16

Information Builders

The Table of Residuals and/or Durbin-Watson upon prompted request. Observation. Actual dependent variable value. Estimated dependent variable value. Residual (actual - estimate). Durbin-Watson statistic.

If EQFILE processing is on (see Section 2.3 Preparing the Environment: STATSET on page 2-2), the regression equation is stored in an equation file FOCEXEC. This equation, along with the results of other regressions, can be used for graphics, analysis, reporting, etc. To create a regression equation for further processing, set the STATSET PRINT flag OFF, and EQFILE processing ON; this suppresses printed output.

References:
The Gauss-Jordan Method is used in the solution of the normal equation. W.W. Cooley and P.R. Lohnes (1971), Multivariate Procedures for the Behavioral Sciences, (Chapter 3), New York: Kreiger Bernard Ostle and Rick Mensing (1975), Statistics in Research, (Chapter 8), Ames, Iowa: Iowa State College Press Norman Draper and Harry Smith (1981), Applied Regression Analysis, NY: Wiley

Statistical Analysis Users Manual

3-17

Sample MULTR Terminal Session

3-18

Information Builders

3.7 Polynomial Regression: POLRG


The POLRG option performs a polynomial regression for any choice of dependent or independent variables in the specified data. POLRG supports processing of missing data and equation file processing. To initiate this analysis, answer the ANALYSE prompt for an operation as follows:
ENTER STATISTICAL OPERATION DESIRED- polrg

Subsequently, you will be prompted for the following: Highest degree polynomial to be used (1 - 10). The dependent and independent variables (name, truncation or number). Is a table of residuals desired? Is the regression equation desired?

The analysis calculates powers of the independent variable to calculate polynomials of increasing degree. The calculation proceeds by degree until there is no further reduction in the residual sum of squares or the maximum degree polynomial specified is reached. POLRG produces the polynomial degree at each step: The intercept. The regression coefficients. The analysis of variance for the degree step. The source of variation.

Statistical Analysis Users Manual

3-19

The degree of freedom. The sum of squares. The mean square. The f-value. The sum of squares improvement.

The regression equation for the degree step. A table of residuals and/or Durbin-Watson upon request.

POLRG and MULTR share the same methodology. EQFILE and PRINT control are also supported.

Sample POLRG Terminal Session

3-20

Information Builders

Statistical Analysis Users Manual

3-21

3.8 Descriptive Statistics: STATS


The STATS analysis calculates and displays a selected set of statistics for all numeric variables in the analyzed file. Respond as follows to select this analysis:
ENTER STATISTICAL ANALYSIS DESIRED- stats

You will be prompted for all required input and may specify any or all statistics (optionally grouped by a sort field). You can hold the statistical output as a HOLD file for use by FOCUS (reporting, graphics, etc.). Processing of missing data is automatically supported. STATS is useful for obtaining a significant quantity of descriptive statistics on up to 64 variables. This information provides both a valuable end-product report (e.g., modes, medians, deciles, quartiles, etc.), as well as a necessary step in the selection of subsequent analyses. Each statistical option has a name or number, as shown in Figure 3-4. When you enter STATS in response to the ANALYSE prompt, you are prompted for a statistical operation:
ENTER STATISTICAL OPERATION DESIRED - stats ENTER OPTION NAME(S) OR NUMBER(S) DESIRED (E.G. 3,MEAN,VAR,14) all

3-22

Information Builders

Enter the name, number, or unique truncation of the options desired. The selected statistics are displayed for all the numerical variables. A response of ? produces a help display similar to the following one. Number 1 2 3 Name ALL MEAN MEDIAN Explanation Options 2 - 14. The average value. The midpoint, or 50th percentile, derived after all values are in order (lowest to highest). The maximum minus the minimum value. A measure of dispersion. The square root of the variance. The most frequent value of a variable. The highest value. The standard error measure of sample mean stability, estimated by the standard deviation divided by the square root of the number of observations. A measure of the symmetry of a distribution. The lowest value. A measure of data dispersion about its mean. A measure of a distributions peaks or flatness. A table of deciles or variable values at each of 10 10% population points. A table of quartiles or variable values at each of 4 25% population points.

4 5 6 7 8

RANGE STDEV MODE MAXIMUM STERROR

9 10 11 12 13

SKEWNESS MINIMUM VARIANCE KURTOSIS DECILE

14

QUARTILE

Statistical Analysis Users Manual

3-23

Number 15

Name NUM_OBS

Explanation Number of observations present is placed in a HOLD file if the HOLD option is selected. Data is grouped according to the first field. Statistics are written to a HOLD file specified by the STATSET command. Printing of statistics is suppressed. Overrides print suppression by HOLD.

40 41

GROUPS HOLD

42

PRINT

Figure 3-4. Table of STATS Statistical Options The following response to a STATS prompt for an option
2, 3, 4, DEC, Q

produces means, medians, ranges, the 10 deciles, and the four quartiles for each numeric variable in the subject file. Note that you can mix numbers and names in a response and that typing ahead is supported. The options entered above could have been input on the initial response to the ANALYSE prompt for an operation, as shown below:
ENTER STATISTICAL OPERATION DESIRED -- stats 2,3,r,d,q

ALL Option
All of the statistical options selected are calculated for each numeric variable (except for the group field) in the subject file. If you choose to include missing data in the analysis (see Section 2.3 Preparing the Environment: STATSET on page 2-2), the number of observations present for each variable is displayed against the size of the sample. The ALL option produces all of the statistics (options 2 - 14).

MEAN Option
The MEAN option produces the mean, or average value, for the variable. It is a simple measure of the variables central tendency (the sum of all variable values divided by the number of values). The formula used is as follows

Xi
i=1 X = ------------N

where:
i

=1,2,3, ...,N.

3-24

Information Builders

Is the number of observations.


Xi

Is the ith observation of variable X.

MEDIAN Option
The MEDIAN option calculates the middle case value for each numeric variable. From the median value, 50% of the cases lie above it and below it if the variable is ranked from its lowest to highest values. The median value lies precisely on the 50th percentile. If the number of cases (N) is odd, the median value is the (N + 1)/2. If N is even, the median is linearly extrapolated according to the following formula
median = X N 2 + 0.5x ( X N 2 + 1 X N 2 )

where:
XN/2

Is the N/2th value of 0 in a ranked sequence of the variable X.

RANGE Option
The RANGE option calculates the difference between the maximum and minimum value for each numeric variable.

STDEV Option
The STDEV option calculates the standard deviation for each numeric variable. In normal distributions, it is the value such that 65% of the cases lie between the mean 1 standard deviation. It is a measure of the spread of values and is equal to the square root of the variance (S2) defined below.
N

STDEV =

VARIANCE =

i=1 -----------------------------N1

( Xi X )

MODE Option
The MODE option calculates the value that occurs most frequently for each numeric variable. If more than one value occurs the same number of times, the lowest value is deemed the mode. If the mode cannot be calculated (e.g., all values occur only once), a message to that effect is printed and the mode is set to the missing value as set in STATSET.

MAXIMUM Option
The MAXIMUM option determines the highest value for each numeric variable.

Statistical Analysis Users Manual

3-25

STERROR Option
The STERROR option calculates the standard error for each numeric variable. Given a sample (a given group of cases), the true population mean can be estimated by examining the means for a large number of equal sized samples chosen from that population. The array of these sample means forms a normal distribution. This distribution has a standard deviation which is called the standard error. It is an estimate of the difference between a given sample mean and an estimated population mean. The standard error is determined by dividing the standard deviation by the square root of the number of observations:
STERROR = S ( N )

SKEWNESS Option
The SKEWNESS option calculates the skewness for each numeric variable. It is a measure of the deviation from symmetry for a distribution. Since skewness (or third moment) is an odd power of (Xi - X), a value of zero indicates symmetry, and a positive or negative value indicates clustering above or below the mean (X) respectively. The following equation defines skewness:
Xi X 3 -------------S =1 SKEWNESS = i---------------------------N
n

For computational purposes, the following formula is used


2 X i3 3X X i2 + 3X X i n n n

NX SKEWNESS = ------------------------------------------------------------------------------i=1 -----------------------------2 X i2 NX n 3/2

i=1 i=1 i=1 ------------------------------------------------------------------------------3

N1

where:
Xi

Is the ith observation.


X

Is the mean.
N

Is the number of observations.


i

= 1,2,..., N. The denominator is the calculation formula for S3 (where S2 is the variance). If skewness cannot be calculated, a message is printed and the held value is set to the missing value from STATSET.

3-26

Information Builders

MINIMUM Option
The MINIMUM option determines the lowest value for each numeric variable.

VARIANCE Option
The VARIANCE option calculates the variance (S2) for each numeric variable. It is a measure of variation from the sample mean. As an even power (second moment), both positive and negative differences count equally with large variations counting more than small ones. The variance is literally the average squared deviation from the mean:
2 ( Xi X ) n

=1 S 2 = i-----------------------------N1

where N-1 is generally taken as the denominator (instead of N) assuming sample data rather than the entire population. The difference is negligible for large samples (large N). A small variance occurs when there is little variation in the sample. For computational purposes, the following formula is used:
2 X i2 NX n

=1 S 2 = i-----------------------------N1

KURTOSIS Option
The KURTOSIS option calculates the kurtosis for each numeric variable. As a fourth power of the difference from the mean (the fourth moment), it is a measure of the flatness or sharp definition of a sample distribution. The kurtosis for a normal distribution is zero. A positive or negative value indicates, respectively, a distribution narrower or flatter than a normal one. It is defined by:
Xi X 4 -------------- 3 S i=1 KURTOSIS = ---------------------------------------N
n

For computational purposes, the following formula is used:


2 3 x i4 4X Xi3 + 6X X i2 4X X i n n n n

i=1 i=1 i=i i=1 -----------------------------------------------------------------------------------------------------------------------N + X4 KURTOSIS = -----------------------------------------------------------------------------------------------------------------------2 i=1 2 X i2 NX n

-----------------------------N1 Note the denominator is the square of the variance (S4) and the minimum value of the KURTOSIS is -3. If the kurtosis is not available, a message is printed and the value, if held, is set to the missing value set by STATSET.

Statistical Analysis Users Manual

3-27

DECILES Option
The DECILES option calculates the 10 deciles for each numeric variable. If a variable is sorted from low to high, then the 10 deciles are the exact values for which 10%, 20%, 30%,..., 90%, 100% of the value lie below. The kth decile is calculated as follows (N k) DECILE k = X i + ( X i + 1 X i ) N k INT -----------------------10 10 where:
N

Is the number of observations.


k

Is the decile.
i

Is the integer portion of (N*k)/10.


Xi

Is the lower bound of the decile.

QUARTILES Option
The STATS option QUARTILES calculates the four quartiles for each numeric variable. They are calculated similarly to deciles and represent the 25%, 50%, 75%, 100% population points. The formula for the kth quartile is (N k) Nk QUARTILE k = X i + ( X i + 1 X i ) ---------- INT --------------4 4 where: (N k) i = INT ---------------4

NUM_OBS Option
The NUM_OBS option determines the number of values present for each numeric variable. It is always displayed.

3.8.1 Control Options


A series of special control options provides run-time information and allows you to adjust the processing and disposition of your STATS output.

GROUPS Option
The GROUPS option causes all selected options to be calculated for each numeric variable, one for each value of the first analyzed field in the file. The group field is assumed to be sorted in ascending order and may be numeric or alphanumeric. No statistics are calculated for the first field (or variable) if the GROUPS option is specified. The group field is then the first field in the original HOLD file and keeps its original field name.

3-28

Information Builders

HOLD Option
The HOLD option causes the output of STATS to be held as a HOLD file with the filename set by STATSET. Printing is suppressed unless PRINT is also specified. If STATS ALL, HOLD, and GROUPS are specified, and the STATSET HOLD flag is left as the default value, a Master File Description is created for STATHOLD. The HOLD option generates a HOLD file containing all of the selected STATS options. A record is created for each variable, and for each group field value (if GROUPS is specified). The HOLD file can then be used for reports, graphs, and/or relational matches with other data files. The latter can be used to create standardized data (data with zero mean and unit deviation).

PRINT Option
The STATS option PRINT sends output to a print queue, and is used to override the normal print suppression of the HOLD option.

Statistical Analysis Users Manual

3-29

Sample STATS Terminal Session

3-30

Information Builders

3.9 Stepwise Multiple Regression: STEPR


This ANALYSE option performs a stepwise multiple regression for any choice of dependent and independent variables in the subject file. Missing data and equation file processing are supported. Up to 64 variables may be processed. The procedure prompts for: A constant value or proportion of the sum of the squares to limit the variables entering in the regression. Is a table of residuals desired? Is the regression equation desired? The dependent variable (name, truncation, or number). The nature of each of the remaining variables (available, forced, or deleted).

Instructions are provided for the last prompt (nature of variables) the first time STEPR is executed during an ANALYSE session, but not thereafter. Each step of the analysis looks at the reduction of the sum of squares for each variable. Each step adds the next independent variable that shares the highest partial correlation with the dependent variable. Forced or deleted designations always take precedence.

Statistical Analysis Users Manual

3-31

STEPR produces: The Dependent Variable. The number of forced variables. The number of deleted variables.

For each step in the regression: The Step Number. The name, number and (forced or available) designation of the variable entered. The sum of squares reduced. The cumulative proportion reduced. The multiple correlation coefficient. The multiple correlation coefficient adjusted for degrees of freedom. The f-value for analysis of variance. The standard error of estimate. For each variable name and number. -The mean. -The standard deviation. -The regression coefficient. -The standard error of the regression coefficient. -The T value. -The Beta weight. The regression equation at that step. A table of residuals and/or Durbin-Watson upon prompted request.

The analysis continues step by step and is terminated if the proportion reduced is less than the limiting constant specified by the user (0 is acceptable) or upon completion of the link. As with MULTR, EQFILE and PRINT control is supported. The analysis uses the Abbreviated Doolittle Method to enter variables in the regression and compute their regression coefficients.

Reference
Carl A. Bennett and others, Statistical Analysis in Chemistry and the Chemistry Industry, (Appendix 6A), Ann Arbor, Michigan: Books on Demand (313-761- 4700).

3-32

Information Builders

Sample STEPR Terminal Session

Statistical Analysis Users Manual

3-33

3-34

Information Builders

3.10 Time-Series Analysis: TIMESER


The TIMESER command implements a variety of operations on time-series data, including forecasting, leading and lagging, curve fitting, and smoothing. There are facilities for saving newly-created variables in core for use with other ANALYSE facilities, for writing data to HOLD files, and for creating FOCEXECs containing equations. TIMESER is initiated by entering TIMESER in reponse to the ANALYSE prompt for a statistical operation:

Statistical Analysis Users Manual

3-35

ENTER STATISTICAL OPERATION DESIRED - timeser

TIMESER forecasting functions create additional records containing calculated variables along with existing actual variables. The actual variables are marked as missing for these new time series records by the ANALYSE missing value indicator. If MISSING is not set ON, TIMESER assumes a missing value with the default message:
MISSING DATA VALUE UNDEFINED: ASSUMED TO BE: -999.00

TIMESER then prompts for the field to be used as the time series variable and the value of 1 interval. For example:
ENTER NAME OF TIME VARIABLE- QUARTER ENTER TIME FOR 1 INTERVAL (e.g. "1", "2 DAYS", "1 MONTH") - 3

TIMESER then prompts for a command. For example:


ENTER COMMAND (OR ? FOR HELP)- FIT

The time-variable, time intervals, and TIMESER commands are described in the following sections.

3.10.1 The Time Variable


The user must provide a variable that contains the time values; this field is called the "time variable." The other fields contain the data which are treated as functions of the time-variable, and are called "data-variables." In response to the TIMESER prompt
ENTER NAME OF TIME VARIABLE -

you may enter the fieldname, alias, a position number, or unique truncation of the variable that specifies position of "time" in the series. TIMESER then prompts for the increment between successive time periods. For example:
ENTER ITEM FOR 1 INTERVAL (e.g., "1", "2 DAYS", "1 MONTH") - .25 MONTH

Respond with just a number for a series that is not date-oriented (e.g., 1, 3, .5) or a number and a unit for dateoriented time series (e.g. "1 DAY" for a YMD or MDY formatted time-variable, "1 MONTH" for a YM or MY formatted time-variable, etc.). TIMESER considers a time variable date- oriented if it has a FOCUS date format in the file being analyzed (e.g., I4YM). The only restrictions on the time-variable are that the values must be in ascending order and an integral multiple of the TIME interval. TIMESER supports not-present values or gaps in the series being analyzed. Based on the interval specified, the full series is formed by assigning the missing value indicator to records not originally provided. For example, consider a YM (year/month) formatted time variable. Quarterly data (every third month provided) is formed into a monthly series by specifying a 1 MONTH interval. Similarly, monthly data is formed into "weekly" data by specifying a .25 MONTH interval. The series points for missing values may then be valued by either linear interpolation (LINERP) or exponential interpolation (EINTERP) as described below.

3.10.2 Commands that Create New Variables


The TIMESER commands described in the following sections create new data variables that may be manipulated in exactly the same manner as other data variables. When new variables are created, the names are constructed by appending the first nine characters of the name of the referenced data variable to the default prefix of the command that created it. For example, if we compute the moving average of a variable named SALES, the variable MA.SALES is created. Alternate variable names may be specified using AS phrases.

3-36

Information Builders

TIMESER commands may be specified in full or by a unique truncation of the name (e.g., "MTOTAL SALES 4" and "MT SA 4" are equivalent). Fields may be referenced by full fieldname, alias or unique truncation. (The default prefixes are shown in parentheses following the commands. When using them, separate prefixes from fieldnames with periods.)

LEAD (LE) Command


A variable is created which is the value of the specified variable shifted N time periods into the future. The command
LEAD Q N

produces
LE.Q i = Q i + N

where:
Qi

Is the ith value of the variable Q. Note that this example illustrates typing ahead. If you enter only LEAD, FOCUS will prompt you for a variable.

LAG (LG) Command


A variable is created which is the value shifted N time periods into the past. The command
LAG Q N

produces:
LG.Q i = Q i N

MAVG (MA) Command


A variable is created which is the N period moving average of the specified variable. The command
MAVG Q N

produces the TIMESER prompt:


HOW MANY PERIODS DO YOU WISH TO EXTRAPOLATE --

If "0" is entered, no extrapolation is performed. If you enter N (>0), TIMESER prompts


ENTER STARTING PERIOD FOR EXTRAPOLATION OR "TOP" --

which results in a moving average extrapolation that produces values based on maintaining the last average calculated.

Statistical Analysis Users Manual

3-37

MTOTAL (MT) Command


A variable is created that represents the N period moving total. The command
MTOTAL Q N

results in:
i

MT.Q i =

j = iN+1

Qj

CMAVG (CA) Command


A variable is created which is a centered moving average for N periods of the variable specified. The command
CMAVG Q N

results in
1 CA.Q i = --N
(N 1) i + ----------------2

Qj

(N 1) j = i ----------------2

when N is odd. If N is even, the interval extends 1 period further forward than backward.

CMTOTAL (CT) Command


A variable is created for the centered moving total for N periods. The command
CMTOTAL A N

produces the following result


(N 1 ) i + ----------------2

CA.Q i =

Qj

j = i (N 1 ) ----------------2

when N is odd. If N is even the product is similar to CMAVG.

LINTERP Command
Substitute values for missing values in the series are determined and supplied using linear interpolation. There is no prefix. "LINTERP Q" results in a linear fit (using Q=a+bT) between the nearest two present values to assign values for missing values.

EINTERP Command
Same as LINTERP but an exponential interpolation is performed by entering
EINTERP Q

where:

3-38

Information Builders

Q = aebt

Is used to find an exponential fit between the two adjacent values to assign values for variables with null values.

LDDIFF (DF) Command


The variable created is the difference between the N period lag and the current value of the specified variable. The command
LDDIFF Q N

where:
LD.Q i = Q i + N Q i

LDPCT (DP) Command


The LDPCT command is the same as LDDIFF but a percentage difference is computed. The command
LDPCT Q N

where: ( Qi + N Qi ) LD.Q i = 100 ---------------------------Qi

LGDIFF (GF) Command


The variable created is the difference between the N period lag and the current value of the specified variable. The command
LGDIFF Q N

where:
GF.Q i = Q i Q i N

LGPCT (GP) Command


The LGPCT command is the same as LGDIFF but a percentage difference is computed. The command
LGPCT Q N

where: ( Qi Qi N ) GP.Q i = 100 ---------------------------Qi N

Statistical Analysis Users Manual

3-39

LEXTRAP (LE) Command


The variable created equals a specified variable up to and including a specified time period, after which values are computed by linear extrapolation. The command
LEXTRAP Q N

results in the following prompts:


TYPE "AHEAD OR BACK" FOR EXTRAPOLATION DIRECTION ENTER STARTING PERIOD FOR EXTRAPOLATION OR "TOP" ENTER LINEAR GROWTH FACTOR OR "LAST" TO USE LAST DIFFERENCE -

New points are calculated by adding either the user-provided growth factor (user) or the difference between the last two points to the last point.
Q i = Q i 1 + user

or
Q i = Qi 1 + [ Qi 1 Q i 2 ]

CEXTRAP (CE) Command


The CEXTRAP command is the same as LEXTRAP except a compound growth extrapolation is performed by entering
CEXTRAN Q N

resulting in the following prompts:


TYPE "AHEAD" OR "BACK" FOR EXTRAPOLATION DIRECTION ENTER STARTING PERIOD FOR EXTRAPOLATION OR "TOP" ENTER PERCENTAGE GROWTH FACTOR OR "LAST" TO USE LAST PERCENT DIFFERENCE -

New points are calculated by increasing the last point by either the user- provided percentage growth factor (user) or percentage difference between the last two points.
Q i = Q i 1 user

or
Q i = Qi 1 [ Q i 1 Qi 2 ]

EXSMO (ES) Command


See Section 3.3 Exponential Smoothing: EXSMO on page 3-4 for a discussion of the EXSMO command.

FIT Command
The FIT command performs regressions to fit a number of equations to the specified variable and optionally forecasts the fitted equation forward. The command
FIT Q

produces the following prompts:

3-40

Information Builders

DO YOU WISH TO KEEP PREDICTED VALUES (TYPE "KEEP" OR "NOKEEP") HOW MANY PERIODS DO YOU WISH TO EXTRAPOLATE DO YOU WISH TO KEEP RESIDUALS (TYPE "RESID" OR "NORESID") ENTER THE TYPE(S) OF EQUATION YOU WISH TO FIT -

If you do not name a variable, FOCUS prompts you for one. If predicted values are kept ("KEEP" entered) and residuals are kept, they then become new variables available for holding (see Section 3.10.3 Other TIMESER Commands on page 3-42) and analysis by any ANALYSE function. The equations that may be fitted are listed below. "Y" represents the variable to be fitted and "T" the time variable. The default prefixes for the predicted- value variables appear in parentheses in the table. The names of the residual variables are constructed by further prefixing the predicted-value variable name with an "R." Name/(Default Prefix) LINEAR (LF.) EXP (EF.) POWER (PF.) HYP1 (H1.) HYP2 (H2.) HYP3 (H3.) HYP4 (H4.) ALL Type of Function Linear Exponential Power Simple Hyperbolic Hyperbolic (Type 2) Hyperbolic (Type 3) Hyperbolic (Type 4) All functions Equation Y=A+B*T Y = A * EXP(B * T) Y = A*T**B Y = A + B/T Y = 1/(A + B*T) Y = T/(A + B*T) Y = A/(1 + NBT)1/N

Figure 3-5. TIMESER Commands that Create New Variables If the time variable has a date format, FIT will treat the data variable as a function not of the time variable, but of a pseudo variable named T.INDEX, which is defined as the integer K for the Kth data point. T.INDEX is not saved in core, but it is included in the HOLD file if one is requested. All of the resulting FIT equations may be saved for further FOCUS use in TABLE and GRAPH (this is discussed as a function of STATSET, and in Section 3.10.4, "Saving Forecast Equations"). If the STATSET PRINT option is ON, the equation, the Durbin-Watson and the regression statistics are output for each type of equation selected. The Durbin-Watson is a measure of serial correlation of adjacent residuals in a regression; a value close to 2 indicates a reasonable fit (a value under 2 indicates positive autocorrelation; a value over 2 indicates negative autocorrelation). NOTE: In each case if only the TIMESER command is entered, (e.g., "LEAD"), you are prompted for the specified variable and a number of periods (if required).

Statistical Analysis Users Manual

3-41

3.10.3 Other TIMESER Commands


Command
DISPLAY

Description Displays values for the time variable (T.INDEX) and up to 4 specified variables for the user-specified portion of the time series. T.INDEX is a sequential index generated by TIMESER. DISPLAY sends the following prompts:
ENTER UP TO 4 VARIABLES TO DISPLAY -ENTER FIRST TIME PERIOD TO DISPLAY OR "BOT"-> ENTER LAST TIME PERIOD TO DISPLAY OR "TOP" ->

Enter values for the time variable to specify the first and last periods to be displayed.
HOLD

Writes a hold file (default name STATHOLD, see STATSET) containing all variables, original or created. Saves all variables (original and created) for use by other ANALYSE operations and returns you to the ANALYSE prompt. Deletes new variables and returns you to the ANALYSE prompt. Replaces a specified variable. REPLACE prompts:
ENTER NAME OF VARIABLE -ENTER PERIOD TO REPLACE -ENTER NEW VALUE --

KEEP

QUIT REPLACE

If you type ahead "REPLACE Q N" only the prompt for new value appears.
SUBSET

Restricts all TIMESER operations to a range of data you specify with values entered as the first and last values for the time variable. "SUBSET ?" produces the following prompts:
ENTER SUBSET STATUS: "ON OF OFF" -ENTER FIRST PERIOD IN SUBSET -ENTER LAST PERIOD IN SUBSET --

Alternatively, you may specify:


SUBSET ON N1 N2

where N1 and N2 define the beginning and end of your subset. Figure 3-6. TIMESER Control Commands

3-42

Information Builders

3.10.4 Saving Forecast Equations


FIT can create equation FOCEXECs similar to those created by other ANALYSE commands. FOCEXECs are created if the ANALYSE STATSET parameters EQFILE and EQVAR are enabled (see STATSET). All equation types selected for FIT operations appear in the FOCEXECs, with the variable names constructed by appending the default predicted-variable prefixes to the variable names specified in the EQVAR parameters. All printed output from FIT commands can be suppressed by setting the STATSET PRINT parameter OFF (in which case only the FOCEXECs are created). For example, the following FOCUS procedure produces a FOCEXEC file named FITDEFS:
ANALYSE FILE HOLD STATSET EQFILE=FITDEFS EQVAR=Y PRINT=OFF MISSING=ON TIMESER TIMEVAR 1 FIT DATAVAR LINEAR POWER

The resulting FOCEXEC (FITDEFS) will contain DEFINE commands for the fields named LF.Y and PF.Y.

3.10.5 Ease-of-Use Features


To simplify your statistical analysis sessions, there are several shortcuts you can use when you feel comfortable with the basic TIMESER functions.

Typing Ahead
Responses to TIMESER prompts may be stacked (i.e., the specifications for an entire command may be typed on one line). For example, the following series of prompts
ENTER ENTER ENTER ENTER COMMAND (OR ? FOR HELP) -- display UP TO 4 VARIABLES TO DISPLAY -- sales lg.sales FIRST TIME PERIOD TO DISPLAY OR "BOT" -- 8004 LAST TIME PERIOD TO DISPLAY OR "TOP" -- 8112

could be avoided by responding as follows:


ENTER COMMAND (OR ? FOR HELP) - display sales lg.sales 8004 8112

Abbreviations
TIMESER accepts the shortest unique truncations of command names in place of the full names. Variable names (fieldnames) and their aliases can also be referenced by unique truncations. Equation names in FIT can be abbreviated (e.g., "LIN" is a suitable replacement for "LINEAR"). All other words must be typed in full.

Statistical Analysis Users Manual

3-43

Alternate Variable Names


It is possible to place the data from TIMESER operations into variables other than those specified in the operations default prefix. These can be existing variables or new variables. Except in the case of interpolation, the variable created cannot be the same as the original variable. In all data-creation commands except FIT, the name of the variable can be replaced with an AS phrase. Thus
LAG SALES 3

places data in the default variable LG.SALES, whereas


LAG SALES AS LAGGEDSALES

places data in a variable named LAGGEDSALES.

Graphing TIMESER Results


When creating TIMESER HOLD files, missing variable values (such as actual variable values in forecasted periods) take on the missing data value (default -999). The corresponding GRAPH missing value support using the GRAPH SET parameters GMISSING=ON (or OFF) and GMISSVAL=-999 can be used to graph actuals forecasted, and fitted data (see the FOCUS Users Manual).

3-44

Information Builders

TIMESER Sample Terminal Session

Statistical Analysis Users Manual

3-45

3.11 Crosstabulations: XTABS


The XTABS analysis generates contingency tables in addition to user-selected statistics. These tables, or crosstabluations, are simply counts of observations classified by two or more discrete variables. Such tables, commonly called "crosstabs," are joint frequency distributions of observations (or cases) classified by two variables. Each classification variable may be either alphanumeric or numeric. Crosstabs represent one type of analysis of the relationship between two or more variables (other types include correlations, regressions, and analyses of variance). Typical uses of crosstabs are for surveys, and employee and census studies. To select XTABS, enter
XTABS SALARY BY DEPARTMENT

in response to the ANALYSE prompt for a statistical operation.

3-46

Information Builders

A simple crosstab, showing the counts for salaries by department (into "Low Paid" vs. "Better Paid"), follows. Each count is a cell, identified by its column and row. (A complete example appears in the "Sample XTABS Terminal Session" at the end of this section.)

In order to clarify such joint distributions, other cell statistics (such as percentages of rows, columns and totals) are offered as options, along with various overall statistics (such as chi-square, Cramers V, and contingency coefficients). The widths of these tables (panel sizes) are taken from the STATSET flag WIDTH. The number of rows and columns (and therefore, the number of panels) is limited only by available computer memory. (You should be careful not to use XTABS with continuous variables or those with many values, because tables with empty cells are difficult to interpret and may violate many of the assumptions applied in developing the associated statistics.)

3.11.1 Specifying Variables


When invoked, XTABS initially responds with a prompt for the crosstab or classifying variable:
PLEASE ENTER THE NAMES OF THE CROSSTABULATION VARIABLES (E.G. "SEATS AND COUNTRY BY FUEL")

The specification of variables consists of a series of variable names or variable lists, each separated by the word BY:
(variable name or list) BY (variable name or list) BY...

The first variable(s) specified becomes the horizontal classifier(s) and those following the BY are the vertical classifiers, running down the page. For example
SALARY BY DEPARTMENT

produces a single crosstab of cases displayed across SALARY by DEPARTMENT (down the page). Variable names may be fieldnames, aliases or unique truncations of any fields in the file being analyzed. A variable list is a series of variable names separated by the word AND. All words and names must be separated by either blanks or commas. For example
UNIT_SOLD AND RETURNS BY PROD_CODE AND DATE

produces four crosstabs: UNIT_SOLD BY PROD_CODE, RETURNS BY PROD_CODE, UNIT_SOLD BY DATE, and RETURNS BY DATE Multiple BY phrases may also be used. For example

Statistical Analysis Users Manual

3-47

TEMP BY DIST BY BACTERIA.

produces three crosstabs: TEMP BY DIST, TEMP BY BACTERIA, and DIST BY BACTERIA. Since typing ahead is supported, a convenient way to generate the above request is as follows:
ENTER STATISTICAL OPERATION DESIRED - xtabs temp by dist by bacteria

3.11.2 Specifying Options


After the classification variables are specified, XTABS prompts for desired options:
PLEASE ENTER OPTION NAME(S) OR NUMBER(S) DESIRED (E.G. "ALL" OR "ROWPCT COLPCT CHISQ")

You can then enter option numbers, names, or unique truncations, using the word BY to separate the variable names or variable lists.

3-48

Information Builders

All of the XTABS options are shown in the following table in Figure 3-7. Number /Name Description Summary Statistics 1 2 3 4 6 7 8 9 20
CHISQ CRAMV CONT LAMBDA

Chi-square. Cramers V; phi for 2 by 2 tables. Contingency coefficient. Asymmetric lambdas with each variable taken as dependent, and symmetric lambda. Kendalls tau b. Kendalls tau c. Gamma. Asymmetric Somers D with each variable dependent plus symmetric Somers D. All summary statistical options. Cell Statistics

TAUB TAUC GAMMA SOMERSD

ALL

27 28 29 30 31 32 33 34

EXPECTED DEVIATN

The frequency expected (assuming independence) is printed in each cell. The deviation of the observed frequency from the expected frequency is printed in each cell. The contribution of the cell to the chi-square is printed in each cell. (The sum of the cell chi- squares is the tables chi-square.) The cell frequencys percentage of the row total is printed in each cell. The cell frequencys percentage of the column total is printed in each cell. The cell frequencys percentage of the grand total is printed in each cell. The count (frequency) is printed in each cell. The frequency weighted according to the values of the selected field. Control Options

CELLCHI2

ROWPCT COLPCT TOTPCT COUNT WEIGHT

35 36

NOGRID DASH

Suppresses the default vertical grid normally printed on all tables. Suppresses the dashed lines that are normally printed between horizontal rows of cells.

Statistical Analysis Users Manual

3-49

Number 40 42 44 45 46 50 ?

/Name
NORANGE

Description All variables are assumed to be unranged and no prompting is done for ranges or user-defined. Suppresses page headings. Suppresses all statistics. Suppresses row totals (and total percentages). Suppresses column totals (and total percentages). Data grouped by first field data must be sorted by first field. Generates a display of online information about XTABS operation.

NOHEAD NOSTATS NORTOTAL NOCTOTAL GROUPS "XTABS ?"

Figure 3-7. XTABS Statistical Options

3.11.3 Summary Statistics


Summary statistics are important for measuring the significance of the relationship(s) between variables in the sample of cases analyzed. When all possible cases involving a set of variables are analyzed, the results actually measure the relationship (excluding measurement error) between the variables of interest. Since most studies use only small samples of the cases potentially available, it is important to measure the significance of the results obtained. In other words, you must determine whether the results obtained could be due partly to chance or whether they actually represent the universal population. One then speaks of "statistical significance," which is measured by various summary statistics (chi-square, contingency coefficient, Tau, etc.). These summary statistics appear in tables beneath the actual crosstabulation results.

CHISQ Option
The CHISQ option calculates and displays the chi-square statistic which tests the independence of the joint distribution of the variables in the table. The statistic does not measure the strength of the relation, but can be interpreted as a test of whether or not the variables are related. Empty cells and extremely small or large sample values lessen the significance of this statistic. Chi-square is defined by the following formula
N

2 = where:
fobs(i)

i=1

( f obs ( i ) f exp ( i ) ) 2 -------------------------------------f exp ( i )

Is the observed frequency (or count) in the ith cell.


fexp(i)

Is the expected frequency in the ith cell (assuming no relationship between the variables).

3-50

Information Builders

The larger the difference between the observed and randomly expected frequencies, the larger chi-square becomes. Large values for chi-square thus indicate the presence of a systematic relationship, while small values imply the absence of a relationship or statistical independence. Chi-square varies with the number of rows and columns used to determine the number of degrees of freedom (provided along with chi-square by XTABS) and the sample size.

CRAMV Option
The CRAMV statistic calculates and displays Cramers V ( for 2 x 2 tables) which makes a correction for the sample size (chi-square does not). It is defined by the following formula 2 V = --------------------------min ( r i ) ( c i ) where:
"r" and "c"
1/2

Are the number of rows and columns, respectively, of relationship strength.

Is the phi statistic (appropriate only for 2x2 crosstabs) defined by: 2 = ---N
1/2

(for 2 x 2 crosstabs) corrects 2 for the number of cases (N). For the 2 x 2 cross-tabulations ranges from 0 to +1 for a perfected relationship. For larger crosstabs (greater than 2 x 2), has no upper limit. Cramers V is used to adjust for the minimum of the rows and columns. The lower bound of V is 0, so values from 0 to +1 indicate minimal relationships. Large Cramers V values indicate a strong association.

CONT Option
The CONT option calculates and displays the contingency coefficient. It is another chi-square-based statistic adjusted for sample size. It is defined by the following formula: 2 C = --------------2 + N
1/ 2

The contingency coefficient, C, runs from 0 to a maximum value dependent on the size of the table. In comparisons it should be used with crosstabs of identical dimensions (same numbers of rows and columns).

LAMBDA Option
The LAMBDA option calculates and outputs the three lambda statistics, one symmetric lambda, and two asymmetric lambdas, one with an independent row variable and a dependent column variable, and one with the reverse (dependent row variable and independent column variable). Asymmetric lambda is based on the proportional reduction in error in estimating the distribution of the dependent variable when the independent variable is known. In other words, it measures how well you can identify the value of the dependent variable.

Statistical Analysis Users Manual

3-51

When all of the occurrences for any given value of the independent variable occur in a single cell (i.e., the remaining cells for that row or column are all zero), then the value of lambda is 1. Asymmetric lambda is defined by the following formula
j asym = --------------------------------------------------N max ( f r )

max ( fjk ) max ( fk )

where:
k

Is the sum of the independent variable.


max(fjk)

Is the sum of j maximum values of all cell frequencies for each category of the independent variable.
MAX(F )

Is the maximum value for each category of the dependent variable. The symmetric lambda is an average of the asymmetric lambda. No assumption of dependency is assumed. It is defined by the following formula

max ( f jk ) max ( f k ) max ( fjk ) max ( fk ) j j sym = -------------------------------------------------------------------------------------------------------------2N max ( f k ) max ( f k )


where the terms are defined as in the asymmetric lambdas.

TAUB Option
The TAUB option calculates and displays Kendalls Tau b statistic. Tau b, which is most appropriate in a square table (number of rows and columns equal), evaluates all cases pair-wise relative to the ordering (low to high) of each variable. Pairs with both variables higher are called concordant pairs and those with both variables reversed are called discordant. Other cases are considered "tied." Tau b, Tau c, gamma, and Somers D are all measures of association between two variables and differ mainly in the manner of counting tied pairs. Tau b is defined by the following formula
Taub = ( P Q ) [ 0.5 ( N ( N 1 ) T ri ( T ri 1 ) ) 1.5 ( N ( N 1 ) T ri ( T ci 1 ) ) ] 1 / 2

where:
"P" and "Q"

Tri

Are the number of concordant and discordant pairs respectively. Note that if there is a general ordering of pairs in the same direction on both variables, Tau b will be positive. and Tci Are the number of ties on the row and column variable, respectively. They turn out to also be the respective row and column totals.

The formula used for computation is:


2 2 Taub = ( P Q ) [ 0.25 ( N 2 T ri ) ( N 2 T ci ) ]

3-52

Information Builders

TAUC Option
The TAUC option calculates and outputs Kendalls Tau c statistic. It is most appropriate in rectangular cases (the number of rows and columns unequal). Tau c is basically an average value per pair of (P-Q), where a row and column adjusted approximation is taken for the number of pairs. Tau c is defined by the following formula
Tau c = 2 min ( P Q ) N 2 ( m 1 )

where:
P and Q

Are as defined above, or as Tau b.


N

Is the number of cases.


m

Is the minimum of the number of rows and columns.

GAMMA Option
The GAMMA option calculates and displays the Gamma statistic. It is independent of ties or table size (dimensions) and is defined by the formula:
Gamma = ( P Q ) ( P + Q )

Its value is positive, zero, or negative, if, respectively, there are more, equal, or less concordant pairs than discordant pairs. For a 2 by 2 table, gamma equals another statistic called "Yules Q."

SOMERSD Option
The SOMERSD option calculates and outputs the Somers D statistic. Ties are taken into consideration in a different way than in the Tau statistics. As in the LAMBDA case, three statistics are produced: two asymmetric cases, one for each variable taken (or the dependent variable), and the symmetric case. The asymmetric Somers D is calculated with the following formula
Asymmetric Somers D = ( P Q ) ( P + Q + T i )

where: Ti = 1,2 Are row and column ties when the dependent variable defines the rows and columns respectively.
P and Q

Are as defined for the Tau statistic. The symmetric case does not account for which variable is dependent as defined by the following formula:
Symmetric Somers D = ( P Q ) [ P + Q + 0.5 ( T 1 + T 2 ) ]

ALL Option
The ALL option calculates and produces all of the available summary statistics.

Statistical Analysis Users Manual

3-53

3.11.4 Cell Statistics


The XTABS cell statistics may be specified selectively for calculation and display in each table cell.

EXPECTED
The EXPECTED cell statistic calculates and displays the expected cell frequency (assuming independence). It is defined by the formula
Ci r f exp ( i ) = --------i N

where:
Ci

Is the column total or marginal for the ith cell.


ri

Is the row total or marginal for the ith cell.


N

Is the total number of observations.

DEVIATN
This cell statistic calculates and displays the deviation between the observed and expected frequency for each cell. It is defined by the formula
DEVIATN = ABS ( f obs ( i ) f exp ( i ) )

where:
fobs(i)

Is the observed cell frequencies for the ith cell.


fexp(i)

Is the expected cell frequency for the ith cell.

CELLCHI2
The CELLCHI2 option calculates and displays the contribution to the chi- square summation for each cell. It is defined by the formula ( f obs ( i ) f exp ( i ) ) 2 CELLCHI2 i = -------------------------------------f exp ( i ) where:
fobs(i)

Is the observed frequency (or count) in the ith cell.


fexp(i)

Is the expected frequency in the ith cell assuming no relationship between the variables.

3-54

Information Builders

ROWPCT
The ROWPCT option calculates and displays the cell frequencys percentage of the row total for each cell. It is defined by the formula
f obs ( i ROWPCT i = 100 -----------) ri

where:
ri

Is the row total for the ith cell.

COLPCT
The COLPCT option calculates and displays the cell frequencys percentage of the column total for each cell. It is defined by the following formula
f obs ( i COLPCTi = 100 -----------) ci

where:
ci

Is the column total for the ith cell.

TOTPCT
The TOTPCT option calculates and displays the cell frequencys percentage of the table grand total for each cell. It is defined with the following formula
f obs ( i TOTPCT i = 100 -----------) N

where:
N

Is the total number of cases.

COUNT
The COUNT statistic calculates and displays the cell frequency (fobs(i)) for each cell.

ALL Option
The ALL option calculates and produces all of the available summary statistics.

3.11.5 Ease-of-Use Features


XTABS includes a number of default ease-of-use features: Each table is printed with a page heading and a grid. Complete summary statistics are produced and displayed.

Statistical Analysis Users Manual

3-55

Counts and row, column, and total percentages are displayed in each cell. Row and column totals and percentages, along with grand totals, are automatically calculated and displayed (whenever cell row and column percentages are produced). If no variables are specified (e.g., XTABS ALL), the first two variables (or the second and third) in the file being analyzed are used as the ACROSS and BY variables. Output is automatically paneled in accord with the STATSET PANEL setting.

After you specify the crosstab variables and options, XTABS determines whether the required data is available. If this analysis is the first process for the existing selection criteria (see the STATSET option SELECT in Section 2.3 Preparing the Environment: STATSET on page 2-2) the data will be read in and the number of observations displayed as follows:
NUMBER OF OBSERVATIONS = 48

3.11.6 Control Options


XTABS provides a number of control options for use in producing crosstabulations. Option
NOGRID NORANGE

Action Suppresses the printing of the GRID (printed by default). Suppresses automatic range prompting (see Section 3.11.7 Specifying Columns and Rows on page 3-56). In this case, one row or column is produced for each discrete variable value and appears with the value as the heading. Suppresses the printing of a heading in the table of crosstabs (default). Suppresses automatic calculation and display of all summary statistics (default). Suppresses automatic calculation of row totals and percentages (default). Suppresses automatic calculation of column totals and percentages (default). Produces full crosstab and summary statistics for each discrete value of the first field. The sorting field is assumed to be in ascending order. This is similar to the GROUPS option in the ANALYSE Statistics (STATS) facility. Displays the online documentation for XTABS.

NOHEAD NOSTATS NORTOTAL NOCTOTAL GROUPS

Figure 3-8. XTABS Control Options

3.11.7 Specifying Columns and Rows


XTABS prompts for column range categories and heading labels for each classification variable:
PLEASE ENTER RANGE AND/OR HEADING SPECIFICATIONS OR "NONE" FOR EACH VARIABLE (E.G. BOT 12 AS LOW, 12 18, 20 36 AS HIGH)

3-56

Information Builders

The response to this prompt determines the extent of values included in each column and row, along with their printed labels. Columns and rows are automatically sorted in ascending order. A response of NONE for a variable (alpha or numeric) causes one column or row to be created for each discrete value of the variable with the label set to the variable value. A column or row range specifies the columns (or rows) lower inclusive bound (greater than or equal to value) and the upper non-inclusive bound (less than value). A lower bound of BOT and an upper bound of TOP may be specified for the first and last row or column. For example
BOT -100, 0 100, 100 TOP

where the range specifiers have the following meanings:


BOT -100

Specifies that the first row or column will include the lowest values through -100 (negative 100).
0 100

Specifies that the next group of rows or columns will contain values in the 0 to 100 range.
100 TOP

Specifies that the final group of rows or columns will contain values over 100. Ranges must be in ascending order. Discrete ranges may be alpha or numeric but bounded ranges must be numeric. Ranges or discrete values may be specified but not mixed for a single variable. The resulting columns and rows will be displayed in the ascending order of the data values regardless of the labels specified. Ranges may have gaps, or different lengths, but they may not overlap. The general form of the range specifications is as follows:
SALARY bot 16000 as "Low Paid", 16000 top as "Better Paid"

The "AS label" phrase is optional and may be used for none, any, or all ranges specified. The label (truncated to 12 characters) becomes the row or column heading. Labels containing embedded blanks must be enclosed in single quotation marks. For example:
90 100 as "Very Hot"

If no label is specified, then the upper and lower boundaries (or the discrete values) become the column or row labels. Each range or discrete value response is checked for ascending order and another prompt is issued if the values supplied are in error. (Such input for a variable is generally entered on one line.) If additional lines are needed, enter a comma as the last character on each line, to direct XTABS to continue on the following line with another prompt for the same variable. (If you accidentally type a comma at the end of the current line, and do not wish to continue, type NEXT on the next line to satisfy the prompt.) Type LAST, if you wish to respecify the complete ranges (label, etc.) for the preceding variable. Note that prompting for ranges (labels, etc.) continues until the variable is syntactically complete.

3.11.8 General Notes


XTABS gets its number of lines per page from the FOCUS SET LINES command. This, together with the STATSET WIDTH flag, determines the length and paneling of the report.

Statistical Analysis Users Manual

3-57

The FOCUS SET PAUSE=ON facility causes XTABS to wait before printing to allow time for aligning forms on the output device. When STATSET missing value processing is in effect, the number of missing cases is displayed on the generated crosstabulation. Missing cases are not displayed or included in the compilation. If missing cases are required on the table, the range response "-999 as missing" (included with other column or row specifications) will generate appropriately labeled columns or rows. Since typing ahead is supported throughout ANALYSE, a complete crosstabulation may be generated by the following ANALYSE prompt and response:
ENTER STATISTICAL OPERATION DESIRED xtabs salary by department all, none, none

References
John Mueller, Karl Schuessler, and Herbert Costner (1970), Statistical Reasoning in Sociology (2nd edition), Boston: Houghton Mifflin.

Sample XTABS Terminal Session

3-58

Information Builders

Statistical Analysis Users Manual

3-59

A Error Messages
(FOC108) NUMBER OF ANALYSE VARIABLES EXCEEDS 64

The statistical analysis sub-system cannot handle more than 64 independent variables.
(FOC109) (FOC110) INSUFFICIENT CORE FOR FACTOR ANALYSIS

The statistical operation requested requires more core storage.


INVALID STATSET MODE:

The ANALYSE option STATSET does not have the parameter requested.
(FOC111) SET VALUE MISSING

The value after the STATSET parameter is missing. A valid value must be provided.
(FOC112) (FOC113) VALUE MUST BE NUMERIC:

The value in response to the ANALYSE prompt, or STATSET parameter must be numerical.
AN ILLEGAL RESPONSE HAS BEEN ENTERED:

The response to the ANALYSE prompt is not recognized. Type EXPLAIN in the ANALYSE mode if assistance is needed.
(FOC114) AN INDEPENDENT VARIABLE IS ALSO A DEPENDENT VARIABLE:

A regression cannot use the same variable as both independent and dependent.
(FOC115) THE NUMBER ENTERED EXCEEDS THE NUMBER OF VARIABLES:

The numerical variable identity provided in the ANALYSE prompt exceeds the number of variables in the file being analyzed.
(FOC116) FLUSHING TO QUIT OR TO NEXT STATISTICAL OPERATION

Sequential processing of ANALYSE commands cannot be continued because of parameter errors. The stacked commands are ignored until the next valid STATMODE is encountered.
(FOC117) VALID RESPONSES ARE ON OR OFF:

There are only two valid responses to this prompt. These are ON or OFF.
(FOC118) DEPENDENT VARIABLE DOES NOT EXIST:

The dependent variable provided in response to the ANALYSE prompt is not specified in the MASTER description for the data.
(FOC119) INDEPENDENT VARIABLE DOES NOT EXIST:

Same as (FOC118) but applies to the dependent variable.


(FOC120) VALUE ENTERED IS INCONSISTENT WITH FILE DESCRIPTION:

The data format is different from the value supplied to the ANALYSE prompt, i.e., not numeric, etc.
(FOC121) (FOC122) INVALID LEVEL ENCOUNTERED IN DATA FOR FACTOR:

Data file submitted for FACTOR analysis is not correctly formulated.


NUMBER OF LEVELS NOT CONSISTENT FOR FACTOR:

The factor analysis sub-system has detected an input setup error.


(FOC123) NUMBER OF FACTORS EXCEEDS 6:

The statistical process of factor analysis cannot exceed 6 factors.

Statistical Analysis Users Manual

A-1

(FOC124)

NON-NUMERIC DATA IN FILE:

There is non-numerical data for a variable which must be numerical. This observation must be eliminated from the ANALYSE statistical process.
(FOC125) RECAP CALCULATIONS MISSING :

The word RECAP is not followed by a calculation. Either it should be removed, or a calculation provided.
(FOC126) NUMBER OF DATA VALUES NOT CONSISTENT WITH FACTOR LEVELS:

In the ANALYSE mode the factor analysis procedure FACTO requires at least the same number of fields as levels of factors.
(FOC127) LITERAL TEST VALUE FOR GROUP IS INCORRECT:

In the request statement a screening phrase against a group field has inconsistent values based on the contents of the fields in the group. The size of one or more sections is either incorrect or not numerical.
(FOC128) NOTE..LIMIT USES EQ OR LE TEST CONDITION ONLY:

The RECORDLIMIT and READLIMIT phrases test on EQ or LE only. The meaning is the same.
(FOC133) WARNING: ANALYSE FILE LOST...HAD SAME NAME AS HOLD FILE

The prior HOLD file is over-written. Use new names with the EQFILE option.

A-2

Information Builders

B Master File Descriptions and Diagrams


This appendix consists of the Master File Descriptions and Diagrams (pictures) of the following files: AIRLINE EMPLOYEE EDUCFILE JOBFILE PROPERTY SMSA

All of these files can be created by FOCEXECs distributed with FOCUS. EMPLOYEE, EDUCFILE and JOBFILE are created by EMPTEST (for CMS) and EMPTSO (for TSO) FOCEXECs. To create AIRLINE, PROPERTY, and SMSA, execute the AIRLINE, PROPERTY and SMSA FOCEXECs.

Statistical Analysis Users Manual

B-1

B-2

Information Builders

Statistical Analysis Users Manual

B-3

B-4

Information Builders

Statistical Analysis Users Manual

B-5

Statistical Analysis Users Manual

B-6

Statistical Analysis Users Manual

B-7

Statistical Analysis Users Manual

B-8

Statistical Analysis Users Manual

B-9

Statistical Analysis Users Manual

B-10