You are on page 1of 89

FOCUS for IBM Mainframe

Statistical Analysis User’s Manual

DN1000019.1091

Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 External Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 FOCUS Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Preparing Data for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Direct Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Selective Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Specifying Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Selecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3Environmental Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The Examples in this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preparing for ANALYSE Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Entering the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Statistical Operations Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Preparing the Environment: STATSET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Limiting the Sample Size: FILESIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Reviewing Online Documentation: EXPLAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1-2 1-2 1-2 1-2 1-3 1-3 1-3 1-3 1-5 1-7 2-1 2-1 2-2 2-2 2-6 2-7

2.

3.

3 The Statistical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.1 Analysis of Variance: ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.2 Correlation Analysis: CORRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.3 Exponential Smoothing: EXSMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4 3.4 Factor Analysis: FACTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 3.5 Discriminant Analysis: MDISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.6 Multiple Linear Regression: MULTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15 3.7 Polynomial Regression: POLRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 3.8 Descriptive Statistics: STATS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 3.8.1 Control Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.9 Stepwise Multiple Regression: STEPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31 3.10 Time-Series Analysis: TIMESER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35 3.10.1 The Time Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 3.10.2 Commands that Create New Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 3.10.3 Other TIMESER Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42 3.10.4 Saving Forecast Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43 3.10.5 Ease-of-Use Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43

3.11 Crosstabulations: XTABS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Specifying Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.2 Specifying Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.4 Cell Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.5 Ease-of-Use Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.6 Control Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.7 Specifying Columns and Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.8 General Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. B.

3-46 3-47 3-48 3-50 3-54 3-55 3-56 3-56 3-57

Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 Master File Descriptions and Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1

1 Introduction
FOCUS provides a full range of statistical tools. These facilities have been designed for ease of use in a conversational environment. The spectrum of facilities covers two types of statistical operations: • Simple functions supplied by the FOCUS report request language as part of the regular report writer including: • • • • • • • Minimums. Maximums. Averages. Average sums of squares. Percentage counts. User-supplied functions.

Complex statistical operations with interactive prompts to determine what to perform next (based on results obtained so far). These include: • • • • • • • • • • Descriptive Statistics. Correlations. Multiple Linear Regressions. Stepwise Regressions. Polynomial Regressions. Analyses of Variance. Discriminant Analyses. Factor Analyses. Exponential Smoothing. Time Series Analyses and Forecasting.

Statistical operations are performed on selected sets of variables and their values (observations) collected from one or more FOCUS or external files by the FOCUS report writer. Up to 64 variables are permitted. The number of observations is only limited by the amount of storage available to the user (e.g., 20,000 observations with 20 numeric variables would need 1.6 million bytes of virtual storage). The steps for performing statistical analysis of data in FOCUS databases are: 1. 2. 3. 1. or Create a HOLD file of selected and/or redefined data using the report writer. Issue the ANALYSE command. Respond to the prompts. Create a Master File Description describing the data to be analyzed (attributes of the variables: names, formats and lengths, etc. -- described in the FOCUS Users Manual).

The steps for performing statistical analysis of data in external files are:

Statistical Analysis Users Manual

1-1

1. XTABS). 1. Create a HOLD file of selected and redefined data using the report generation language (only necessary if the external file is not a fixed. Data from any combination of these sources may be collected for analysis through the facilities of the FOCUS request language. When using external files: • • Create and maintain them with your normal system facilities. Fixed-format data files can be analyzed directly (you do not need to create an extract file). The subsections that follow briefly outline the preparations. Facilitates the creation of control variables in ANOVA and MDISC. floating point. 1.1 External Files Any user-supplied fixed-format file can be used. FOCUS files are recommended for complex data structures. Each such file must be accompanied by a Master File Description in which the fieldnames and formats of data are stated. 1. The FOCUS data validation. 3. 2. Advantages of using FOCUS files: • • • • • • • More efficient storage due to elimination of redundant data.1 Sources of Data Data for most statistical analyses may be drawn from two sources: external files and FOCUS files. maintenance. double-precision floating point. Alphanumeric fields are supported (used to order groups as in STATS.2 Preparing Data for Analysis The steps taken to prepare data for analysis depend on the location and condition of the data to be analyzed. The numeric data fields in either FOCUS or external files may be treated as any one of the following four numeric representations: binary integer. Faster data retrieval for analysis than external files when data selection is involved. It is acceptable to have mixed formats in a file. The facilities for describing them are contained in the FOCUS Users Manual. 1-2 Information Builders . or packed decimal.1. and correction facilities are available. Respond to the prompts. Usage: • 1.format file). Documentation of the procedures for describing and building FOCUS files appears in the FOCUS Users Manual. Issue the ANALYSE command.1.2 FOCUS Files Any FOCUS file or set of interconnected files may supply data for statistical analysis.

1.3 Preparing the Environment: STATSET on page 2-2).2. Statistical Analysis Users Manual 1-3 . which are described in the FOCUS Users Manual. simply issue the ANALYSE command: ANALYSE FILE filename 1. which can be overwritten by subsequent requests.2. Before the analysis process begins. actual or modified. may be statistically analyzed by first extracting the data with a TABLE command.3 Specifying Variables Variables to be analyzed are specified as FOCUS verb objects: PRINT POPULATION MILLIONAIRES Variables can be created or transformed directly with the DEFINE or COMPUTE facilities.2 Selective Data Analysis Subsets of the data. Such procedures may be run reiteratively since FOCUS treats HOLD files as external files (with their own descriptions).1. An alternative method is to create subsets of data using the SELECT operand in a STATSET operation (see Section 2.1 Direct Analysis When using all of the data in a fixed-format external file for analysis (with no selection or modifications).2. 1.2. issue a normal FOCUS report request naming the file to be accessed (FOCUS or external) and selecting the fields for analysis: TABLE FILE filename PRINT field1 field2..fieldn ON TABLE HOLD END ANALYSE FILE HOLD..4 Selecting Data Use FOCUS screening statements to specify the data you wish to analyze: IF STATE EQ TEXAS The prepared data is placed in a HOLD file. All analysis is then performed on the data in the HOLD file...

. . END TABLEF FILE SMSA PRINT POPULATION BIGSMSA AND COMPUTE LOGPOP=LOG(POPULATION). Example 2: Transformed Variables DEFINE FILE SMSA BIGSMSA/I2=IF POPULATION GT 500 THEN 1 ELSE 0. Example 1: Selected Variables TABLE FILE SMSA PRINT POPULATION ON TABLE HOLD AS POP END ANALYSE FILE POP . . . ON TABLE HOLD AS NEWPOP END ANALYSE FILE NEWPOP . Example 3: Selected Data 1-4 Information Builders .Figure 1-1. Extracting Data for Analysis Several sets of extracted data may be held for subsequent analysis with the HOLD AS feature.

Specify the dependent variable. 1. You can use either comma. GRAPH. .YES.QUIT ANALYSE FILE HOLD CORRE MULTR TEMP 2 DIST DEPTH YES NO QUIT in which case you would supply a name for variable &1 on the command line when you executed the FOCEXEC (see the FOCUS Users Manual).MULTR. Omit table of residuals.or blankdelimited type ahead reponses. Select multiple linear regression.DIST. For example.3 Environmental Considerations ANALYSE operations may be performed live. typing ahead is supported.). an ANALYSE session that performs a correlation analysis followed by a multiple linear regression analysis may be stored in a FOCEXEC as shown (the contents of the FOCEXEC are shown in the column on the right): Enter ANALYSE. in batches.2. one response at a time. Specify two independent variables.&1. . etc.NO. Sample ANALYSE Session This sample 10-line FOCEXEC could be stored in a file and executed by entering ‘‘EX focexecname’’ in FOCUS (‘‘focexecname’’ is a name of up to eight characters that you assign when you create the file). For example. It should be noted that ANALYSE automatically inhibits all prompt messages when used from Dialogue Manager and that all message facilities of the Dialogue Manager are available (see the FOCUS Users Manual). Select correlation analysis. To facilitate operation by experienced users.TABLE FILE PROPERTY PRINT REGION CTYPE AND PROPERTY ON TABLE HOLD AS PROPHOLD IF REGION EQ "NORTHEAST" END ANALYSE FILE PROPHOLD . or from FOCEXECs in the same manner as other FOCUS commands (TABLE. Figure 1-2. You can also type ahead in live sessions: Statistical Analysis Users Manual 1-5 . the FOCEXEC in Figure 1-2 could also be written as follows ANALYSE FILE HOLD CORRE. Specify the independent variables. Request a regression equation.DEPTH. End the ANALYSE session.

The FOCUS LET facility is also supported in ANALYSE. Set up EQFILE TEMP1.BACTERIA.DIST.MULTR. Execute TEMP1 FOCEXEC (DEFINEs). The following example illustrates the combined use of ANALYSE.EQFILE=TEMP1 MULTR.2. You can summon these by selecting EXPLAIN in response to the ANALYSE prompt for a statistical operation.NO STATSET EQVAR=TEMP2. GRAPH.DEPTH. Graph TEMP1. Specify handling of records with missing data. Set the print width for your terminal. If you are a new user. you may wish to review online descriptions of each of the operations. Specify a regression.EQFILE=TEMP2 MULTR. TEMP2 and the difference across DIST. Set up EQFILE TEMP2.3. ANALYSE.DEPTH. Specify the print destination. or select STATSET option ONLINE=OFF.7 Specifying Columns and Rows on page 356).DEPTH.TEMP. Use these facilities to: • • • • Set up special HOLD files for ANALYSE output. End the ANALYSE session.11.NO QUIT EX TEMP1 HOLD EX TEMP2 HOLD ADD GRAPH FILE HOLD WRITE TEMP1 AND TEMP2 AND COMPUTE DIFF=TEMP1-TEMP2 ACROSS DIST END RETYPE Create a TABLE report of the graph. Using TABLE.DIST. Enter ANALYSE. Specify a second regression.YES. set PRINT=OFFLINE in FOCUS before entering the ANALYSE environment. Execute TEMP2 FOCEXEC (DEFINEs).YES. and offers an easy way to store row and column labels in crosstabs before starting your ANALYSE session (see Section 3.DIST.2. ANALYSE FILE HOLD STATSET EQVAR=TEMP1. 1-6 Information Builders . STATSET provides facilities for altering the default parameters (called ‘‘STATSET flags’’) that control the ANALYSE environment during your session. and equation files. variables. and GRAPH in a Session Note the use of the ANALYSE option STATSET at the start of the FOCEXEC in Figure 1-3.ANALYSE FILE HOLD CORRE.QUIT To direct ANALYSE output to an offline device (usually a line printer). and TABLE facilities.YES.NO. Figure 1-3.TEMP.TEMP.

(ANOVA and MDISC usually require that the data originate in a hierarchical structure with the ‘‘control’’ variables as keys. PROPERTY and Standard Metropolitan Statistical Area (SMSA) data files.4 The Examples in this Manual Each of the ANALYSE operation summaries in this manual concludes with a sample terminal session. Statistical Analysis Users Manual 1-7 . The Master File Descriptions and the data used in the examples for ANALYSE operations appear in Appendix B of this manual. descriptions of the files holding the data are stored in separate files called Master File Descriptions. The FOCUS Users Manual contains instructions for creating Master File Descriptions. EMPLOYEE is a FOCUS file and before you can use EMPLOYEE data in statistical operations you must first extract it into a flat file using a TABLE request. illustrating the actual function(s) with ‘‘live’’ data. The examples used to illustrate the statistical operations are drawn on the AIRLINE.) In FOCUS.1. The extract step is required for all data residing in non-sequential files. EMPLOYEE.

1 Entering the Environment You enter the Statistical Analysis (or ANALYSE) environment by issuing the command ANALYSE: ANALYSE FILE HOLD FOCUS responds by displaying the ANALYSE options: STATISTICAL OPERATIONS.3 and 2. Section 2. after reading a file just once you can perform unlimited analyses on the data.5 describes the online ANALYSE documentation facilities that explain the various statistical facilities. Sections 2. ANOVA POLRG CORRE STATS EXPLAIN STATSET EXSMO STEPR FACTO TIMESER MDISC XTABS MULTR FILESIZE You select an operation by typing in a selection. and concludes with a request for another operation: ENTER STATISTICAL OPERATION DESIRED - Thus.. The analysis chosen then proceeds.2 Preparing for ANALYSE Sessions 2. Statistical Analysis Users Manual 2-1 ..4 of this chapter discuss ANALYSE facilities for preparing the environment and limiting the size of the sample file. issuing prompts for any required information.

Each of the flags (e. 2.) can be set individually or they can be set in groups. as often as needed. Figure 2-1.2 Statistical Operations Summary The statistical subcommands and operations are summarized in Figure 2-1. Polynomial regression (Section 3..5). Whenever possible. create an equation file.10).4). deciles. Each prompt begins with the name of the flag (e. Time Series Analysis (Section 3. Subcommand ANOVA CORRE EXPLAIN EXSMO FACTO MDISC MULTR POLRG STATS STATSET STEPR TIMESER XTABS Statistical Operation Analysis of variance (Section 3. analyses are integrated and use the same input as the previous steps.3 Preparing the Environment: STATSET The ANALYSE option STATSET is used to set the parameters (or ‘‘flags’’) that govern the FOCUS ANALYSE environment. missing data. Correlation coefficient. Factor analysis (Section 3.) (Section 3.9).6).g.11). equation file processing. Descriptive statistics (modes. mean. and number of observations (Section 3.5). and prepare the environment (Section 2. You can enter STATSET in three ways: • Type STATSET and press ENTER. variances.g. Multiple linear regression (Section 3. This initiates automatic prompting for all STATSET flags. Cross tabulation and summary statistics (Section 3..1). data selection. standard deviation. Documentation of ANALYSE functions (Section 2.2). Triple exponential smoothing and forecasting (Section 3. quartiles. 2-2 Information Builders .3). Summary of Statistical Operations NOTE: • • Each operation is initiated by a mnemonic subcommand and may be repeated in any order. all regression analyses use the results of a single set of CORRE calculations instead of reproducing the input data for each analysis.8). etc. Select processing of missing data.3). Stepwise linear regression (Section 3.2. etc.7). HOLD: ENTER THE NAME--). For example. Discriminant analysis (Section 3.

Specifies the FOCUS terminal print width setting. STATSET Flags that Control the ANALYSE Environment EQFILE The EQFILE flag names the equation FOCEXEC (maximum eight characters) being created. To issue a status request for the current settings of the STATSET flags. similar to the usage HOLD AS HOLD1 in the FOCUS TABLE command. EQVAR name HOLD name STATHOLD MISSING ON OFF MISSVAL value -999 ONLINE ON OFF PRINT ON OFF SELECT WIDTH criteria n Figure 2-2.• • Type ? STATSET and press ENTER. STATSET MISSING=ON. with EQVAR as the left-hand side of the equation. The selection criteria used to select records for analysis. An ON or OFF setting routes the output from analyses to print (useful when HOLD files are created and analyzed in subsequent iterations). issue the following command: ? STATSET This is illustrated in the STATSET terminal session at the end of this section. The command syntax is as follows: STATSET EQFILE=FIT1 Statistical Analysis Users Manual 2-3 . Flag EQFILE Value name EQFILE Description Names the FOCEXEC that will hold the regression equation as a defined variable. Equation file processing creates a FOCEXEC containing the regression equation.. All variables are compared to this value when MISSING=ON. Type STATSET and make your own assignments for STATSET flags before pressing ENTER (e. The dependent variable name used in the regression equation (stored in the EQFILE FOCEXEC). Set to NONE or not issued if no FOCEXEC is required. The value used to identify missing data fields.g. This initiates a display of the current STATSET flag settings. An ON or OFF setting specifies where to route output (ONLINE/OFFLINE) from within ANALYSE. HOLD=STYDY1). The name of the HOLD file being created. The STATSET flags are summarized on the following page. Note that this affects report formatting by XTABS. MISSVAL=-999. An ON or OFF setting specifies whether records with missing data will be processed.

The file being analyzed remains available for further ANALYSE processing throughout the current ANALYSE session. Processing with new selection criteria (STATSET flag SELECT) is inhibited (because the disk file is now different) and the following error message is displayed: (FOC134) RE-READING DATA UNDER NEW SELECTION CRITERIA IS INHIBITED. DEFINE FILE &1 &2 TEMP1 = .008607*DEPTH ) ) In this form ‘‘&1’’ may be used as needed (e. followed by the normal output: HOLDING. any HOLD file. the regression fitting TEMP by DIST and DEPTH would produce an EQFILE that looks like the following: -DEFAULT &2="". 2-4 Information Builders . A number of statistical operations (e.04427 +( -. This data is written into a HOLD file. ‘‘&2’’ can be ADD or a blank (see the FOCUS Users Manual).. A given file may be replaced (overwritten) as necessary. many different HOLD files can be created in one ANALYSE session. STATS. The HOLD flag may be set as often as needed. EQVAR=TEMP1’’. In this way.. Note that you may create as many equation files as needed in addition to the original data for subsequent use in comparative graphing and analysis reporting. A Master File Description is also created.. The command for setting this flag is STATSET HOLD=mychoice where: mychoice Is up to any eight-character name. When the file created has the same name as the ANALYSE subject. (FOC134) RE-READING DATA UNDER NEW SELECTION CRITERIA IS INHIBITED.. the following message is displayed (FOC133) WARNING: ANALYSE FILE LOST.EQVAR The EQVAR flag sets the variable name (maximum 12 characters) that appears on the left-hand side of the EQFILE.. END > 91. The command syntax is as follows: STATSET EQVAR=FIT1 HOLD The HOLD flag sets the name of the HOLD file to be created. TIMESER) provide the means for holding varying types of data. Using ‘‘STATSET EQFILE=TFIT1. The default name for the HOLD flag is STATHOLD. Such files may then be used directly by FOCUS. etc.0027106*DIST +( -1.g..).HAD SAME NAME AS HOLD FILE.g.

This special processing of records with ‘‘missing data’’ is supported in the following operations: CORRE. is to produce an equation file for future use. The command syntax is as follows: STATSET MISSVAL=-999999 ONLINE The ONLINE flag determines the destination of statistical output. Error messages and prompts are directed to the terminal. STATS and XTABS. statistical output is sent to the terminal or printer. UNIT SOLD GT 25. To inhibit the output. When ON. This is useful when the purpose of a regression. individual variables within records are ignored if they are numerically equal to the value in the STATSET flag MISSVAL (see below). When set ON (the default value). When ON (the default value). statistical output is sent to the terminal (SYSPRINT). use the following STATSET command: STATSET PRINT=OFF SELECT The SELECT flag sets the record selection criteria for subsequent statistical analyses. The default value (OFF) specifies no screening. The default value is -999. and terminate the last with a comma/dollar sign (. Statistical Analysis Users Manual 2-5 . Separate multiple tests with commas. statistical output is sent to the offline device (usually a line printer). STEPR). PROD_CODE EQ B10. When set OFF. SELECT may be set to include a series of valid FOCUS screening conditions of the following form: FIELDNAME TEST RELATION LITERAL (or LITERAL or.$). as shown below: STATSET SELECT=CITY EQ STAMFORD.MISSING The MISSING flag specifies whether missing data should be included in the analysis (ON) or omitted (OFF). STEPR. POLRG. Use the following command to turn off this flag: STATSET ONLINE=OFF PRINT The PRINT flag controls the statistical output of the regression analyses (MULTR.. depending on the setting for ONLINE. POLRG. The default value is OFF. or series of regressions.. When OFF. a test for a single field may not exceed one line. MULTR. however. To include ‘‘missing data’’ issue the following command: STATSET MISSING=ON MISSVAL The MISSVAL flag sets the value that identifies missing data. statistical output is inhibited. Error messages and prompts are directed to the terminal.$ Note that multiple fields may be tested on one or more lines. It is recommended that you use only whole numbers.) Any number of screening tests may be used.

$ LOGRIDER. T.statset HOLD : ENTER THE NAME OF THE HOLD FILE FOR STATISTICAL OUTPUT .none MISSING : ENTER (ON OR OFF) THE STATUS OF MISSING DATA PROCESSING . enter ‘‘FILESIZE’’ in response to the ANALYSE prompt for a statistical operation. 3.INDEX. PLAG12 NE 0. ANOVA CORRE EXPLAIN EXSMO FACTO POLRG STATS STATSET STEPR TIMESER ENTER STATISTICAL OPERATION DESIRED filesize ENTER NUMBER OF RECORDS TO PROCESS 10 MDISC XTABS MULTR FILESIZE 2-6 Information Builders .off SELECT : ANY SELECTION TESTS (YES OR NO) .stats 2. TDATE. you can also specify the number of observations to be used for your sample. MISSVAL=0. 1 MONTH LAG LOGRIDER AS PLAG1 1 LAG LOGRIDER AS PLAG12 12 HOLD END Analysis preparation FOCEXEC ANALYSE STATSET STATSET MULTR YES STAT END FILE LAGGED MISSING=ON.80 EQFILE : ENTER ‘‘NONE"" OR THE REGRESSION EQUATION FILE NAME .. HOLD=LAGGED TIMESER.on PRINT : ENTER (ON OR OFF) THE STATUS OF THE PRINT CONTROL FLAG . The command syntax is as follows: STATSET WIDTH=130 Sample STATSET Terminal Session Data extraction FOCEXEC ANALYSE FILE AIRLINE STATSET MISSING=ON. PLAG12 Sample STATSET dialogue ENTER STATISTICAL OPERATION DESIRED . EQFILE=REGRESS SELECT=PLAG1 NE 0. PLAG1.sessout ONLINE : ENTER ‘‘ON"" FOR ONLINE PRINTING OR ‘‘OFF"" FOR OFFLINE PRINTING . STATISTICAL OPERATIONS..off WIDTH : ENTER NUMBER OF CHARACTERS TO PRINT ON A LINE . In XTABS (the ANALYSE cross tabulation facility) WIDTH is used to control automatic paneling.no ENTER STATISTICAL OPERATION DESIRED .WIDTH The WIDTH flag sets the number of characters in the print line (the default is 80 characters). To do so.4 Limiting the Sample Size: FILESIZE At the point of selecting a statistical operation. HOLD=REGRESS.

5 Reviewing Online Documentation: EXPLAIN There is online documentation available for each ANALYSE statistical operation. FOR SUMMARY. if you know the name of the particular analysis. operation Prints the documentation for a specified operation. which you can summon either by selecting the EXPLAIN option as your STATISTICAL OPERATION. FOCUS command level help is also available with: >>HELP ANALYSE EXPLAIN Enter EXPLAIN in response to the ANALYSE prompt for a statistical operation to invoke the following display ENTER STATISTICAL OPERATION DESIREDexplain ENTER COMMAND (ALL.2. LIST Lists and briefly describes the statistical operations. or. Statistical Analysis Users Manual 2-7 . enter the name of the operation followed by a space and a question mark ( ?). OR NAME OF SPECIFIC OPERATION) where: ALL Prints the entire online file of documentation for ANALYSE operations. LIST. Thus ANOVA ? summons a display of the online documentation for ANOVA.

Sample EXPLAIN Terminal Session 2-8 Information Builders .

Specify the sort sequence. Ralston (1967). The statistical function. An example follows: TABLE FILE PROPERTY WRITE CNT. The analyses are presented in alphabetical sequence.C END > Initiate the request. A. including the number of factors and the number of groups or levels in each factor (in a factorial manner). Mathematical Methods for Digital Computers. Sample ANOVA Session References The ANOVA analysis is done as a factorial design using a three-operator method. Retrieve the number of REGIONS (factor 1). 3. (Note that the number of factors at each level must be the same.REGION WRITE CNT. The ANALYSE request.1 Analysis of Variance: ANOVA The ANOVA statistical operation performs an analysis of variance for an equal cell factorial design for up to 5 factors. Property by region (factor 3). The file for analysis.B. New York: John Wiley and Sons. All other required information is taken from the input file. respond ANOVA to the ANALYSE prompt for a statistical operation: ENTER STATISTICAL OPERATION DESIRED . The actual variable PROPERTY.PROPERTY BY REGION BY CTYPE PRINT PROPERTY BY REGION NOPRINT BY CTYPE NOPRINT ON TABLE HOLD END ANALYSE FILE HOLD ANOVA A. The factors.) To initiate it. Retrieve number of City types (factor 2) within factor 1. Statistical Analysis Users Manual 3-1 .3 The Statistical Operations This chapter describes each of the statistical operations that you can run within ANALYSE. ‘‘Analysis of Variance’’ (Chapter 20).anova After selecting ANOVA. you are prompted for the name of each factor desired.CTYPE BY REGION WRITE CNT. Figure 3-1.

in response to the ANALYSE prompt for a statistical operation: ENTER STATISTICAL OPERATION DESIRED . To initiate it. enter CORRE. There are no prompts for additional information.2 Correlation Analysis: CORRE The CORRE analysis calculates and displays means. and correlation matrices for all numeric variables in the file named. standard deviations.corre 3-2 Information Builders .Sample ANOVA Terminal Session 3.

. MISSVAL=some value). m k=1.... m k=1.. the following equations are used: Means n ∑ x ij Xj = i=1 ------------- n where: j=1.....) With the input data X ij .... 2.... where i=1.= n–1 -------------------------------n–1 i=1 ∑ ( X ij – X j ) Nj 2 where: j=1. 2. n (observations) and j=1. 2. 2. m Correlation Coefficients S jk T jk = ------------------------S jj ⋅ S kk where: j=1. 2. (In pair-wise deletion. 2. m and Statistical Analysis Users Manual 3-3 ...... The method used calculates product-moment correlation coefficients. m Standard Deviation S jj S j = --------------.... 2... 2..The results of these calculations and the sums of cross-product deviations (not displayed) are saved for use in subsequent regression analyses.. pair-wise deletion is performed... m (variables).. m Sums of Cross-Product Deviation =1 ∑ ( X ij – Tj ) ⋅ ( X ik – T k ) – i---------------------------------------------------------------n n ∑ ( X ij – Tj ) ⋅ ∑ ( X ij – Tk ) n S jk = i=1 where j=1. cases are omitted from the computation when either of the two variables under consideration is missing... If you choose to process records with missing data (SET MISSING=ON.

forecasted. NOTE: • • • Any alphanumeric data is detected and treated as numeric zeroes..i=1 T j = ------------n ∑ x ij n are used for computational accuracy.3 Exponential Smoothing: EXSMO The EXSMO analysis performs triple exponential smoothing for a time series of numeric variables in the file. Sample CORRE Terminal Session 3. 3-4 Information Builders . the CORRE output includes the number of observations and pairs present (i. CORRE requires a greater number of observations than variables (by at least 1). not missing). retained in core. recalculated. Many smoothed series for the same or different variables can be displayed. and/or written to a HOLD file.e. When processing records with missing data (MISSING=ON).

FORMAT=D15. ENTER NAME OF VARIABLE TO BE SMOOTHED . The values of variables that exist when you enter EXSMO are protected and cannot be overwritten by EXSMO. alias. For example: EXSMO RIDER AS SMOOTH_RIDER These alternate names may contain up to 12 characters and must be enclosed in single quotation marks if they contain any embedded blanks. The initial default values of A. 0. RIDER is the second variable in the file. in response to the ANALYSE prompt for a statistical operation. ENTER STATISTICAL OPERATION DESIRED . For example.$ If you prefer to name the smoothed variable yourself. Statistical Analysis Users Manual 3-5 .5C A = X 1 – B – 0. EXSMO creates a smoothed variable with the following attributes: FIELDNAME=S.exsmo EXSMO then prompts for the name of the field to be smoothed. AIRLINE. or any unique truncation of the alias or fieldname.sales The variable (in the above example ‘‘sales’’) may be specified by its fieldname. the expression EXSMO RIDER AS NRIDER results in an error message (and a repeat prompt). holding.0 < α < 1.To initiate this analysis. B. if NRIDER is the name of a field in the file being analyzed.G. and X 3 are the first three input time series data points. you can overwrite fields created by EXSMO during the current session.05C where X 1 . enter EXSMO. B.’’ For example. EXSMO uses the default method to calculate starting values for A. EXSMO prompts the user for a smoothing constant α (alpha) (where 0. B. the following three prompts appear: ENTER COEFFICIENT A ENTER COEFFICIENT B ENTER COEFFICIENT C - If you specify NO. B. and the calculation is done in the order shown (C.1) - The larger the value of the constant. EXSMO names smoothed variables (those created for examination. use the AS phrase. X 2 . position number in the file. and C) required for the smoothing: DO YOU WISH TO SPECIFY INITIAL COEFFICIENTS (YES/NO) If you specify YES.E02.0 ). ENTER SMOOTHING CONSTANT (E.RIDER.ALIAS=S. C are determined as follows: C = X 1 – 2X 2 + X 3 B = X 2 – X 1 – 1.3. and C. and then A). EXSMO then prompts the user for the source of the three additional constants (A. However. or further smoothing) by prefixing the first 10 characters of the names or aliases (aliases are assigned if HOLD files are created) with an ‘‘S. the larger the influence of previous data on the next smoothed point.

Starting with initial values of A, B, and C, EXSMO calculates the first smoothed series data point S 1 , updates coefficients A, B, and C, and then proceeds step by step through the time series values generating the smoothed series. The formulas used at each step are defined below
S i = A + B + 0.5C

where:
Si

Is the smoothed data value for the next time period (the ith period).
A,B,and C

Are the values that exist at the (i-1)th time period. After S 1 is calculated for one time period ahead, the A, B, and C coefficents are updated with the following formula
A = X1 + ( 1 – α ) × ( Si – Xi ) B = B + C – 1.5 ( α ) ( 2 – α ) × ( S i – X i ) C = C – ( α ) × ( Si – Xi )
3 2 3

where:
Xi

Is the input time series data point for the one time period ahead.
α

Alpha is the smoothing constant. The calculation is done in the order shown (A, B, and then C) with the B and C on the right-hand side of the equations taking on their previous values. EXSMO then proceeds, one time period at a time, until an entire smoothed time series (Si ) is calculated covering the same extent as the input time series. At the end, EXSMO provides final values for the coefficients A, B, and C in the expression
Fi = A + B ( T ) + C ( T ) ⁄ 2
2

where:
Fi

Is the ith forecasted value of the smooth series.
T=1,2,3...

Are used for the first, second, third, etc. forecasted time period(s). This expression is used to find estimates (or forecasts) for the specified number of time periods ahead (T).

3-6

Information Builders

When the smoothing calculations are complete, EXSMO prints the initial and final values of the coefficients used (see the example below). The user then receives a series of prompts calling for dispositions for the input and smoothed data for the actual and forecasted time periods (display, hold, etc.). EXSMO requests the data with the following prompt:
ENTER COMMAND (E.G. PRINT,LAST,FORECAST,SHOW,EXSMO,KEEP,HOLD,QUIT)

Note that at any point before a new smoothed series is computed, a response of EXSMO restarts the analysis at the first prompt.

EXSMO Control Commands
COMMAND
PRINT p q

DESCRIPTION A table of data values and smoothed-series values is displayed for data points numbered from p to q, where p and q are positive integers. For values greater than N (the number of observations) only predicted smoothed values can be displayed. Displays the last p data values, along with their corresponding smoothed series values. Displays the predicted smoothed series values for the first p data points starting with N + 1. Displays information (for point p only) on one line. Allows the user to specify another variable for smoothing, or resmooth the present variable using different parameters. The data for all variables currently defined is written to a HOLD file. Exit from EXSMO, while retaining the data for all created variables in core. Exit from EXSMO, discarding all variables created in the session.

LAST p

FORECAST p

SHOW p EXSMO

HOLD KEEP

QUIT

Figure 3-2. EXSMO Control Commands Each command may be specified by name or with a unique truncation. The PRINT, LAST, and FORECAST commands each displays three columns of output (the data point number, the input, and the smoothed data values). Data points beyond the extent of the input file are represented as blanks. The EXSMO control commands support reiterative smoothing and display of variables. The smoothed variables created with EXSMO are saved and are available thereafter as normal variables for use in any subsequent ANALYSE statistical operation (e.g., CORRE, MULTR, etc.). Thus, you can perform multiple analyses using both the original and the smoothed variables.

PRINT
The PRINT command initiates two prompts for the time period to be printed. For example:

Statistical Analysis Users Manual

3-7

print ENTER FIRST DATA POINT TO OUTPUT ENTER LAST DATA POINT TO OUTPUT -

1 3

Since typing ahead is supported, this is equivalent to entering:
print 1 3

LAST
The LAST command initiates the prompt:
ENTER NUMBER OF DATA POINTS -

When smoothing a large file this is a useful technique for examining just the end of the file.

FORECAST
The FORECAST command initiates the following prompt:
ENTER NUMBER OF DATA POINTS -

This command is useful when only the forecasted data points are required, particularly when used in conjunction with the LAST command. For example, ‘‘LAST 4 FORECAST 5’’ produces a display of the last four actual time periods together with the first five forecasted time periods. NOTE: PRINT, LAST, and FORECAST may be entered together to produce an integrated display.

SHOW
The SHOW command initiates the following prompt:
ENTER POINT TO DISPLAY -

This command is used to examine the time period.

HOLD
The HOLD command initiates the following prompts:
ENTER FIRST DATA POINT TO OUTPUT ENTER LAST DATA POINT TO OUTPUT -

HOLD creates a raw data file (and Master File Description) that can be used immediately in graphs, reports, etc. The file created, using the name specified by the STATSET HOLD flag, will contain all of the new variables created by the smoothing process in addition to the original fields from the target file. All pre-existing alpha and integer fields (for maintaining identity and date facilities) are preserved in their original formats. All other numeric fields are held in decimal (D) format with the number of decimal places that they originally held. All smoothed fields created by EXSMO are held in D15.3 format. Numeric variables not forecasted are assigned values as set in the STATSET MISSVAL flag (alpha fields are set to blank) for the forecasted time periods.

KEEP
The KEEP command causes an exit from EXSMO, while retaining all of the newly created variables in core. You can then perform other ANALYSE functions and later return to EXSMO for further smoothing.

3-8

Information Builders

QUIT
The QUIT command exits EXSMO and returns to ANALYSE, deleting all of the smoothed data created during the EXSMO session.

Sample EXSMO Terminal Session

Statistical Analysis Users Manual

3-9

Rotated factor matrix. This is done to minimize the number of factors. eigenvalues equal to or greater than the supplied eigenvalue are retained. Communalities for initial extraction and final rotation. Biomedical Computer Programs manual. Cumulative percentage eigenvalues. see: W. In the extraction. Reference For information about the principal component.facto The analysis performs a principal components solution and a varimax rotation of the factor matrix. Variances for each iteration cycle. FACTO produces: • • • • • • • Eigenvalues. 3-10 Information Builders . Factor matrix. The varimax rotation is used to simplify columns (factors). The user is prompted for a minimum eigenvalue to limit the number of factors retained. varimax rotation method.J. Dixon (1973). Enter FACTO in response to the ANALYSE prompt for an operation to initiate this analysis: ENTER STATISTICAL OPERATION DESIRED . The results of the principal component analysis determine the minimum number of dimensions needed to account for most of the original variable set variance. Los Angeles: University of California Press. Eigenvectors. in the factor matrix.3. rather than rows (variables).4 Factor Analysis: FACTO The FACTO option performs factor analysis.

Sample FACTO Terminal Session Statistical Analysis Users Manual 3-11 .

3.mdisc 3-12 Information Builders .5 Discriminant Analysis: MDISC To perform discriminant analysis enter MDISC in response to the ANALYSE prompt for an operation: ENTER STATISTICAL OPERATION DESIRED.

Names three variables. Dixon (1973). • a.W. Figure 3-3.STATE WRITE CNT. a. Biomedical Computer Programs. Numbered discriminant functions. c. The generalized Mahalanobis D-square. MDISC produces: • • • • • The mean of each variable by group.6 through 6. Creating an Extract File to Analyze with MDISC NOTE: The counts must contain the number of unique items. in addition to the variables. References T. Observation. The calculation checks for and requires that the number of variables be equal to or greater than the number of groups. not records. The input file created by the TABLE request must contain the number of groups. Statistical Analysis Users Manual 3-13 . The largest function number. b. Los Angeles: University of California Press. The number of observations (in each group). Coefficients. Anderson (2nd Edition . The number of groups. Constant. For example: TABLE FILE SMSA WRITE CNT. (Section 6. Evaluation of classification functions for each observation.8). b. and the name of each group.The required information is taken directly from the input file without further prompting. The pooled dispersion matrix.SMSA BY STATE PRINT POPULATION MILLIONARES REMARRIAGE BY STATE ON TABLE HOLD END ANALYSE FILE HOLD MDISC Initiates the request. W. The common means for each variable. Introduction to Multivariate Statistical Analysis. the number of observations in each group. The probability associated with largest discriminant function.1984). New York: John Wiley and Sons.J.

Sample MDISC Terminal Session 3-14 Information Builders .

multr MULTR supports processing of both equation files and records with missing data.3. A linear relationship of the following form is sought n Y = ∑ Ci ⋅ xi + C0 i=1 Statistical Analysis Users Manual 3-15 . Initiate MULTR as follows: ENTER STATISTICAL OPERATION DESIRED.6 Multiple Linear Regression: MULTR The MULTR option performs a multiple linear regression for any choice of dependent and independent variables in a specified data file.

MULTR displays: • The Variable. F values. The Standard Error of Estimate. The Multiple Correlation Coefficient. • • • • • • • • • • Mean. you may select as many alternate analyses (other variables) as desired. in which case the data would already be available). truncation or number). Since MULTR searches for a relationship between variables. and correlation coefficients for all variables. xi Are the N selected independent variables. Sum of squares. An Analysis of Variance for the Regression. Each independent variable (name. Degrees of freedom. • The Regression Equation listed. especially when many variables are included (a maximum of 64 is permitted). the system prompts the user to determine: • • • • The dependent variable (name. truncation or number). After examining the means. Regression coefficient. • • • • • Source of variation. 3-16 Information Builders . variable names are also assigned position numbers which you can use when selecting them. The Intercept. it automatically calls CORRE to produce the data from which it selects the Y and Xi variables for the regression (if not called previously. Mean squares. To simplify variable selection. the dependent variable. Correlation vs.where: Y Is the selected dependent variable. After the analysis is performed and the report appears. Ci Are the N regression coefficients relating the change in Y caused by a change in each Xi and the intercept C0. Whether a table of residuals is desired. perform other types of analysis and return to evaluate other models displayed by MULTR. Standard deviation. Standard error of regression coefficient. Computed T value. Whether the regression equation is desired. standard deviations.

this suppresses printed output. W. Estimated dependent variable value. Statistics in Research.• The Table of Residuals and/or Durbin-Watson upon prompted request. (Chapter 8). • • • • • Observation. reporting. etc.3 Preparing the Environment: STATSET on page 2-2).R. Actual dependent variable value. can be used for graphics. This equation. References: The Gauss-Jordan Method is used in the solution of the normal equation. and EQFILE processing ON. along with the results of other regressions. Ames.W. Residual (actual . Lohnes (1971). (Chapter 3). the regression equation is stored in an equation file FOCEXEC. set the STATSET PRINT flag OFF. If EQFILE processing is on (see Section 2.estimate). Durbin-Watson statistic. analysis. To create a regression equation for further processing. Multivariate Procedures for the Behavioral Sciences. Iowa: Iowa State College Press Norman Draper and Harry Smith (1981). Cooley and P. New York: Kreiger Bernard Ostle and Rick Mensing (1975). NY: Wiley Statistical Analysis Users Manual 3-17 . Applied Regression Analysis.

Sample MULTR Terminal Session 3-18 Information Builders .

answer the ANALYSE prompt for an operation as follows: ENTER STATISTICAL OPERATION DESIRED.polrg Subsequently. you will be prompted for the following: • • • • Highest degree polynomial to be used (1 .10). To initiate this analysis. Is a table of residuals desired? Is the regression equation desired? The analysis calculates powers of the independent variable to calculate polynomials of increasing degree. truncation or number). • The source of variation. The regression coefficients. POLRG supports processing of missing data and equation file processing.3. POLRG produces the polynomial degree at each step: • • • The intercept.7 Polynomial Regression: POLRG The POLRG option performs a polynomial regression for any choice of dependent or independent variables in the specified data. The calculation proceeds by degree until there is no further reduction in the residual sum of squares or the maximum degree polynomial specified is reached. The analysis of variance for the degree step. Statistical Analysis Users Manual 3-19 . The dependent and independent variables (name.

The sum of squares improvement. The mean square. A table of residuals and/or Durbin-Watson upon request. The sum of squares. The f-value. POLRG and MULTR share the same methodology. Sample POLRG Terminal Session 3-20 Information Builders . The regression equation for the degree step.• • • • • • • The degree of freedom. EQFILE and PRINT control are also supported.

Statistical Analysis Users Manual 3-21 .

you are prompted for a statistical operation: ENTER STATISTICAL OPERATION DESIRED .VAR. Processing of ‘‘missing data’’ is automatically supported.stats You will be prompted for all required input and may specify any or all statistics (optionally grouped by a sort field). When you enter STATS in response to the ANALYSE prompt. This information provides both a valuable end-product report (e.).14) all 3-22 Information Builders . etc.3.). Each statistical option has a name or number. deciles.stats ENTER OPTION NAME(S) OR NUMBER(S) DESIRED (E..8 Descriptive Statistics: STATS The STATS analysis calculates and displays a selected set of statistics for all numeric variables in the analyzed file. medians.G. modes. as well as a necessary step in the selection of subsequent analyses. 3. quartiles. as shown in Figure 3-4.g. Respond as follows to select this analysis: ENTER STATISTICAL ANALYSIS DESIRED. You can hold the statistical output as a HOLD file for use by FOCUS (reporting. etc. graphics. STATS is useful for obtaining a significant quantity of descriptive statistics on up to 64 variables.MEAN.

The midpoint. A measure of dispersion. Number 1 2 3 Name ALL MEAN MEDIAN Explanation Options 2 .Enter the name. derived after all values are in order (lowest to highest). or 50th percentile. number. The selected statistics are displayed for all the numerical variables. The square root of the variance. The lowest value. A measure of a distribution’s peaks or flatness. The standard error measure of sample mean stability. 4 5 6 7 8 RANGE STDEV MODE MAXIMUM STERROR 9 10 11 12 13 SKEWNESS MINIMUM VARIANCE KURTOSIS DECILE 14 QUARTILE Statistical Analysis Users Manual 3-23 . The highest value. A response of ‘‘?’’ produces a help display similar to the following one. A measure of the symmetry of a distribution. The ‘‘average’’ value. The most frequent value of a variable. The maximum minus the minimum value. or unique truncation of the options desired. estimated by the standard deviation divided by the square root of the number of observations. A table of quartiles or variable values at each of 4 25% population points.14. A measure of data dispersion about its mean. A table of deciles or variable values at each of 10 10% population points.

r.d..q ALL Option All of the statistical options selected are calculated for each numeric variable (except for the group field) in the subject file. . Overrides print suppression by HOLD. 3-24 Information Builders . It is a simple measure of the variable’s central tendency (the sum of all variable values divided by the number of values). MEAN Option The MEAN option produces the mean. the 10 deciles. Data is grouped according to the first field.. The ALL option produces all of the statistics (options 2 . ranges. The options entered above could have been input on the initial response to the ANALYSE prompt for an operation. 4.N. Note that you can mix numbers and names in a response and that typing ahead is supported.stats 2.2. If you choose to include missing data in the analysis (see Section 2.3. Table of STATS Statistical Options The following response to a STATS prompt for an option 2. Statistics are written to a HOLD file specified by the STATSET command. or average value. medians. for the variable.Number 15 Name NUM_OBS Explanation Number of observations present is placed in a HOLD file if the HOLD option is selected. Q produces means. the number of observations present for each variable is displayed against the size of the sample.3.3 Preparing the Environment: STATSET on page 2-2). as shown below: ENTER STATISTICAL OPERATION DESIRED -. and the four quartiles for each numeric variable in the subject file. DEC. 40 41 GROUPS HOLD 42 PRINT Figure 3-4. The formula used is as follows ∑ Xi i=1 X = ------------N n where: i =1. Printing of statistics is suppressed.14).. 3.

N Is the number of observations. MAXIMUM Option The MAXIMUM option determines the highest value for each numeric variable. it is the value such that 65% of the cases lie between the mean ± 1 standard deviation. STDEV Option The STDEV option calculates the standard deviation for each numeric variable. all values occur only once). From the median value. If more than one value occurs the same number of times. If the mode cannot be calculated (e.g. The median value lies precisely on the 50th percentile. If the number of cases (N) is odd. If N is even. In normal distributions. RANGE Option The RANGE option calculates the difference between the maximum and minimum value for each numeric variable. the median is linearly extrapolated according to the following formula median = X N ⁄ 2 + 0. Xi Is the ith observation of variable X. MEDIAN Option The MEDIAN option calculates the middle case value for each numeric variable. It is a measure of the spread of values and is equal to the square root of the variance (S2) defined below. the lowest value is deemed the mode. the median value is the (N + 1)/2. Statistical Analysis Users Manual 3-25 .. a message to that effect is printed and the mode is set to the missing value as set in STATSET.5x ( X N ⁄ 2 + 1 – X N ⁄ 2 ) where: XN/2 Is the N/2th value of 0 in a ranked sequence of the variable X. 50% of the cases lie above it and below it if the variable is ranked from its lowest to highest values. N STDEV = VARIANCE = i=1 -----------------------------N–1 ∑ ( Xi – X ) 2 MODE Option The MODE option calculates the value that occurs most frequently for each numeric variable.

The following equation defines skewness: Xi – X 3 ∑ -------------S =1 SKEWNESS = i---------------------------N n For computational purposes. If skewness cannot be calculated. X Is the mean.. N. and a positive or negative value indicates clustering above or below the mean (X) respectively.. Since skewness (or third moment) is an odd power of (Xi .’’ It is an estimate of the difference between a given sample mean and an estimated population mean. Given a sample (a given group of cases).. It is a measure of the deviation from symmetry for a distribution.2. N Is the number of observations. a value of zero indicates symmetry.STERROR Option The STERROR option calculates the standard error for each numeric variable. The denominator is the calculation formula for S3 (where S2 is the variance). The array of these sample means forms a normal distribution. The standard error is determined by dividing the standard deviation by the square root of the number of observations: STERROR = S ⁄ ( N ) SKEWNESS Option The SKEWNESS option calculates the skewness for each numeric variable. a message is printed and the held value is set to the missing value from STATSET. This distribution has a standard deviation which is called the ‘‘standard error. the following formula is used 2 ∑ X i3 – 3X ⋅ ∑ X i2 + 3X ⋅ ∑ X i n n n N–X SKEWNESS = ------------------------------------------------------------------------------i=1 -----------------------------2 ∑ X i2 – NX n 3/2 i=1 i=1 i=1 ------------------------------------------------------------------------------3 N–1 where: Xi Is the ith observation. i = 1.X).. the true population mean can be estimated by examining the means for a large number of equal sized samples chosen from that population. 3-26 Information Builders .

It is a measure of variation from the sample mean. is set to the missing value set by STATSET. For computational purposes. both positive and negative differences count equally with large variations counting more than small ones. VARIANCE Option The VARIANCE option calculates the variance (S2) for each numeric variable. it is a measure of the flatness or sharp definition of a sample distribution. As an even power (second moment). the following formula is used: 2 ∑ X i2 – NX n =1 S 2 = i-----------------------------N–1 KURTOSIS Option The KURTOSIS option calculates the kurtosis for each numeric variable. The variance is literally the average squared deviation from the mean: 2 ∑ ( Xi – X ) n =1 S 2 = i-----------------------------N–1 where N-1 is generally taken as the denominator (instead of N) assuming sample data rather than the entire population. if held. Statistical Analysis Users Manual 3-27 . If the kurtosis is not available. a distribution narrower or flatter than a normal one. A small variance occurs when there is little variation in the sample. It is defined by: Xi – X 4 ∑ -------------. As a fourth power of the difference from the mean (the fourth moment). The difference is negligible for large samples (large N).MINIMUM Option The MINIMUM option determines the lowest value for each numeric variable. the following formula is used: 2 3 ∑ x i4 – 4X ⋅ ∑ Xi3 + 6X ⋅ ∑ X i2 – 4X ⋅ ∑ X i n n n n i=1 i=1 i=i i=1 -----------------------------------------------------------------------------------------------------------------------N + X4 KURTOSIS = -----------------------------------------------------------------------------------------------------------------------2 i=1 2 ∑ X i2 – NX n -----------------------------N–1 Note the denominator is the square of the variance (S4) and the minimum value of the KURTOSIS is -3. A positive or negative value indicates. respectively.– 3 S i=1 KURTOSIS = ---------------------------------------N n For computational purposes. a message is printed and the value. The kurtosis for a normal distribution is zero.

3. QUARTILES Option The STATS option QUARTILES calculates the four quartiles for each numeric variable.1 Control Options A series of special control options provides run-time information and allows you to adjust the processing and disposition of your STATS output. The kth decile is calculated as follows (N ⋅ k) DECILE k = X i + ( X i + 1 – X i ) ⋅ N ⋅ k – INT -----------------------10 10 where: N Is the number of observations..– INT --------------4 4 where: (N × k) i = INT ---------------4 NUM_OBS Option The NUM_OBS option determines the number of values present for each numeric variable. then the 10 deciles are the exact values for which 10%. It is always displayed. 30%.8. GROUPS Option The GROUPS option causes all selected options to be calculated for each numeric variable. 90%. No statistics are calculated for the first field (or variable) if the GROUPS option is specified.. 3-28 Information Builders . The group field is then the first field in the original HOLD file and keeps its original field name. i Is the integer portion of (N*k)/10. k Is the decile. Xi Is the lower bound of the decile. 20%.. 75%.. 50%. 100% of the value lie below. The formula for the kth quartile is (N ⋅ k) N⋅k QUARTILE k = X i + ( X i + 1 – X i ) ⋅ ---------. The group field is assumed to be sorted in ascending order and may be numeric or alphanumeric. If a variable is sorted from low to high. They are calculated similarly to deciles and represent the 25%. 100% population points.DECILES Option The DECILES option calculates the 10 deciles for each numeric variable. one for each value of the first analyzed field in the file.

HOLD Option The HOLD option causes the output of STATS to be held as a HOLD file with the filename set by STATSET. and GROUPS are specified. and is used to override the normal print suppression of the HOLD option. and/or relational matches with other data files. If STATS ALL. PRINT Option The STATS option PRINT sends output to a print queue. A record is created for each variable. graphs. and for each group field value (if GROUPS is specified). a Master File Description is created for STATHOLD. Printing is suppressed unless PRINT is also specified. The HOLD option generates a HOLD file containing all of the selected STATS options. HOLD. The latter can be used to create standardized data (data with zero mean and unit deviation). Statistical Analysis Users Manual 3-29 . The HOLD file can then be used for reports. and the STATSET HOLD flag is left as the default value.

Sample STATS Terminal Session 3-30 Information Builders .

Instructions are provided for the last prompt (nature of variables) the first time STEPR is executed during an ANALYSE session. ‘‘Forced’’ or ‘‘deleted’’ designations always take precedence. Up to 64 variables may be processed. or number).3. Statistical Analysis Users Manual 3-31 . The nature of each of the remaining variables (available. Is a table of residuals desired? Is the regression equation desired? The dependent variable (name.9 Stepwise Multiple Regression: STEPR This ANALYSE option performs a stepwise multiple regression for any choice of dependent and independent variables in the subject file. Each step of the analysis looks at the reduction of the sum of squares for each variable. but not thereafter. forced. truncation. The procedure prompts for: • • • • • A constant value or proportion of the sum of the squares to limit the variables entering in the regression. or deleted). Missing data and equation file processing are supported. Each step adds the next independent variable that shares the highest partial correlation with the dependent variable.

• • • • • • • • The name.STEPR produces: • The Dependent Variable. The multiple correlation coefficient adjusted for degrees of freedom. Michigan: Books on Demand (313-761. • • • The number of forced variables. number and (forced or available) designation of the variable entered. For each variable name and number. The multiple correlation coefficient. (Appendix 6A). Reference Carl A. The analysis continues step by step and is terminated if the proportion reduced is less than the limiting constant specified by the user (0 is acceptable) or upon completion of the link. -The standard error of the regression coefficient. Bennett and others. Ann Arbor. The cumulative proportion reduced. • • The regression equation at that step. -The mean. EQFILE and PRINT control is supported. 3-32 Information Builders . For each step in the regression: The Step Number. As with MULTR. -The regression coefficient. The sum of squares reduced. The standard error of estimate. -The standard deviation. The number of deleted variables. Statistical Analysis in Chemistry and the Chemistry Industry.4700). The f-value for analysis of variance. -The Beta weight. -The T value. A table of residuals and/or Durbin-Watson upon prompted request. The analysis uses the Abbreviated Doolittle Method to enter variables in the regression and compute their regression coefficients.

Sample STEPR Terminal Session Statistical Analysis Users Manual 3-33 .

3-34 Information Builders .

TIMESER is initiated by entering TIMESER in reponse to the ANALYSE prompt for a statistical operation: Statistical Analysis Users Manual 3-35 . There are facilities for saving newly-created variables in core for use with other ANALYSE facilities. and for creating FOCEXECs containing equations. including forecasting. leading and lagging. curve fitting. and smoothing. for writing data to HOLD files.3.10 Time-Series Analysis: TIMESER The TIMESER command implements a variety of operations on time-series data.

the names are constructed by appending the first nine characters of the name of the referenced data variable to the default prefix of the command that created it. . "1 DAY" for a YMD or MDY formatted time-variable. TIMESER supports not-present values or gaps in the series being analyzed.g." The other fields contain the data which are treated as functions of the time-variable. 1.1 The Time Variable The user must provide a variable that contains the time values.. Alternate variable names may be specified using AS phrases. When new variables are created.2 Commands that Create New Variables The TIMESER commands described in the following sections create new data variables that may be manipulated in exactly the same manner as other data variables. The actual variables are marked as missing for these new time series records by the ANALYSE missing value indicator. "2 DAYS". For example. "1". etc. 3.)..SALES is created.FIT The time-variable. For example: ENTER ITEM FOR 1 INTERVAL (e. the full series is formed by assigning the missing value indicator to records not originally provided. if we compute the moving average of a variable named SALES. "2 DAYS". I4YM). For example.oriented if it has a FOCUS date format in the file being analyzed (e. For example: ENTER COMMAND (OR ? FOR HELP). consider a YM (year/month) formatted time variable. or unique truncation of the variable that specifies position of "time" in the series.g.g. The series points for missing values may then be valued by either linear interpolation (LINERP) or exponential interpolation (EINTERP) as described below.timeser TIMESER forecasting functions create additional records containing calculated variables along with existing actual variables. the variable MA. TIMESER considers a time variable date.00 TIMESER then prompts for the field to be used as the time series variable and the value of 1 interval...ENTER STATISTICAL OPERATION DESIRED . this field is called the "time variable.g. 3-36 Information Builders . and are called "data-variables. "1 MONTH") . Based on the interval specified." In response to the TIMESER prompt ENTER NAME OF TIME VARIABLE - you may enter the fieldname. 3. monthly data is formed into "weekly" data by specifying a .g. a position number. If MISSING is not set ON. 3.10. "1 MONTH") .25 MONTH interval. TIMESER then prompts for the increment between successive time periods. and TIMESER commands are described in the following sections. time intervals.10. "1 MONTH" for a YM or MY formatted time-variable.5) or a number and a unit for dateoriented time series (e. Similarly. TIMESER assumes a missing value with the default message: MISSING DATA VALUE UNDEFINED: ASSUMED TO BE: -999. For example: ENTER NAME OF TIME VARIABLE.3 TIMESER then prompts for a command.25 MONTH Respond with just a number for a series that is not date-oriented (e. "1".QUARTER ENTER TIME FOR 1 INTERVAL (e. alias. The only restrictions on the time-variable are that the values must be in ascending order and an integral multiple of the TIME interval. Quarterly data (every third month provided) is formed into a monthly series by specifying a 1 MONTH interval.

Q i = Q i – N MAVG (MA) Command A variable is created which is the N period moving average of the specified variable. Statistical Analysis Users Manual 3-37 ..TIMESER commands may be specified in full or by a unique truncation of the name (e. Fields may be referenced by full fieldname. TIMESER prompts ENTER STARTING PERIOD FOR EXTRAPOLATION OR "TOP" -- which results in a moving average extrapolation that produces values based on maintaining the last average calculated. If you enter N (>0). If you enter only LEAD.g. The command MAVG Q N produces the TIMESER prompt: HOW MANY PERIODS DO YOU WISH TO EXTRAPOLATE -- If "0" is entered. When using them. FOCUS will prompt you for a variable. alias or unique truncation.Q i = Q i + N where: Qi Is the ith value of the variable Q. (The default prefixes are shown in parentheses following the commands. LAG (LG) Command A variable is created which is the value shifted N time periods into the past. "MTOTAL SALES 4" and "MT SA 4" are equivalent). Note that this example illustrates typing ahead. The command LAG Q N produces: LG. separate prefixes from fieldnames with periods. no extrapolation is performed.) LEAD (LE) Command A variable is created which is the value of the specified variable shifted N time periods into the future. The command LEAD Q N produces LE.

the interval extends 1 period further forward than backward. The command CMAVG Q N results in 1 CA. There is no prefix. If N is even. LINTERP Command Substitute values for missing values in the series are determined and supplied using linear interpolation.MTOTAL (MT) Command A variable is created that represents the N period moving total.Q i = --N (N – 1) i + ----------------2 ∑ Qj (N – 1) j = i – ----------------2 when N is odd.Q i = ∑ j = i–N+1 Qj CMAVG (CA) Command A variable is created which is a centered moving average for N periods of the variable specified.Q i = ∑ Qj j = i – (N – 1 ) ----------------2 when N is odd. The command CMTOTAL A N produces the following result (N – 1 ) i + ----------------2 CA. EINTERP Command Same as LINTERP but an exponential interpolation is performed by entering EINTERP Q where: 3-38 Information Builders . If N is even the product is similar to CMAVG. The command MTOTAL Q N results in: i MT. "LINTERP Q" results in a linear fit (using Q=a+bT) between the nearest two present values to assign values for missing values. CMTOTAL (CT) Command A variable is created for the centered moving total for N periods.

Q i = 100 ---------------------------Qi – N Statistical Analysis Users Manual 3-39 .Q = aebt Is used to find an exponential fit between the two adjacent values to assign values for variables with null values. The command LGDIFF Q N where: GF.Q i = Q i + N – Q i LDPCT (DP) Command The LDPCT command is the same as LDDIFF but a percentage difference is computed. The command LGPCT Q N where: ( Qi – Qi – N ) GP. The command LDPCT Q N where: ( Qi + N – Qi ) LD. The command LDDIFF Q N where: LD.Q i = Q i – Q i – N LGPCT (GP) Command The LGPCT command is the same as LGDIFF but a percentage difference is computed. LDDIFF (DF) Command The variable created is the difference between the N period lag and the current value of the specified variable.Q i = 100 ---------------------------Qi LGDIFF (GF) Command The variable created is the difference between the N period lag and the current value of the specified variable.

Q i = Q i – 1 + ∆user or Q i = Qi – 1 + [ Qi – 1 – Q i – 2 ] CEXTRAP (CE) Command The CEXTRAP command is the same as LEXTRAP except a compound growth extrapolation is performed by entering CEXTRAN Q N resulting in the following prompts: TYPE "AHEAD" OR "BACK" FOR EXTRAPOLATION DIRECTION ENTER STARTING PERIOD FOR EXTRAPOLATION OR "TOP" ENTER PERCENTAGE GROWTH FACTOR OR "LAST" TO USE LAST PERCENT DIFFERENCE - New points are calculated by increasing the last point by either the user. The command FIT Q produces the following prompts: 3-40 Information Builders .3 Exponential Smoothing: EXSMO on page 3-4 for a discussion of the EXSMO command.LEXTRAP (LE) Command The variable created equals a specified variable up to and including a specified time period. FIT Command The FIT command performs regressions to fit a number of equations to the specified variable and optionally forecasts the fitted equation forward. Q i = Q i – 1 ∆user or Q i = Qi – 1 [ Q i – 1 – Qi – 2 ] EXSMO (ES) Command See Section 3. after which values are computed by linear extrapolation.provided percentage growth factor (∆user) or percentage difference between the last two points. The command LEXTRAP Q N results in the following prompts: TYPE "AHEAD OR BACK" FOR EXTRAPOLATION DIRECTION ENTER STARTING PERIOD FOR EXTRAPOLATION OR "TOP" ENTER LINEAR GROWTH FACTOR OR "LAST" TO USE LAST DIFFERENCE - New points are calculated by adding either the user-provided growth factor (∆user) or the difference between the last two points to the last point.

) HYP1 (H1.) HYP2 (H2. Statistical Analysis Users Manual 3-41 . FIT will treat the data variable as a function not of the time variable.INDEX is not saved in core.) ALL Type of Function Linear Exponential Power Simple Hyperbolic Hyperbolic (Type 2) Hyperbolic (Type 3) Hyperbolic (Type 4) All functions Equation Y=A+B*T Y = A * EXP(B * T) Y = A*T**B Y = A + B/T Y = 1/(A + B*T) Y = T/(A + B*T) Y = A/(1 + NBT)1/N Figure 3-5. The default prefixes for the predicted. "Saving Forecast Equations").10. the equation.10..value variables appear in parentheses in the table.) EXP (EF. they then become new variables available for holding (see Section 3. but of a pseudo variable named T. which is defined as the integer K for the Kth data point. The Durbin-Watson is a measure of serial correlation of adjacent residuals in a regression.3 Other TIMESER Commands on page 3-42) and analysis by any ANALYSE function.) HYP4 (H4.4. you are prompted for the specified variable and a number of periods (if required).g. If predicted values are kept ("KEEP" entered) and residuals are kept. the Durbin-Watson and the regression statistics are output for each type of equation selected.) POWER (PF. FOCUS prompts you for one. but it is included in the HOLD file if one is requested. All of the resulting FIT equations may be saved for further FOCUS use in TABLE and GRAPH (this is discussed as a function of STATSET. (e. a value over 2 indicates negative autocorrelation).INDEX. NOTE: In each case if only the TIMESER command is entered. The names of the residual variables are constructed by further prefixing the predicted-value variable name with an "R. T. and in Section 3. "LEAD"). If the STATSET PRINT option is ON. "Y" represents the variable to be fitted and "T" the time variable. TIMESER Commands that Create New Variables If the time variable has a date format. a value close to 2 indicates a reasonable fit (a value under 2 indicates positive autocorrelation.) HYP3 (H3." Name/(Default Prefix) LINEAR (LF.DO YOU WISH TO KEEP PREDICTED VALUES (TYPE "KEEP" OR "NOKEEP") HOW MANY PERIODS DO YOU WISH TO EXTRAPOLATE DO YOU WISH TO KEEP RESIDUALS (TYPE "RESID" OR "NORESID") ENTER THE TYPE(S) OF EQUATION YOU WISH TO FIT - If you do not name a variable. The equations that may be fitted are listed below.

3 Other TIMESER Commands Command DISPLAY Description Displays values for the time variable (T. TIMESER Control Commands 3-42 Information Builders . Saves all variables (original and created) for use by other ANALYSE operations and returns you to the ANALYSE prompt. you may specify: SUBSET ON N1 N2 where N1 and N2 define the beginning and end of your subset. Figure 3-6. DISPLAY sends the following prompts: ENTER UP TO 4 VARIABLES TO DISPLAY -ENTER FIRST TIME PERIOD TO DISPLAY OR "BOT"-> ENTER LAST TIME PERIOD TO DISPLAY OR "TOP" -> Enter values for the time variable to specify the first and last periods to be displayed. "SUBSET ?" produces the following prompts: ENTER SUBSET STATUS: "ON OF OFF" -ENTER FIRST PERIOD IN SUBSET -ENTER LAST PERIOD IN SUBSET -- Alternatively.INDEX is a sequential index generated by TIMESER. see STATSET) containing all variables.10. original or created.3. T. HOLD Writes a hold file (default name STATHOLD. Replaces a specified variable. Deletes new variables and returns you to the ANALYSE prompt. SUBSET Restricts all TIMESER operations to a range of data you specify with values entered as the first and last values for the time variable.INDEX) and up to 4 specified variables for the user-specified portion of the time series. REPLACE prompts: ENTER NAME OF VARIABLE -ENTER PERIOD TO REPLACE -ENTER NEW VALUE -- KEEP QUIT REPLACE If you type ahead "REPLACE Q N" only the prompt for new value appears.

g.sales lg. Equation names in FIT can be abbreviated (e. there are several shortcuts you can use when you feel comfortable with the basic TIMESER functions.Y and PF. 3. with the variable names constructed by appending the default predicted-variable prefixes to the variable names specified in the EQVAR parameters.. All printed output from FIT commands can be suppressed by setting the STATSET PRINT parameter OFF (in which case only the FOCEXECs are created).8112 could be avoided by responding as follows: ENTER COMMAND (OR ? FOR HELP) . the following series of prompts ENTER ENTER ENTER ENTER COMMAND (OR ? FOR HELP) -.3.Y.display sales lg. Statistical Analysis Users Manual 3-43 . FOCEXECs are created if the ANALYSE STATSET parameters EQFILE and EQVAR are enabled (see STATSET).5 Ease-of-Use Features To simplify your statistical analysis sessions. For example.8004 LAST TIME PERIOD TO DISPLAY OR "TOP" -.sales 8004 8112 Abbreviations TIMESER accepts the shortest unique truncations of command names in place of the full names.4 Saving Forecast Equations FIT can create equation FOCEXECs similar to those created by other ANALYSE commands. For example. Variable names (fieldnames) and their aliases can also be referenced by unique truncations. the following FOCUS procedure produces a FOCEXEC file named FITDEFS: ANALYSE FILE HOLD STATSET EQFILE=FITDEFS EQVAR=Y PRINT=OFF MISSING=ON TIMESER TIMEVAR 1 FIT DATAVAR LINEAR POWER The resulting FOCEXEC (FITDEFS) will contain DEFINE commands for the fields named LF.sales FIRST TIME PERIOD TO DISPLAY OR "BOT" -.10. Typing Ahead Responses to TIMESER prompts may be stacked (i. the specifications for an entire command may be typed on one line). All equation types selected for FIT operations appear in the FOCEXECs.. All other words must be typed in full.10.display UP TO 4 VARIABLES TO DISPLAY -. "LIN" is a suitable replacement for "LINEAR").e.

These can be existing variables or new variables. and fitted data (see the FOCUS Users Manual). Except in the case of interpolation. The corresponding GRAPH missing value support using the GRAPH SET parameters GMISSING=ON (or OFF) and GMISSVAL=-999 can be used to graph actuals forecasted.Alternate Variable Names It is possible to place the data from TIMESER operations into variables other than those specified in the operation’s default prefix. Graphing TIMESER Results When creating TIMESER HOLD files. 3-44 Information Builders . the name of the variable can be replaced with an AS phrase. the variable created cannot be the same as the original variable. In all data-creation commands except FIT. whereas LAG SALES AS LAGGEDSALES places data in a variable named LAGGEDSALES. Thus LAG SALES 3 places data in the default variable LG.SALES. missing variable values (such as actual variable values in forecasted periods) take on the missing data value (default -999).

TIMESER Sample Terminal Session Statistical Analysis Users Manual 3-45 .

regressions.11 Crosstabulations: XTABS The XTABS analysis generates contingency tables in addition to user-selected statistics. are simply counts of observations classified by two or more discrete variables. Crosstabs represent one type of analysis of the relationship between two or more variables (other types include correlations. and employee and census studies. Typical uses of crosstabs are for surveys. To select XTABS. enter XTABS SALARY BY DEPARTMENT in response to the ANALYSE prompt for a statistical operation. 3-46 Information Builders . Such tables. or crosstabluations. These tables. Each classification variable may be either alphanumeric or numeric. and analyses of variance)." are joint frequency distributions of observations (or cases) classified by two variables.3. commonly called "crosstabs.

RETURNS BY PROD_CODE. Variable names may be fieldnames. For example UNIT_SOLD AND RETURNS BY PROD_CODE AND DATE produces four crosstabs: UNIT_SOLD BY PROD_CODE. The first variable(s) specified becomes the horizontal classifier(s) and those following the BY are the vertical classifiers. along with various overall statistics (such as chi-square.1 Specifying Variables When invoked. The number of rows and columns (and therefore.) In order to clarify such joint distributions. aliases or unique truncations of any fields in the file being analyzed. running down the page.) 3.11. The widths of these tables (panel sizes) are taken from the STATSET flag WIDTH. XTABS initially responds with a prompt for the crosstab or classifying variable: PLEASE ENTER THE NAMES OF THE CROSSTABULATION VARIABLES (E. For example Statistical Analysis Users Manual 3-47 . the number of panels) is limited only by available computer memory. columns and totals) are offered as options. and contingency coefficients). (A complete example appears in the "Sample XTABS Terminal Session" at the end of this section. A variable list is a series of variable names separated by the word AND.. each separated by the word BY: (variable name or list) BY (variable name or list) BY.. and RETURNS BY DATE Multiple BY phrases may also be used. other cell statistics (such as percentages of rows. All words and names must be separated by either blanks or commas. "SEATS AND COUNTRY BY FUEL") The specification of variables consists of a series of variable names or variable lists. follows. For example SALARY BY DEPARTMENT produces a single crosstab of cases displayed across SALARY by DEPARTMENT (down the page). Each count is a cell. identified by its column and row.G. "Better Paid").A simple crosstab. (You should be careful not to use XTABS with continuous variables or those with many values. showing the counts for salaries by department (into "Low Paid" vs. Cramer’s V. because tables with empty cells are difficult to interpret and may violate many of the assumptions applied in developing the associated statistics. UNIT_SOLD BY DATE.

xtabs temp by dist by bacteria 3.11. names. TEMP BY BACTERIA.TEMP BY DIST BY BACTERIA. produces three crosstabs: TEMP BY DIST. using the word BY to separate the variable names or variable lists.2 Specifying Options After the classification variables are specified.G. and DIST BY BACTERIA. 3-48 Information Builders . "ALL" OR "ROWPCT COLPCT CHISQ") You can then enter option numbers. XTABS prompts for desired options: PLEASE ENTER OPTION NAME(S) OR NUMBER(S) DESIRED (E. Since typing ahead is supported. a convenient way to generate the above request is as follows: ENTER STATISTICAL OPERATION DESIRED . or unique truncations.

Gamma. Kendall’s tau c. Asymmetric lambdas with each variable taken as dependent. The cell frequency’s percentage of the column total is printed in each cell. The frequency weighted according to the values of the selected field. All summary statistical options.) The cell frequency’s percentage of the row total is printed in each cell. The deviation of the observed frequency from the expected frequency is printed in each cell. Cell Statistics TAUB TAUC GAMMA SOMERSD ALL 27 28 29 30 31 32 33 34 EXPECTED DEVIATN The frequency expected (assuming independence) is printed in each cell. Asymmetric Somers’ D with each variable dependent plus symmetric Somers’ D.All of the XTABS options are shown in the following table in Figure 3-7. Cramer’s V. Number /Name Description Summary Statistics 1 2 3 4 6 7 8 9 20 CHISQ CRAMV CONT LAMBDA Chi-square. The cell frequency’s percentage of the grand total is printed in each cell.squares is the table’s chi-square. and symmetric lambda. The contribution of the cell to the chi-square is printed in each cell. Statistical Analysis Users Manual 3-49 . The count (frequency) is printed in each cell. Kendall’s tau b. Contingency coefficient. Control Options CELLCHI2 ROWPCT COLPCT TOTPCT COUNT WEIGHT 35 36 NOGRID DASH Suppresses the default vertical grid normally printed on all tables. Suppresses the dashed lines that are normally printed between horizontal rows of cells. phi for 2 by 2 tables. (The sum of the cell chi.

CHISQ Option The CHISQ option calculates and displays the chi-square statistic which tests the independence of the joint distribution of the variables in the table. Generates a display of online information about XTABS operation.). Suppresses all statistics. Data grouped by first field data must be sorted by first field. fexp(i) Is the expected frequency in the ith cell (assuming no relationship between the variables). Suppresses page headings.3 Summary Statistics Summary statistics are important for measuring the significance of the relationship(s) between variables in the sample of cases analyzed. 3-50 Information Builders . Empty cells and extremely small or large sample values lessen the significance of this statistic. it is important to measure the significance of the results obtained.Number 40 42 44 45 46 50 ? /Name NORANGE Description All variables are assumed to be unranged and no prompting is done for ranges or user-defined. When all possible cases involving a set of variables are analyzed. etc. One then speaks of "statistical significance.11. you must determine whether the results obtained could be due partly to chance or whether they actually represent the universal population. Suppresses column totals (and total percentages). Suppresses row totals (and total percentages). Chi-square is defined by the following formula N χ2 = where: fobs(i) ∑ i=1 ( f obs ( i ) – f exp ( i ) ) 2 -------------------------------------f exp ( i ) Is the observed frequency (or count) in the ith cell. These summary statistics appear in tables beneath the actual crosstabulation results. contingency coefficient. The statistic does not measure the strength of the relation. In other words. Tau. the results actually measure the relationship (excluding measurement error) between the variables of interest. NOHEAD NOSTATS NORTOTAL NOCTOTAL GROUPS "XTABS ?" Figure 3-7. but can be interpreted as a test of whether or not the variables are related. Since most studies use only small samples of the cases potentially available." which is measured by various summary statistics (chi-square. XTABS Statistical Options 3.

It is another chi-square-based statistic adjusted for sample size. Cramer’s V is used to adjust Φ for the minimum of the rows and columns. one symmetric lambda. it measures how well you can identify the value of the dependent variable. Φ has no upper limit. In comparisons it should be used with crosstabs of identical dimensions (same numbers of rows and columns). runs from 0 to a maximum value dependent on the size of the table. Asymmetric lambda is based on the proportional reduction in error in estimating the distribution of the dependent variable when the independent variable is known. Chi-square varies with the number of rows and columns used to determine the number of degrees of freedom (provided along with chi-square by XTABS) and the sample size. and two asymmetric lambdas. CRAMV Option The CRAMV statistic calculates and displays Cramer’s V (Φ for 2 x 2 tables) which makes a correction for the sample size (chi-square does not). Large Cramer’s V values indicate a strong association. respectively. C. It is defined by the following formula: χ2 C = --------------χ2 + N 1/ 2 The contingency coefficient. CONT Option The CONT option calculates and displays the contingency coefficient. the larger chi-square becomes. while small values imply the absence of a relationship or statistical independence. one with an independent row variable and a dependent column variable. and one with the reverse (dependent row variable and independent column variable). LAMBDA Option The LAMBDA option calculates and outputs the three lambda statistics. For larger crosstabs (greater than 2 x 2). For the 2 x 2 cross-tabulations Φ ranges from 0 to +1 for a perfected relationship. Large values for chi-square thus indicate the presence of a systematic relationship. Statistical Analysis Users Manual 3-51 . so values from 0 to +1 indicate minimal relationships.The larger the difference between the observed and randomly expected frequencies. of relationship strength. The lower bound of V is 0. It is defined by the following formula Φ2 V = --------------------------min ( r i ) ( c i ) where: "r" and "c" 1/2 Are the number of rows and columns. Φ Is the phi statistic (appropriate only for 2x2 crosstabs) defined by: χ2 Φ = ---N 1/2 Φ (for 2 x 2 crosstabs) corrects χ2 for the number of cases (N). In other words.

5 ( N ( N – 1 ) – ∑ T ri ( T ri – 1 ) ) ⋅ 1. Note that if there is a general ordering of pairs in the same direction on both variables.. No assumption of dependency is assumed. Tau b will be positive. Other cases are considered "tied. which is most appropriate in a square table (number of rows and columns equal). and Somers’ D are all measures of association between two variables and differ mainly in the manner of counting tied pairs.25 ⋅ ( N 2 – ∑ T ri ) ⋅ ( N 2 – ∑ T ci ) ] 3-52 Information Builders . the remaining cells for that row or column are all zero). MAX(F ) Is the maximum value for each category of the dependent variable. The formula used for computation is: 2 2 Taub = ( P – Q ) [ 0. The symmetric lambda is an average of the asymmetric lambda.e. Σ max(fjk) Is the sum of j maximum values of all cell frequencies for each category of the independent variable. They turn out to also be the respective row and column totals. Tau b is defined by the following formula Taub = ( P – Q ) [ 0. then the value of lambda is 1. and Tci Are the number of ties on the row and column variable. evaluates all cases pair-wise relative to the ordering (low to high) of each variable. gamma. Tau c. Pairs with both variables higher are called concordant pairs and those with both variables reversed are called discordant. Asymmetric lambda is defined by the following formula j λ asym = --------------------------------------------------N – max ( f r ) ∑ max ( fjk ) – max ( fk ) where: k Is the sum of the independent variable. TAUB Option The TAUB option calculates and displays Kendall’s Tau b statistic.When all of the occurrences for any given value of the independent variable occur in a single cell (i. It is defined by the following formula ∑ max ( f jk ) – max ( f k ) – ∑ max ( fjk ) – max ( fk ) j j λ sym = -------------------------------------------------------------------------------------------------------------2N – max ( f k ) – max ( f k ) where the terms are defined as in the asymmetric lambdas.5 ( N ( N – 1 ) – ∑ T ri ( T ci – 1 ) ) ] 1 / 2 where: "P" and "Q" Tri Are the number of concordant and discordant pairs respectively. Tau b. respectively." Tau b.

zero. It is most appropriate in rectangular cases (the number of rows and columns unequal). Ties are taken into consideration in a different way than in the Tau statistics. or negative. It is independent of ties or table size (dimensions) and is defined by the formula: Gamma = ( P – Q ) ⁄ ( P + Q ) Its value is positive. or less concordant pairs than discordant pairs.2 Are row and column ties when the dependent variable defines the rows and columns respectively. respectively. For a 2 by 2 table. Tau c is basically an average value per pair of (P-Q). where a row and column adjusted approximation is taken for the number of pairs. Tau c is defined by the following formula Tau c = 2 min ( P – Q ) ⁄ N 2 ∗ ( m – 1 ) where: P and Q Are as defined above. three statistics are produced: two asymmetric cases. N Is the number of cases. one for each variable taken (or the dependent variable). P and Q Are as defined for the Tau statistic." SOMERSD Option The SOMERSD option calculates and outputs the Somers’ D statistic. The asymmetric Somers’ D is calculated with the following formula Asymmetric Somers D = ( P – Q ) ⁄ ( P + Q + T i ) where: Ti = 1. As in the LAMBDA case.5 ( T 1 + T 2 ) ] ALL Option The ALL option calculates and produces all of the available summary statistics. The symmetric case does not account for which variable is dependent as defined by the following formula: Symmetric Somers D = ( P – Q ) ⁄ [ P + Q + 0. and the symmetric case. if. there are more. or as Tau b.TAUC Option The TAUC option calculates and outputs Kendall’s Tau c statistic. gamma equals another statistic called "Yule’s Q. Statistical Analysis Users Manual 3-53 . GAMMA Option The GAMMA option calculates and displays the Gamma statistic. m Is the minimum of the number of rows and columns. equal.

3. fexp(i) Is the expected cell frequency for the ith cell.11. DEVIATN This cell statistic calculates and displays the deviation between the observed and expected frequency for each cell. It is defined by the formula ( f obs ( i ) – f exp ( i ) ) 2 CELLCHI2 i = -------------------------------------f exp ( i ) where: fobs(i) Is the observed frequency (or count) in the ith cell. EXPECTED The EXPECTED cell statistic calculates and displays the expected cell frequency (assuming independence).square summation for each cell. It is defined by the formula Ci r f exp ( i ) = --------i N where: Ci Is the column total or marginal for the ith cell. CELLCHI2 The CELLCHI2 option calculates and displays the contribution to the chi. ri Is the row total or marginal for the ith cell. fexp(i) Is the expected frequency in the ith cell assuming no relationship between the variables. 3-54 Information Builders . N Is the total number of observations. It is defined by the formula DEVIATN = ABS ( f obs ( i ) – f exp ( i ) ) where: fobs(i) Is the observed cell frequencies for the ith cell.4 Cell Statistics The XTABS cell statistics may be specified selectively for calculation and display in each table cell.

11. It is defined by the following formula f obs ( i COLPCTi = 100 ∗ -----------) ci where: ci Is the column total for the ith cell.5 Ease-of-Use Features XTABS includes a number of default ease-of-use features: • • Each table is printed with a page heading and a grid. It is defined by the formula f obs ( i ROWPCT i = 100 ∗ -----------) ri where: ri Is the row total for the ith cell. ALL Option The ALL option calculates and produces all of the available summary statistics.ROWPCT The ROWPCT option calculates and displays the cell frequency’s percentage of the row total for each cell. It is defined with the following formula f obs ( i TOTPCT i = 100 ∗ -----------) N where: N Is the total number of cases. Complete summary statistics are produced and displayed. 3. COUNT The COUNT statistic calculates and displays the cell frequency (fobs(i)) for each cell. Statistical Analysis Users Manual 3-55 . COLPCT The COLPCT option calculates and displays the cell frequency’s percentage of the column total for each cell. TOTPCT The TOTPCT option calculates and displays the cell frequency’s percentage of the table grand total for each cell.

11.6 Control Options XTABS provides a number of control options for use in producing crosstabulations. XTABS determines whether the required data is available. Suppresses the printing of a heading in the table of crosstabs (default). are automatically calculated and displayed (whenever cell row and column percentages are produced).7 Specifying Columns and Rows XTABS prompts for column range categories and heading labels for each classification variable: PLEASE ENTER RANGE AND/OR HEADING SPECIFICATIONS OR "NONE" FOR EACH VARIABLE (E. Row and column totals and percentages. the first two variables (or the second and third) in the file being analyzed are used as the ACROSS and BY variables. 20 36 AS HIGH) 3-56 Information Builders . XTABS Control Options 3. Suppresses automatic range prompting (see Section 3. column. one row or column is produced for each discrete variable value and appears with the value as the heading.7 Specifying Columns and Rows on page 3-56).• • • • Counts and row.G. If no variables are specified (e. If this analysis is the first process for the existing selection criteria (see the STATSET option SELECT in Section 2. Produces full crosstab and summary statistics for each discrete value of the first field. This is similar to the GROUPS option in the ANALYSE Statistics (STATS) facility.. XTABS ALL). Suppresses automatic calculation and display of all summary statistics (default). Suppresses automatic calculation of row totals and percentages (default). Output is automatically paneled in accord with the STATSET PANEL setting. The sorting field is assumed to be in ascending order.11.3 Preparing the Environment: STATSET on page 2-2) the data will be read in and the number of observations displayed as follows: NUMBER OF OBSERVATIONS = 48 3. along with grand totals. 12 18.g. Displays the online documentation for XTABS.11. Suppresses automatic calculation of column totals and percentages (default). After you specify the crosstab variables and options. BOT 12 AS LOW. NOHEAD NOSTATS NORTOTAL NOCTOTAL GROUPS ? Figure 3-8. and total percentages are displayed in each cell. Option NOGRID NORANGE Action Suppresses the printing of the GRID (printed by default). In this case.

The response to this prompt determines the extent of values included in each column and row. but they may not overlap. The label (truncated to 12 characters) becomes the row or column heading. to direct XTABS to continue on the following line with another prompt for the same variable.) for the preceding variable. enter a comma as the last character on each line. then the upper and lower boundaries (or the discrete values) become the column or row labels. (Such input for a variable is generally entered on one line. Ranges or discrete values may be specified but not mixed for a single variable. any. A lower bound of BOT and an upper bound of TOP may be specified for the first and last row or column. This. 100 TOP Specifies that the final group of rows or columns will contain values over 100. 0 100. Columns and rows are automatically sorted in ascending order.8 General Notes XTABS gets its number of lines per page from the FOCUS SET LINES command. if you wish to respecify the complete ranges (label. For example: 90 100 as "Very Hot" If no label is specified. For example BOT -100. along with their printed labels. etc. etc. A response of NONE for a variable (alpha or numeric) causes one column or row to be created for each discrete value of the variable with the label set to the variable value.11.) If additional lines are needed. Discrete ranges may be alpha or numeric but bounded ranges must be numeric. Each range or discrete value response is checked for ascending order and another prompt is issued if the values supplied are in error. Ranges must be in ascending order. Statistical Analysis Users Manual 3-57 . determines the length and paneling of the report. The general form of the range specifications is as follows: SALARY bot 16000 as "Low Paid". and do not wish to continue. A column or row range specifies the column’s (or row’s) lower inclusive bound (greater than or equal to value) and the upper non-inclusive bound (less than value). Labels containing embedded blanks must be enclosed in single quotation marks. or different lengths.) continues until the variable is syntactically complete. Ranges may have gaps. type NEXT on the next line to satisfy the prompt. 16000 top as "Better Paid" The "AS label" phrase is optional and may be used for none. The resulting columns and rows will be displayed in the ascending order of the data values regardless of the labels specified. 100 TOP where the range specifiers have the following meanings: BOT -100 Specifies that the first row or column will include the lowest values through -100 (negative 100). 3. 0 100 Specifies that the next group of rows or columns will contain values in the 0 to 100 range. (If you accidentally type a comma at the end of the current line. together with the STATSET WIDTH flag. Note that prompting for ranges (labels.) Type LAST. or all ranges specified.

Boston: Houghton Mifflin. If missing cases are required on the table. When STATSET missing value processing is in effect. Statistical Reasoning in Sociology (2nd edition). and Herbert Costner (1970).The FOCUS SET PAUSE=ON facility causes XTABS to wait before printing to allow time for aligning forms on the output device. Sample XTABS Terminal Session 3-58 Information Builders . Since typing ahead is supported throughout ANALYSE. the range response "-999 as missing" (included with other column or row specifications) will generate appropriately labeled columns or rows. none. Missing cases are not displayed or included in the compilation. Karl Schuessler. the number of missing cases is displayed on the generated crosstabulation. a complete crosstabulation may be generated by the following ANALYSE prompt and response: ENTER STATISTICAL OPERATION DESIRED xtabs salary by department all. none References John Mueller.

Statistical Analysis Users Manual 3-59 .

(FOC123) NUMBER OF FACTORS EXCEEDS 6: The statistical process of factor analysis cannot exceed 6 factors. (FOC121) (FOC122) INVALID LEVEL ENCOUNTERED IN DATA FOR FACTOR: Data file submitted for FACTOR analysis is not correctly formulated. or STATSET parameter must be numerical. (FOC115) THE NUMBER ENTERED EXCEEDS THE NUMBER OF VARIABLES: The numerical variable identity provided in the ANALYSE prompt exceeds the number of variables in the file being analyzed. (FOC117) VALID RESPONSES ARE ON OR OFF: There are only two valid responses to this prompt. (FOC120) VALUE ENTERED IS INCONSISTENT WITH FILE DESCRIPTION: The data format is different from the value supplied to the ANALYSE prompt.e. Statistical Analysis Users Manual A-1 . i. etc. not numeric. INVALID STATSET MODE: The ANALYSE option STATSET does not have the parameter requested. These are ON or OFF. (FOC111) SET VALUE MISSING The value after the STATSET parameter is missing. (FOC112) (FOC113) VALUE MUST BE NUMERIC: The value in response to the ANALYSE prompt. AN ILLEGAL RESPONSE HAS BEEN ENTERED: The response to the ANALYSE prompt is not recognized. NUMBER OF LEVELS NOT CONSISTENT FOR FACTOR: The factor analysis sub-system has detected an input setup error. (FOC116) FLUSHING TO QUIT OR TO NEXT STATISTICAL OPERATION Sequential processing of ANALYSE commands cannot be continued because of parameter errors. A valid value must be provided. (FOC118) DEPENDENT VARIABLE DOES NOT EXIST: The dependent variable provided in response to the ANALYSE prompt is not specified in the MASTER description for the data. Type EXPLAIN in the ANALYSE mode if assistance is needed.A Error Messages (FOC108) NUMBER OF ANALYSE VARIABLES EXCEEDS 64 The statistical analysis sub-system cannot handle more than 64 independent variables. (FOC114) AN INDEPENDENT VARIABLE IS ALSO A DEPENDENT VARIABLE: A regression cannot use the same variable as both independent and dependent.. The stacked commands are ignored until the next valid STATMODE is encountered. (FOC109) (FOC110) INSUFFICIENT CORE FOR FACTOR ANALYSIS The statistical operation requested requires more core storage. (FOC119) INDEPENDENT VARIABLE DOES NOT EXIST: Same as (FOC118) but applies to the dependent variable.

(FOC128) NOTE.(FOC124) NON-NUMERIC DATA IN FILE: There is non-numerical data for a variable which must be numerical. (FOC133) WARNING: ANALYSE FILE LOST... (FOC125) RECAP CALCULATIONS MISSING : The word RECAP is not followed by a calculation. or a calculation provided.HAD SAME NAME AS HOLD FILE The prior HOLD file is over-written. (FOC127) LITERAL TEST VALUE FOR GROUP IS INCORRECT: In the request statement a screening phrase against a group field has inconsistent values based on the contents of the fields in the group. This observation must be eliminated from the ANALYSE statistical process. (FOC126) NUMBER OF DATA VALUES NOT CONSISTENT WITH FACTOR LEVELS: In the ANALYSE mode the factor analysis procedure FACTO requires at least the same number of fields as levels of factors.. The size of one or more sections is either incorrect or not numerical. Use new names with the EQFILE option.LIMIT USES ’EQ’ OR ’LE’ TEST CONDITION ONLY: The RECORDLIMIT and READLIMIT phrases test on ’EQ’ or ’LE’ only. Either it should be removed. The meaning is the same. A-2 Information Builders .

Statistical Analysis Users Manual B-1 . EDUCFILE and JOBFILE are created by EMPTEST (for CMS) and EMPTSO (for TSO) FOCEXECs. EMPLOYEE. PROPERTY and SMSA FOCEXECs. PROPERTY. and SMSA. execute the AIRLINE.B Master File Descriptions and Diagrams This appendix consists of the Master File Descriptions and Diagrams (pictures) of the following files: • • • • • • AIRLINE EMPLOYEE EDUCFILE JOBFILE PROPERTY SMSA All of these files can be created by FOCEXECs distributed with FOCUS. To create AIRLINE.

B-2 Information Builders .

Statistical Analysis Users Manual B-3 .

B-4 Information Builders .

Statistical Analysis Users Manual B-5 .

Statistical Analysis Users Manual B-6 .

Statistical Analysis Users Manual B-7 .

Statistical Analysis Users Manual B-8 .

Statistical Analysis Users Manual B-9 .

Statistical Analysis Users Manual B-10 .