You are on page 1of 152

SAP Predictive Analysis User Guide SAP Predictive Analysis 1.

0 Support Package 1

2012-10-04

Copyright

2012 SAP AG. All rights reserved.SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer, StreamWork, SAP HANA and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries.Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects Software Ltd. Business Objects is an SAP company.Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, and other Sybase products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Sybase, Inc. Sybase is an SAP company. Crossgate, m@gic EDDY, B2B 360, B2B 360 Services are registered trademarks of Crossgate AG in Germany and other countries. Crossgate is an SAP company. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty. 2012-10-04

Contents

Chapter 1

What's New in the SAP Predictive Analysis 1.0 SP1 release.................................................7

Chapter 2

About This Guide.....................................................................................................................9 Who Should Read This Guide..................................................................................................9 What This Guide Contains.......................................................................................................9 Overview of SAP Predictive Analysis....................................................................................11

2.1 2.2
Chapter 3

Chapter 4

Getting Started with SAP Predictive Analysis.......................................................................13 Launching SAP Predictive Analysis........................................................................................13 Basics of SAP Predictive Analysis..........................................................................................13 SAP Predictive Analysis at a Glance......................................................................................14 Understanding SAP Predictive Analysis.................................................................................15 Designer Perspective.............................................................................................................15 Visualization Perspective........................................................................................................18 Setting Preferences...............................................................................................................19 Accessing Keyboard Shortcuts..............................................................................................21 Viewing Security Logs............................................................................................................24 Building Analyses..................................................................................................................27 Creating a New Folder...........................................................................................................27 Creating an Analysis..............................................................................................................27 Reading Data.........................................................................................................................28 Preparing Data.......................................................................................................................29 Applying Algorithms...............................................................................................................29 Storing Results of the Analysis..............................................................................................31 Running the Analysis..............................................................................................................32 Saving the Analysis................................................................................................................33 Viewing Results.....................................................................................................................33

4.1 4.2 4.3 4.4 4.4.1 4.4.2 4.5 4.6 4.7


Chapter 5

5.1 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.3 5.4 5.5

2012-10-04

Contents

Chapter 6

Analyzing Data......................................................................................................................35 Visualization Techniques........................................................................................................36 Table......................................................................................................................................36 Scatter Plot............................................................................................................................38 Statistical Summary...............................................................................................................39 Parallel Coordinates...............................................................................................................39 Ad Hoc Chart.........................................................................................................................40 Algorithm Summary................................................................................................................41 Time Series Graph.................................................................................................................42 Regression Graph..................................................................................................................43 Cluster Graph........................................................................................................................44 Decision Tree.........................................................................................................................45 Working with Models............................................................................................................49 Creating a Model...................................................................................................................49 Viewing Model Information.....................................................................................................50 Exporting a Model as PMML..................................................................................................50 Deleting a Model....................................................................................................................50 Component Properties..........................................................................................................53 Data Readers.........................................................................................................................53 BI4 Unv Reader......................................................................................................................53 CSV Reader..........................................................................................................................55 Excel Reader..........................................................................................................................57 HANA Reader........................................................................................................................58 JDBC Reader........................................................................................................................60 XI3 Unv Reader......................................................................................................................61 Data Preparation....................................................................................................................63 Data Type Definition...............................................................................................................64 Filter.......................................................................................................................................64 Formula..................................................................................................................................70 Sample...................................................................................................................................75 Algorithms..............................................................................................................................79 Association............................................................................................................................79 Classification..........................................................................................................................84 Clustering..............................................................................................................................86 Decision Trees.......................................................................................................................89 Neural Network......................................................................................................................95 Outliers................................................................................................................................102

6.1 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 6.1.6 6.1.7 6.1.8 6.1.9 6.1.10
Chapter 7

7.1 7.2 7.3 7.4


Chapter 8

8.1 8.1.1 8.1.2 8.1.3 8.1.4 8.1.5 8.1.6 8.2 8.2.1 8.2.2 8.2.3 8.2.4 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.3.5 8.3.6

2012-10-04

Contents

8.3.7 8.3.8 8.4 8.4.1 8.4.2 8.4.3 8.5


Chapter 9

Regressions.........................................................................................................................104 Time Series..........................................................................................................................126 Data Writers........................................................................................................................138 CSV Writer..........................................................................................................................139 HANA Writer........................................................................................................................140 JDBC Writer........................................................................................................................140 Saved Models......................................................................................................................141 Use Case Scenarios............................................................................................................143 Sales Forecasting................................................................................................................143 Retail Store Segmentation...................................................................................................144 Glossary Definitions............................................................................................................147

9.1 9.2
Chapter 10

Appendix A Index

More Information.................................................................................................................149 151

2012-10-04

Contents

2012-10-04

What's New in the SAP Predictive Analysis 1.0 SP1 release

What's New in the SAP Predictive Analysis 1.0 SP1 release

The following are the new enhancements in the SAP Predictive Analysis 1.0 SP1 release: Change in the product name In this release, the name of the product is changed from SAP BusinessObjects Predictive Analysis to SAP Predictive Analysis. Support for additional SAP HANA 1.0 SPS3 PAL Algorithms In this release, SAP Predictive Analysis provides the following new PAL algorithms: HANA Apriori HANA KNN HANA C4.5 New install types in SAP Predictive Analysis installer In this release, the SAP Predictive Analysis installer contains the following new install types: Express Advanced

2012-10-04

What's New in the SAP Predictive Analysis 1.0 SP1 release

2012-10-04

About This Guide

About This Guide

2.1 Who Should Read This Guide


This guide is intended for professional data analysts, business analysts, and information designers who want to use the SAP Predictive Analysis application to analyze and visualize data using predictive algorithms.

2.2 What This Guide Contains


This guide provides: An overview of SAP Predictive Analysis Information on various components or algorithms available in SAP Predictive Analysis Information on how to create an analysis Information on how to analyze data using predictive analysis visualization techniques

2012-10-04

About This Guide

10

2012-10-04

Overview of SAP Predictive Analysis

Overview of SAP Predictive Analysis

SAP Predictive Analysis is a statistical analysis and data mining solution that enables you to build predictive models to discover hidden insights and relationships in your data, from which you can make predictions about future events. This application can access data from various data sources, such as flat files, relational databases, in-memory databases, and SAP BusinessObjects universes, and can operate on different volumes of data, from a small matrix of data in a CSV file to a very large dataset in SAP HANA. With SAP Predictive Analysis, you can perform various analyses on the data, including time series forecasting, outlier detection, trend analysis, classification analysis, segmentation analysis, and affinity analysis. This application enables you to analyze data using different visualization techniques, such as scatter plot matrices, parallel coordinates, cluster graphs, decision trees, and ad hoc graphs. SAP Predictive Analysis offers a range of predictive analysis algorithms, supports the use of the R Open Source statistical analysis language, and offers in-memory data mining capabilities for handling large volume data analysis in an efficient manner.

11

2012-10-04

Overview of SAP Predictive Analysis

12

2012-10-04

Getting Started with SAP Predictive Analysis

Getting Started with SAP Predictive Analysis

4.1 Launching SAP Predictive Analysis


To launch SAP Predictive Analysis, choose Start > All Programs > SAP Predictive Analysis > SAP Predictive Analysis 1.0 SP1 > Predictive Analysis 1.0. The application is launched with the welcome screen which provides you the following options: Create New: Enables you to create a new analysis. Recently-Used Analyses: Enables you to open recently-used analyses. Demo Videos: Enables you to view videos of SAP Predictive Analysis use case scenarios. Getting Started: Enables you to learn how to use the SAP Predictive Analysis application by providing access to help.

4.2 Basics of SAP Predictive Analysis


Component A component is the basic processing unit of SAP Predictive Analysis. Each component contains input and/or output anchors (connection points). These anchors are used to connect components through connectors. When you connect components together, data is transmitted from predecessor components to their successor components. SAP Predictive Analysis consists of the following components: Data Readers Data Preparation Algorithms Data Writers

13

2012-10-04

Getting Started with SAP Predictive Analysis

In SAP Predictive Analysis designer, you can access components from the components panel in the analysis editor. The status icon of a component enables you to identify its state. The following are the states of a component: (Not Configured): This state is displayed when you drag a component onto the analysis editor. It signifies that the component needs to be configured before running the analysis. (Configured): This state is displayed after all the necessary properties are configured for the component. (Success): This state is displayed after the successful execution of the analysis. (Failure): This state is displayed if this component causes the execution of the analysis to fail.

Analysis An analysis is a collection of different components linked together in a particular sequence that defines the flow of data. Model Model is a reusable component created by training an algorithm using historical data.

4.3 SAP Predictive Analysis at a Glance

14

2012-10-04

Getting Started with SAP Predictive Analysis

Figure 4-1: SAP Predictive Analysis main application window

4.4 Understanding SAP Predictive Analysis


SAP Predictive Analysis designer consists of the following two perspectives: Designer Perspective: This perspective enables you to design and run analyses, and create predictive models. Visualization Perspective: This perspective enables you to understand the data and results of analysis by using various visualization techniques and intuitive charts.

4.4.1 Designer Perspective

15

2012-10-04

Getting Started with SAP Predictive Analysis

4.4.1.1 Explorer
The Explorer view contains folders and analyses that you create for specific domains. A folder helps you to organize and manage analyses. In explorer, you can: Create folders for multiple domains Create analyses for different domains Move analyses from one folder to another Copy analyses from one folder to another Rename folders and analyses Filter analyses for faster access Delete folders and analyses

4.4.1.2 Analysis Editor


The analysis editor enables you to design, edit, and run analyses. To design an analysis, drag components onto the editor and link them using connectors. In the analysis editor, you can: Run an analysis Preview component data Arrange components either vertically or horizontally for better layout, structure, and representation of an analysis Copy, paste, delete, and rename components Undo and redo operations View model information Export saved models as Predictive Model Markup Language (PMML) View component results View component properties All the above actions are available in the context menu options, application menu, and toolbar.

4.4.1.3 Components Panel

16

2012-10-04

Getting Started with SAP Predictive Analysis

The components panel in the analysis editor displays all the components in the metadata repository (MDR). It contains data readers, data preparation components, algorithms, data writers, and saved models. This panel provides options for exploring components in tree and list views. In the components panel, you can: Filter components for faster access Add components to your favorites, for convenient access in the future Delete a saved model

4.4.1.4 Miniature View


This is an optional view. You can enable this view from View > Miniature View. The Miniature View provides a bird's-eye view of an analysis. If the analysis is large, the miniature view helps you to navigate to individual components in the analysis editor easily.

4.4.1.5 Properties View


The Properties view enables you to configure analysis and component properties. You need to select the required component to view its properties in the properties view. To view analysis properties in the properties view, open an analysis and move the focus onto the analysis editor. Component properties view contains the following tabs: Settings: Contains configurable properties of a component and provides an option to save an algorithm as a model. General: Contains general properties of the components, such as component name, type, and description. Data Mappings: This tab is specific to CSV Reader and Excel Reader components. It provides options to define name and data type for the source columns. Analysis properties view contains the following tabs: General: Contains general properties of an analysis, such as analysis name and analysis description. In-Database Connection: This tab is specific to HANA Reader. It contains in-DB connection details and provides option to change the in-DB connection.

4.4.1.6 Message Log

17

2012-10-04

Getting Started with SAP Predictive Analysis

The Message Log view enables you to view warnings, information messages, and error messages from the analysis execution. You need to double click on a message in message log to view its details. In the message log view, you can: Filter messages by selecting only the message types you want to view Set the number of messages to be viewed in the message log Clear messages from the message log view

4.4.1.7 Preview
The Preview option enables you to view a snapshot of the data in the data reader or data preparation components. In this view, you can filter the data by entering a filter string and sort data by selecting the required sort type from the Sort Direction menu.

4.4.2 Visualization Perspective

4.4.2.1 Executed Analyses


The Executed Analyses view displays analyses executed in the current session. Each executed analysis contains visualizations for each component that is part of it except for data writer components.

4.4.2.2 Attribute Selection


The Attribute Selection view displays attributes (columns) from the data source. By default, the first five attributes are selected and visualizations are displayed based on the selected component.

4.4.2.3 Visualization Editor

18

2012-10-04

Getting Started with SAP Predictive Analysis

The Visualization Editor enables you to view and interact with charts and graphs.

4.5 Setting Preferences


SAP Predictive Analysis enables you to personalize some of its settings based on your preferences.

19

2012-10-04

Getting Started with SAP Predictive Analysis

General Settings
Option Description Default

Notification Set- Preference for displaying analysis execution status through notifica- On tings tions. When the application is opened, notifications appear at the bottom right corner of the application. When the application is minimized, notifications appear from the Windows system tray. Turn on notifications Heap Memory Preference to display the heap memory status in the application. Heap memory is the runtime memory pool created at the application start-up. The heap memory status in the application indicates the following parameters: 1. Heap Size: Current heap memory usage of the application. 2. Total Heap Size: Total heap memory allocated for the application at that particular time. JVM will increase the total heap size based on the memory used by the application. 3. Maximum Heap Size: Maximum heap memory that can be allocated for the application. Generally, the JVM invokes the garbage collector and releases the free memory from time to time. However, you can free the memory by running the garbage collector. If the application is running out of the memory, we recommend that you close analyses which are currently not used, to free the memory. Show heap memory status at the bottom of the application window Off

Analysis Editor Settings


Option Description Default

Magnetic Component Association

Preference for enabling associations between components. On Create a connection when the component is dropped on the connector Preference to select the style of the connector. Manhattan Con Shortest Path Connector - represents straight-line connec- nector tor. Manhattan Connector - represents bend-point connector. Preference for prompting while switching between perspec- On tives. Prompt to switch to visualization perspective after analysis execution

Connector Style

Perspective Switch Settings

20

2012-10-04

Getting Started with SAP Predictive Analysis

Connection Management Settings Connection Management Settings provides options to create, delete, and modify in-database connections.
Option Description

Add

Preference to create a new in-database connection. To add a new in-database connection, perform the following steps: 1. Choose Add. The "New Connection" wizard appears. 2. 3. 4. 5. Enter a name for the connection. Select the required database type. Enter the required parameters for the database. Choose Finish.

Edit

Preference to edit an existing in-database connection. 1. From the connections table, select the connection to edit. 2. Choose Edit. The "Edit Connection" wizard appears. 3. Modify the connection parameters as required. 4. Choose Finish.

Set as Default

Preference to set an existing in-database connection as default. 1. From the connections table, select the required connection. 2. Choose Set as Default. After setting a connection as default, the connection name in the table is appended with "(default)".

Delete

Preference to delete an existing in-database connection. From the connections table, select the required connection. Choose Delete.

4.6 Accessing Keyboard Shortcuts


The specific functionality of SAP Predictive Analysis can be accessed from the keyboard. Keyboard accessibility is designed for efficient navigation in the software and for making functions accessible to all kind of people.

21

2012-10-04

Getting Started with SAP Predictive Analysis

Shortcuts for Navigating the Main Window


Shortcut Key Function

F1 F10 ALT + SHIFT + M ALT + SHIFT + E CTRL + F4 CTRL + F6 CTRL + F7 CTRL + F8 CTRL + TAB CTRL + SHIFT + TAB

Launch SAP Predictive Analysis context-sensitive help. Moves the focus to the application menu bar. Open the miniature view. View/Hide the explorer view. Close the current analysis editor. Displays the list of opened analyses and moves focus to the next opened analysis in the editor. Switch between views. Switch between visualization and designer perspectives. Moves the focus from one control to the next control. Moves the focus from one control to the previous control.

CTRL + Shift + F6 Displays the list of opened analyses and moves focus to the previous opened analysis in the editor. CTRL + Shift + Ar Scroll the editor. row Alt + HYPHEN and Close the welcome screen. C CTRL + E CTRL + Y CTRL + Z ALT + F and X Provides option to switch between analysis editors. Redo the last operation. Undo the last operation. Exits the application.

Shortcuts for Accessing Components and Analysis


Key Combination Function

Alt + Shift + A Alt + Shift + F Alt + Shift + I Alt + A Ctrl + W Ctrl + Shift + W CTRL + S

Create a new analysis. Create a new folder. Create a new in-DB connection. Add components to favorites. Close the active analysis in the analysis editor. Close all active analyses in the analysis editor. Save the active analysis.

22

2012-10-04

Getting Started with SAP Predictive Analysis

Key Combination

Function

CTRL + SHIFT + S Save all active analyses. CTRL + X CTRL + C CTRL + V Delete CTRL + R F2 F3 CTRL + F3 CTRL + F9 F4 F5 F6 F7 F8 F9 / or \ ? or | / + SPACEBAR Alt + R Alt + T Alt + Y Cut the selected analysis. Copy the selected analysis to the clipboard. Paste the copied analysis from the clipboard. Delete the selected analysis or folder. View the result of the analysis. Rename the selected analysis or folder. Arrange components horizontally in the analysis editor. Arrange components vertically in the analysis editor. Create connection between two components. Open the properties view for the selected component. Refresh the analysis or folder. Preview a snapshot of data from the data source. Run until here. Run analysis. Cancel the analysis execution. Repeatedly pressing either of these keys selects input and output connectors (if available) associated with the component in focus. Select component along with the associated connectors. Select connector. Add a data source object onto the Result Objects pane of the Universe wizard. Add a data source object or filter object onto the Filter pane of the Universe wizard. Apply attribute selection in the visualization editor.

Shortcuts related to models


Key Combination Function

CTRL + T CTRL + SHIFT + T

Export model as PMML View model information.

Example: Building an analysis without using the mouse This example describes steps to build an analysis containing a CSV Reader component and a linear regression algorithm without using the mouse.

23

2012-10-04

Getting Started with SAP Predictive Analysis

1. Press Start Menu and type predictive analysis. 2. Press Enter. The SAP Predictive Analysis application is launched. 3. Press Alt + Shift + A to open a new analysis. 4. Press CTRL + F7 to display the list of available views. Hold the CTRL key and select Editor using Up/Down Arrow. 5. Press TAB to move the focus to the components panel. 6. Press Enter and select Show Filter using Down Arrow. 7. Press TAB to move the focus to the component filter text box. Type CSV Reader and press TAB key until the focus is moved to the CSV Reader component. 8. Press Down Arrow to select the CSV Reader component. 9. Press ENTER to add the CSV Reader component onto the editor. 10. Press CTRL + F7 to display the list of available views. Hold the CTRL key and select Properties using Up/Down Arrow. 11. Press TAB until the focus is moved to File Name. 12. Press TAB to move the focus to the Browse button. 13. Press SPACEBAR to open the "Choose File" dialog. Select the .csv file and press ENTER. Using the TAB key, you can navigate to all the fields in the properties view and enter or select required values. To select values from a drop-down list, use the Up/Down Arrow key. To select a option in the radio button, use the Left/Right Arrow key. To check or uncheck values in the check box, use the SPACEBAR. 14. Press CTRL + F7 to display the list of available views. Hold the CTRL key and select Editor using Up/Down Arrow. 15. Press TAB to move the focus to the component filter text box. Type Linear Regression and press the TAB key until the focus is moved to the Linear Regression algorithm. 16. Press Down Arrow to select the Linear Regression algorithm. 17. Press ENTER to add the Linear Regression algorithm onto the editor. 18. Press CTRL + F7 to display the list of available views. Hold the CTRL key and select Properties using Up/Down Arrow. 19. Using the TAB key, you can navigate to all the fields in the properties view and enter or select required values. 20. Press CTRL + F7 to display the list of available views. Hold the CTRL key and select Editor using Up/Down Arrow. 21. Press F8 to run the analysis.

4.7 Viewing Security Logs


SAP Predictive Analysis logs security events in the security log file. Information about database connection configuration details and security-related events, such as configuration changes or failure to access external systems such as databases and SAP BusinessObjects BI systems, is logged in the

24

2012-10-04

Getting Started with SAP Predictive Analysis

security file. The security log files are located at: <PA_INST_DIR>\SecurityLogs\PASSecuri ty_<timestamp>.log

25

2012-10-04

Getting Started with SAP Predictive Analysis

26

2012-10-04

Building Analyses

Building Analyses

5.1 Creating a New Folder


You create folders for saving analyses. 1. Select File > New > Folder. 2. Enter a name for the folder. 3. Enter a description for the folder. 4. Click Finish.

5.2 Creating an Analysis


You create an analysis by assembling different components in a particular sequence, configuring them, and connecting them together with connectors that define the flow of data. Following are the steps to create an analysis: 1. Read data from data sources 2. (Optional) Prepare the data for analysis (for example, filter data or sample data) 3. Apply algorithms 4. (Optional) Store results of the analysis

Related Topics Reading Data Preparing Data Applying Algorithms Storing Results of the Analysis

27

2012-10-04

Building Analyses

5.2.1 Reading Data


To analyze data, you need to read data from relational databases, flat files, or SAP BusinessObjects universes using Data Reader components. You can read data from the following data sources: Flat files: Excel, CSV, TEXT, and DAT files Relational databases: Sybase SQL Anywhere, MySQL, MS SQL Server, Oracle, SAP HANA (both relational database and OLAP cube) SAP BusinessObjects Universes: XI 3.1 Universes, BI 4.0 Universes (.unv files) 1. From the File menu, choose New > Analysis. The components panel in the analysis editor displays the list of components available in the Predictive Analysis Service repository. 2. Add the required data reader component to the analysis by double-clicking it in the components panel. Alternatively, you can drag the required data reader component from the components panel onto the analysis editor. Note: If you select HANA Reader component, only in-DB specific data preparation, algorithms, and writer components are displayed in the components panel. 3. Select the data reader component in the analysis. In the properties view, configure component properties: a. On the Settings tab, enter the necessary details for the data reader component properties. b. On the General tab, enter the display name and description for the data reader component. The component display name is the component name that appears in the analysis editor. 4. To preview the data in the component, choose in the properties view. As you build an analysis, you can check the flow of the data along components by previewing data in each component. Previewing data provides you with a snapshot of the data at each component and ensures that you are progressing in the right direction in building the analysis. Related Topics BI4 Unv Reader CSV Reader Excel Reader HANA Reader JDBC Reader XI3 Unv Reader

28

2012-10-04

Building Analyses

5.2.2 Preparing Data


This is an optional step. In many cases, the raw data from the data source may not be suitable for analysis. For accurate results, you may need to prepare and process the data before analysis. Data preparation involves checking data for accuracy and missing fields, filtering data based on range values, sampling the data to investigate a subset of data, and manipulating data. You can process data using data preparation components. 1. From the components panel, select the Data Preparation tab. 2. Add the required data preparation component to the analysis by double-clicking it in the components panel. The component is automatically added to the analysis and linked to the component that is selected in the analysis editor, provided the components are compatible. Alternatively, you can drag the required data preparation component from the components panel onto the analysis editor. When you a drag component from the component panel very close to another component in the analysis, if the dragged component is compatible with the component already existing in the analysis, the analysis editor automatically detects an association and suggests a connection between components. You can choose to retain the suggested connection or connect the components manually. 3. Select the data preparation component in the analysis. In the properties view, configure component properties: a. On the Settings tab, enter the necessary details for the data preparation component properties. b. On the General tab, enter the display name and description for the data preparation component. The component display name is the component name that appears in the analysis. 4. To preview the data in the component, choose Related Topics Data Type Definition Filter Formula Sample in the properties view.

5.2.3 Applying Algorithms

29

2012-10-04

Building Analyses

Once you have the relevant data for analysis, you need to apply appropriate algorithms to determine patterns in the data to be analyzed. Determining an appropriate algorithm to use for a specific purpose is a challenging task. You can use a combination of various algorithms to analyze data. For example, you can first use time series algorithms to smooth data and then use regression algorithms to find trends. The following table provides information on which algorithm to choose for specific purposes:
Purpose Algorithms to use

Performing time-based predictions

Time Series Algorithms Triple Exponential Smoothing R-Single Exponential Smoothing R-Double Exponential Smoothing R-Triple Exponential Smoothing

Predicting continuous variables based on other variables in the dataset

Regression Algorithms Linear Regression Exponential Regression Geometric Regression Logarithmic Regression HANA Multiple Linear Regression R-Linear Regression R-Exponential Regression R-Geometric Regression R-Logarithmic Regression R-Multiple Linear Regression

Finding frequent itemset pattern in large transac- Association Algorithms tional datasets to generate association rules HANA Apriori R-Apriori Clustering observations into groups of similar itemsets Clustering Algorithms HANA K-Means K-Means

30

2012-10-04

Building Analyses

Purpose

Algorithms to use

Classifying and predicting one or more discrete Classification Algorithms or Decision Trees variables based on other variables in the dataset HANA C 4.5 R-CNR Tree Detecting outlying values in the dataset Outlier Detection Algorithms Inter Quartile Range Nearest Neighbour Outlier

Forecasting, classification, and statistical pattern Neural Network Algorithms recognition R-NNet Neural Network R-MONMLP Neural Network

1. From the components panel, select the Algorithms tab. 2. Add the required algorithm component to the analysis by double-clicking it in the components panel. The component is automatically added to the analysis and linked to the component that is selected in the analysis editor, provided the components are compatible. Alternatively, you can drag the required algorithm component from the components panel onto the analysis editor. When you a drag component from the component panel very close to another component in the analysis, if the dragged component is compatible with the component already existing in the analysis, the analysis editor automatically detects an association and suggests a connection between components. You can choose to retain the suggested connection or connect the components manually. 3. Select the algorithm component in the analysis. In the properties view, configure component properties: a. On the Settings tab, enter the necessary details for the algorithm component properties. b. On the General tab, enter the display name and description for the algorithm component. The component display name is the component name that appears in the analysis editor. For information on each algorithm and its properties, see the "Algorithms" section under Component Properties chapter.

5.2.4 Storing Results of the Analysis


This is an optional step. After executing an analysis, you can store the results of the analysis in flat files or databases for further analysis using Data Writer components.

31

2012-10-04

Building Analyses

1. From the components panel, select the Data Writers tab. 2. Add the required data writer component to the analysis by double-clicking it in the components panel. The component is automatically added to the analysis and linked to the component that is selected in the analysis editor, if the components are compatible. Alternatively, you can drag the required data writer component from the components panel onto the analysis editor. When you a drag component from the component panel very close to another component in the analysis, if the dragged component is compatible with the component already existing in the analysis, the analysis editor automatically detects an association and suggests a connection between components. You can choose to retain the suggested connection or connect the components manually. 3. Select the data writer component in the analysis. In the properties view, configure component properties: a. On the Settings tab, enter the necessary details for the data writer component properties. b. On the General tab, enter the display name and description for the data writer component. The component display name is the component name that appears in the analysis. Related Topics CSV Writer JDBC Writer HANA Writer

5.3 Running the Analysis


To run the analysis, from the Run menu, choose Run Analysis. Alternatively, you can run the analysis in the following ways: Right-click the component and choose Run Analysis. In the toolbar, choose . Choose in the analysis editor.

If your analysis is very large and complex, you can run the analysis, component by component and analyze the data. To run a part of the analysis, from the Run menu, select Run till here. Alternatively, you can run the analysis partially in the following ways: Right-click the component and choose Run till here. In the toolbar, choose . In the analysis, choose from the component's hover menu.

32

2012-10-04

Building Analyses

5.4 Saving the Analysis


After creating an analysis, you can save it to reuse in the future. Saving an analysis includes saving its metadata. Note: Results of components execution are not saved with analyses. You need to execute the analysis again to view component results. 1. Choose File > Save. 2. Enter the name for the analysis. 3. Select a folder to save the analysis. 4. Choose OK.

5.5 Viewing Results


After running the analysis, SAP Predictive Analysis notifies about the execution status of the analysis. After successful execution, the designer prompts you to switch to the visualization perspective to view the result of each component in the analysis. To view the results immediately, choose Yes. To view the results later, choose No. Alternatively, you can view the results in the following ways: From the Component menu, choose View Result. In the toolbar, choose . Right-click the component and choose View Result. In the analysis, click from the component's hover menu.

33

2012-10-04

Building Analyses

34

2012-10-04

Analyzing Data

Analyzing Data

SAP Predictive Analysis provides various visualization techniques for analyzing data and results of an execution. To analyze data, switch to the visualization perspective by choosing the Visualization button in the toolbar. After the successful execution of the analysis, the result of each component in the analysis is represented using different visualization techniques. The following table summarizes components and their supported visualization techniques.
Components Visualization Techniques

Data Readers and Data Prepa- Table, Scatter Plot Chart, Summary Statistics, Parallel Coordinates, ration and Ad hoc Chart Clustering Algorithms Classification and Regression Algorithms Time Series Algorithms Regression Algorithms Table, Cluster Graph, Algorithm Summary, and Ad hoc Chart Table, Decision Tree, Algorithm Summary, and Ad hoc chart Table, Time Series Graph, Algorithm Summary, and Ad hoc Chart Table, Regression Graph, Algorithm Summary, and Ad hoc Chart

The visualization of data reader components enables you to understand the data and look for patterns in the data. The visualization of data preparation components enables you to ensure that you have the appropriate data for the analysis. The visualization of algorithms helps you to analyze algorithm results. To analyze data, perform the following steps: 1. Switch to the visualization perspective by choosing the Visualization button in the toolbar. 2. From the Executed Analyses view, double-click the required component in the analysis to view its visualization. The first five continuous attributes (columns) from the component are selected and visualizations of the selected attributes are displayed. However, you can also select the attributes required for analysis from the Attribute Selection view and choose Apply. Note: The following table describes the data size limit for visualizations.

35

2012-10-04

Analyzing Data

Visualization

No of cells/rows supported

Table Scatter Plot Cluster Graph Parallel Coordinates Ad Hoc Chart Regression Graph Time Series Graph Statistical Summary Decision Tree

No limit 150,000 cells 200,000 cells 400,000 cells 7500 cells 100,000 rows 100,000 rows No limit 32 categorical values

You can change the background of the chart by selecting a required template from the Template drop-down list. To apply a different color to the chart, select the required color from the Palette drop-down list.

6.1 Visualization Techniques

6.1.1 Table
Table is the default visualization for all components. The table view displays data in a tabular format. After the execution of the analysis, most of the algorithms adds new columns to the output data. The new columns can be easily distinguished from the source columns using highlighting options. The following table summarizes the new columns that algorithms add.

36

2012-10-04

Analyzing Data

Algorithms

New Columns

Outlier Algorithms

If you select the output mode as "Show Outliers", a new column (Outlier Column) is added with Boolean values 1 and 0. 1 indicates that the value is an outlier and 0 indicates that the value is not an outlier. If you select the output mode as "Trend", a new column is added that contains predicted values. A new column is added that contains the cluster number to which each row belongs. Three new columns (Year, Trend/Forecast values, and Period) are added. Four new columns (Rules, Support, Confidence, and Lift) are added.

Regression Algorithms Clustering Algorithms Time Series Algorithms Association Algorithms

Classification and Regression If you select the output mode as "Trend", a new column (Predicted Algorithms values) is added.

Highlighting Options Each component in an analysis supports different highlighting options. The highlighting options are:

37

2012-10-04

Analyzing Data

Highlight Null Values Highlight Newly Added Columns Highlight Outliers

6.1.2 Scatter Plot


Scatter plot charts are matrices of charts (n*n charts, where n is the number of selected attributes) used to compare data across different dimensions. By default, a maximum of five continuous attributes starting from the first attribute from the source data are selected for analysis and a 5*5 matrix of charts is plotted. However, you can manually select the required attributes from the Attribute Selection view and refresh the visualization by choosing Apply. To view individual charts, double-click the required chart and double-click again to return. Scatter plot charts are plotted for each attribute combination and histograms are plotted for the individual attribute.

38

2012-10-04

Analyzing Data

6.1.3 Statistical Summary


Statistical Summary provides summary information for continuous attributes in the data source. The summary information includes count, minimum value, maximum value, variance, standard deviation, sum, average, range, and number of records. A histogram chart is plotted for each attribute.

6.1.4 Parallel Coordinates


Parallel Coordinates is a visualization technique used to visualize multi-dimensional data and look for multivariate patterns in the data for analysis. In this chart, by default, the first six attributes are represented as vertically-spaced parallel axes. The Attribute Selection view allows you to choose the subset of attributes to be viewed in the chart. Each axis is labeled with the attribute name, and minimum and maximum values for attributes. Each

39

2012-10-04

Analyzing Data

observation is represented as a series of connected points along the parallel axes. You can click on the axis to view the value of data at that point.

From the Categorial Attributes drop-down list, you can select a attribute based on which you want to determine the pattern. The Categorial Attributes dropdown list contains string or non-continuous values. Each distinct value of the categorical attribute is associated with a color. If the dataset does not contain any categorical attributes, the parallel coordinates are displayed in blue.

6.1.5 Ad Hoc Chart


The Ad Hoc chart enables you to create various kinds of interactive charts. To create a chart, perform the following steps: 1. Under Available objects, double-click the required objects. The objects are added to both the Labels axis (X-axis) and Values axis (Y-axis) and a column chart is rendered by default. Note: Specify the number of rows in the Max Rows field and choose Refresh.

40

2012-10-04

Analyzing Data

2. From the Choose Chart drop-down list, choose the required chart type. 3. To filter values of an attribute, drag the required attribute from Available objects onto the Filters pane, choose the required operator from the drop-down list, and select a value in the range slider. The chart is updated based on your selection. 4. To view distinct values in an attribute, switch to the Distinct values tab. 5. To view the raw data in the data source, switch to the Raw Data tab. Note: The Raw Data tab does not appear if the source data is large. 6. To save the chart as an image, perform the following steps: a. Choose Save as image. b. In the "File Name" field, choose Browse. c. Enter a name for the image and choose Save. d. Enter a title for the chart. e. Choose the required background color. f. Enter image width and image height. g. Choose OK.

6.1.6 Algorithm Summary


Algorithm Summary provides summary information for the algorithm execution. It contains information about the input source columns used in the algorithm. It also contains measures like least squares, f-statistics, confidence level, and various other parameters based on the type of the algorithm, which

41

2012-10-04

Analyzing Data

determines the efficiency of the algorithm. This information helps you to understand whether the algorithm is the best choice for the given dataset.

6.1.7 Time Series Graph


The Time Series graph enables you to visualize time series data in comparison with the fitted or predicted values from the algorithm. Using this graph, you can view the data forecasted over a specified period. In trend mode, a dependent variable is represented as a bar graph and trend values are represented as a line graph. In predict mode, a dependent variable is represented as a bar graph and predicted values are represented as a line graph. When you use time series model for creating an analysis, only predicted values are represented as a line graph.

42

2012-10-04

Analyzing Data

If the dataset is very large, the graph may be unclear. For better visibility of data, use the Range slider located at the bottom of the graph to select a specific data range from the large dataset. The data between the two drag handles of the range slider is represented in the visualization editor.

6.1.8 Regression Graph


The Regression graph is used to visualize the correlation between the dependent and independent variables. In trend mode, you can analyze the performance of the algorithm by comparing the actual dependent variables with predicted values, where dependent variables are represented as a bar graph and predicted values are represented as a line graph. In fill mode, the algorithm fills the missing values and displays the output as a line graph.

43

2012-10-04

Analyzing Data

If the dataset is very large, the graph may be unclear. For better visibility of data, use the Range slider located at the bottom of the graph to select a specific data range from the large dataset. The data between the two drag handles of the range slider is represented in the visualization editor.

6.1.9 Cluster Graph


A cluster graph is a visualization technique that uses different charts to represent cluster information such as cluster size, cluster density and distance, cluster variable comparison, and cluster comparison. Note: If you use HANA K-Means algorithm to cluster observations, then only cluster size and cluster variable comparison information are represented as charts.

44

2012-10-04

Analyzing Data

Cluster Size Cluster size is the number of elements in each cluster and is represented by a horizontal bar chart. However, you can also visualize the cluster size in a pie chart or a vertical bar chart. Cluster Density and Distance The distance between clusters and density of each cluster is represented by a network chart. Each node in the network represents a cluster and its size. The color of the node represents density. You can enlarge the network chart by choosing . Cluster Variable Comparison The comparison of the total distribution of all clusters against the distribution of each cluster is represented by a histogram. You can select the required attribute of the cluster from the variable drop-down list and change the cluster using the slider. Cluster Comparison The R-K Means algorithm computes center points for each input attribute in each cluster. The comparison of each center point and cluster is represented by the radar chart.

6.1.10 Decision Tree

45

2012-10-04

Analyzing Data

A decision tree is a visualization technique that enables you to classify observations into groups and predict future events based on the set of decision rules. This presentation is used for decision tree analysis. In this technique, a binary decision tree is built by splitting observations into smaller sub-groups until the stopping criterion is met. The leaf node indicates classified data. You can enlarge the decision tree by choosing the zoom-in button. Note: Decision tree cannot be rendered if there are more than 32 categorical values for a dependent column. The look and feel of the decision tree differs based on the algorithm vendor. For example, the decision tree for R-CNR Tree algorithm will be different from HANA C4.5 algorithm.

46

2012-10-04

Analyzing Data

Each node in the decision tree represents the classification of data at that level. You can view node contents by choosing on each node.

47

2012-10-04

Analyzing Data

48

2012-10-04

Working with Models

Working with Models

A model is a reusable component created by training an algorithm using historical data and saving the instance. You create models for the following reasons: To share computed business rules that can be applied to similar data To quickly analyze results without the historical data by using the trained instance of the algorithm

7.1 Creating a Model


You can create a model by saving the state of a trained algorithm. 1. In the Components panel, choose Algorithms. 2. Drag the required algorithm component onto the analysis editor or double-click the required algorithm component. 3. In the analysis editor, right-click the algorithm component and choose Properties. 4. On the Settings tab of the properties view, configure component properties. a. b. c. d. Enter the necessary property values for the algorithm. Choose Save as Model. Enter model name and description. If you want to overwrite the existing model with a new model, select Model Overwrite. Note: A model created by using a different algorithm but with the same name will not be overwritten. 5. On the General tab of the properties view, enter the component display name and description. 6. Choose Run Analysis. The model is created and saved in the metadata repository (MDR). This model can be used just like any other component for creating an analysis. Note: Independent column names used while scoring the model should be the same as independent column names used while creating the model.

49

2012-10-04

Working with Models

7.2 Viewing Model Information


Model information contains the saved state of the trained algorithm. This information is helpful for data analysts to understand the structure of the model. Model information includes: Column details such as which columns were used for generating the model Summary of the algorithm 1. In the Components panel of the analysis editor, select Models. 2. Drag the required model onto the component chain editor. 3. Right-click the model and choose View Model Information. Based on the algorithm selected to generate the model, the corresponding visualization is displayed.

7.3 Exporting a Model as PMML


You can export the model information into a local file in industry standard Predictive Modeling Markup Language (PMML) format and share the model with PMML compliant applications for the analysis of similar data. 1. Create an analysis. 2. Save the analysis as a model. 3. In the Component panel, choose Models. 4. Drag the model onto the analysis editor. 5. Right-click the model and choose Export Model. 6. Enter a name for the file. 7. Select the file type, either PMML or XML, as required. 8. Click Save.

7.4 Deleting a Model


We recommend that you use this option with caution, since deleting a model might result in making the analysis that contains the model's reference unusable. To delete a model, perform the following steps: 1. In the components panel, choose Models.

50

2012-10-04

Working with Models

2. Right-click the required model and choose Delete.

51

2012-10-04

Working with Models

52

2012-10-04

Component Properties

Component Properties

This section describes components and their properties.

8.1 Data Readers


Use data reader components to read data from different kinds of data sources such as flat files, relational databases, SAP BusinessObjects universes, and SAP HANA. These components are mandatory to create any analysis.

8.1.1 BI4 Unv Reader


The BI4 Unv Reader component enables you to read data from universes (.unv files) which are stored in the SAP BusinessObjects BI 4.0 platform. Note: In this release, reading data from .unx universes is not supported. Therefore, when you connect to the CMS using BI4 Unv Reader, only .unv universes are displayed.

53

2012-10-04

Component Properties

BI4 Unv Reader Properties


Property Name Description Default Value Possible Values Input Required?

Host Name User Name Password

Name of the machine on which the None Central Management Server (CMS) is running. CMS user name. CMS password. None None Enterprise

Any string value. For example, IP Mandatoaddress of the machine or mary chine name. Any string value. Any string value. Enterprise, LDAP, Windows AD Mandatory Optional Mandatory

Authenti- CMS authentication type. cation Type Supported features in BI4 Unv Reader
Universe or Query Features

Supported in Release 1.0?

Context Predefined filters Query level filter - Constant Query level filter - LOV Query level filter - Objects Universe level Prompts Cascading Prompts Object level security Row level security Linked universe Query on Query Sub Query View/Edit SQL Query level Prompts Multi Queries Combined Queries Database Ranking

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Not Available Not Available Not Available Not Available Not Available Not Available Not Available

To read data from a universe stored in the SAP BusinessObjects BI 4.0 platform, perform the following steps:

54

2012-10-04

Component Properties

1. Drag the BI4 Unv Reader component onto the analysis editor. 2. Right-click the BI4 Unv Reader component, and select Properties. 3. Enter the host name (CMS name). Note: Before establishing a connection to the Central Management Server (CMS), ensure that you enter the host name details in the Microsoft Windows hosts file. The hosts file is located on your system at the following location: <Win_Install_Dir>\System32\drivers\etc 4. Enter the CMS user name, CMS password, and authentication type. 5. Choose Test Connection to validate the connection. 6. Choose Add Query. The "Universe Selection Wizard" is displayed with the list of universes available in the CMS. 7. Select the required universe and choose Next. The "Query Panel" is displayed with the list of all classes and data source objects. 8. To create a query, add the required data source objects and filter objects onto the Result Objects and Filters panes, respectively. 9. Choose Finish. If the query contains contexts, you are taken to the context tab to select contexts. Select the required context and choose Finish. If the query contains prompts, you are taken to the prompts tab to select prompts. Select the required prompt to obtain the corresponding list of values (LOV). Select the required list of values from the LOV list and choose Finish. If the query is validated successfully, you are taken to the properties view of the BI4 Unv Reader component. If you want to choose a different universe for reading data, perform the following steps: 1. Choose Edit Query. 2. Choose the Back button. 3. Select the required universe. To remove the query from the BI4 Unv Reader component, choose the Remove Query button.

8.1.2 CSV Reader


The CSV Reader component enables you to read data from flat files such as CSV, Text, and DAT files. Using this component, you can change the name, data type, and date format of the source column.

55

2012-10-04

Component Properties

CSV Reader Properties


Property Name Default Value Possible Values Input Required?

Description

File Name Column Separator

Fully qualified name of the CSV, Text, or DAT file. Column delimiter that separates data tokens in the file.

None Comma(,)

C:\abc.csv

Mandatory Mandatory

Comma(,) Semi colon(;) Tab( ) Colon(:) Pipe(|) Space(" ")

Escape Specifies how single or double Quote Char quotation marks must be treated on importing data.

None

Single(') Double(") None

Optional

First Row If set to true, the data in the first True As Column row of the file is used as the colHeader umn name/header. Otherwise, CSV Reader creates column names such as column1, column2, and so on. No. Lines for scanning Data Types Number of Columns Decimal Separator Number of lines to be used for 10 scanning the data type of the data.

True, False

Optional

1, 2, and so on

Optional

First set of 'n' columns to be consid- None ered for importing. Decimal separator used in the data Perisource. od(.)

1, 2, and so on

Optional Optional

Period(.) Comma(,)

Grouping Separator

Grouping separator or thousands Comseparator to be used in the data ma(,) source.

Period(.) Comma(,)

Optional

56

2012-10-04

Component Properties

Property Name

Description

Default Value Possible Values

Input Required?

Trim Spaces

Discards leading and trailing spaces in the data.

None

None: 1 Left: 2 Right:3 Both: 4

Optional

Encoding

Text-encoding method.

None

UTF-8 None

Optional

Comment Character Define String

Character that indicates the start of a comment. Defines what value to be considered for missing strings.

# False

#, <---, ////

Optional

OptionEmpty String(True) - considers the al missing string as empty string Null String (False) - considers the missing string as null string

To change the name and data type of the source column, perform the following steps: 1. Drag the CSV Reader component onto the analysis editor. 2. To change the column name, in the Data Mappings tab of the properties view, enter an alias name for the required source column. 3. To change the data type of a column, in the Data Mappings tab of the properties view, select the required data type for the source column (primary data type).

8.1.3 Excel Reader


The Excel Reader component enables you to read data from the Microsoft Excel files (.xls, .xlsx, .xlsm). Using this component, you can change the name, data type, and date format of the source column.

57

2012-10-04

Component Properties

Excel Reader Properties


Property Name Default Value Possible Values Input Required?

Description

File Name

Name of the excel file.

None

C:\abc.xls C:\abc.xlsx C:\abc.xlsm

Mandatory

Worksheet Name First Row As a Column Header Excel

Name of the sheet in the excel file. None If set to true, the data in the first True row of the file is used as the column name/header. Otherwise, Excel Reader creates column names such as column1, column2, and so on. Mode to be considered for fetching All data. fields (1)

Sheet1, Sheet2 True, False

Mandatory Optional

Select Mode

MandaAll Fields (1): Selects all the tory fields in the excel sheet. Field Range (2): Specify the range of data to be fetched. For example, A1:B5 Range Name (3): Provides option to select a predefined range.

To change the name and data type of the source column, perform the following steps: 1. Drag the Excel Reader component onto the analysis editor. 2. To change the column name, in the Data Mappings tab of the properties view, enter an alias name for the required source column. 3. To change the data type of a column, in the Data Mappings tab of the properties view, select the required data type for the source column (primary data type).

8.1.4 HANA Reader


The HANA Reader component enables you to read data from the SAP HANA database only. Using this component, you can understand the data, look for patterns, and visualize the data using various visualization techniques. Use this component if you want to use Predictive Analysis Library (PAL) algorithms provided by SAP HANA.

58

2012-10-04

Component Properties

Note: If you choose to use HANA Reader to read the data, you are not allowed to use SAP and R specific algorithms. HANA Reader Properties
Property Name Description Input Required?

Default Value

Possible Values

Table Name

The name of the ta- None ble.

The table name. For example, Customer

Mandatory Optional Mandatory

Column Input columns to be If you do not select any Column names. Name considered for recolumn, then all columns trieving data. are selected by default. Schema The name of the Name schema. Define Datasource None Any String value

Define tables/views Analysis Views or analysis views for retrieving data.

Tables and Views: Select this Mandaoption if you want to retrieve tory data from tables or views. Analysis Views: Select this option if you want to retrieve data from attribute views, analytic views, and calculation views.

Following are the steps to perform an in-DB analysis using SAP HANA database: 1. From the components panel, drag the HANA Reader component onto the analysis editor. Note: If you select HANA Reader component, then only in-DB specific data preparation, algorithms, and writer components are available in the components panel. If you have not yet created any HANA connection, the "New Connection" wizard appears and allows you to create a new connection. Continue with step 2 to proceed. If you have already created HANA connections, then the default connection is selected for the analysis. You can change the associated connection to other existing connections by using the In-DB Connection tab in the analysis properties panel. Continue with step 5 to proceed. 2. Enter a name for the connection. 3. Enter the host name, port number, user name, and password. 4. Choose Finish. The newly created connection is selected for the analysis. 5. In the HANA Reader properties panel, perform the following steps: a. Select the required data source. Analysis Views: If you select this option, HANA Reader displays attribute, analytic, and calculation views.

59

2012-10-04

Component Properties

Tables and Views: If you select this option, HANA Reader displays the list of schemas available in the database. You need to select the required schema to display the list of tables and views.

b. Select the required table or view or analysis view to display the list of columns. c. Select the required columns from the columns list. Note: If you do not select any column, then all columns are selected by default.

8.1.5 JDBC Reader


The JDBC Reader component enables you to read data from relational databases such as MySQL, MS SQL Server, MS Access, Oracle, SAP HANA, Sybase IQ, and Sybase SQL Anywhere. Note: You need to configure SAP Predictive Analysis to use the corresponding JDBC driver. For configuration information, see the SAP Predictive Analysis Installation and Configuration Guide. If you use the JDBC Reader component to read data from SAP HANA, you can only use SAP and R specific algorithms. To use Predictive Analysis Library (PAL) algorithms provided by SAP HANA, you need to use the HANA Reader component.

JDBC Reader Properties


Property Name Description Input Required?

Default Value

Possible Values

Database List of database names None Type supported by Predictive Analysis service. Machine Name Port Number Name of the machine on None which the database is installed. Database port number. None

MySQL, MS SQL Server, Ora- Mandacle, DB2, Sybase SQL Anytory where, SAP HANA database, Sybase IQ Any string value Mandatory

For example, 3306, 30415, and Mandaso on tory Any string value Any string value Mandatory Mandatory

Database Name of the database. Name User Name Database user name.

None None

60

2012-10-04

Component Properties

Property Name Description

Default Value

Possible Values

Input Required?

Password Table Name Column Name

Database password.

None

Any string value The table or view name. For example, Customer

Optional Mandatory Optional

Name of the table or None view to be considered for retrieving the data.

Input columns to be con- If you do not select Column names sidered for retrieving da- columns, then all columns ta. are selected by default.

To read data from a relational database, perform the following steps: 1. From the components panel, drag the JDBC Reader component onto the analysis editor. 2. In the properties view, under Connection Credentials section, enter necessary details and choose Connect. If the connection to the database is established successfully, the tables and views of the database are loaded in the Table Details section. 3. Select a table or view from the list of tables and views. 4. Select the required columns.

8.1.6 XI3 Unv Reader


The XI3 Unv Reader component enables you to read data from universes which are stored in the SAP BusinessObjects XI 3.1 platform. XI3 Unv Reader Properties
Property Name Description Default Value Possible Values Input Required?

61

2012-10-04

Component Properties

Property Name Description

Default Value

Possible Values

Input Required?

Host Name User Name Password

Name of the machine on which None the Central Management Server (CMS) is running. CMS user name. CMS password. None None Enterprise

Any string value. For example, IP Mandatoaddress of the machine or machine ry name. Any string value. Any string value. Enterprise, LDAP, Windows AD Mandatory Optional Mandatory

Authenti- CMS authentication type. cation Type Supported features in XI3 Unv Reader
Universe or Query Features

Supported in Release 1.0?

Context Predefined filters Query level filter - Constant Query level filter - LOV Query level filter - Objects Universe level Prompts Cascading Prompts Object level security Row level security Linked universe Query on Query Sub Query View/Edit SQL Query level Prompts Multi Queries Combined Queries Database Ranking

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Not Available Not Available Not Available Not Available Not Available Not Available Not Available

To read data from a universe stored in the SAP BusinessObjects XI 3.1 platform, perform the following steps: 1. Drag the XI3 Unv Reader component onto the analysis editor.

62

2012-10-04

Component Properties

2. Right-click the XI3 Unv Reader component, and select Properties. 3. Enter the host name (CMS name). Note: Before establishing a connection to the Central Management Server (CMS), ensure that you enter the host name details in the Microsoft Windows hosts file. The hosts file is located on your system at the following location: <Win_Install_Dir>\System32\drivers\etc For example, enter 1.1.1.1 <host_name> in the hosts file. 4. Enter the CMS user name, CMS password, and authentication type. 5. Choose Test Connection to validate the connection. 6. Choose Add Query. The "Universe Selection Wizard" is displayed with the list of universes available in the CMS. 7. Select the required universe and choose Next. The "Query Panel" is displayed with the list of all classes and data source objects. 8. To create a query, add the required data source objects and filter objects onto the Result Objects and Filters panes, respectively. 9. Choose Finish. If the query contains contexts, you are taken to the context tab to select contexts. Select the required context and choose Finish. If the query contains prompts, you are taken to the prompts tab to select prompts. Select the required prompt to obtain the corresponding list of values (LOV). Select the required list of values from the LOV list and choose Finish. If the query is validated successfully, you are taken to the properties view of the XI3 Unv Reader component. If you want to choose a different universe for reading data, perform the following steps: 1. Choose Edit Query. 2. Choose the Back button. 3. Select the required universe. To remove the query from the XI3 Unv Reader component, choose the Remove Query button.

8.2 Data Preparation


Use data preparation components to prepare the data for analysis. These are optional components.

63

2012-10-04

Component Properties

8.2.1 Data Type Definition


The Data Type Definition component enables you to change the name, data type, and date format of the source column. Defining data type helps you to prepare data to make it suitable for further analysis. For example, If the name of the column in the data source is "des", it may not be clear during analysis. Hence, you can change the name of the column to "Designation" in the analysis, so that the end users can easily understand it. If the date is stored in the mmddyy (120201, without any date separator) format, it may be considered as an integer value by the system. Using the Data Type Definition component, you can change the date format to any valid format such as mm/dd/yyyy, or dd/mm/yyyy, and so on. Data Type Definition Properties
Property Name Description Default Value Possible Values Input Required?

Column Name Alias Name Primary Data Type

Name of the source The last modified name of the col- None column. umn or the name of the column the source data. New name for the source column. None Any string value. String, Integer, Date, Double dd/mm/yyyy, dd/mm/yy, dd-mmyyyy, and so on.

No

Optional Optional

Data type of the col- System calculated type umn.

Date For- Date format pattern None mat for columns of type date.

Optional

To change the name and data type of the source column, perform the following steps: 1. Drag the JDBC Reader component onto the analysis editor and configure its properties. 2. Drag the Data Type Definition component onto the analysis editor. 3. To change the column name, in the properties view, enter an alias name for the required source column. 4. To change the data type of the column, in the properties view, select the required data type for the source column.

8.2.2 Filter

64

2012-10-04

Component Properties

The Filter component enables you to filter rows and columns based on a specified condition. Note: If you are perform in-DB analysis and using the in-DB Filter component, the advanced filter editor is not supported. This means that the in-DB Filter component does not support functions and advanced expressions. After configuring the filter component, if you change the data source, the filter component still retains the previously defined row filters.

Filter Properties
Property Name Description Default Value Possible Values Input Required?

Selected Columns to be consid- None Columns ered for applying a filter.

Column indexes, separated by a comma. Any positive integer value >=0. If you do not specify any columns, all columns are selected by default. If you do not specify any conditions, all rows are selected by default.

Optional

Filter Filter condition. Condition

None

Optional

Row Filter Operators The following are the operators available in the Row Filter panel.
Operator Description

Equal to Not Equal to Greater Than Greater Than Equal to Lesser Than

Select rows where column value is equal to the given value. Select rows where column value is not equal to the given value. Select rows where column value is greater than the given value. Select rows where column value is greater than or equal to the given value. Select rows where column value is lesser than the given value.

Lesser Than Equal Select rows where column value is lesser than or equal to the given value. to Between Select rows where column value is between the given values. For example, Column1 BETWEEN 25 AND 45 selects the rows where Column1 has values that are greater than or equal to 25 and lesser than or equal to 45. Select rows where column value is not between the given values. For example, Column1 NOT BETWEEN 25 AND 45 selects the rows where Column1 has values that are lesser than 25 or greater than 45.

Not Between

65

2012-10-04

Component Properties

Operator

Description

In List

Select rows where column value belongs to the list of given values. For example, Column1 IN LIST 25;45;23;12 select rows where Column1 has values that are equal to either 25 or 45 or 23 or 12. Select rows where column value does not belongs to the list of given values. For example, Column1 NOT IN LIST 25;45;23;12 select rows where Column1 has values that are not equal to either 25 or 45 or 23 or 12. Select rows where column value is NULL. Select rows where column value is not NULL.

Not In List

Is Null Is Not Null

Example:
Store Revenue Profit

Land Mark Spencer Soch

10000 20000 25000

1000 4500 8000

1. Remove the "Store" column from the right pane of Selected Columns. 2. In the Row Filter panel, double-click the Profit column or drag the Profit column onto the Filters pane. 3. Select the Greater than operator from the operator drop-down list. 4. Enter 2000 in the text box. 5. Execute the analysis.
Revenue Profit

20000 25000

4500 8000

Note: Filter component supports expressions that return Boolean result only. For example, in the Employee table below:

66

2012-10-04

Component Properties

Emp ID

Emp Name

DOB

Age

Date of ConfirDate of Joining mation

1 2 3 4

Laura Desy Alex John

11/11/1986 12/5/1981 30/5/1978 6/6/1979

25 30 33 32

12/9/2005 24/6/2000 10/10/1998 2/12/1999

27/11/2005 10/7/2000 24/10/1998 20/12/1999

The expression DAYSBETWEEN([Date of Joining],[Date of Confirmation]) is not a valid filter expression as it returns a numeric value. The correct usage of the DAYSBETWEEN expression in filter is DAYSBETWEEN([Date of Joining],[Date of Confirmation]) == 14. This expression selects those rows where number of days between "Date of Joining" and "Date of Confirmation" is 14. For the employee table mentioned above, the third row is selected. DAYNAME([Date of Joining]) == 'Saturday' selects second and third rows in the employee table.

When entering a String literal that contains single quotation marks, each single quotation mark inside the String literal must be escaped with a backslash character. For example, 'Customer's' must be represented as 'Customer\'s'. When entering a column name that contains square brackets, each square bracket inside the column name must be escaped with a backslash character. For example, [Customer[Age]] must be represented as [Customer\[Age\]].

Supported Functions Note: Filter component does not support data manipulation functions.
Function (Function when applied on the Employee table)

Category

Description

Date

DAYSBETWEEN CURRENTDATE MONTHSBETWEEN

Returns number of days between two dates. Returns the current system date. Returns months between two dates. For example, the new column contains 2,0,2,0 when MONTHSBETWEEN([Date of Joining],[Date of Confirmation]) is applied on the Employee table.

DAYNAME

Returns the day name in the string format. For example, the new column contains Monday, Saturday, Saturday, Thursday when DAYNAME([Date of Joining]) is applied on the Employee table.

67

2012-10-04

Component Properties

Category

Function (Function when applied on the Employee table)

Description

DAYNUMBEROFMONTH

Returns the day number of the particular month. For example, 12/11/1980 returns 12.

DAYNUMBEROFWEEK

Returns the day number in a week. For example, Sunday =1, Monday=2.

DAYNUMBEROFYEAR

Returns the day number in a year. For example, 1st Jan =1, 1st Feb=32, 3rd Feb=34.

LASTDATEOFWEEK

Returns the date of last day in a week. For example, 12/9/2005 returns 17/9/2005

LASTDATEOFMONTH

Returns the date of last day in a month. For example, 12/9/2005 returns 30/9/2005

MONTHNUMBEROFYEAR

Returns the month number in a date. For example, Jan=1, Feb=2, Mar=3, and so on.

WEEKNUMBEROFYEAR

Returns the week number in a year. For example, 12/9/2005 returns 38.

QUARTERNUMBEROFDATE

Returns the quarter number in a date. For example, 12/9/2005 returns 3.

String

CONCAT

Concatenates two strings. For example, CONCAT('USA', 'Australia') returns USAAustralia.

INSTR

Returns true, if the search string is found in the source string. For example, INSTR('USA', 'US') returns true.

SUBSTRING

Returns substring from the source string. For example, SUBSTRING('USA', 1,2) returns US.

Math

MAX MIN

Returns the maximum value in a column. Returns the minimum value in a column.

68

2012-10-04

Component Properties

Category

Function (Function when applied on the Employee table)

Description

COUNT SUM AVERAGE

Returns the number of values in a column. Returns the sum of the values in a column. Returns the average of the values in a column.

Condition- IF(condition) THEN(string expression/mathe- Checks whether the condition is met, and al Expres- matical expression/conditional expression) returns one value if 'true' and another value sion ELSE(string expression/mathematical expres- if 'false'. sion/conditional expression) For example, IF([Date of Joining]>12/9/2005) THEN ('Employee joined after Sept 12, 2005') ELSE ('Employee joined on or before Sept 12, 2005')

Note: Mathematical expressions containing functions that return a numeric value are not supported. For example, expression DAYNUMBEROFMONTH(CURRENTDATE())==2 is not supported because DAYNUMBEROFMONTH returns a numeric value. Mathematical Operators Use mathematical operators to create formulas containing numeric columns and/or numbers. For example, the expression [Age] + 1 adds a new column with values 26, 31, 34, 33.
Mathematical Operators Description

+ * / () ^ % E Conditional Operators

Addition operator Subtraction operator Multiplication operator Division operator Round brackets or parenthesis Power operator Modulo operator Exponential operator

Use conditional operators to create IF THEN ELSE or SELECT expressions.

69

2012-10-04

Component Properties

Conditional Operators

Description

== != < > <= >= Logical Operators

Equal to Not Equal to Less than Greater than Less than or equal to Greater than or equal to

Use logical operators to compare two conditions and return 'true' or 'false'. For example, IF([Date of Joining]>12/9/2005 && [Age] >=25 ) THEN ('True') ELSE ('False') adds a new column with values True, False, False, False.
Logical Operators Description

&& ||

AND OR

8.2.3 Formula
Formula component enables you to apply predefined functions and operators on the data. All functions and expressions except data manipulation functions add a new column with the formula result. Note: When entering a String literal that contains single quotation marks, each single quotation mark inside the String literal must be escaped with a backslash character. For example, 'Customer's' must be represented as 'Customer\'s'. When entering a column name that contains square brackets, each square bracket inside the column name must be escaped with a backslash character. For example, [Customer[Age]] must be represented as [Customer\[Age\]].

70

2012-10-04

Component Properties

Formula Properties
Property Name Description Default Value Possible Values Input Required?

New Column The name of the new column cre- None Name ated by applying formula Formula String Formula you want to apply None

Any string value For example, Average([Age])

Mandatory Mandatory

Example: Employee Table:


Date of ConfirDate of Joining mation

Emp ID

Emp Name

DOB

Age

1 2 3 4

Laura Desy Alex John

11/11/1986 12/5/1981 30/5/1978 6/6/1979

25 30 33 32

12/9/2005 24/6/2000 10/10/1998 2/12/1999

27/11/2005 10/7/2000 24/12/1998 20/12/1999

1. Drag the Formula component onto the analysis editor. 2. In the properties view, enter a name for the formula. For example, Average_Age. 3. In the Expression field, enter the formula: AVERAGE([Age]) 4. Choose Validate and Apply to validate the formula syntax. Output table:

71

2012-10-04

Component Properties

Emp ID

Emp Name

DOB

Age

Date of Join- Date of Coning firmation

Average_Age

1 2 3 4

Laura Desy Alex John

11/11/1986 12/5/1981 30/5/1978 6/6/1979

25 30 33 32

12/9/2005 24/6/2000 10/10/1998 2/12/1999

27/11/2005 10/7/2000 24/12/1998 20/12/1999

30 30 30 30

Supported Functions
Category Function (Function when applied on the Employee table) Description

Date

DAYSBETWEEN CURRENTDATE MONTHSBETWEEN

Returns number of days between two dates. Returns the current system date. Returns months between two dates. For example, the new column contains 2,0,2,0 when MONTHSBETWEEN([Date of Joining],[Date of Confirmation]) is applied on the Employee table.

DAYNAME

Returns the day name in the string format. For example, the new column contains Monday, Saturday, Saturday, Thursday when DAYNAME([Date of Joining]) is applied on the Employee table.

DAYNUMBEROFMONTH

Returns the day number of the particular month. For example, 12/11/1980 returns 12.

DAYNUMBEROFWEEK

Returns the day number in a week. For example, Sunday =1, Monday=2.

DAYNUMBEROFYEAR

Returns the day number in a year. For example, 1st Jan =1, 1st Feb=32, 3rd Feb=34.

LASTDATEOFWEEK

Returns the date of last day in a week. For example, 12/9/2005 returns 17/9/2005

72

2012-10-04

Component Properties

Category

Function (Function when applied on the Employee table)

Description

LASTDATEOFMONTH

Returns the date of last day in a month. For example, 12/9/2005 returns 30/9/2005

MONTHNUMBEROFYEAR

Returns the month number in a date. For example, Jan=1, Feb=2, Mar=3, and so on.

WEEKNUMBEROFYEAR

Returns the week number in a year. For example, 12/9/2005 returns 38.

QUARTERNUMBEROFDATE

Returns the quarter number in a date. For example, 12/9/2005 returns 3.

String

CONCAT

Concatenates two strings. For example, CONCAT('USA', 'Australia') returns USAAustralia.

INSTR

Returns true, if the search string is found in the source string. For example, INSTR('USA', 'US') returns true.

SUBSTRING

Returns substring from the source string. For example, SUBSTRING('USA', 1,2) returns US.

STRLEN

Returns the number of characters in the source string. For example, STRLEN('Australia') returns 9. Returns the maximum value in a column. Returns the minimum value in a column. Returns the number of values in a column. Returns the sum of the values in a column. Returns the average of the values in a column. Performs in-place replacement of a string. For example, @REPLACE([country],'USA', 'AMERICA') replaces USA by AMERICA in country column.

Math

MAX MIN COUNT SUM AVERAGE

Data Ma- @REPLACE nipulation

73

2012-10-04

Component Properties

Category

Function (Function when applied on the Employee table)

Description

@BLANK

Replaces blank values with a specified value. For example, @BLANK([country], 'USA') replaces all blank values with USA in country column.

@SELECT

Selects rows that satisfy the given condition. You can use any conditional operator to specify the condition. For example, @SELECT([country]=='USA') selects rows where country is equal to USA.

Condition- IF(condition) THEN(string expression/mathe- Checks whether the condition is met, and al Expres- matical expression/conditional expression) returns one value if 'true' and another value sion ELSE(string expression/mathematical expres- if 'false'. sion/conditional expression) For example, IF([Date of Joining]>12/9/2005) THEN ('Employee joined after Sept 12, 2005') ELSE ('Employee joined on or before Sept 12, 2005')

Note: Mathematical expressions containing functions that return a numeric value are not supported. For example, expression DAYNUMBEROFMONTH(CURRENTDATE())+2 is not supported because DAYNUMBEROFMONTH returns a numeric value. Mathematical Operators Use mathematical operators to create formulas containing numeric columns and/or numbers. For example, the expression [Age] + 1 adds a new column with values 26, 31, 34, 33.
Mathematical Operators Description

+ * / ()

Addition operator Subtraction operator Multiplication operator Division operator Round brackets or parenthesis

74

2012-10-04

Component Properties

Mathematical Operators

Description

^ % E Conditional Operators

Power operator Modulo operator Exponential operator

Use conditional operators to create IF THEN ELSE or SELECT expressions.


Conditional Operators Description

== != < > <= >= Logical Operators

Equal to Not Equal to Less than Greater than Less than or equal to Greater than or equal to

Use logical operators to compare two conditions and return 'true' or 'false'. For example, IF([Date of Joining]>12/9/2005 && [Age] >=25 ) THEN ('True') ELSE ('False') adds a new column with values True, False, False, False.
Logical Operators Description

&& ||

AND OR

8.2.4 Sample
The Sample component enables you to select a subset of data from large datasets. The Sample component supports the following sample types: First N: Selects first N records in the dataset. Last N: Selects last N records in the dataset. Every N: Selects every Nth record in the dataset, where N is an interval. For example, if N=2, the 2nd, 4th, 6th, 8th records are selected and so on. Simple Random: Randomly selects records of size N or N percent of records in a dataset.

75

2012-10-04

Component Properties

Systematic Random: In this sample type, sample intervals or buckets are created based on the bucket size. The Sample component selects the Nth record at random from the first bucket, and from each subsequent bucket the Nth record is selected.

Sample Properties
Property Name Default Value

Description

Possible Values

Input Required?

Sampling Type Limit Rows by

Type of sampling.

First N First N, Last N, Every N, Systematic Random, Simple Random

Mandatory

Select the method for limiting the rows.

Num- Number of Rows, Mandatory for First N, Last N, ber of Percentage of Rows and Simple Random sampling Rows types. None Any positive integer Either "Number of Rows" or value > =0 "Percentage of Rows" is mandatory Range: 0 -100 For example, 50, 50.25, 12.444, and so on. Either "Number of Rows" or "Percentage of Rows" is mandatory. Optional for First N and Last N sampling types.

Num- Number of rows to fetch. This ber of property applies for First N, Rows Last N, and Simple Random sample types. Percentage of Rows

Percentage of rows to fetch. None This property applies for First N, Last N, and Simple Random sample types.

76

2012-10-04

Component Properties

Property Name

Description

Default Value

Possible Values

Input Required?

Maximum Rows Interval

Maximum number of rows to be None selected. Interval between rows to be None selected. This property applies for the Every N sample type.

Any positive integer Optional for all sample types value >= 0 Any positive integer Mandatory for Every N value > 0 Any positive integer Mandatory value >0

Buck- Size of the bucket in the Sys- None et tematic Random sample type. Size

Example:
Emp ID Emp Name DOB Age

1 2 3 4 5 6 7 8 9 10 1. First N: For N=5


Emp ID

Laura Desy Alex John Ted Tom Anna Valerie Mary Martin

11/11/1986 12/5/1981 30/5/1978 6/6/1979 4/7/1987 30/6/1970 24/6/1965 6/7/1990 19/9/1985 21/11/1986

25 30 33 32 24 41 46 21 26 25

Emp Name

DOB

Age

1 2 3 4 5 2. Last N: For N=4

Laura Desy Alex John Ted

11/11/1986 12/5/1981 30/5/1978 6/6/1979 4/7/1987

25 30 33 32 24

77

2012-10-04

Component Properties

Emp ID

Emp Name

DOB

Age

7 8 9 10 3. Every N: Interval=3
Emp ID

Anna Valerie Mary Martin

24/6/1965 6/7/1990 19/9/1985 21/11/1986

46 21 26 25

Emp Name

DOB

Age

3 6 9

Alex Tom Mary

30/5/1978 30/6/1970 19/9/1985

33 41 26

4. Simple Random: For number of rows=2 The result can be any two rows.
Emp ID Emp Name DOB Age

7 8

Anna Valerie

24/6/1965 6/7/1990

46 21

5. Systematic Random: Bucket Size=4


Emp ID Emp Name DOB Age

2 6 10 or

Desy Tom Martin

12/5/1981 30/6/1970 21/11/1986

30 41 25

78

2012-10-04

Component Properties

Emp ID

Emp Name

DOB

Age

1 5 9

Laura Ted Mary

11/11/1986 4/7/1987 19/9/1985

25 24 26

8.3 Algorithms
Use algorithms for performing data mining and statistical analysis on your data. For example, determining trends and patterns in data. SAP Predictive Analysis provides built-in algorithms such as Regressions, Time Series, and Outliers. However, the application also supports Decision Trees, K-Means, Neural Network, Time Series, and Regression algorithms from R open-source library. You can also perform in-database analysis using Predictive Analysis Library (PAL) algorithms from SAP HANA.

8.3.1 Association

8.3.1.1 HANA Apriori


The HANA Apriori algorithm enables you to find frequent itemsets pattern in large transactional datasets for generating association rules. This algorithm is used to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of customers with association analysis, prediction of their future behavior can be made. For example, the information that a customer who buys shoes is more likely to buy socks at the same time can be represented in an association rule (with a given minimum support and minimum confidence) as: Shoes=>Socks [support = 0.5, confidence= 0.1]

79

2012-10-04

Component Properties

HANA Apriori Properties


Property Name Description Default Value Possible Values

Input Required?

Item Column

Column name containing items. None

Column names

Mandatory

Trans- Column name containing the action- transaction ID. ID Column

None

Column name

Mandatory only if Input Mode is "Transactions"

Miss- Method for handling missing val- Reing Val- ues. move ues

Support Confidence Pre Rule Post Rule Percentage of task-relevant data 0.5 transactions for which the pattern is true. Measure of certainty or trustwor- 0.1 thiness associated with each discovered pattern. Name for the new column that Precontains antecedent (LHS) of the Rule apriori rule for the given dataset. Name for the new column that contains consequent (RHS) of the apriori rule for the given dataset.

Remove: The algorithm skips Mandatory the records containing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Mandatory

Range: 0-1

Range: 0-1

Mandatory

Any String value

Mandatory

PostRule Any String value

Mandatory

SupName for the new column that port contains support values for the Values corresponding rules.

Support

Any String value

Mandatory

80

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Input Required?

Confi- Name for the new column that dence contains confidence values for Values the corresponding rules.

Confi- Any String value dence Any String value

Mandatory

Lift Val- Name for the new column that Lift ues contains lift values for the corresponding rules. Num- Number of threads to be used for 1 ber of execution. Threads

Mandatory

Any positive Integer value

Optional

8.3.1.2 R-Apriori
The R-Apriori algorithm enables you to find frequent itemsets pattern in large transactional datasets for generating association rules using "arules" R package. For example, the information that a customer who buys shoes is more likely to buy socks at the same time can be represented in an association rule (with a given minimum support and minimum confidence) as: Shoes=> Socks [support = 0.5, confidence= 0.1]

81

2012-10-04

Component Properties

R-Apriori Properties
Property Name Description Default Value Is this property configurable when the algorithm is used as a model?

Possible Values

Input Required?

Out- Mode in which the output Rules put must be displayed. Mode Input Input data format. Format Trans- actions

Rules Transactions

Mandatory

No

Tabular Format: Input Mandatory contains only items. Transactions: Input contains two columns (TransactionID and Item).

Yes

Note: Tabular format with input value true or false for items is also supported. Item Columns names contain- None Col- ing items. umn(s) Trans- Columns name containacing the transaction ID. tionID Column None Column names Mandatory Yes

Column name

Mandatory Yes only if Input Mode is "Transactions" Mandatory No

Sup- Percentage of task-rele- 0.5 port vant data transactions for which the pattern is true. Confi- Measure of certainty or 0.1 dence trustworthiness associated with each discovered pattern.

Range: 0-1

Range: 0-1

Mandatory

No

Trans- Name of the new column Trans- Any string value acthat contains transaction action tion IDs. ID ID Items Name of the new column Items that contains items. Any string value

Mandatory

Yes

Mandatory

Yes

82

2012-10-04

Component Properties

Property Name Description

Default Value

Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

Match- Name of the new column Match- Any string value ing that contains matching ing Rules rules. Rules Lhs Comma-separated labels None Item(s) of the items which appears on the left hand side of the rules or itemsets. Rhs Comma-separated labels None Item(s) of the items which appears on the right hand side of the rules or itemsets. Both Comma separated labels None Item(s) of the items which appears on both sides of the rules or itemsets. None Comma-separated labels None Item(s) of the items which need not appear in the rules or itemsets. De- Default appearance of the Both fault items which is not explicit- Items() Ap- ly mentioned. pearance Sort Indicates how to sort AsItem(s) items with respect to their cendfrequency. ing Transaction Size The items in the data source. For example, Cake, Bread, and so on

Mandatory

Yes

Optional

No

The items in the data source. For example, Cake, Bread, and so on

Optional

No

The items in the data source. For example, Cake, Bread, and so on The items in the data source. For example, Cake, Bread, and so on Lhs Item(s) Rhs Item(s) Both Item(s) None Item(s)

Optional

No

Optional

No

Optional

No

Optional Ascending Descending Do Not Sort Ascending Transaction Size Descending Transaction Size Optional

No

Filter Numeric value that indi- 0 Items cates how to filter unused items from transactions.

Range: -1 to +1

No

83

2012-10-04

Component Properties

Property Name Description

Default Value

Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

Tree Boolean value that indi- True View cates whether to organize transactions as a prefix tree. Use Boolean value that indi- True Heap cates whether to use Sort heap sort instead of quick sort to sort the transactions. Minimize Memory Boolean value that indi- False cates whether to minimize memory usage instead of maximizing speed.

True, False

Optional

No

True, False

Optional

No

True, False

Optional

No

Load Boolean value that indi- True Trans- cates whether to load actransactions into memory. tion

True, False

Optional

No

8.3.2 Classification

8.3.2.1 HANA KNN


The HANA KNN algorithm enables you to classify objects based on the trained sample data. In KNN, objects are classified by the majority votes of its neighbors.

84

2012-10-04

Component Properties

HANA KNN Properties


Property Name Description Default Value Possible Values Input Required?

Indepen- Input source columns. dent Columns Neighbor- Number of neighbors to hood consider for finding disCount tances. Voting Type Missing Values Type of voting.

None

Column names

Mandatory Mandatory Mandatory

Any positive integer value

Majority Voting

Majority Voting Distance Weighted Voting

Method for handling miss- Remove ing values.

MandatoRemove: The algorithm skips the records containing missing values in ry the independent or dependent columns. Keep: The algorithm considers missing values for processing. Stop: The algorithm stops the execution if the missing value is found in independent or dependent columns. Mandatory Mandatory Mandatory

Schema Name of the schema that None Name contains trained data. Table Name Name of the table that contains trained data. None

Any String value Any String value Any positive integer value

Indepen- Name of the input columns None dent to be considered for trainColumns ing data.

85

2012-10-04

Component Properties

Property Name Description

Default Value

Possible Values

Input Required?

Depen- Name of the output column None dent Col- to be considered for trainumn ing data. Number Number of threads to be of used for execution. Threads Predicted Column Name 1

Any positive integer value

Mandatory Optional

Any positive integer value

Name of the column that None contains classification value.

Any String value

Mandatory

8.3.3 Clustering

8.3.3.1 HANA K-Means


The HANA K-Means algorithm is used to cluster observations into groups of related observations without any prior knowledge of those relationships. The algorithm clusters observations into k groups, where k is provided as an input parameter. The algorithm then assigns each observation to clusters based on the proximity of the observation to the mean of the cluster. The process continues until the clusters converge. Note: Each time you execute the HANA K-Means algorithm, you might obtain a different cluster number for each cluster. However, the observations in each cluster remains the same. Creating models using the HANA K-Means algorithm is not supported.

86

2012-10-04

Component Properties

HANA K-Means Properties


Default Value Input Required?

Property

Description

Possible Values

Output Mode

Mode in which the output is to be displayed.

Cluster

Cluster Column name

NA Mandatory

Indepen- Select the source columns on None dent which you want to apply the alColumns gorithm. Missing Values Method for handling missing values.

Remove

Remove: The algorithm skips the Mandarecords containing missing values tory in the independent or dependent column. Ignore: The algorithm ignores the record containing missing values during calculation. However, the records are retained in the result table. Stop: The algorithm stops if the missing value is found in independent or dependent column. Mandatory Mandatory Optional Optional

Number of Clusters Cluster Name

Enter number of groups for clustering Enter a name for the new column added in the result that contains cluster number.

Any positive integer value

Cluster- Any string value Number 100 Any positive integer value First K values Weighted random with replacement Random without replacement

Maximum Enter the number of iterations Iterations allowed for finding clusters. Center Calculation Method

Select the method used for cal- First K culating initial cluster centres. values

87

2012-10-04

Component Properties

Property

Description

Default Value

Possible Values

Input Required?

Normalization Number of Threads

Select True to normalize the data.

True

True, False Any positive integer value > 0

Optional Optional

Enter the number of threads that 1 can be used for execution. 0.0001

Exit Enter the threshold value for Threshold exiting from the iterations.

Range: 0-1

Optional

8.3.3.2 R-K-Means
The R-K-Means algorithm is used to cluster observations into groups of related observations without any prior knowledge of those relationships. The algorithm clusters observations into k groups, where k is provided as an input parameter. The algorithm then assigns each observation to clusters based on the proximity of the observation to the mean of the cluster. The process continues until the clusters converge. Note: Each time you execute the R-K-Means algorithm, you might obtain a different cluster number for each cluster. However, the observations in each cluster remain the same. Creating models using the R-K-Means algorithm is not supported.

R-K-Means Properties
Property Name Description Default Values Possible Values Input Required?

Output Mode Independent Columns

Mode in which the output is to be Cluster displayed. Source columns on which cluster- None ing is performed.

Cluster Column names

NA Mandatory

Number of Number of groups into which the None Clusters dataset is to be clustered. Cluster Name Name of the new column that contains the cluster numbers. ClusterNumber

Any positive integer value. Any string value

Mandatory Mandatory

88

2012-10-04

Component Properties

Property Name

Description

Default Values

Possible Values

Input Required?

Maximum Iterations

Number of iterations for finding the clusters.

10

Any positive integer value. Any positive integer value. Hartigan-Wong Lloyd Forgy MacQueen

Optional Optional Optional

Number of Number of random initial sets to 1 Initial Sets be chosen. Algorithm Algorithm to be used for perform- Hartiganing k-means clustering. Wong

8.3.4 Decision Trees

8.3.4.1 HANA C 4.5


HANA C 4.5 Decision Tree is used as a classifier for determining an appropriate action or decision for predetermined set of actions. A decision tree helps you to effectively identify the factors you must consider and how each factor has historically been associated with different outcomes of the decision. A decision tree is a classifier that uses a tree-like structure of conditions and their possible consequences. Each node of a decision tree can be a leaf node or a decision node. Leaf node - contains the value of the dependent variable. Decision node - contains one condition that specifies some tests on an attribute value. The outcome of the condition is further divided into branches with sub-trees or leaf nodes.

89

2012-10-04

Component Properties

HANA C 4.5 Decision Tree Properties


Property Name Default Value Is this property configInput urable when the algoRerithm is used as a modquired? el?

Description

Possible Values

Output Mode in which the Trend Mode output is to be displayed.

Fill: Fills missing values in Manda- Yes tory the target column Trend: Provides the input data from the data source along with predicted values for the given dataset Manda- Yes tory

Independent Column

Input source columns.

None

Column names

Depen- Target or output dent column. Column Missing Values Method for handling missing values.

None

Column name

Manda- Yes tory

Remove

Manda- Yes Remove: The algorithm skips the records containing tory missing values in the independent or dependent columns. Keep: The algorithm considers missing values for processing. Stop: The algorithm stops the execution if the missing value is found in independent or dependent columns. Manda- No tory

Percentage

Percentage of data 1 to be considered for analysis.

Any positive double value

90

2012-10-04

Component Properties

Property Name

Description

Default Value

Possible Values

Is this property configInput urable when the algoRerithm is used as a modquired? el?

Num- Number of threads 1 ber of to be used for exeThreads cution. ColName of the inde- None umn pendent column Name containing numeric values. Enter Bin Ranges None

Any positive Integer value

Option- Yes al Option- No al

Any String Value

Any String value. For example, 5000,10000;12000,15000

Option- No al

8.3.4.2 R-CNR Tree


The R-Classification and Regression Tree algorithm classifies observations into groups and predicts one or more discrete variables based on other variables. This algorithm scans the input column and builds a binary decision tree by splitting observations into two sub-groups. Each sub-group is consequently split into two more sub-groups, and the split continues until the stopping criterion is met. Note: The "rpart" package which is part of R 2.11.1 cannot handle column names with spaces or special characters. The "rpart" package supports only the input column name format that is supported by R dataframe. CNR tree doesn't work if the version of the caret package which is part of R is less than 4.85. Independent column names used while scoring the model should be same as independent column names used while creating the model. Column names containing spaces or any other special character other than period (.) are not supported. PMML version supported: PMML 3.2 for R components

91

2012-10-04

Component Properties

R-CNR Tree Properties


Property Name Description Default Values Is this property Input configurable when Rethe algorithm is quired? used as a model?

Possible Values

Out- Mode in which the output is Trend put to be displayed. Mode

Inde- Input source columns. pendent C o u lm n s Dependent Column Target column.

Trend: Provides the input Manda- Yes tory data from the data source along with predicted values for the given dataset Fill: Fills missing values in the target column. Manda- Yes tory

None Column names

None Column name

Manda- Yes, only if output tory mode is fill

Miss- Method for handling missing Iging values. nore Values

Manda- Yes rpart: The algorithm deletes all observations tory for which dependent column is missing, but retains those observations in which one or more independent columns are missing. Ignore: The algorithm skips the records containing missing values in the independent or dependent columns. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent columns.

92

2012-10-04

Component Properties

Property Name Description

Default Values

Possible Values

Is this property Input configurable when Rethe algorithm is quired? used as a model?

Method Type of splitting rule to use. Clas- sification

Classification: Use this Manda- Yes method if the dependent tory variable has categorical values. Regression: Use this method if the dependent variable has continuous values. Manda- No tory

Minimum Split

Minimum number of observa- 20 tions that need to exist for splitting a node.

Any positive integer value

Split Splitting criteria of the node. Gini Criteria Pre- Name for the new column dicted that contains predicted valCol- ues. umn Name Complexity Parameter

Manda- No Gini: Gini impurity Information: Information tory gain Manda- Yes tory

Pre- Any string value dictedValues

Complexity parameter is a 0.005 Range: 0-1 user input that saves computing time by preventing any split that does not improve the fit.

OpNo tional

Maxi- Maximum level of any node None Any positive integer value. mum of the final tree with the root Depth node counted as level 0. Note: If the maximum depth is greater than 30, then the algorithm does not produce results as expected (on 32bit machine).

OpNo tional

93

2012-10-04

Component Properties

Property Name Description

Default Values

Possible Values

Is this property Input configurable when Rethe algorithm is quired? used as a model?

Cross Validation

Specifies the number of None Any positive integer value. cross validations to be done. A higher cross validations value increases the computational time and produces more accurate results.

OpNo tional

Prior Refers to vector of prior Proba- probabilities. bility Use Defines how to use surroSurro- gates in the splitting progate cess.

None Any possible decimal num- OpNo ber. For example, 0.65, 0.35 tional Display Only No Display Only - an obser- Optional vation with a missing value for the primary split rule is not sent further down the tree. Use Surrogate - use this option for splitting subjects missing the primary variable; if all surrogates are missing, the observation is not split. Stop if missing - If all surrogates are missing, send the observation in the majority direction.

94

2012-10-04

Component Properties

Property Name Description

Default Values

Possible Values

Is this property Input configurable when Rethe algorithm is quired? used as a model?

Surro- Controls the selection of a gate best surrogate. Style

No Use total correct classifi- Optional cation - algorithm uses total number of correct classification for finding a potential surrogate variable. Use percent non missing cases - algorithm uses the percentage of non missing cases classified to find a potential surrogate. OpNo tional

Maximum Surrogate

While building a CART surro- 5 gate, rules are stored at nonterminal nodes that are used if missing values are found during prediction. Max Surrogates specifies the maximum number of surrogates that can be retained at each node in a tree. A higher value increases the computational time and produces more accurate results if there are missing values.

Any positive integer value

Mod- Name for the model. el Name

None Any string value

OpNo tional

8.3.5 Neural Network

8.3.5.1 R-MONMLP Neural Network

95

2012-10-04

Component Properties

The R-MultiLayerPerceptron Neural Network is used for forecasting, classification, and statistical pattern recognition using R library functions. Note: R does not support PMML storage for MONMLP Neural Network.

R-MONMLP Neural Net Properties


Is this property Input configurable when Re- the algorithm is q u i r e d ? used as a model?

Property Name

Description

Default Values Possible Values

Output Mode

Mode in which the output Trend is to be displayed.

a n d a - Yes Fill: Fills missing val- M tory ues in the target column. Trend: Provides the input data from the data source along with predicted values for the given dataset. M a n d a - Yes tory M a n d a - Yes, only if output tory model is fill

Independent Columns

Input columns for apply- None ing the algorithm.

Column names

Dependent Target column for apply- None Columns ing the algorithm. Hidden Lay- Number of nodes or neu- None er1 Neurons in the first hidden rons layer. Hidden Lay- Number of nodes or neu- 0 er2 Neurons in the second hidrons den layer. Predicted Column Name Maximum Iterations Monotone Columns

Column Name

Any positive integer val- M a n d a - No ue. tory Any positive integer val- Op- No ue. tional M a n d a - Yes tory

Name for the new column Predict- Any string value that contains predicted edValvalues. ues Maximum number of iter- 5000 ations used by the optimization algorithm. Column indexes of covari- None ates for which the monotonicity constraint should hold.

Any positive integer value Op- No tional Any positive integer value Op- No tional

96

2012-10-04

Component Properties

Property Name

Description

Default Values Possible Values

Is this property Input configurable when Re- the algorithm is q u i r e d ? used as a model?

Training Iter- Specifies the number of None ations stopped training iterations between calculation of the cost function on the out-of-bootstrap cases. Initial Weights Either a vector giving the -0.5, minimum and maximum 0.5 allowable values of the random weights or an initial weight vector.

Any positive integer value Op- No tional

Double values

Op- No tional

Maximum Maximum number of ex- 10 Exceptions ceptions of the optimization routine allowed before the algorithm is terminated with an error. Scale Dependent Columns Boolean value that deter- True mines if dependent columns matrix should be scaled to zero mean and unit variance prior to fitting. Set this property to false if you are using an output layer transfer function that limits the range of predictions. Boolean value that indi- False cates whether bootstrap aggregation (bagging) can be used.

Any positive integer value Op- No tional

True, False

Op- No tional

Bagging Required

True, False

Op- No tional

Trials to Number of repeated trials 1 Avoid Local used to avoid local miniMinima ma. No. Ensem- Number of ensemble ble Memmembers to fit. bers 1

Any positive integer value Op- No tional Any positive integer value Op- No tional

97

2012-10-04

Component Properties

Property Name

Description

Default Values Possible Values

Is this property Input configurable when Re- the algorithm is q u i r e d ? used as a model?

Hidden Lay- Represents the activation Taner Transfer function to be used for gent Function the hidden layer. Sigmoid Function

Trivial Identity Function (Linear) Logistic Sigmoid Function Tangent Sigmoid Function Trivial Identity Function Logistic Sigmoid Function Tangent Sigmoid Function

Op- No tional

Output Lay- Represents the activation Trivial er Transfer function to be used for Identity Function the output layer. Func- tion Derivative of Hidden Layer Transfer Function Represents the derivative Tan of hidden layer activation sig.prime function. Derivative of Output Layer Transfer Function Represents the derivative Lin of output layer activation ear.prime function. Name for the model. None

Op- No tional

Op- No Derivative of Trivial Identity Function (Lin- tional ear) Derivative of Logistic Sigmoid Function Derivative of Tangent Sigmoid Function Op- No Derivative of Trivial tionIdentity Function Derivative of Logistic al Sigmoid Function Derivative of Tangent Sigmoid Function Op- No tional

Model Name

Any string value

8.3.5.2 R-NNet Neural Network


The R-Neural Network algorithm is used for forecasting, classification, and statistical pattern recognition using R library functions.

98

2012-10-04

Component Properties

Note: PMML version supported: PMML 3.2 for R components

R-Neural Network
Property Name Description Default Values Is this property conInput figurable when the Realgorithm is used as quired? a model?

Possible Values

Out- Mode in which the output is Trend put to be displayed. Mode

Trend: Provides the in- Manda- Yes put data from the data tory source along with predicted values for the given dataset. Fill: Fills missing values in the target column. Manda- Yes tory

Inde- Input columns for applying None Column name pen- the algorithm. dent C o l u m n s DeTarget column for applying None Column name pen- the algorithm. dent C o l u m n s Miss- Method for handling missing ing values. Values Re move

Manda- Yes, only if output tory mode is fill

Remove: The algorithm Manda- Yes skips the records con- tory taining missing values in the independent or dependent columns. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent columns. Manda- No tory

HidNumber of nodes or neuden rons in the hidden layer. Layer Neurons

None Any positive integer value

99

2012-10-04

Component Properties

Property Name Description

Default Values

Possible Values

Is this property conInput figurable when the Realgorithm is used as quired? a model?

Pre- Name for the new column dicted that contains predicted valColues. umn Name Type

Pre- Any string value dictedValues

Manda- Yes tory

Type of analysis to be done Re by the algorithm. gression

Regression: Finds trend Manda- No tory in data. Classification: Classifies data. Op- No tional

Skip Boolean value. If set to true, False True, False Hidadds skip-layer connections den from input to output. Layer Linear Boolean value that indiOut- cates whether you need a put linear output. False True, False If you select analysis type as Classification, this value must be true.

Op- No tional

Use Softmax

Boolean value that indiFalse True, False cates if you want to use If you analysis type as "log-linear model" and Classification, this value "maximum conditional likelimust be true. hood" fittings. linout, entropy, softmax and censored are mutually exclusive.

Op- No tional

Use Entropy

A Boolean value that indicates if you want to use "Maximum Conditional Likelihood" fitting. By default, the algorithm uses Least-Squares method.

False

Op- No True: Use "Maximum Conditional Likelihood" tional fitting False: Use LeastSquares method

100

2012-10-04

Component Properties

Property Name Description

Default Values

Possible Values

Is this property conInput figurable when the Realgorithm is used as quired? a model?

Use Boolean value. Censored is False True, False Cen- a variant of softmax in sored which non-zero targets indicate possible classes. For softmax, a row of (0,1,1) indicates one example each of classes 2 and 3, but for censored it indicates one example each of classes 2 or 3. Range Represents initial random weights [-rang, rang] 0.7

Op- No tional

Set this value to 0.5 unless Op- No the input is large. If the in- tional put is large, then choose the rang by using the formula: rang * max(|x|) <= 1 Any Double value Op- No tional Op- No tional

Weight Value used for calculating Decay new weights (weight decay).

Maxi- Maximum number of itera- 100 mum tions. Iterations Hessian Matrix Required

Any positive integer value

Boolean value. If set to true, False True, False returns the Hessian measure at the best set of weights.

Op- No tional

MaxiNone Any positive integer value Maximum number of mum weights allowed in the calcuWeights lation. There is no intrinsic limit in the code, but increasing the maximum number of weights may allow fits that are very slow and timeconsuming.

Op- No tional

101

2012-10-04

Component Properties

Property Name Description

Default Values

Possible Values

Is this property conInput figurable when the Realgorithm is used as quired? a model?

Abstol Algorithm terminates if the 1.0 Any double value fit criterion falls below the e^-4^ specified abstol value indicating an essentially perfect fit. Reltol Algorithm terminates if the 1.0 Any double value optimizer is unable to ree^-8^ duce the fit criterion by a factor: 1 - reltol Con- List of contrasts to be used None trasts for factors appearing as variables in the model. Model Name for the model Name contr.helmert contr.poly contr.sum contr.treatment contr.SAS formula

Op- No tional

Op- No tional

Op- No tional

None Model name

Op- No tional

8.3.6 Outliers

8.3.6.1 Inter Quartile Range


The Inter Quartile Outlier (IQR) algorithm enables you to find outlying values based on the statistical distribution between the first and third quartiles. Inter Quartile range = Q3-Q1 Outlier value = any value below Q1-Fence Co-efficient*Inter Quartile range and above Q3+ Fence Co-efficient *Inter Quartile range Note: The input data to the IQR algorithm must be at least 4 rows.

102

2012-10-04

Component Properties

Inter Quartile Outlier Properties


Property Name Description Default Value Possible Values Input Required?

Output Mode

Mode in which the output is displayed.

Re move Outliers

Show Outliers: Adds a Boolean Mandacolumn to the input data specify- tory ing if the corresponding value is an outlier. Remove Outliers: Removes outlying values from the input data. Mandatory

IndeInput source column. pendent Column

None

Column name

Missing Method for handling missing values. ReValues move

Remove: The algorithm skips the Mandarecords containing missing val- tory ues in the independent or dependent columns. Keep: The algorithm considers missing values for processing. Stop: The algorithm stops the execution if the missing value is found in independent or dependent columns. Mandatory

Fence Coefficient

Represents the allowed deviation of 1.5 values from the inter quartile range. If this value is high, then the range of normal values is also high.

Range: 1.5 - 3.0

8.3.6.2 Nearest Neighbour Outlier


The K Nearest Neighbour (KNN) Outlier algorithm enables you to find outlying values based on the number of neighbours (N) and the average distance of values considered to their nearest N neighbours.

103

2012-10-04

Component Properties

Nearest Neighbour Outlier Properties


Property Name Description Default Value Possible Values Input Required?

Output Mode

Mode in which the output is displayed.

Remove Outliers

Show Outliers: Adds a Boolean col- Mandatoumn to the input data specifying if the ry corresponding value is an outlier. Remove Outliers: Removes outlying values from the input data. Mandatory

Indepen- Input source column. dent Columns Missing Values

None

Column name

Method for handling missing Remove values.

MandatoRemove: The algorithm skips the records containing missing values in ry the independent or dependent columns. Keep: The algorithm considers missing values for processing. Stop: The algorithm stops the execution if the missing value is found in independent or dependent columns. Mandatory Mandatory

NeighNumber of neighbors to con- 5 borhood sider for finding distances. count Number The number of outlier values None of Out- to be removed from the liers source data.

Any positive integer value

Any positive integer value

8.3.7 Regressions

8.3.7.1 Exponential Regression


The Exponential Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable using an exponential function by means of the least square methodology.

104

2012-10-04

Component Properties

The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = abx where, y is a dependent variable. x is an independent variable. a is an intercept. b is a slope. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model. PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

105

2012-10-04

Component Properties

Exponential Regression Properties


Property Name Description Default Value Is this property configInput urable when the algoRerithm is used as a modquired? el?

Possible Values

Output Mode in which the Mode output is to be displayed.

Trend

Fill: Fills missing values in Manda- Yes tory the target column Trend: Provides the input data from the data source along with predicted values for the given dataset Manda- Yes, only if output tory mode is fill

Depen- Target column on None dent which the regression Colis to be performed. umn Independent Column Input source column None using which the regression is to be performed.

Column name

Column name

Manda- Yes tory

Miss- Method for handling Re ing Val- missing values. move ues

Manda- Yes Remove: The algorithm skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

Predict- Name for the new ed Col- column that contains umn predicted values. Name

PreAny string value dictedValues Any string value

Model Name for the model None Name

Optional

No

106

2012-10-04

Component Properties

8.3.7.2 Geometric Regression


The Geometric Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable using a geometric function by means of the least square methodology. The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = axb where, y is a dependent variable. x is an independent variable. a is an intercept. b is a slope. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model. PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

107

2012-10-04

Component Properties

Geometric Regression Properties


Property Name Description Default Value Possible Values Is this property configInput urable when the algoRerithm is used as a quired? model?

Output Mode in which the out- Trend Mode put is to be displayed.

Fill: Fills missing values in Manda- Yes tory the target column. Trend: Provides the input data from the data source along with predicted values for the given dataset. Manda- Yes, only if output tory mode is fill

Depen- Target column on dent which the regression Colis to be performed. umn Independent Column

None Column name

Input source column None Column name using which the regression is to be performed. Re move

Manda- Yes tory

Miss- Method for handling ing missing values. Values

Manda- Yes Remove: The algorithm skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

PreName for the new coldicted umn that contains preColdicted values. umn Name Model Name for the model. Name

Pre- Any string value dictedValues None Any string value

OpNo tional

108

2012-10-04

Component Properties

8.3.7.3 HANA Multiple Linear Regression


The HANA Multiple Linear Regression algorithm enables you to find a mathematical relationship between a group of random variables that best predicts continuous dependent variables based on independent variables by means of the least squares methodology. The dependency between independent variables (xi1-ip) and the dependent variable (y) is of the form: yI = B0 + B1xi1 + B2xi2 + ... + Bpxip + EI where I = 1,2, ..., n where, y is a dependent variable. xi1-ip are independent variables. E is the error. B0-p are co-efficient metrics.

109

2012-10-04

Component Properties

HANA Multiple Linear Regression Properties


Default Value Is this property configInput urable when the algoRerithm is used as a quired? model?

Property Description

Possible Values

Out- Mode in which the out- Trend put put is to be displayed. Mode

Fill: Fills missing values in Manda- Yes tory the target column Trend: Provides the input data from the data source along with predicted values for the given dataset Manda- Yes tory

Independent C o l u m n s Dependent Column Missing Values

Input source columns None Any positive integer value using which the regression is to be performed. Target column on which the regression is to be performed. None Any positive integer value

Manda- Yes, only if output tory mode is fill

Method for handling missing values.

Re move

Manda- Yes Remove: The algorithm skips the records contain- tory ing missing values in the independent or dependent column. Ignore: The algorithm ignores the record containing missing values during calculation. However, the records are retained in the result table. Stop: The algorithm stops if the missing value is found in independent or dependent column.

110

2012-10-04

Component Properties

Property Description

Default Value

Possible Values

Is this property configInput urable when the algoRerithm is used as a quired? model?

Predicted Column Name

Enter a name for the PreAny string value new column that con- dictedtains the calculated Value values. 1

Manda- Yes tory

Num- Enter the number of ber of threads that can be Threads used for execution. Model Name for the model. Name

Any positive integer value > 0 OpNo tional OpNo tional

None Any string value

8.3.7.4 Linear Regression


The Linear Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable by means of least square methodology. The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = a+bx where, y is a dependent variable x is an independent variable a is an intercept b is a slope Note: Data type of columns used during model scoring should be same as data type of columns used while building the model. PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

111

2012-10-04

Component Properties

Linear Regression Properties


Property Name Description Default Value Is this property configInput urable when the algoRerithm is used as a modquired? el?

Possible Values

Output Mode in which the Mode output is to be displayed.

Trend

Fill: Fills missing values in Manda- Yes tory the target column. Trend: Provides the input data from the data source along with predicted values for the given dataset. Manda- Yes, only if output tory mode is fill

Depen- Target column on None dent which the regression Colis to be performed. umn Independent Column Input source column None using which the regression is to be performed.

Column name

Column name

Manda- Yes tory

Miss- Method for handling Re ing Val- missing values. move ues

Manda- Yes Remove: The algorithm skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

Predict- Name for the new ed Col- column that contains umn predicted values. Name

PreAny string value dictedValues Any string value

Model Name for the model. None Name

Optional

No

112

2012-10-04

Component Properties

8.3.7.5 Logarithmic Regression


The Logarithmic Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable using a logarithmic function by means of the least square methodology. The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = a.log(x) + b where, y is a dependent variable. x is an independent variable. a is an intercept. b is a slope. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model. PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

113

2012-10-04

Component Properties

Logarithmic Regression Properties


Property Name Description Default Value Is this property configInput urable when the algoRerithm is used as a quired? model?

Possible Values

Output Mode in which the Mode output is to be displayed.

Trend

Fill: Fills missing values in Manda- Yes tory the target column. Trend: Provides the input data from the data source along with predicted values for the given dataset. Manda- Yes, only if output tory mode is fill

Depen- Target column on dent which the regression Colis to be performed. umn Independent Column

None Column name

Input source column None Column name using which the regression is to be performed. Re move

Manda- Yes tory

Miss- Method for handling ing missing values. Values

Manda- Yes Remove: The algorithm skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

Predict- Name for the new coled Col- umn that contains preumn dicted values. Name Model Name for the model. Name

Pre- Any string value dictedValues None Any string value

OpNo tional

114

2012-10-04

Component Properties

8.3.7.6 R-Exponential Regression


R-Exponential Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable by using an exponential function from the R library. The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = abx where, y is a dependent variable. x is an independent variable. a is an intercept. b is a slope. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model. PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

115

2012-10-04

Component Properties

R-Exponential Regression Properties


Property Name Description Default Value Possible Values Is this property conInput figurable when the Realgorithm is used as quired? a model?

Out- Mode in which the output is Trend put to be displayed. Mode

Independent Column Dependent Column Missing Values

Trend: Provides the in- Manda- Yes put data from the data tory source along with predicted values for the given dataset. Fill: Fills missing values in the target column. Manda- Yes tory

Input source column using None Column name which the regression is to be performed.

Target column on which the None Column name regression is to be performed.

Manda- Yes, only if output tory model is fill

Method for handling missing Re values. move

Remove: The algorithm Manda- Yes skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

PreName for the new column dicted that contains predicted valColues. umn Name

Pre- Any string value dictedValues

116

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Is this property conInput figurable when the Realgorithm is used as quired? a model?

Allow True Boolean value. If set to true, Singuthe aliased coefficients are lar Fit ignored in the coefficient covariance matrix. If set to false, a model having aliased coefficients produces an error. A model having aliased coefficients signifies that the square matrix x*x is singular.

True, False

Manda- No tory

Con- List of contrasts to be used None trasts for factors appearing as variables in the model. Model Name for the model. Name

contr.treatment contr.poly contr.sum contr.helmert

Op- No tional

None Any string value

Op- No tional

8.3.7.7 R-Geometric Regression


The R-Geometric Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable by using a geometric function from the R library. The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = axb where, y is a dependent variable. x is an independent variable. a is an intercept. b is a slope. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model.

117

2012-10-04

Component Properties

PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

R-Geometric Regression Properties


Property Name Description Default Value Possible Values Is this property conInput figurable when the Realgorithm is used as quired? a model?

Out- Mode in which the output is Trend put to be displayed. Mode

Independent Column Dependent Column

Trend: Provides the in- Manda- Yes put data from the data tory source along with predicted values for the given dataset. Fill: Fills missing values in the target column. Manda- Yes tory

Input source column using None Column name which the regression is to be performed.

Target column on which the None Column name regression is to be performed.

Manda- Yes, only if output tory mode is fill

Miss- Method for handling missing ing values. Values

Re move

Remove: The algorithm Manda- Yes skips the records con- tory taining missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

Pre- Name for the new column dicted that contains predicted valColues. umn Name

Pre- Any string value dictedValues

118

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Is this property conInput figurable when the Realgorithm is used as quired? a model?

Allow True Boolean value. If set to true, Singuthe aliased coefficients are lar Fit ignored in the coefficient covariance matrix. If set to false, a model having aliased coefficients produces an error. A model having aliased coefficients signifies that the square matrix x*x is singular.

True, False

Manda- No tory

Con- List of contrasts to be used None trasts for factors appearing as variables in the model. Model Name for the model. Name

contr.treatment contr.poly contr.sum contr.helmert

Op- No tional

None Any string value

Op- No tional

8.3.7.8 R-Linear Regression


The R-Linear Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable by using the R library. The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = a+bx where, y is a dependent variable. x is an independent variable. a is an intercept. b is a slope. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model.

119

2012-10-04

Component Properties

PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

R-Linear Regression Properties


Property Name Description Default Value Possible Values Is this property conInput figurable when the Realgorithm is used as quired? a model?

Out- Mode in which the output is Trend put to be displayed. Mode

Independent Column Dependent Column Missing Values

Trend: Provides the in- Manda- Yes put data from the data tory source along with predicted values for the given dataset. Fill: Fills missing values in the target column. Manda- Yes tory

Input source column using None Column name which the regression is to be performed.

Target column on which the None Column name regression is to be performed.

Manda- Yes, only if output tory mode is fill

Method for handling missing Re values. move

Remove: The algorithm Manda- Yes skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

PreName for the new column dicted that contains predicted valColues. umn Name

Pre- Any string value dictedValues

120

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Is this property conInput figurable when the Realgorithm is used as quired? a model?

Allow True Boolean value. If set to true, Singuthe aliased coefficients are lar Fit ignored in the coefficient covariance matrix. If set to false, a model having aliased coefficients produces an error. A model having aliased coefficients signifies that the square matrix x*x is singular. Confi- Accuracy of the predictions. 0.95 dence Level DeNumber of values in the fi- 5 grees nal calculation of a statistic of that is free to vary. Freedom

True, False

Manda- No tory

Range: 0-1

Manda- Yes tory Manda- Yes tory

Any positive integer value

Con- List of contrasts to be used None trasts for factors appearing as variables in the model. Model Name for the model. Name

contr.treatment contr.poly contr.sum contr.helmert

Op- No tional

None Any string value

Op- No tional

8.3.7.9 R-Logarithmic Regression


The R-Logarithmic Regression algorithm provides an objective method of fitting a straight line to a sample of observations. It is used to determine how an individual variable is able to influence another variable by using a logarithm function from the R library. The dependency between the independent variable (x) and the dependent variable (y) is of the form: y = a.log(x) + b

121

2012-10-04

Component Properties

where, y is a dependent variable. x is an independent variable. a is an intercept. b is a slope. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model. PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

122

2012-10-04

Component Properties

R-Logarithmic Regression Properties


Property Name Description Default Value Possible Values Is this property conInput figurable when the Realgorithm is used as quired? a model?

Out- Mode in which the output is Trend put to be displayed. Mode

Independent Column Dependent Column Missing Values

Trend: Provides the in- Manda- Yes put data from the data tory source along with predicted values for the given dataset. Fill: Fills missing values in the target column. Manda- Yes tory

Input source column using None Column Name which the regression is to be performed.

Target column on which the None Column Name regression is to be performed.

Manda- Yes, only if output tory mode is fill

Method for handling missing Re values. move

Remove: The algorithm Manda- Yes skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column. Manda- Yes tory

PreName for the new column dicted that contains predicted valColues. umn Name

Pre- Any string value dictedValues

123

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Is this property conInput figurable when the Realgorithm is used as quired? a model?

Allow True Boolean value. If set to true, Singuthe aliased coefficients are lar Fit ignored in the coefficient covariance matrix. If set to false, a model having aliased coefficients produces an error. A model having aliased coefficients signifies that the square matrix x*x is singular.

True, False

Manda- No tory

Con- List of contrasts to be used None trasts for factors appearing as variables in the model. Model Name for the model. Name

contr.treatment contr.poly contr.sum contr.helmert

Op- No tional

None Any string value

Op- No tional

8.3.7.10 R-Multiple Linear Regression


The R-Multiple Linear Regression algorithm enables you to find a mathematical relationship between a group of random variables that best predicts continuous dependent variables based on independent variables by means of the least squares methodology. The dependency between independent variables (xi1-ip) and the dependent variable (y) is of the form: yi = B0 + B1xi1 + B2xi2 + ... + Bpxip + Ei where i = 1,2, ..., n where, y is a dependent variable. xi1-ip are independent variables. E is the error. B0-p are co-efficient metrics. Note: Data type of columns used during model scoring should be same as data type of columns used while building the model.

124

2012-10-04

Component Properties

PMML version supported: PMML 3.2 for R components, PMML 4.0 for SAP components

R-Multiple Linear Regression Properties


Property Name Description Default Value Possible Values Is this property configInput urable when the algoRerithm is used as a quired? model?

Out- Mode in which the output Trend put is to be displayed. Mode

Fill: Fills missing values Manda- Yes tory in the target column Trend: Provides the input data from the data source along with predicted values for the given dataset Manda- Yes tory

Inde- Input source columns us- None Any positive integer value pen- ing which the regression dent is to be performed. C o l u m n s Dependent Column Target column on which the regression is to be performed. None Any positive integer value

Manda- Yes, only if output tory mode is fill

Miss- Method for handling ing missing values. Values

Re move

Remove: The algorithm Manda- Yes skips the records contain- tory ing missing values in the independent or dependent column. Keep: The algorithm retains missing values for processing. Stop: The algorithm stops if the missing value is found in independent or dependent column.

125

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Is this property configInput urable when the algoRerithm is used as a quired? model?

Confi- Accuracy of the predicdence tions. Level Degrees of Freedom Number of values in the final calculation of a statistic that are free to vary.

0.95

Range: 0-1

Optional Optional

Any positive integer value

Model Name for the model. Name

None Any string value

OpNo tional

8.3.8 Time Series

8.3.8.1 Triple Exponential Smoothing


The Triple Exponential Smoothing algorithm enables you to smooth the source data by reducing noise and perform prediction using data. Time series data is a sequence of observations over a period of time. Using this algorithm, you can also find seasonal trends in data.

126

2012-10-04

Component Properties

Triple Exponential Smoothing Properties


Is this property configurable when the algorithm is used as a model?

Property Name Description

Default Value

Possible Values

Input Required?

Out- Mode in which Trend put the output is to Mode be displayed.

Trend: Outputs the Mandatory No source data along with predicted values for the given dataset. Forecast: Outputs forecasted values for the given time period. Mandatory No

De- Input column to None pen- be forecasted. dent Column Consider Date Column A Boolean val- False ue that specifies whether to use the date column.

Column name

True, False

Optional

No

Date Column contain- None Col- ing dates. umn

Column Name

Mandatory No if "Consider Date Column" is selected.

Miss- Method for haning dling missing Val- values. ues

Remove if "Consider date column" is not selected. Stop if "Consider date column" is selected.

Remove: The algo- Mandatory No rithm skips the records containing missing values in the independent or dependent column. Stop: The algorithm stops if the missing value is found in independent or dependent column.

127

2012-10-04

Component Properties

Property Name Description

Default Value

Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

Start Year from None Year which the observations are to be considered. Start Period from 1 Peri- which the obserod vations are to be considered.

Any valid year. For ex- Mandatory No ample, 2009, 1987, 2019. Mandatory No

If the value of Period is quarter, the start period value can be between 1 and 4. If the value of Period is month, the start period value can be between 1 and 12.

Peri- Period to be None od considered in a year. Peri- Periods to be None ods considered in a Per year. Year Peri- Number of fuods ture periods to to predict. Predict None

Quarter (4), Month Mandatory No (12), Year (1), Custom Any positive integer value. Mandatory No if "Period" is Custom.

Any positive integer value.

Mandatory Yes if "Output Mode" is For example, consider Forecast. the time series input data ranges from 2000 1st quarter - 2009 1st quarter. If you select Period as Quarter, and Periods to Predict as 1, then the 2009 2nd quarter will be predicted. Any string value Mandatory Yes

Predicted Column Name

Name for the new column that contains the predicted values

PredictedValues

128

2012-10-04

Component Properties

Property Name Description

Default Value

Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

Year Name for the Val- new column ues that contains year values Period Values Name for the new column that contains period values

Year

Any string value

Mandatory Yes

Period

Any string value

Mandatory Yes only is "Period" is Custom Mandatory Yes only is "Period" is Month Mandatory Yes only is "Period" is Quarter Mandatory No

Month Name for the Val- new column ues that contains month values Quarter Values Alpha Name for the new column that contains quarter values

Month

Any string value

Quarter

Any string value

Smoothing con- 0.3 stant used for smoothing observations.

Range: 0-1

Beta Smoothing con- 0.1 stant used for finding trend parameters. Gam- Smoothing con- 0.1 ma stant used for finding seasonal trends in observations. Mod- Name for the el model. Name None

Range: 0-1

Mandatory No

Range: 0-1

Mandatory No

Any string value

Optional

No

8.3.8.2 R-Double Exponential Smoothing

129

2012-10-04

Component Properties

The R-Double Exponential Smoothing algorithm enables you to smooth the source data by reducing noise and performing prediction for the time series data by using R library functions. Time series data is a sequence of observations over a period of time. You can also use this algorithm to find trends in data. Note: When you use a time series model for analysis, the model does not consider data from the selected data source. Instead, it forecasts by considering data that was used while generating a model. R-Double Exponential Smoothing Properties
Property name Description Default Values Is this property configurable when the algorithm is used as a model?

Possible Values

Input Required?

OutMode in which the Trend put output is to be disMode played.

Mandatory Trend: Outputs the source data along with predicted values for the given dataset. Forecast: Outputs forecasted values for the given time period. Mandatory

No

Depen- Column to be fore- None Column name dent cast. Column Period Period in which the None Quarter (4), Month (12), data is to be foreCustom cast. Periods Per Year Start Year

No

Mandatory

No

Periods to be con- None Any positive integer value. Mandatory sidered in a year.

No

Year from which the observations must be considered.

None Any valid year. For exam- Mandatory ple, 2009, 1987, 2019.

No

130

2012-10-04

Component Properties

Property name Description

Default Values

Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

Start Period from which 1 Period the observations must be considered.

If the value of Period is quarter, then the start period value can be between 1 and 4. If the value of Period is month, then the start period value can be between 1 and 12.

Optional

No

Peri- Number of future None Any positive integer value. Mandatory ods to periods to predict. For example, consider the Pretime series input data dict ranges from 2000 1st quarter - 2009 1st quarter. If you select Period as Quarter, and Periods to Predict as 1, then the 2009 2nd quarter will be predicted. Predicted Column Name Year Values Name for the new column that contains the predicted values PredictedValues Any string value Mandatory

Yes

Yes

Name for the new Year Any string value column that contains year values

Mandatory

Yes

Period Name for the new Peri- Any string value Valcolumn that con- od ues tains period values Month Name for the new Month Any string value Valcolumn that conues tains month values Quarter Values Name for the new Quar- Any string value column that con- ter tains quarter values

Mandatory only is Yes "Period" is Custom Mandatory only is Yes "Period" is Month Mandatory only is Yes "Period" is Quarter

131

2012-10-04

Component Properties

Property name Description

Default Values

Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

Alpha Smoothing con0.3 stant used for smoothing observations. Beta Smoothing con0.1 stant used for finding trend parameters. Number of observa- 2 tions required to start the calculation.

Range: 0-1

Optional

No

Range: 0-1

Optional

No

No. Periodic Observations Level

Any positive integer value Mandatory when No N such that N*Period <= Output Mode is number of input records Forecast, otherwise optional. No No

Start value for find- None Any double value. For ex- Optional ing level (a[0]). ample, 0.4, 1.50, 7.45. Optional

Trend Start value for find- None Any double value. ing trend parameters (b[0]). Opti- Starting values for 0.3, mizer alpha, beta, and 0.1, Inputs gamma required 0.1 for the optimizer. Confi- Accuracy of the dence prediction. Level 0.95

Range: 0-1 for alpha, be- Optional ta, and gamma

No

Range: 0-1

Optional

Yes

Model Name for the mod- None Any string value Name el.

Optional

No

8.3.8.3 R-Single Exponential Smoothing


The R-Single Exponential Smoothing algorithm enables you to smooth the source data by reducing noise and performing prediction for the time series data by using R library functions. Time series data is a sequence of observations over a period of time.

132

2012-10-04

Component Properties

Note: When you use a time series model for analysis, the model does not consider data from the selected data source. Instead, it forecasts by considering data that was used while generating a model. R-Single Exponential Smoothing Properties
Property Name Default Value Possible Values Is this property configurable when the algorithm is used as a model?

Description

Input Required?

Output Mode in which Mode the output is to be displayed.

Trend

Mandatory Trend: Outputs the source data along with predicted values for the given dataset. Forecast: Outputs forecasted values for the given time period. Mandatory

No

Depen- Column to be dent forecasted. Column Period Period in which the data is to be forecast. Periods Per Year Start Year Periods to be considered in a year.

None Column Name

No

None Quarter (4), Month (12), Custom

Mandatory

No

None Any positive integer value. Mandatory

No

Year from which None Any valid year. For exam- Mandatory the observations ple, 2009, 1987, 2019. are to be considered. If the value of Period is quarter, then the start period value can be between 1 and 4. If the value of Period is month, then the start period value can be between 1 and 12. Optional

No

Start Period from 1 Period which the observations are to be considered.

No

133

2012-10-04

Component Properties

Property Name

Description

Default Value Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

Peri- Number of future None Any positive integer value. Mandatory ods to periods to preFor example, consider the Predict. time series input data dict ranges from 2000 1st quarter - 2009 1st quarter. If you select Period as Quarter, and Periods to Predict as 1, then the 2009 2nd quarter will be predicted. Predicted Column Name Year Values Name for the new column that contains the predicted values PredictedValues Any string value Mandatory

Yes

Yes

Name for the new Year Any string value column that contains year values

Mandatory

Yes

Period Name for the new Peri- Any string value Valcolumn that con- od ues tains period values Month Name for the new Month Any string value Valcolumn that conues tains month values Quar- Name for the new Quar- Any string value ter Val- column that con- ter ues tains quarter values Alpha Smoothing con- 0.3 stant used for smoothing observations. Level Start value for finding level (a[0]). Range: 0-1

Mandatory only is Yes "Period" is Custom Mandatory only is Yes "Period" is Month

Mandatory only is Yes "Period" is Quarter Optional No

None Any double value. For ex- Optional ample, 0.4, 1.50, 7.45.

No

134

2012-10-04

Component Properties

Property Name

Description

Default Value Possible Values

Input Required?

Is this property configurable when the algorithm is used as a model?

No. Periodic Observations

Number of obser- 2 vations required to start the calculation. 0.95

Any positive integer value Mandatory when No N such that N*Period <= Output Mode is number of input records Predict, otherwise optional. Range:0-1 Optional Yes

Confi- Accuracy of the dence prediction. Level Model Name for the Name model.

None Any string value

Optional

No

8.3.8.4 R-Triple Exponential Smoothing


The R-Triple Exponential Smoothing algorithm enables you to smooth the source data by reducing noise and performing prediction for the time series data by using R library functions. Time series data is a sequence of observations over a period of time. You can also use this algorithm to find seasonal trends in data. Note: When you use a time series model for analysis, the model does not consider data from the selected data source. Instead, it forecasts by considering data that was used while generating a model.

135

2012-10-04

Component Properties

R-Triple Exponential Smoothing


Property Name Description Default Value Possible Values Is this property configurable when the algorithm is Input Required? used as a model?

Out- Mode in which the output put is to be displayed. Mode

Trend

Trend: Outputs the Mandatory source data along with predicted values for the given dataset. Forecast: Outputs forecasted values for the given time period. Mandatory

No

Dependent Column Period Periods Per Year

Column to be forecast.

None Column name

No

Period in which the data None Quarter (4), Month (12), Mandatory needs to be forecast. Custom Periods to be considered None Any positive integer val- Mandatory in a year. ue.

No No

Pre- Name for the new coldicted umn that contains the Col- predicted values umn Name Year Values Period Values Name for the new column that contains year values

Pre- Any string value dictedValues Year Any string value

Mandatory

Yes

Mandatory

Yes

Name for the new col- Peri- Any string value umn that contains period od values

Mandatory only Yes is "Period" is Custom Mandatory only Yes is "Period" is Month Mandatory only Yes is "Period" is Quarter

Month Name for the new col- Month Any string value Val- umn that contains month ues values Quar- Name for the new col- Quar- Any string value ter umn that contains quar- ter Val- ter values ues

136

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Is this property configurable when the algorithm is Input Required? used as a model?

Start Year Start Period

Year from which the ob- None Any valid year. For exam- Mandatory servations are to be ple, 2009, 1987, 2019. considered. Period from which the observations are to be considered. 1 If the value of Period is quarter, then the start period value can be between 1 and 4. If the value of Period is month, then the start period value can be between 1 and 12. If the value of Period is custom (n), then the start period value can be between 1 and n. Optional

No

No

Peri- Number of future periods None Any positive integer val- Mandatory ods to to predict. ue. PreFor example, consider dict the time series input data ranges from 2000 1st quarter - 2009 1st quarter. If you select Period as Quarter, and Periods to Predict as 1, then the 2009 2nd quarter will be predicted. Alpha Smoothing constant 0.3 used for smoothing observations. Beta Smoothing constant used for finding trend parameters. 0.1 Range: 0-1 Optional

Yes

No

Range: 0-1

Optional

No

Gam- Smoothing constant 0.1 ma used for finding seasonal parameters.

Range: 0-1

Optional

No

137

2012-10-04

Component Properties

Property Name Description

Default Value Possible Values

Is this property configurable when the algorithm is Input Required? used as a model?

Sea- Provides options for se- Addi- sonal lecting which type of tive HoltWinter Exponential Smoothing formulas to use for calculations.

Additive: Multiplicative:

Optional

No

Confi- Accuracy of the predic- 0.95 Range: 0-1 dence tion. Level No. Number of observations 2 Peri- required to start the calodic culation. Observations Level Start value for finding level (a[0]) Any positive integer value N such that N*Period <= number of input records

Optional

Yes

Mandatory No when Output Mode is Predict, otherwise optional. No No No

None Any double value. For Optional example, 0.4, 1.50, 7.45. Optional

Trend Start value for finding None Any double value. trend parameters (b[0]) Season

Start values used for None This value is dependent Optional finding seasonal paramon the period you select. eters. For example, if you select quarter as period, provide four double values. Range: 0-1 for alpha, beta, and gamma Optional

Opti- Starting values for alpha, 0.3, mizer beta, and gamma re0.1, Inputs quired for the optimizer. 0.1 Model Name for the model. Name

No

None Any string value

Optional

No

8.4 Data Writers


User data writers to you can store the results of the analysis in flat files or databases for further analysis.

138

2012-10-04

Component Properties

8.4.1 CSV Writer


CSV Writer component enables you to write data to flat files such as CSV, TEXT, and DAT files. CSV Writer Properties
Property Name Default Value Input Required?

Description

Possible Values

File Name Fully qualified name of the CSV, Text, or DAT file. Overwrite Overwrites the existing file.

None True

C:\abc.csv True, False Comma(,) Semi colon(;) Tab( ) Colon(:) Pipe(|) Space(" ")

Mandatory Optional Mandatory

Column Column delimiter that separates da- ComSeparator ta tokens in the file. ma(,)

Quotation Specifies how single or double quo- None Character tation marks are to be treated when importing data.

Single(') Double(") None

Optional

Include Column Headers

Boolean value. If set to true, adds the column name/header.

True

True, False

Optional

Decimal Decimal separator used in the data PeriSeparator source. od(.)

Period(.) Comma(,)

Optional

Grouping Grouping separator or thousand None Separator separator used in the data source.

Period(.) Comma(,) None

Optional

139

2012-10-04

Component Properties

Property Name

Description

Default Value

Possible Values

Input Required?

Number Format

Format of the number.

None None None

For example, ##,###.### or 1, 980.986 For example, dd/mm/yyyy

Optional Optional Optional

Date Time Format of the date. Format Encoding Text-encoding method.

UTF-8 None

8.4.2 HANA Writer


The HANA Writer component enables you to write data into SAP NewDB database tables. HANA Writer Properties
Property Name Description Default Value Possible Values Input Required?

Schema Name Table Type Table Name Overwrite

The name of the schema. None The table type to write the Column Store data. The name of the table. Overwrite table if already exists. None False

The schema name. Column Store, Row Store

Mandatory Mandatory

The table name. Mandatory True, False Mandatory

8.4.3 JDBC Writer


The JDBC Writer component enables you to write data into relational databases such as MySQL, MS SQL Server, DB2, Oracle, SAP MaxDB, and SAP NewDB. Note: You need to configure SAP Predictive Analysis to use the corresponding JDBC driver. For configuration information, see the SAP Predictive Analysis Installation and Configuration Guide.

140

2012-10-04

Component Properties

JDBC Writer Properties


Property Name Description Default Value Possible Values Input Required?

Database List of databases support- None Type ed by Predictive Analysis. Machine Name of the machine on None Name which the database is installed. Port Database or service port None Number number. Database Name of the database. Name User Name Password Table Type Table Name Database user name. Database password. The table type to write the data. Name of the table. None None None Column Store None

MySQL, SAP MaxDB, MS SQL Server, Oracle, DB2, Sybase, SAP NewDB Any string value

Mandatory

Mandatory

For example, 3306, 30415, and so on Any string value Any string value Any string value Column Store, Row Store

Mandatory Mandatory Mandatory Mandatory Mandatory only if you select the "Database Type" as "NewDB" Mandatory

Table name

8.5 Saved Models


Models that you create by saving the state of algorithms are listed under Saved Models. The SAP Predictive Analysis application does not contain predefined models. Therefore, when you launch the application for the first time, this list will be empty. For information on creating a new model, see the "Creating a Model" section under Working with Models.

141

2012-10-04

Component Properties

142

2012-10-04

Use Case Scenarios

Use Case Scenarios

This section provides you use case scenarios that describes how you can use SAP Predictive Analysis to analyze data and forecast future events.

9.1 Sales Forecasting


Scenario: The Regional Manager of an airline company wants to develop strategies to increase business and fine-tune operations. The airline passengers' data such as flight date and number of passengers traveled are stored in a CSV file. The manager would like to analyze the trend in business since 2000 and wants to forecast the number of passengers flying in the next one year (for example, 2012). This example assumes that the manager has some basic knowledge in statistical analysis and data mining techniques. Using SAP Predictive Analysis, the manager creates a forecasting analysis. Since the airline passenger data is seasonal in nature, he selects the Triple Exponential Smoothing algorithm for forecasting. Following are instructions to create an analysis for forecasting airline passengers: 1. Open a new analysis in SAP Predictive Analysis. 2. From the components panel, select the CSV Reader component and drag it onto the analysis editor. 3. In the properties view, perform the following steps: a. In the General tab, rename the component to Airline Passengers. b. In the Settings tab, select the Airline Passenger.csv file. 4. Choose 5. Choose to preview the data in the csv file.

to run the analysis for retrieving component results. The "Analyze the result" dialog is displayed. Select Yes to switch to the visualization perspective and visualize the data using ad hoc chart.

6. In the application toolbar, choose the Designer button to switch to the designer perspective. 7. In the Components panel, select the Algorithms tab. 8. Expand Time Series subtype. 9. Drag the Triple Exponential Smoothing algorithm onto the analysis editor. When you drag the algorithm component closer to the CSV Reader component in the analysis, the analysis editor automatically detects association between two components and suggests connection. When you drop the algorithm component in the analysis, the connection is automatically created.

143

2012-10-04

Use Case Scenarios

Alternatively, you can double-click the Triple Exponential Smoothing algorithm. The algorithm component is automatically connected to the CSV Reader component. 10. Select the Triple Exponential Smoothing algorithm. In the properties view, perform the following steps: a. Select Forecast as output mode, as you want to forecast the data. b. Select the Airline Passenger column as the dependent column. The algorithm forecasts the data based on the Airline Passenger column. c. In the Missing Value field, select Ignore. d. In the Period field, select Month(12). e. Enter 2000 as the start year. f. Enter 1 as a start period. As the period is Month(12), 1 implies first month of the year (January). g. Enter 12 for the number of periods to predict. h. Retain the default values for the advanced properties. 11. Choose to run the analysis. The "Analyze the result" dialog is displayed. Select Yes to switch to visualization perspective. 12. In the visualization editor, select the Time Series tab. Use range slider located at the bottom of the graph to select a specific data range. 13. Switch to the designer perspective. 14. In the components panel, select Data Writers. 15. 16. 17. 18. 19. Drag the CSV Writer component onto the analysis editor. Connect it to the Triple Exponential Smoothing algorithm. Select the CSV Writer component. In the properties view, select a CSV file to store the result. Choose to run the analysis.

The "Analyze the result" dialog is displayed. Select Yes to switch to visualization perspective. The fitted and forecast results are stored in the CSV file. 20. Choose File > Save 21. Enter a name for the analysis. 22. Select the folder to save the analysis and choose OK.

9.2 Retail Store Segmentation


Scenario: The Country Manager of a Retail Chain (which has 150 stores) finalize on three sales promotion strategies. Data pertaining to stores such as store location, sales turnover, store size, staff, and profit margin are stored in a CSV file. The manager wants to segment 150 stores into three different groups based on sales turnover, profit margin, store size, and staff size so that specific strategies can be applied to each store segment.

144

2012-10-04

Use Case Scenarios

This example assumes that the country manager has some basic knowledge in statistical analysis and data mining techniques. Using SAP Predictive Analysis, he builds a segmentation analysis by using the R-K-Means algorithm. Following are instructions to build an analysis for segmentation analysis: 1. Open a new analysis in SAP Predictive Analysis. 2. From the Components panel, select the CSV Reader component and drag it onto the analysis editor. 3. In the properties view, perform the following steps: a. In the General tab, rename the component to Retail Stores. b. In the Settings tab, select the Retail Stores.csv file. 4. Choose 5. Choose to preview the data in the csv file.

to run the analysis for retrieving component results. The "Analyze the result" dialog is displayed. Select Yes to switch to visualization perspective.

6. In the application toolbar, choose the Designer button to switch to the designer perspective. 7. In the Components panel, select the Algorithms tab. 8. Expand Clustering subtype. 9. Drag the R-K-Means algorithm onto the analysis editor. When you drag the algorithm component closer to the CSV Reader component in the analysis, the analysis editor automatically detects association between two components and suggests connection. When you drop the algorithm component in the analysis, the connection is automatically created. Alternatively, you can double-click the R-K-Means algorithm. The algorithm component is automatically connected to the CSV Reader component. 10. Select the R-K-Means algorithm. In the properties view, perform the following steps: a. Select the columns to be used for cluster analysis. b. In the Missing Value field, select Ignore. c. In the Number of Clusters field, enter 3. d. Retain the default values for the advanced properties. 11. Choose to run the analysis. The "Analyze the result" dialog is displayed. Select Yes to switch to visualization perspective. 12. In the visualization editor, select the Cluster Graph. The cluster graph contains: The size of the three clusters in the form of horizontal bar chart. You can change it to a pie chart or a vertical bar chart. The cluster density and distance between the cluster centroids. The independent variables of each cluster compared to the overall data in the area chart. The variable to be compared can be selected from the Variable drop-down list and the cluster can be selected from the Cluster slider. Cluster comparison using radar chart. The cluster can be selected from the Cluster slider. 13. In the visualization editor, select Ad hoc chart to view the output in the required format. 14. Switch to the designer perspective.

145

2012-10-04

Use Case Scenarios

15. In the components panel, select Data Writers. 16. 17. 18. 19. 20. Drag the CSV Writer component onto the analysis editor. Connect it to the R-K-Means algorithm. Select the CSV Writer component. In the properties view, select a CSV file to store the result. In the properties view, select a CSV file to store the result. The "Analyze the result" dialog is displayed. Select Yes to switch to visualization perspective. The cluster numbers are written to the CSV file. 21. Choose File > Save 22. Enter a name for the analysis. 23. Select the folder to save the analysis and choose OK.

146

2012-10-04

Glossary Definitions

Glossary Definitions

Metadata Repository (MDR): Abstraction layer on a database schema that contains information about components and analyses.

Execution State Repository (ESR): Stores transient data during the execution of an analysis.

Predictive Analysis Service (PAS): Handles requests from the Predictive Analysis designer client. The Predictive Analysis Service orchestrates between the Metadata Repository, the Execution State Repository, and components to process the client request.

Component: Basic processing unit of SAP Predictive Analysis.

Analysis: Collection of different components linked together in a particular sequence that defines the flow of the data. It helps you to perform predictive analysis.

In-Database (In-DB): In-database (in-DB) is an analysis execution mode in which data processing is performed within the database using data mining capabilities. In this mode, the data is never taken out of the database for processing and hence the processing speed is very high. This mode can be used to process large data sets. SAP HANA supports in-DB data mining through R integration and Predictive Analysis Library (PAL).

In-Process (In-Proc): In-Process is an analysis execution mode in which the data processing is performed by taking data out of the database into the predictive analysis process space.

Data Readers: Components used for reading data from flat files, relational databases, SAP BusinessObjects universes, and in-DB databases.

Data Preparation: Components used to prepare data before processing it. These components can be used for checking data for accuracy and missing fields, merging data from different sources, filtering data based on range values, binning data, and sampling a subset of data from large datasets.

Algorithms:

147

2012-10-04

Glossary Definitions

Components used to find meaningful patterns and rules in the data with the help of predictive analysis techniques and predict the future of the business. Data Writers: Components used for storing results into flat files, relational databases, SAP BusinessObjects universes, and in-DB databases. Model: Reusable component created by training an algorithm using historical data.

148

2012-10-04

More Information

More Information

Information Resource

Location

SAP BusinessObjects product information

http://www.sap.com Navigate to http://help.sap.com/businessobjects and on the "SAP BusinessObjects Overview" side panel click All Products. You can access the most up-to-date documentation covering all SAP BusinessObjects products and their deployment at the SAP Help Portal. You can download PDF versions or installable HTML libraries. Certain guides are stored on the SAP Service Marketplace and are not available from the SAP Help Portal. These guides are listed on the Help Portal accompanied by a link to the SAP Service Marketplace. Customers with a maintenance agreement have an authorized user ID to access this site. To obtain an ID, contact your customer support representative. http://service.sap.com/bosap-support > Documentation Installation guides: https://service.sap.com/bosap-instguides Release notes: http://service.sap.com/releasenotes

SAP Help Portal

SAP Service Marketplace

The SAP Service Marketplace stores certain installation guides, upgrade and migration guides, deployment guides, release notes and Supported Platforms documents. Customers with a maintenance agreement have an authorized user ID to access this site. Contact your customer support representative to obtain an ID. If you are redirected to the SAP Service Marketplace from the SAP Help Portal, use the menu in the navigation pane on the left to locate the category containing the documentation you want to access. https://cw.sdn.sap.com/cw/community/docupedia

Docupedia

Docupedia provides additional documentation resources, a collaborative authoring environment, and an interactive feedback channel. https://boc.sdn.sap.com/

Developer resources https://www.sdn.sap.com/irj/sdn/businessobjects-sdklibrary

149

2012-10-04

More Information

Information Resource

Location

SAP BusinessObjects articles on the SAP Community Network

https://www.sdn.sap.com/irj/boc/businessobjects-articles These articles were formerly known as technical papers. https://service.sap.com/notes

Notes These notes were formerly known as Knowledge Base articles. Forums on the SAP Community Network https://www.sdn.sap.com/irj/scn/forums http://www.sap.com/services/education Training From traditional classroom learning to targeted e-learning seminars, we can offer a training package to suit your learning needs and preferred learning style. http://service.sap.com/bosap-support Online customer support The SAP Support Portal contains information about Customer Support programs and services. It also has links to a wide range of technical information and downloads. Customers with a maintenance agreement have an authorized user ID to access this site. To obtain an ID, contact your customer support representative. http://www.sap.com/services/bysubject/businessobjectsconsulting Consulting Consultants can accompany you from the initial analysis stage to the delivery of your deployment project. Expertise is available in topics such as relational and multidimensional databases, connectivity, database design tools, and customized embedding technology.

150

2012-10-04

Index
A
ad hoc chart 40 Algorithm components association rules 81 cart analysis 91 clustering observations 88 cnr tree 91 detect outlier 103 detect outliers 102 exponential regression 104 geometric regression 107 inter quartile outlier 102 KNN outlier 103 linear regression 111 logarithmic regression 113 neural network 98 R apriori 81 R double exponential smoothing 130 R exponential regression 115 R geometric regression 117 R linear regression 119 R logarithmic regression 121 R multiple regression algorithm 124 R single exponential smoothing 132 R triple exponential smoothing 135 R-K-Means 88 time series algorithm 126, 130, 132, 135 triple exponential algorithm 126 Algorithm properties multilayer perceptron 96 neural network 96 algorithm summary 41

C
cluster graph 44 create folder 27

M
matrix chart 38

D
Data Reader Components BI4 Unv Reader 53 CSV Reader 55 flat files 55 HANA Reader 58 JDBC Reader 60 relational databases 60 SAP HANA database 58 Universe XI 3.1 61 data selections 18 Data Writer Components CSV Writer 139 flat files 139 HANA Writer 140 JDBC Writer 140 relational databases 140 decision tree 46

P
parallel coordinates 39 Predictive Analysis basics 13 Preprocessor components filter 65 Preprocessor Components formula 70 sample 75

R
reading data 28 regression graph 43

S
setting preferences 19 summary statistics 39

E
executed analyses 18

T
table 36 time series graph 42

K
keyboard shortcuts 21

V
visualization editor 19

L
lattice chart 38

151

2012-10-04

Index

152

2012-10-04