Data mining with SpagoBI, Weka and Oracle.

Stephen Ogutu www.ogutu.org

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Copyright © 2013 by Stephen Ogutu All rights reserved, including the right to reproduce this book or portions thereof in any form whatsoever. For information, address: Stephen Ogutu, P.O. Box 8031-00200 Nairobi Kenya

.

Trademarks: All other trademarks are the property of their respective owners. Stephen Ogutu is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties or merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss or profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

1

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Dedication This book is dedicated to the memory of my late mother, a great woman. My beautiful wife Sheila for her unending support and my two cute children Emmanuel and Shallin.

Acknowledgments Special thanks to the SpagoBI community and the ow2 consortium. Thank you all for creating a great product and documenting it effectively.

2

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Introduction.
In most places I have seen people use business intelligence tools purely for OLAP, creating reports and charts. However business intelligence tools are much more powerful than this. In this tutorial, we will look at a real world example of using SpagoBI to discover patterns hidden in a large data set of millions of records containg the US census data. According to Wikipedia, data mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

3

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

The Problem
We will assume that you are a military recruiter and your problem is to find a list of people who qualifies to join the army. You want people who have a certain group of qualities. They must not be children, should not be earning too much and therefore already comfortable and not interested in joining the army. Should not have served in the army before e.t.c. To aid in your work, you have been given a large dataset of 2.4 million records from the last census with ID number so you can get the contacts of the people. You want to mine the data using BI so that it groups for you potential candidates to reduce the time taken to recruit. You don’t want to run after people who are not interested in joining the army.

Preparing the data.
For us to use SpagoBI to perform data mining, we will need to load the data to be analyzed into a relational database. We will be using Oracle since it is the most popular enterprise database and also because we need to simulate as far as possible a real world scenario. So where will we get the data? Download the data from http://archive.ics.uci.edu/ml/machine-learning-databases/census1990mld/USCensus1990.data.txt and save it to your computer. It is a large file, 352MB of data. The data is in a CSV format so the first thing we need to do is import it into the Oracle database. If you have no prior experience with Oracle, see my book “SpagoBI, ORACLE and OLAP” available here http://www.scribd.com/doc/133975956/SpagoBI-with-ORACLE-11g

Loading the data.
We need to load the data into Oracle before we can perform data mining with SpagoBI. Start your database and login as user sys.

Create a tablespace that will hold your census data. This tablespace should be 2GB in size but can extend if needed. (A tablespace is a logical container where table data is kept). I will place the datafile for my tablespace in drive C:\ as that is where I have space. Use the command below. CREATE TABLESPACE CENSUS_DATA DATAFILE 'C:\oraclexe\app\oracle\oradata\XE\census_data01.dbf' SIZE 2048M AUTOEXTEND ON;

4

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Create the user spago with password spago. This is the user that will own the census data. Use the command below. Notice that we are granting unlimited usage on the tablespace CENSUS_DATA to spago. That is, user spago can use as much space as he likes on this tablespace. CREATE USER SPAGO IDENTIFIED BY SPAGO DEFAULT TABLESPACE CENSUS_DATA QUOTA UNLIMITED ON CENSUS_DATA;

Next grant the CREATE SESSION privilege (Allows the user to login) and CREATE TABLE privilege (Allows the user to create a table to the user spago). GRANT CREATE SESSION TO SPAGO; GRANT CREATE TABLE TO SPAGO;

Confirm that you can login as SPAGO user.

5

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Next, create the table that will hold the data from the USCensus1990.data.txt file you downloaded previously. Below is the script for creating this table. Save it as C:\US\table.sql
CREATE TABLE CENSUS (caseid INT,dAge INT,dAncstry1 INT,dAncstry2 INT,iAvail INT,iCitizen INT,iClass INT,dDepart INT,iDisabl1 INT,iDisabl2 INT,iEnglish INT,iFeb55 INT,iFertil INT,dHispanic INT,dHour89 INT,dHours INT,iImmigr INT,dIncome1 INT,dIncome2 INT,dIncome3 INT,dIncome4 INT,dIncome5 INT,dIncome6 INT,dIncome7 INT,dIncome8 INT,dIndustry INT,iKorean INT,iLang1 INT,iLooking INT,iMarital INT,iMay75880 INT,iMeans INT,iMilitary INT,iMobility INT,iMobillim INT,dOccup INT,iOthrserv INT,iPerscare INT,dPOB INT,dPoverty INT,dPwgt1 INT,iRagechld INT,dRearning INT,iRelat1 INT,iRelat2 INT,iRemplpar INT,iRiders INT,iRlabor INT,iRownchld INT,dRpincome INT,iRPOB INT,iRrelchld INT,iRspouse INT,iRvetserv INT,iSchool INT,iSept80 INT,iSex INT,iSubfam1 INT,iSubfam2 INT,iTmpabsnt INT,dTravtime INT,iVietnam INT,dWeek89 INT,iWork89 INT,iWorklwk INT,iWWII INT,iYearsch INT,iYearwrk INT,dYrsserv INT);

To create the table, execute the query as shown below when logged in as user spago.

SQL*Loader
The table is now prepared and all that remains is to load the data into it. To load the data, we are going to use an Oracle tool called SQL loader. This is a tool that loads data from a flat file into an oracle database. For SQL loader to work, it needs a control file which tells it where the data is and into which table into the database we should load the file. Below is a sample control file we will use. load data infile 'C:\US\USCensus1990.data.txt' into table CENSUS fields terminated by "," optionally enclosed by '"' (caseid ,dAge ,dAncstry1 ,dAncstry2 ,iAvail ,iCitizen ,iClass ,dDepart ,iDisabl1 ,iDisabl2 ,iEnglish ,iFeb55 ,iFertil ,dHispanic ,dHour89 ,dHours ,iImmigr ,dIncome1 ,dIncome2 ,dIncome3 ,dIncome4 ,dIncome5 ,dIncome6 ,dIncome7 ,dIncome8 ,dIndustry ,iKorean ,iLang1 ,iLooking ,iMarital ,iMay75880 ,iMeans ,iMilitary ,iMobility ,iMobillim ,dOccup ,iOthrserv ,iPerscare ,dPOB ,dPoverty ,dPwgt1 ,iRagechld ,dRearning ,iRelat1 ,iRelat2 ,iRemplpar ,iRiders ,iRlabor ,iRownchld ,dRpincome ,iRPOB ,iRrelchld ,iRspouse ,iRvetserv ,iSchool ,iSept80 ,iSex ,iSubfam1 ,iSubfam2 ,iTmpabsnt ,dTravtime ,iVietnam ,dWeek89 ,iWork89 ,iWorklwk ,iWWII ,iYearsch ,iYearwrk ,dYrsserv ) 6

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

The line infile 'C:\US\USCensus1990.data.txt' tells us that this is the source of the data. The data will be loaded into the table census and the data is separated by commas. The values in bracket displays the columns of the table. Save the file with the text as control.ctl in the folder C:\US\control.ctl. We are now ready to load the data. Launch command prompt and at the terminal, type the commands below.
sqlldr SPAGO/SPAGO control=C:\US\control.ctl

This means that we are launching the SQL loader utility and it is connecting to the database as user SPAGO (Which we created previously) and with password SPAGO. It will use the control file in the location specified.

When you hit enter key, it will start inserting the data into the database and this might take a while depending on the speed of your machine. On my laptop it took less than 5 minutes to load the 2.45 million records.

7

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

WEKA
SpagoBI uses software called Weka (Waikato Environment for Knowledge Analysis) which is a collection of machine learning algorithms developed at the University of Waikato, New Zealand. Though Weka supports many algorithms, only cluster analysis is supported in SpagoBI and therefore we will limit ourselves to clustering for the remainder of this document.

Cluster analysis
Clustering is a method used to discover natural groups in data without prior knowledge of the groups. Suppose you have a database of an insurance company and you run a clustering algorithm against it, what details might you discover? It might will give you groups of policy holders with high claim cost who you can blacklist from your firm and groups with low claim cost who you can do business with. This is an example of how data mining can be used in the real world. Marketers also use clustering algorithms to discover certain groups in their customer data whom they target with specific products. In the Telco sector, you might discover that young people call mostly at a particular time of the day or use more of a certain service e.g. internet data as opposed to voice and you can use this information to target them with offers for internet data bundle. Clustering has many other uses in marketing, image processing, medicine etc. Looking at the census data that is now in our database, it makes no sense at all but once we start analyzing it, we might discover interesting details from it. The particular algorithm we will be using is called the k-means algorithm.

Downloading Weka.
Download Weka 3.6.1 from http://sourceforge.net/projects/weka/files/?source=navbar and install it into your computer. Next put the Oracle jdbc library to your computers class path so that Weka will be able to find it when connecting to Oracle database. The Oracle library is in the path C:\oraclexe\app\oracle\product\11.2.0\server\jdbc\lib\ojdbc6.jar. This may differ if you installed express edition on a different path.

8

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

JDBC Driver
Now Weka needs to know where the Oracle JDBC driver is. We tell it by modifying the file Oracle DatabaseUtils.props which can be found in the jar file C:\Program Files\Weka-3-6\weka.jar. You have three options. 1. Modify the file DatabaseUtils.props to include the Oracle setting by navigating to the location where you installed Weka e.g. C:\Program Files\Weka-3-6, right click on weka.jar and open using winrar.

9

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Navigate to experiment/DatabaseUtils.props and extract out the file DatabaseUtils.props. Change the line jdbcURL=jdbc:idb=experiments.prp to jdbcUrl=jdbc:oracle:thin:@localhost:1521:XE and Change the line
jdbcDriver=RmiJdbc.RJDriver,jdbc.idbDriver,org.gjt.mm.mysql.Driver,com.m ckoi.JDBCDriver,org.hsqldb.jdbcDriver to jdbcDriver=oracle.jdbc.driver.OracleDriver

then return the file back to the jar file. The file should now look like this. Notice the highlighted entry.

10

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

2. The other easy option is to delete the file DatabaseUtils.props and rename the file DatabaseUtils.props.oracle to DatabaseUtils.props in the jar file.

3. The last and recommended option which we will be using is to extract the file DatabaseUtils.props.oracle and copy it to your home directory with the name DatabaseUtils.props e.g “C:\Documents and Settings\Stephen Ogutu \DatabaseUtils.props” then modify it as follows. a. Change the database URL to jdbcURL=jdbc:oracle:thin:@localhost:1521:XE if you installed Oracle Express edition. b. Change the JDBC driver to jdbcDriver=oracle.jdbc.driver.OracleDriver 4. Weka is now ready to connect to Oracle.

11

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

A simple analysis with weka.
Now start Weka and click on explorer.

To select the source of data that we need to analyze click on Open DB icon and under the URL, enter jdbcUrl=jdbc:oracle:thin:@localhost:1521:XE as shown below.

12

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Click on User and enter the following details.

13

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

When you click on connect, on the info box, it should tell you “connecting to: jdbc:oracle:thin:@localhost:1521:XE = true “ Let us start with 20,000 records since most laptops will not handle the 2.4 million records at a go. First let us see if there is any relation between age,marital status,military service,poverty level and gender in the census data. After entering the password, click on connect. Then enter the query and click on execute.

When you click OK, you might get this error if you used option one to change the DatabaseUtils.props file .

This is because WEKA does not know about the NUMBER data type returned from the jdbc driver and so we need to map it into a type that WEKA understands. Since we know the values are integers, we will

14

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

map them to a java type integer (Represented by number 5 in the file DatabaseUtils.props). Add the line below in your DatabaseUtils.props and save. NUMBER=5 Your file should be similar to this.

Try to run the query in WEKA again. We should now get the screen below.

15

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Understanding the output.
Under Attributes, click on ISEX as shown below. This attribute (column displays the gender). Remember the k-means algorithm that we will be using only accepts numbers so we need a way of converting the gender representation Male or Female to number values. This has been achieved by using the number 1 for females and the number 0 for Males.

16

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

When you click on ISEX attribute (Yellow arrow in the above diagram) and look at the output of the green arrow (Arrow II), you see that the minimum number is 0 (Males) and Maximum is 1 (Females) and in our data sample, the Mean (Average) between Males and Females is 0.517, in other words, the distribution between males and females is almost half with females slightly higher than males which is the norm in most populations. From the graph, we see that we only have males represented by red arrow (Arrow III) who total 9658 in our sample count. Females total 10342, blue arrow (Arrow IIII). Notice that in the graph or visual representation, there is nothing between 0 and 1 since we either have males or females. Let’s look at another attribute which is military service. Click on IMILITARY attribute.

17

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

We see that we have a minimum of 0 and a maximum of 4 with a mean of 2.801. From the description of that column found in http://archive.ics.uci.edu/ml/machine-learning-databases/census1990mld/USCensus1990raw.attributes.txt and copied below.

It therefore means that from our sample of 20,000 rows (instances), there are 4708 people who have not reached the military age so they are represented by zero. See the blue arrow on the image below.

18

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

We have 143 people on active duty (red arrow). Remember 1 represents active duty. 2214 people who were on active duty in the past (Green arrow), 288 serve in the national guard (Black arrow) and 12647, the majority never served in the armed forces(Yellow arrow). Now you should be ok with understanding the data. Let us now run a clustering algorithm on the data.

19

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

K-Means Algorithm
Click on cluster, click on chose and select simple kmeans.

Next to the chose button click on the bold text SimpleKMeans and change the number of clusters to 5.

20

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Click on ignore attributes and select CASEID. We will not be using this attribute (column) in the clustering because it is merely used to identify a row or the instance.

Under cluster mode, select “Use training set” then click on start.

21

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

From the results above, we can see that the data sample was 20000 (black arrow), the number of attributes (columns) were 6 (yellow arrow) and out of these, CASEID was ignored. The data have been partitioned into 6 groups with similar characteristics. Let us look at cluster 0 (Green box). 1. It has a total population of 2370 people. 2. From the value of DAGE column which is 1.1093 and looking at the age function found here http://archive.ics.uci.edu/ml/machine-learning-databases/census1990mld/USCensus1990.mapping.sql and copied below,

22

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

We can deduce that people in this cluster are less than 13 years old since the value 1.1093 rounded off to the nearest integer is 1. 3. The value of IMARITAL attribute is 3.9958 which rounded to the nearest integer is 4 and from http://archive.ics.uci.edu/ml/machine-learning-databases/census1990mld/USCensus1990raw.attributes.txt which is copied below,

means that those in this cluster have never married since they are less than 15 years old. 4. The value of IMILITARY is 0.0025 which rounded to the nearest integer is 0 and from http://archive.ics.uci.edu/ml/machine-learning-databases/census1990mld/USCensus1990raw.attributes.txt copied below

23

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

means they are under age and so have never been in military service. 5. The value of the attribute DPOVERTY is 1.8342 which rounded off to the nearest integer is 2 and means not applicable. 6. Lastly the ISEX attribute is 1 which means they are females. In summary, this is a cluster which consists of 2,370 female underage kids who have never been to the army. If you were employed by the army to recruit soldiers, would you consider members in this cluster? 7. As an exercise, assume you are a recruiting agent for the army and you have this data. Find a cluster of people who would make potential candidates.

References: https://list.scms.waikato.ac.nz/pipermail/wekalist/2005-April/030088.html http://archive.ics.uci.edu/ml/machine-learning-databases/census1990mld/USCensus1990raw.attributes.txt http://www.kdd.org/explorations/issues/11-1-2009-07/p2V11n1.pdf http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

24

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Enter SpagoBI
We will be using SpagoBI to arrive at the same values we have got with the Weka explorer. The advantage with SpagoBI is that we will be able to store the data so that we can analyze it with other tools available in Spago like Qbe, charts and OLAP. If you have never used SpagoBI before, see my introductory lessons to SpagoBI here http://spagolabs.wordpress.com/2013/04/25/7/ or write me using the email xogutu@gmail .com for the softcopy of the SpagoBI book. Now SpagoBI needs a XML KnowledgeFlow layout file ( kfml) file which defines what we have just done above in weka for it to work. For us to create a kfml file, start weka and choose knowledge flow.

Knowledge flow does a similar thing to explorer except that we put the items on a canvas and connect them such that we can visualize how the data flows. Here are the steps. 1. Under DataSources tab, select database loader, the arrow will change to a cross. Click on the knowledge flow layout with the cross. It will deposit the Database Loader icon on the Layout.

25

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

2. Click on the Evaluation tab, select the TrainingSet maker and deposit it on the Layout as shown.

3. Under Filters, select AddCluster and deposit it into the Layout.

4. Lastly under DataSinks tab, select DatabaseSaver and deposit it into the Layout.

26

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

5. Double click on Database Loader and enter the data as shown.

The query should be similar to the one below. select CASEID,DAGE,IMARITAL,IMILITARY,DPOVERTY,ISEX from census where rownum<=20000 6. Right click on DatabaseLoader and select dataSet.

27

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

7. The icon will change into a rubber band, click on Training SetMaker. The two will now be linked.

8. Link Training SetMaker with AddCluster by right clicking on TrainingSetMaker and selecting Training Set.

28

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

9. Double click on AddCluster. Click on chose then SimpleKMeans.

10. Next to choose, click on SimpleKmeans and for number of clusters select 5.

29

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

11. For IgnoredAttributeIndices eneter 1( This is the CASEID since it is the first attribute or column).

12. Link AddCluster to DatabaseSaver.

30

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

13. Double click on DatabaseSaver and enter the details below.

This means the data will be saved into the database table RECRUITS_TRAINING_1_OF_1 which will be created automatically. 14. The final diagram should look like the one below.

31

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

This means that we will take data from the database, pass it to Training set maker then pass it to the clustering algorithm which will cluster it and save the results in the database. 15. Click on Save icon to save your layout. a. Under files of Type, select KFML.

32

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

b. For file name enter Recruit.

Click on Save. 16. We are done with Weka. You can close it.

33

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

SpagoBI datamining document.
We will need to create a datamining document in SpagoBI to perform the clustering. Login to SpagoBI (I am using version 3.3) as user biadmin.

Under Resources, select Data source then click on create button. Enter the details as shown.

Click on Test before Save button. It should say “Connection Test OK”.

34

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

We have now created a connection to the Oracle database from SpagoBI. Next click on Analytical Document -> Document Development.

Click on Insert. 1. 2. 3. 4. 5. 6. 7. 8. For Label enter Recruits. For Name enter Recruits. For Description enter Datamining recruits data. For Type enter Datamining model. For Engine enter Weka engine. For Engine enter Weka engine. For Data Source enter SpagoBIOracle. For State enter Relesed.

35

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

9. For Template enter the KFML file we created in Weka (Recruit.kfml).

10. Under Show document templates select Data Mining folder shown below.

36

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

11. Click on Save. 12. Click on Home Page, Select Business Analysis folder then Data Mining. Click on the Recruits Document.

13. The document will run successfully as shown below.

14. But when you look at the Tomcat log, you will see it inserting the cluster output rows in the database.

37

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

This feature is not available in the original Weka engine that comes with SpagoBI, I have added it as a way of debugging. You can get the modified Weka engine for a small fee when you write me. Note that if you get an error here it might be because of two reasons. a. That your SpagoBIWekaEngine have not been configured properly particularly the file C:\Downloads\All-In-One-SpagoBI-3.3-01242012\SpagoBI-Server3.3\webapps\SpagoBIWekaEngine\WEB-INF\classes\database.properties which should be setup for Oracle connection as shown below.

38

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

b. The other problem might be that Oracle is refusing to create the table because one of the columns is using a reserved word CLUSTER. The offending code is found in the file DatabaseSaver.class in the path C:\Downloads\All-In-One-SpagoBI-3.301242012\SpagoBI-Server-3.3\webapps\SpagoBIWekaEngine\WEBINF\classes\weka\core\converters. I had to add the following lines to make it add underscore if column name was CLUSTER which is a reserved word in Oracle.

How did I know this? Well Oracle allows you to trace a session and I saw from the Oracle trace files that this was the problem. See the trace file below.

39

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

If hunting errors in trace files is not your cup of coffee or if you have no time for it and just needs a functioning Weka engine to use with Oracle mail me for a copy. It will cost you a small fee to compensate for my time.

15. Assuming you had no issues, click on User menu events. You will see that it says “Execution of Weka flow successfully terminated!”

16. We can confirm from Oracle that the table RECRUITS_TRAINING_1_OF_1 was created by SpagoBI, clustering done and data inserted.

40

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

17. Now that data makes no sense at all and so we will need another SpagoBI document to analyze it. That is where Qbe comes in.

41

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Qbe Document.
We will create a Qbe document to help us freely inquire the clustered data and produce reports from it. Create a datamart using SpagoBIMeta version 3.3. Steps: 1. Create a new General project.

42

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

2. Call the project Recruit. Under the Recruit project, create a new SpagoBI model.

43

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

3. Name the model RecruitModel and the file RecruitFile.

44

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

4. Under Connection, select New Oracle and for schema select SPAGO.

45

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

5. Select Physical Model Tables.

46

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

6. Select business model class.

47

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

7. You will have the following screen.

8. Right click on Business Model, click Create and select Datamart.

48

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

9. Select the location.

10. Navigate to the location C:\Downloads\SpagoBIMeta_3.3_Win_20111222\SpagoBIMeta_3.3_win_20111220\workspace\ RecruitModel\dist and copy the files datamart.jar and cfields_meta.xml to C:\Downloads\All-InOne-SpagoBI-3.3-01242012\SpagoBI-Server-3.3\resources\qbe\datamarts\ RecruitModel

49

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

11. Now we need to tell SpagoBi where our datamart is i.e the RecruitModel. We do this by creating a simple xml file like below.

12. 13. 14. 15. 16.

Save the file as C:\DataMart.xml We are done creating the datamart, next login to SpagoBI and create the Qbe document. Click on Analytical documents -> Documents management. Click on Create. Enter data as follows. a. For Label enter RecruitResults. b. For Name enter RecruitResults. c. For Name enter RecruitResults. d. For Type enter Datamart model. e. For Engine enter QbeEngine. f. For State enter Released. g. For Template chose DataMart.xml

50

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

h. Under Show document templates select Business Analysis and Data Mining.

i. j.

Save the document. Under Business analysis folder, select Data Mining folder and select Recruit Results.

51

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

k. Select the attributes DAGE,IMARITAL,IMILITARY,DPOVERTY,ISEX,CLUSTER from Schema and drop it to the Query Editor.

l.

Under alias, rename the fields as shown. Rename DAGE to AGE,IMARITAL to MARITAL STATUS,IMILITARY to MILITARY SERVICE ,DPOVERTY to POVERTY,ISEX to GENDER ,CLUSTER to CLUSTER.

52

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

m. Now we need to get the average for all fields and group by CLUSTER.

n. Next click on Execute Query.

o. Now let us look at the results of the clusters.

53

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

p. We have forgotten to add a count of the people in any given cluster. Go back to Query and add the attribute CASEID and instead of Average function, select count.

q. Run the query again.

Let us look at the results. Cluster 1 is made up of Females (Gender = 1) who have never seen military service (0 means underage for military service ) and have never married (Marital status = 4) and are below 13 years old (AGE=1.11). The total number in this group is 2370. So this group is of no interest to a military recruiter. Notice that the results are exactly the same as the ones we got by using Weka explorer shown below.

54

OGUTU.ORG P.O.Box 8031-00200 Nairobi Kenya

Email: info@ogutu.org Web: www.ogutu.org

Now you know enough to be productive in data mining with SpagoBI.

55

Sign up to vote on this title
UsefulNot useful