Professional Documents
Culture Documents
IN 961 BigDataTrialSandboxforHortonworksInstallandConfig PDF
IN 961 BigDataTrialSandboxforHortonworksInstallandConfig PDF
2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any
means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other
company and product names may be trade names or trademarks of their respective owners and/or copyrighted
materials of such owners.
Abstract
This document describes how to use Informatica Big Data Edition Sandbox for Hortonworks to run sample mappings
based on common big data uses cases. After you understand the sample big data use cases, you can create and run
your own big data mappings.
Supported Versions
Informatica 9.6.1 HotFix 1
Table of Contents
Installation and Configuration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Step 1. Download the Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Download and Install VMWare Player. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Register at Informatica Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Download the Big Data Trial Sandbox for Hortonworks Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Step 2. Start the Big Data Trial Sandbox for Hortonworks Virtual Machine. . . . . . . . . . . . . . . . . . . . . . . . 4
Step 3. Configure and Install the Big Data Trial Sandbox for Hortonworks Client. . . . . . . . . . . . . . . . . . . . 4
Configure the Domain Properties on the Windows Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Configure a Static IP Address on the Windows Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Install the Big Data Trial Sandbox for Hortonworks Client. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Step 4. Access the Big Data Trial Sandbox for Hortonworks Sandbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Apache Ambari. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Informatica Administrator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Informatica Developer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Big Data Trial Sandbox for Hortonworks Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Running Common Tutorial Mappings on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Performing Data Discovery on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Performing Data Warehouse Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Processing Complex Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Reading and Parsing Complex Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Writing to Complex Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Working with NoSQL Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
The Big Data Trial Sandbox for Hortonworks virtual machine has the following components:
Note: The Informatica Big Data Trial Sandbox for Hortonworks installation and configuration document is available on
the desktop of the virtual machine.
The Big Data Trial Sandbox for Hortonworks client installs the libraries and binaries required for the Informatica
Developer (Developer tool) client.
The software available for download at the referenced links belongs to a third party or third parties, not Informatica
Corporation. The download links are subject to the possibility of errors, omissions or change. Informatica assumes no
responsibility for such links and/or such software, disclaims all warranties, either express or implied, including but not
limited to, implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and
disclaims all liability relating thereto.
You must have at least 10 GB of RAM and 30 GB of disk space available on the machine on which you download and
install VMWare Player.
When you register with Informatica Marketplace, you get a free 60-day trial to use Big Data Trial Sandbox for
Hortonworks.
3
Download the Big Data Trial Sandbox for Hortonworks Files
After you log in to Informatica Marketplace, download the Big Data Trial Sandbox for Hortonworks virtual machine and
client.
Includes the Big Data Trial Sandbox for Hortonworks virtual machine. Download the file to the machine on
which VMware Player is installed.
961_BigDataTrial_Client_Installer_win32_x86.zip
Includes the compressed Big Data Trial Sandbox for Hortonworks client. Download the file to an Informatica
client installation directory on a Microsoft Windows-32 machine.
Extract files in the client zip file to a directory on your local machine. For example, extract the files to the C:/
drive on your machine.
Step 3. Configure and Install the Big Data Trial Sandbox for
Hortonworks Client
To communicate with the virtual machine before you run the client, you must configure the domain properties for the
Big Data Trial Sandbox for Hortonworks client installation.
Optionally, to avoid updating the IP address of the virtual machine each time it changes, you can configure a static IP
address for the virtual machine.
Then, you can run the silent installer to install the Big Data Trial Sandbox for Hortonworks client.
1. Click Applications > System Tools > Terminal to open the terminal to run commands.
2. Run the ifconfig command to find the IP address of the virtual machine.
The ifconfig command returns all interfaces on the virtual machine. Select the eth interface to get values
for IP address.
4
The following image shows the ifconfig command with the return value for inet addr highlighted with a red
arrow:
3. Add the IP address and the default hostname hdp-bde-demo to the hosts file on the Windows machine on
which you install the Developer tool.
The hosts file can be located in the following location: C:\Windows\System32\drivers\etc\hosts. Add the
following line to the hosts file: <IP address> <hostname>. For example, add the following line:
192.168.159.159 hdp-bde-demo
1. Click Applications > System Tools > Terminal to open the terminal to run commands.
2. Run the ifconfig command to find the IP address and hardware ethernet address of the virtual machine.
The ifconfig command returns all interfaces on the virtual machine. Select the eth interface to get values
for the hardware ethernet address.
The following image shows the ifconfig command with the return values for inet addr and HWaddr outlined
with red boxes:
3. Edit vmnetdhcp.conf to add the values for host name, IP address, and hardware ethernet address.
vmnetdhcp.conf is located in the following directory: C:\ProgramData\VMware
Add the following entry before the #END tag at the end of the file:
host <hostname> {
hardware ethernet <your HWaddr>;
fixed-address <your inet addr>;
}
The following sample code shows how to set a static IP address:
host hdp-bde-demo {
hardware ethernet 00:0C:29:10:F9:4C;
fixed-address 192.168.159.159;
}
4. Add the IP address and the default hostname hdp-bde-demo to the hosts file on the Windows machine on
which you install the Developer tool.
5
The hosts file can be located in the following location: C:\Windows\System32\drivers\etc\hosts. Add the
following line to the hosts file: <IP address> <hostname>. For example, add the following line:
192.168.159.159 hdp-bde-demo
5. Shut down the virtual machine.
6. Restart the host machine and virtual machine.
The silent installer runs in the background. The process can take several minutes.
The command window displays a message that indicates that the installation is complete.
You can find the Informatica_Version_Client_InstallLog.log file in the following directory: C:\Informatica
\9.6.1_BDE_Trial\
After the installation process is complete, you can launch the Big Data Trial Sandbox for Hortonworks Client.
You can log in to Informatica Administrator (the Administrator tool) to monitor Informatica services and the status of
mapping jobs.
You can log in to the Developer tool to run the sample mappings based on common big data use cases. You can
create your own mappings and run the mappings from the Developer tool.
For more information on how to run mappings in the Developer tool, see the Informatica Big Data Trial Sandbox for
Hortonworks User Guide.
Apache Ambari
You can log in to Ambari from the following URL: http://hdp-bde-demo:8080/#/login.
Password: admin
Informatica Administrator
You can access the Administrator tool from the following URL: http://hdp-bde-demo:6005
Password: Administrator
6
Informatica Developer
You can start the Developer tool client from the Windows Start menu.
Password: Administrator
The Big Data Trial Sandbox for Hortonworks includes samples for the following use cases:
After you run the mappings in the Developer tool, you can monitor the mapping jobs in the Administrator tool.
m_DataLoad_1
m_DataLoad_1 loads data from the READ_WordFile1 flat file from your machine to the
WRITE_HDFSWordFile1 flat file on HDFS.
m_DataLoad_2
m_DataLoad_2 loads data from the READ_WordFile2 flat file from your machine to the
WRITE_HDFSWordFile2 file on HDFS.
7
The following image shows the mapping m_DataLoad_2:
m_WordCount
m_WordCount reads two source files from HDFS and parses the data and the output to a flat file on HDFS.
The DataDiscovery project in the Developer tool includes the following samples that you can use to perform data
discovery on Hadoop:
Use the samples to understand how to perform data discovery on Hadoop. You want to discover the quality of the
source customer data in the CustomerData flat file before you use the customer data as a source in a mapping. You
8
should verify the quality of the customer data to determine whether the data is ready for processing. You can run the
Profile_CustomerData profile based on the source data to determine the characteristics of the customer data.
The profile determines the characteristics of columns in a data source, such as value frequencies, unique values, null
values, patterns, and statistics.
The number of unique and null values in each column, expressed as a number and percentage.
The patterns of data in each column and the frequencies with which these values occur.
Statistics about the column values, such as the maximum value length, minimum value length, first value, and
last value in each column.
The data types of the values in each column.
The following figure shows the profile results that you can analyze to determine the characteristics of the customer
data:
The DataWarehouseOptimization project in the Developer Tool includes samples that you can use to perform data
warehouse optimization on Hadoop.
Use the samples to analyze customer portfolios by processing the records that have changed in a 24 hour time period.
You can offload the data on Hadoop, find the customer records that have been inserted, deleted, and updated in the
last 24 hours, and then update those records in your data warehouse. You can capture these changes even if the
number of columns change or if the keys change in the source files.
To capture the changes, use the Data Warehouse Optimization workflow. The workflow contains mappings that move
the data from local flat files to HDFS, identify the changes, and then load the final output to flat files.
The following image shows the sample Data Warehouse Optimization workflow:
9
To run the workflow, enter the following command to run the workflow from the command line:
./infacmd.sh wfs startWorkflow -dn infa_domain -sn infa_dis -un Administrator -pd
Administrator -Application App_DataWarehouseOptimization -wf wf_DataWarehouseOptimization
To run the mappings in the workflow, open a mapping and right-click the mapping to run the mapping.
Mapping_Day2
The workflow object Mapping_Day 2 reads customer data from flat files in a local file system and writes to an
HDFS target for the next 24-hour period.
m_CDC_DWHOptimization
The workflow object m_CDC_DWHOptimization captures the changed data. It reads data from HDFS and
identifies the data that has changed. To increase performance, you can configure the mapping to run on
Hadoop cluster nodes in a Hive environment.
Sources. HDFS files that were the targets of the previous two mappings. The Data Integration Service
reads all of the data as a single column.
Expression transformations. Extract a key from the non-key values in the data. The expressions use the
INSTR function and SUBSTR function to perform the extraction of key values.
Joiner transformation. Performs a full outer join on the two sources based on the keys generated by the
Expression transformations.
Filter transformations. Use the output of the Joiner transformation to filter rows based on whether or not
the rows should be updated, deleted, or inserted.
Targets. HDFS files. The Data Integration Service writes the data to three HDFS files based on whether
the data is inserted, deleted, or updated.
10
Consolidated_Mapping
The workflow object Consolidated_Mapping consolidates the data in the HDFS files and loads the data to the
data warehouse.
Sources. The HDFS files that were the target of the previous mapping are the sources of this mapping.
Expression transformations. Add the deleted, updated, or inserted tags to the data rows.
Union transformation. Combines the records.
Target. Flat file that acts as a staging location on the local file system.
Big Data Trial Sandbox includes samples that demonstrate the following use cases to process complex files:
The LogProcessing project in the Developer tool includes samples that you can use to read and parse complex files.
Use the samples to process daily web logs from an online trading site and write the parsed data to a flat file. The web
logs contain details about visitors who log in to the website and look up the value of stocks using stock symbols.
To process the web logs, use the web log processing workflow.
11
The following image shows the sample web log processing workflow:
To run the workflow, enter the following command to run the workflow from the command line:
./infacmd.sh wfs startWorkflow -dn infa_domain -sn infa_dis -un Administrator -pd
Administrator -Application app_logProcessing -wf wf_LogProcessing
To run the mappings in the workflow, open a mapping and right-click the mapping to run the mapping.
You can run the following mappings and transformations in the workflow:
m_LoadData
The workflow object m_LoadData reads the parsed web log data and writes to a flat file target. The source
and target are flat files.
m_sample_weblog_parsing
The workflow object m_sample_weblog_parsing is a logical data object read mapping reads data from a
HDFS source, parse the data using a Data Processor transformation, and writes to a logical data object.
12
The following image shows the mapping m_sample_weblog_parsing:
The following image shows the expanded logical data object read mapping m_sample_weblog_parsing:
Source. HDFS file that was the target of the previous mapping.
Data Processor transformation. Processes the input binary stream of data, parses the data, and writes to
XML format.
Joiner transformation. Combines the activity of visitors who return to the website on the same day with
stock queries.
Expression transformation. Adds the current date to each transformed record.
Target. Flat file.
The Complex_File_Writer project in the Developer tool includes samples that you can use to write unstructured data
to complex files.
Use the samples to generate a report in XML format of the sales by country for each customer. You know the customer
purchase order details such as customer ID, product names, and item quantity sold. The purchase order details are
stored in semi-structured compressed XML files in HDFS. Create a mapping that reads all the customer purchase
records from the files in HDFS and use a Data Processor transformation to process the sales by country for each
customer. The mapping converts the semi-structured data to relational data and writes it to a relational target.
13
The following figure shows the Complex File Writer sample mapping:
Transformations
HDFS output
The output, Write_binary_single_file, is a complex file stored in HDFS.
Big Data Trial Sandbox for Hortonworks provides samples for the following NoSQL database:
HBase
14
HBase
Use HBase when you need random real-time read and writes from a database. HBase is a non-relational distributed
database that runs on top of the Hadoop Distributed File System (HDFS) and can store sparse data. Big Data Trial
Sandbox for Hortonworks provides samples that demonstrate how to read and process binary data from HBase.
The HBase_Binary_Data project in the Developer tool includes samples that you can use to read and process binary
data in HBase tables to string data in a flat file target.
The sample HBase table contains the details of people and the cars that they purchased over a period of time. The
table contains the Details and Cars column families. The column names of the Cars column family are of String data
type. You can get all columns in the Cars column family as an single binary column. You can use the sample Java
transformation to covert the binary data to string data. You can join the data from both the column families and write it
to a flat file.
To run the workflow, enter the wfs startworkflow command to run the workflow from the command line.
To run the mappings in the workflow, open a mapping and right-click the mapping to run the mapping.
m_preson_Cars_Write_Static1
The workflow object references the m_pers_cars_static_reader mapping that transforms the binary data in an
HBase data object to columns of the String data type and writes the details to a flat file data object.:
15
Person_Car_Static_Read
The first source for the mapping is an HBase data object named Person_Car_Static that contains the
columns in the Details column family. The HBase read data object operation is named
Person_Car_Static_Read.
pers_cars_Static_bin_read
The second source for the mapping is an HBase data object named Person_cars_Static_bin that
contains the data in the Cars column family. The HBase read data object operation is named
pers_cars_Static_bin_read.
Transformations
Write_Person_Cars_FF
The target for the mapping is a flat file data object named Person_Cars_FF. The flat file data object write
operation is named Write_Person_Cars_FF to write data from the Cars and Details column families.
The Data Integration Service converts the binary column in Person_cars_Static_bin, joins the data in
Person_Car_Static, and writes the data to the flat file data object Write_Person_Cars_FF.
Troubleshooting
This section describes troubleshooting information.
The Informatica services might shut down when the machine on which you run the virtual machine goes into
hibernation or when you resume the virtual machine.
Run the following command to restart the services on the operating system of the virtual machine: sh /home/
infauser/BDETRIAL/.cmdInfaServiceUtil.sh start
To debug mapping failures, check the error messages in the mapping log file.
VMWare Player displays a message that states it cannot power on a 64-bit virtual machine. Or, VMware
Player might display the following error when you play the virtual machine: The host supports Intel VT-x,
but Intel VT-x is disabled. Intel VT-x might be disabled if it has been disabled in the
BIOS/firmware settings or the host has not been power-cycled since changing this setting.
You must enable the BIOS of the machine on which VMware Player runs to use Intel Virtualization
Technology. For more information, refer to the VMware Knowledge Base article here.
16
Virtual machine is in a suspended state
If the virtual machine is in a suspended state, you need to resume the virtual machine. You need to log in to
the virtual machine. After you log in, the Informatica services and Hadoop services start automatically.
In VMware Player, select the virtual machine and click Play virtual machine.
Enter a user name and password for the virtual machine. The default user name and password is: infa / infa
The Developer tool takes a long time to connect to the Model repository
The Developer tool might take a long time to connect to the Model repository because the virtual machine
cannot find the IP address and host name of the client machine.
You must add the IP address and host name of the client machine on the hosts file of the virtual machine.
Use the ipconfig and hostname commands from the command line of the Windows machine to find the IP
address and hostname of the Windows machine.
Add the IP address and the host name to the hosts file on the virtual machine.
For example, the hosts file can be located in the following directory on the virtual machine: /etc/hosts
Mapping fails and job execution failed errors appear in the mapping log
If the mapping fails and you cannot determine the cause of the job execution failed errors that appear in the
mapping log, you can clear the contents of the following directory on the machine that hosts the virtual
machine: /tmp/infa. Then, run the mapping again.
Author
17