You are on page 1of 39

First-hand knowledge.

Reading Sample
In this selection from Chapter 5, you’ll get a taste of some of the
main building blocks that make the jobs run in SAP Data Services.
This chapter provides both the information you need to understand
each of the objects, as well as helpful step-by-step instructions to
get them set up and running in your system.

“Introduction”
“Objects”

Contents

Index

The Authors

Bing Chen, James Hanck, Patrick Hanck, Scott Hertel, Allen Lissarrague, Paul Médaille

SAP Data Services


The Comprehensive Guide
524 Pages, 2015, $79.95/€79.95
ISBN 978-1-4932-1167-8

www.sap-press.com/3688

© 2015 by Rheinwerk Publishing, Inc. This reading sample may be distributed free of charge. In no way must the file be altered, or
individual pages be removed. The use for any commercial purpose other than promoting the book is strictly prohibited.
4

Introduction

Welcome to SAP Data Services: The Comprehensive Guide. The mission of this book
is to capture information in one source that will show you how to plan, develop,
implement, and perfect SAP Data Services jobs to perform data-provisioning pro-
cesses simply, quickly, and accurately.

This book is intended for those who have years of experience with SAP Data Ser-
vices (which we’ll mainly refer to as just Data Services) and its predecessor, Data
Integrator, as well as those who are brand new to this toolset. The book is
designed to be useful for architects, developers, data stewards, IT operations, data
acquisition and BI teams, and management teams.

Structure of the Book


This book is divided into four primary sections.

왘 PART I: Installation and Configuration (Chapters 1–3)


We start by looking at planning out your systems, the installation process, how
the system operates, and how to support and maintain your environments.
왘 PART II: Jobs in SAP Data Services (Chapters 4–8)
The book then delves into the functionality provided in Data Services and how
to leverage that functionality to build data-provisioning processes. You are
taken through the process of using the Integrated Development Environments,
which are the building blocks for creating data-provisioning processes.
왘 PART III: Applied Integrations and Design Considerations (Chapters 9–12)
Here you’re taken through use cases that leverage Data Services, from require-
ments to developed solution. Additionally, design considerations are detailed
to make sure the data-provisioning processes that you created perform well and
are efficient long after they are released into production environments.
왘 PART IV: Special Topics (Chapters 13–14)
The book then explores how SAP Information Steward, when used in conjunc-
tion with Data Services, enables transparency of data-provisioning processes

17
Introduction Introduction

and other key functionality for organizations’ EIM strategy and solutions. Chapter 8, Change Data Capture, is for developers and technical personnel to
Finally, the book looks into the outlook of Data Services. design and maintain jobs that leverage change data capture (CDC). It starts with
what CDC is and the use cases that it supports. This chapter builds a data flow that
With these four parts, the aim is to give you a navigation panel on where you
leverages CDC and ensures a voluminous source table is efficiently replicated to a
should start reading first. Thus, if your focus is more operational in nature, you’ll
data warehouse.
likely focus on Part I, and if you’re a developer new to Data Services, you’ll want
to focus on Parts II and III. Chapter 9, Social Media Analytics, explores the text data processing (TDP) capa-
bility of Data Services, describes how it can be used to analyze social media data,
The following is a detailed description of each chapter.
and suggests more use cases for this important functionality.
Chapter 1, System Considerations, leads technical infrastructure resources,
Chapter 10, Design Considerations, builds on the prior chapters to make sure
developers, and IT managers through planning and implementation of the Data
jobs are architected with performance, transparency, supportability, and long-
Services environment. The chapter starts with a look at identifying the require-
term total cost in mind.
ments of the Data Services environments for your organization over a five-year
horizon and how to size the environments appropriately. Chapter 11, Integration into Data Warehouses, takes you through data ware-
housing scenarios and strategies to solve common integration challenges incurred
Chapter 2, Installation, is for technical personnel. We explain the process of
with dimensional and factual data leveraging Data Services. The chapter builds
installing Data Services on Windows and Linux environments.
data flows to highlight proper provisioning of data to data warehouses.
Chapter 3, Configuration and Administration, reviews components of the land-
Chapter 12, Industry-Specific Integrations, takes the reader through integration
scape and how to make sure your jobs are executed in a successful and timely
strategies for the distribution and retail industries leveraging Data Services.
manner.
Chapter 13, SAP Information Steward, is for the stewardship and developer
Chapter 4, Application Navigation, walks new users through a step-by-step pro-
resources and explores common functionality in Information Steward that
cess on how to create a simple batch job using each of the development environ-
enables transparency and trust in the data provisioned by Data Services and how
ments in Data Services.
the two technology solutions work together.
Chapter 5, Objects, dives into the primary building blocks that make a job run. Chapter 14, Where Is SAP Data Services Headed?, explores the potential future
Along the way, we provide hints to make jobs run more efficiently and make of Data Services.
them easier to maintain over their useful life expectancy and then some!
You can also find the code that’s detailed in the book available for download at
Chapter 6, Variables, Parameters, and Substitution Parameters, explores the www.sap-press.com/3688.
process of defining and using variables and parameters. We’ll look at a few exam-
ples that use these value placeholders to make your Data Services jobs more flex- Common themes presented in the book include simplicity in job creation and the
ible and easier to maintain over time. use of standards. Through simplicity and standards, we’re able to quickly hand off
to support or comprehend from the development team how jobs work. Best prac-
Chapter 7, Programming with SAP Data Services Scripting Language and tice is to design and build jobs with the perspective that someone will have to
Python, explores how Data Services makes its toolset extensible through various support it while half asleep in the middle of the night or from the other side of
coding languages and coding objects. Before that capability is tapped, the ques- the world with limited or no access to the development team. The goal for this
tion, “Why code?” needs to be asked and the answer formulated to ensure that book is that you’ll be empowered with the information on how to create and
the coding activity is being entered into for the right reasons. improve your organization’s data-provisioning processes through simplicity and
standards to ultimately achieve efficiency in cost and scale.

18 19
This chapter dives into the primary building blocks that make a job run.
Along the way, we provide hints to make jobs run more efficiently and
make them easier to maintain over their useful life expectancy and then
some!

5 Objects

Now that you know how to create a simple batch integration job using the two
integrated development environments (IDEs) that come with SAP Data Services,
we dive deeper and explore the objects available to specify processes that enable
reliable delivery of trusted and timely information. Before we get started, we
want to remind you about the Data Services object hierarchy, illustrated in Figure
5.1. This chapter traverses this object hierarchy and describes the objects in
detail.

As you can see from the object hierarchy, the project object is at the top. The proj-
ect object is a collection of jobs that is used as an organization method to create
job groupings.

Note

A job can belong to multiple projects.

We’ll spend the remainder of this chapter discussing the commonly used objects
in detail.

207
5 Objects Jobs 5.1

it’s a collection of objects ordered and configured in a way to process information


Project Function *
that can be scheduled to execute. That’s where the similarities end. A batch job
Batch Job and can also be scaled to perform on multiple servers and monitored in a way that
Real-Time Job
would make any mainframe job jealous. To carry this out, a batch job has proper-
Workflow ** Log ties and has a collection of child objects, including workflows, data flows, scripts,
conditionals, while loops, and logs. The properties of the batch job object are dis-
Data Flow Script Annotation Conditional While Loop Try-Catch
cussed in the following subsections.
Source Target Transform Datastore Formats

Data Integrator Document File Format


Batch Job Execution Properties
Transforms

Data Quality Message


COBOL Batch job execution properties enable control over logging, performance, and
Copybook
Transforms Function
File Format overriding default values for global substitution parameters and global variables
Platform Outbound
Transforms Message
DTD
(we’ll go into more detail on this in Chapter 6).
Excel
Table Workbook
Format Whether you’re trying to diagnose an issue in an integration process you’re
Template XML File developing, performing QA tests against a job en route to production, or doing
Table
root-cause analysis on an issue in production, logging options in batch job execu-
XML Message

* Functions can be called from


tion properties enable insight into what’s occurring while the process, job object,
multiple objects.
XML Schema and all other included objects are executing. We discuss the many logging options
** Workflows are optional as
XML Template will be described in Section 5.2
in the following subsections.

Figure 5.1 Object Hierarchy Monitoring a Sample Rate


This specifies the polling, in seconds, of how often you want logs to capture status
about sourcing, targeting, and transforming information. Specifying a small num-
5.1 Jobs ber captures information more often, and specifying a large number consumes
fewer system resources and lets more time pass before you’re presented with the
When Data Services is considered as an integration solution (and any of its prede- errors in the logs.
cessor names such as BODS, BODI, DI, Acta), it’s often only considered as an
Extraction, Transformation, Load (ETL) tool where a job is scheduled, runs, and Trace Messages: Printing All Trace Messages versus Specifying One by One
finishes. Although this use case only describes the batch job, in addition to the By specifying to Print All Trace Messages, you ignore the individually selected
batch job object type, there also exists a real-time job object type. The job object trace options. With all traces being captured, the results can be quite verbose, so
is really two distinct objects, the batch job object and the real-time job object. We this option shouldn’t be used when diagnosing an issue, although it does present
break each of these objects down in detail in the following sections. a simple method of capturing everything. Alternatively, if Print All Trace Mes-
sages is unchecked, you can specify individually which traces you want to be cap-
tured.
5.1.1 Batch Job Object
You might be thinking that a “batch” job sounds old, decrepit, and maybe even
legacy mainframe-ish. In a way, a batch job is like a legacy mainframe process, as

208 209
5 Objects Jobs 5.1

Disable Data Validation Statistics Collection Job Server or Server Group


If your job contains data flows with validation transforms, selecting this option A job can be executed from any job server linked with the local repository upon
will forgo collecting those statistics. which the job is stored. A listing of these job servers is presented in the drop-
down list box. If one or more groups have been specified to enable load balanc-
Enable Auditing ing, they too are listed as options.
If you’ve set audits via Data Services Designer (e.g., a checksum on a field in a table
source within a data flow), this will enable you to toggle the capture of that infor- Note
mation on and off. Job servers within a job server group collect load statistics at 60-second intervals to cal-
culate a load balance index. The job server with the lowest load balance index is
Collect Statistics for Optimization and Use Collected Statistics selected to execute the next job.
By running your batch job with Collect statistics for optimization set, each
row being processed by contained data flows has its cache size evaluated. So jobs Distribution Level
being executed in production shouldn’t be scheduled to always collect statistics
If you’ve selected for the job to be executed on a server group, the job execution
for optimization. After statistics have been collected and the job is executed with
can be processed on multiple job servers depending on the distribution level
the Use Collected Statistics option, you can determine whether to use In Mem-
value chosen. The choices include the following:
ory or Pageable caches as well as the cache size.
왘 Job Level
Collect Statistics for Monitoring No distribution among job servers.
This option captures the information pertaining to caching type and size into the 왘 Data flow Level
log. Processes are split among job servers to the data flow granularity; that is, a sin-
gle data flow’s processing isn’t divided among multiple job servers.
Job Performance and Pickup Options 왘 Sub-Data flow Level
There are also execution options that enable you to control how the job will per- A single data flow’s processing can be divided among multiple job servers.
form and pick up after a failed run. We discuss these in the following subsections.
5.1.2 Real-Time Job Object
Enable Recovery and Recover Last Failed Execution
Like a batch job, a real-time job object has execution properties that enable differ-
By selecting Enable Recovery, a job will capture results to determine where a job
ing levels of tracing and some of the execution objects. Unlike a batch job, a real-
was last successful and where it failed. Then if a job did fail with the Enable
time job is started and typically stays running for hours or days at a time. Now
Recovery set, a subsequent run can be performed that will pick up at the begin-
this doesn’t mean real-time jobs are slow; rather, they stay running as a service
ning of the last failed object (e.g., workflows, data flows, scripts, and a slew of
and respond to numerous requests in a single run. The rest of this section will
functions).
show how you create a real-time job, execute a real-time job, and finally make
Note requests.

A workflow’s Recover as a unit option will cause Recover last failed execution to start To create a real-time job from the Data Services Designer IDE, from the Project
at the beginning of that workflow instead of the last failed object within that workflow. Area or the Jobs tab of the local repository, right-click, and choose New Real-
time Job as shown in Figure 5.2.

210 211
5 Objects Jobs 5.1

If we were to replace the annotations with objects that commonly exist in a real-
time job, it might look like Figure 5.4. In this figure, you can see that initializa-
tion, real-time processing loop, and clean-up are represented by SC_Init, DF_Get-
CustomerLocation, and SC_CleanUp, respectively.

Figure 5.2 Creating a New Real-Time Job

The new job is opened with process begin and end items. These two items seg-
ment the job into three sections. These sections are the initialization (prior to pro-
cess begin), real-time processing loop (between process begin and end), and
cleanup (after process end) as shown in Figure 5.3. Figure 5.4 Real-Time Job with Objects

Within the DF_GetCustomerLocation data flow, you’ll notice two sources and one
target. Although there can be many objects within a real-time job’s data flow, it’s
required that they at least have one XML Message Source object and one XML
Message Target object.

A real-time job can even be executed via Data Services Designer the same way
you execute a batch job by pressing the (F8) button (or the menu option Debug 폷
Execute) while the job window is active. Even though real-time jobs can be exe-
cuted this way in development and testing situations, the execution in production
implementations is managed by setting the real-time job to execute as a service
via the access server within the Data Services Management Console.

Note

For a real-time job to be executed successfully via Data Services Designer, the XML
Source Message and XML Target Message objects within the data flow must have their
XML test file specified, and all sources must exist.
Figure 5.3 Real-Time Job Logical Sections

212 213
5 Objects Jobs 5.1

To set up the real-time job to execute as a service, you navigate to the Real-Time
Services folder, as shown in Figure 5.5, within the access server under which we
want it to run.

Figure 5.7 Real-Time Services—Start


Figure 5.5 Real-Time Services
Now that the service is running, it needs to be exposed. To do this, navigate to the
In the Real-Time Services window, click the Real-Time Services Configuration Web Services Configuration page, select Add Real-time Service in the combo
tab, and then click the Add button. The window appears filled-in, as shown in list box at the bottom, and then click Apply (see Figure 5.8).
Figure 5.6.

Figure 5.6 Real-Time Service Configuration

After the configuration has been applied, you can start the service by clicking the
Real-Time Services folder, selecting the newly added service, and then clicking
Start (see Figure 5.7). The status changes to Service Started.

Figure 5.8 Web Services Configuration

214 215
5 Objects Workflow 5.2

The Web Services Configuration – Add Real-Time Services window appears as Note
shown in Figure 5.9. Click Add to expose the web service, and a confirmation sta-
At this point, you can also view the history log from the Web Services Status page
tus message is shown.
where you can see the status from initialization and processing steps. The clean-up pro-
cess status isn’t included because the service is still running and hasn’t yet been shut
down.

To test the web service externally from Data Services, a third-party product such
as SoapUI can be used. After importing the web services’ Web Services Descrip-
tion Language (WSDL) information into the third-party product, a request can be
created and executed to generate a response from your real-time job. The sample
request and response are shown in Figure 5.11 and Figure 5.12.

Figure 5.9 Web Services Configuration—Add Real-Time Services

The status of the web service can now be viewed from the Web Services Status
tab, where it appears as a published web service (see Figure 5.10). The web ser- Figure 5.11 Sample SOAP Web Service Request

vice is now ready to accept web service requests.

Figure 5.12 Sample SOAP Web Service Response

The rest of this chapter focuses on the remaining Figure 5.1 objects that job
objects rely on to enable their functionality and support their processing.

5.2 Workflow
A workflow and a job are very similar. As shown in the object hierarchy, many of
the same objects that interact with a job can optionally also interact with a work-
flow, and both can contain objects and be configured to execute in a specified
Figure 5.10 Web Services Status

216 217
5 Objects Workflow 5.2

order of operation. Following are the primary differences between a job and a
multiple jobs. These activities can be encapsulated within a common workflow. Doing
workflow: so will make future updates to that activity easier to implement and ensure that all con-
왘 Although both have variables, a job has global variables, and a workflow has suming jobs of the activity get the update. When creating the workflows to contain the
common activity, make sure to denote that commonality with the name of the work-
parameters.
flow. This is often done with a CWF_ prefix instead of the standard WF_.
왘 A job can be executed, and a workflow must ultimately be contained by a job
to be executed.
5.2.1 Areas of a Workflow
왘 A workflow can contain other workflows and even recursively call itself,
A workflow has two tabs, the General tab and the Continuous Options tab. In
although a job isn’t able to contain other jobs.
addition to the standard object attributes such as name and description, the Gen-
왘 A workflow has the Recover as a unit option, described later in this section, eral tab also allows specification of workflow-only properties as show in the fol-
whereas a workflow has the Continuous option, also described later in this lowing list and in Figure 5.13:
section.
1 Execution Type
You might wonder why you need workflows when you can just specify the order There are three workflow execution types:
of operations within a job. In addition to some of the functionality mentioned
왘 Regular is a traditional encapsulation of operations and enables subwork-
later in this section, the use of workflows in an organization will depend heavily
flows.
on the organization’s standards and design patterns. Such design patterns might
be influenced by the following: 왘 Continuous is a workflow introduced in Data Services version 4.1 (this is
discussed in detail in Section 5.2.2).
왘 Organizations that leverage an enterprise scheduling system at times like work-
왘 Single is a workflow that specifies that all objects encapsulated within it are
flow logic to be built into its streams instead of the jobs in which it calls. In such
to execute in one operating system (OS) process.
instances, a rule can emerge that each logical unit of work is a separate job. This
translates into each job having one activity (e.g., a data flow to effect the inser- 2 Execute only once
tion of data from table A to table B). On the other side of the spectrum, orga- In the rare case where a workflow has included a job more than once and only
nizations that build all their workflow into Data Services jobs might have a sin- one execution of the workflow is required. You may be wondering when you
gle job provision customer from the source of record to all pertinent systems. would ever include the same workflow in the same job. Although rare, this
This latter design pattern might have a separate workflow for each pertinent does come in handy when you have a job with parallel processes, and the work-
system so that logic can be compartmentalized to enable easier support and/or flow contains an activity that is dependent on more than one subprocess and
rerun ability. varies which subprocess finishes first. In that case, you would include the
workflow prior to each subprocess and check the Execute only once checkbox
왘 Organizations may have unique logging or auditing activities that are common
on both instances. During execution, the first time instance to execute will exe-
across a class of jobs. In such cases, rather than rebuilding that logic over and
cute the operations, and the second execution will be bypassed.
over again to be included in each job, your best option is to create one work-
flow in a generic manner to contain that logic so that it can be written once and 3 Recover as a unit
used by all jobs within that class. This forces all objects within a workflow to reexecute when the job being exe-
cuted with the Enable recovery option subsequently fails, and then the job is
Recommendation reexecuted with the Recover from last failed execution option. If the job is
Perform a yearly review of the jobs released into the production landscape. Look for executed with Enable recover, and the workflow doesn’t have the Recover as
opportunities for simplification where an activity has been written multiple times in a unit option selected, the job restarts in recovery mode from the last failed

218 219
5 Objects Workflow 5.2

object within that workflow (assuming the prior failure occurred within the requirements can be met without a continuous workflow, the technical perfor-
workflow). mance and resource utilization are greatly improved by using it. As shown in Fig-
4 Bypass ure 5.15, the instantiation and cleanup processes only need to occur once (or as
This was introduced in Data Services 4.2 SP3 to enable certain workflows (and often as resources are released), and as shown in Figure 5.16, instantiation and
data flows) to be bypassed by passing a value of YES. This has a stated purpose cleanup processes need to occur for each cycle, resulting in more system resource
to facilitate testing processes where not all activities are required and isn’t utilization.
intended for production use.

Figure 5.14 Workflow—Continuous

Figure 5.13 Workflow Object—General Tab

Instantiate Instantiate Clean Up Clean Up


5.2.2 Continuous Workflow Workflow Data Flow Workflow Data Flow

Start
Setting the stage, an organization wants a process to wait for some event to occur Workflow
before it kicks off a set of activities, and then upon finishing the activities, it needs
to await the next event occurrence to repeat. This needs to occur from some start
Loop Start
time to some end time. To effect a standard highly efficient process, Data Services Data Flow

4.1 released the continuous workflow execution type setting. After setting the
execution type to Continuous, the Continuous Options tab becomes enabled
(see Figure 5.14). Complete Complete
Workflow Data Flow
Prior to Data Services 4.1, this functionality could be accomplished by coding
your own polling process to check for an event occurrence, perform some activi-
ties, and then wait for a defined interval period. Although the same functional Figure 5.15 Continuous Workflow

220 221
5 Objects Data Flows 5.4

5.3.2 While Loop


Instantiate The while loop has two components, a condition and a workspace area. Upon the
Workflow
Start
while condition resulting in true, and while it stays true, the objects within its
Loop Workflow workspace are repeatedly executed.

5.3.3 Try and Catch Blocks


Instantiate
Clean Up Data Flow The try and catch objects, also collectively referred as a try and catch block, refer
Workflow
to two separate objects that are always used in a pair.

The try object is dragged onto the beginning of a sequence of objects on a job or
workflow, and a catch object is dragged to where you want the execution encap-
Start
Data Flow
sulation to end. Between these two objects, if an error is raised in any of the
Complete
Workflow sequenced objects, the execution moves directly to the catch object, where a new
sequence of steps can be specified.
Complete
Clean Up Data Flow
Data Flow
5.4 Data Flows
Figure 5.16 Continuous Operation Prior to the Continuous Workflow
Data flows are collections of objects that represent source data, transform objects,
and target data. As the name suggests, it also defines the flow the data will take
through the collection of objects.
5.3 Logical Flow Objects
Data flows are reusable and can be attached to workflows to join multiple data
Logical flow objects are used to map business logic in the workflows. These flows together, or they can be attached directly to jobs.
objects are available in the job and workflow workspace. These objects are not
There are two types of data flows: the standard data flow and an ABAP data flow.
available in the data flow workspace. The logical flow objects consist of condi-
The ABAP data flow is an object that can be called from the standard object. The
tional, while loop, and try and catch block objects.
advantage of the ABAP data flow is that it integrates with SAP applications for bet-
ter performance. A prerequisite for using ABAP data flows is to have transport
5.3.1 Conditional files in place on the source SAP application.
A conditional object can be dragged onto the workspace of a job or workflow The following example demonstrates the use of a standard data flow and an ABAP
object to enable an IF-THEN-ELSE condition. There are three areas of a conditional data flow. We’ll create a standard data flow that has an ABAP data flow that
object: accesses an SAP ECC table, a pass-through Query transform, and a target template
왘 Condition table.

왘 Workspace area to be executed on the condition resulting in true


왘 Workspace area to be executed on the condition resulting in false

222 223
Datastores 5.6

</ENTRY>
<ENTRY>
<SELECTION>11</SELECTION>
<PRIMARY_NAME1>TRINITY PL</PRIMARY_NAME1>
</ENTRY>
</SUGGESTION_LIST>
Listing 5.1 SUGGESTION_LIST Output Field

DSF2® Walk Sequencer Transform


The DSF2 Walk Sequencer transform allows mailers to get postal discounts on their
mailings. Applying walk sequencing to a mailing sorts the list of addresses into
the order the carrier walks their route. The USPS encourages the use of walk
sequencing so that delivery operations are more efficient. Prior to using the DSF2
Walk Sequencer transform, the addresses must be CASS certified and delivery
point validated (DPV). Both of these conditions can be satisfied using the USA Reg-
ulatory Address Cleanse transform.

The DSF2 Walk Sequencer transform enables the following types of postal dis-
counts:

왘 Sortcode discount
왘 Walk Sequence discount
왘 Total Active Deliveries discount
왘 Residential Saturation discount

5.6 Datastores
Datastores connecting to relational databases, as explored in Chapter 3 and Chap-
ter 4, are also able to source and target applications such as SAP Business Ware-
house (SAP BW) and a new datastore type introduced in Data Services version
4.2, RESTful web services, which will be explored in this section.

5.6.1 SAP BW Source Datastores


Because different objects are exposed on the SAP BW side for reading and loading
data, as shown in Figure 5.109, Data Services uses two different types of data-
stores:

299
5 Objects Datastores 5.6

왘 SAP BW source datastore for reading data from SAP BW user ID and not a dialog user ID. The SAP Application Server name and SAP
왘 SAP BW target datastore for loading data into SAP BW Gateway Hostname fields correspond with the server name of your SAP BW
instance. The Client Number and System Number also correspond with those of
An SAP BW source datastore is very similar to the SAP application datastore in your SAP BW instance.
that you can import the same objects as SAP applications except for hierarchies.
SAP BW provides additional objects to browse and import, including InfoAreas,
InfoCubes, Operational Datastore (ODS) objects, and Open Hub tables.

SAP Business Warehouse SAP Data Services

DataSource/PSA
Management Console
Staging BAPI

InfoCube

RFC Server

InfoObjects
Open Hub Tables
Lo

Designer
Open Hub Services

ad

Read

Datastore Object

Job Server

Figure 5.109 SAP Data Services—SAP BW Transactional Layout Figure 5.110 RFC Configuration in Data Services

There are several steps to setting up a job to read data from SAP BW. We’ll detail To set up the source system, follow these steps:
them all here. 1. On the SAP BW side, through the BW Workbench:Modeling, create a new
To set up the RFC server configuration, follow these steps: source system by going to Source System 폷 External Source 폷 Create.
2. You’ll be prompted for a Logical System Name and Source System Name. (For
1. On the Data Services side, through the Administrator module on the Data Ser-
consistency, we tend to make the Logical System Name the same as the RFC
vices Management Console, go into SAP Connections 폷 RFC Server Interface.
ProgramID we entered for the RFC server configuration. The Source System
2. Click the RFC Server Interface Configuration table, and then click the Add
Name we treat as a short description field.) Click the checkmark button to con-
button.
tinue.
3. Enter the information in the fields as shown in Figure 5.110. Click Apply.
3. You’ll be presented with the RFC Destination page as shown in Figure 5.111.
The RFC ProgramID is the key identifier between the two systems, and although The Logical System Name is shown in the RFC Destination field. The Source
it can be whatever you want, it needs to be identical between the systems and System Name is shown in the Description 1 field. The Activation Type should
descriptive. RFC ProgramID is also case sensitive. Username and Password are be set to Registered Server Program. The Program ID field needs to match
connection details for SAP BW; it’s recommended that this be a communication (exactly) the RFC ProgramID. Click the Apply button to complete the configu-
ration.

300 301
5 Objects Datastores 5.6

Figure 5.111 SAP BW Source System Create Configuration

To set up the SAP BW source datastore, follow these steps:


Figure 5.112 Create New Datastore—SAP BW Source
1. In the Data Services Designer, right-click and select New in the Datastore tab
of the Local Object Library.
2. Within the Create New Datastore editor, select SAP BW Source as shown in
5.6.2 SAP BW Target Datastore
Figure 5.112. If you’ll be loading data into SAP BW, there are five main steps, which we discuss
in the following subsections:
3. Click the Advanced button to open the lower editor window. Under SAP prop-
erties, select ABAP execution option (Generate and Execute or Execute Pre- 1. Set up InfoCubes and InfoSources in SAP BW.
loaded).
2. Designate Data Services as a source system in SAP BW.
4. Enter a Client number and a System number.
3. Create a SAP BW target datastore in Data Services.
5. Select a Data Transfer Method (RFC, Shared Directory, Direct Download,
4. Import metadata with the Data Services datastore.
FTP, or Customer Transfer).
5. Construct and execute a job.
6. Depending on the Data Transfer Method, RFC Destination, Working
Directory, and Generated ABAP Directory information will also be needed.
Setting Up the InfoCubes and InfoSources in SAP BW with the SAP Data
7. Click OK to complete the datastore creation.
Warehousing Workbench
An InfoSource will be used to hold data that is placed from Data Services. You cre-
ate and activate an InfoSource using the following steps:

302 303
5 Objects Datastores 5.6

1. In the Modeling section of SAP Data Warehousing Workbench, go to the Info-


Source window (InfoSources tab).
2. Right-click InfoSources at the top of the hierarchy, and select Create Applica-
tion Component. (Application components are tree structures used to organize
InfoSources.)
3. Complete the window that appears with appropriate information. So for Appli-
cation Comp, for example, you might enter “DSAPPCOMP”. For Long descrip-
tion, you might enter “Data Services application component”. Press (Enter).
4. The application component is created and appears in the hierarchy list. Right-
click the name of your new application component in the component list, and
select Create InfoSource.
5. The Create InfoSource: Select Type window appears. Select Transaction
data as the type of InfoSource you’re creating, and press (Enter).
6. The Create InfoSource (transaction data) window appears. Enter the appro-
priate information, and click (Enter). The new InfoSource appears in the hier-
archy under the application component name.

InfoCubes should be created and activated where the extracted data will ulti-
mately be placed. Figure 5.113 Create New Datastore—SAP BW Target

Creating the SAP BW Target Datastore 5.6.3 RESTful Web Services


To create the target datastore, follow these steps: RESTful Web Services or Representational State Transfer (REST) is a type of web
service communication method that allows for the retrieval and manipulation of
1. In the Data Services Designer, right-click and select New in the Datastore tab
data through a web server. It relies on standard HTTP operations to select (GET),
of the Local Object Library.
insert (POST), update (PUT), and delete (DELETE) data. For example, Google pro-
2. Within the Create New Datastore editor, select SAP BW Target as shown on vides a web application based on REST to return geographical data for a given
Figure 5.113. input address. The input address is passed to the REST-based web application
3. Click the Advanced button to open the lower editor window. using HTTP parameters. The following URL calls this web application and speci-
fies Chicago as the value for the HTTP parameter address (address=Chicago). Lis-
4. Enter the Client number and System number.
ting 5.2 shows the resulting geographical information for Chicago returned by
5. Click OK to complete the datastore creation.
the REST service in an XML format.

http://maps.googleapis.com/maps/api/geocode/xml?address=Chicago

<?xml version="1.0" encoding="UTF-8"?>


<GeocodeResponse>
<status>OK</status>

304 305
5 Objects Datastores 5.6

<result> web application because SOAP requires an XML-formatted request envelope to


<type>locality</type>
be sent rather than just HTTP parameters.
<type>political</type>
<formatted_address>Chicago, IL,USA</ formatted_address>
<address_component>
<long_name>Chicago</long_name> 5.6.4 Using RESTful Applications in Data Services
<short_name>Chicago</short_name>
Beginning with version 4.2, Data Services allows for communicating with a REST-
<type>locality</type>
<type>political</type> based application through the use of a web service datastore object. The available
</address_component> functions from the REST application can then be imported into the repository and
<address_component>
called from within data flows. For example, you may have a need to perform a
<long_name>Cook County</long_name>
<short_name> Cook County </short_name> lookup on key master data from a REST application managing master data. The
<type>administrative_area_level_2</type> function call can be done in a Query transform, and the data returned by the REST
<type>political</type>
function call can then be used in your data flow.
</address_component>
<address_component>
To configure a REST web service datastore object, a Web Application Description
<long_name>Illinois</long_name>
<short_name>IL</short_name> Language (WADL) file is required. A WADL file is an XML-formatted file that
<type> administrative_area_level_1</type> describes the functions and their corresponding parameters available from the
<type>political</type>
REST application. These functions typically use the HTTP methods GET, PUT, POST,
</address_component>
<address_component> and DELETE. Listing 5.3 shows a sample WADL file describing a REST application
<long_name>United States</long_name> that has various functions which allow for the retrieval and manipulation of con-
<short_name>US</short_name>
<type>country</type>
tainer and item data. Each of the available functions is based on a standard HTTP
<type>political</type> method.
</address_component>
<geometry> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<location> <application xmlns="http://research.sun.com/wadl/200/10">
<lat>41.8781136</lat> <doc xmlns:jersey="http://jersey.dev.java.net/"
<lng>-87.6297982</lng> jersey:generatedBy="Jersey: 1.0-ea-SNAPSHOT 10/02/
</location> 2008 12:17 PM"/>
<location_type>APPROXIMATE</location_type> <resources base="http://localhost:9998/storage/">
<viewpoint> <resource path="/containers">
<southwest> <method name="GET" id="getContainers">
<lat>41.6443349</lat> <response>
<lng>-87.9402669</lng> <representation mediaType="application/xml"/>
</response>
Listing 5.2 Google Maps API Sample XML Output
</method>
<resource path="(container)">
RESTful web services have enjoyed increased popularity in recent years over <parm xmlns:xs="http://www.w3.org/2001/XMLSchema"
type-xs:string style="template" name="container"/>
other communication methods such as the Simple Object Access Protocol (SOAP).
<method name="PUT" id="putContainer">
One of the reasons REST has become more popular is because of its simplicity <response>
using URL parameters and standard HTTP methods for communicating with the <representation mediaType="application/xml"/>
</response>
web application to retrieve and manipulate data. In addition, REST supports a
</method>
wider range of return data formats such as JavaScript Object Notation (JSON), <method name="DELETE" id="deleteContainer"/>
XML, and plain text. REST is also more efficient when communicating with the <method name="GET" id="getContainer">

306 307
5 Objects File Formats 5.7

<request> method GET. The request schema will include the available input parameters for a
<param xmlns:xs="http://www.w3.org/2001/ function, which in this case, is the parameter item. The reply schema will include
XMLSchema"
type="xs:string" style="query" name="search"/> the XML- or JSON-formatted data in addition to HTTP error codes (AL_ERROR_NUM)
</request> and error messages (AL_ERROR_MSG). These fields can then be integrated and used
<response> in downstream data flow processing logic.
<representation mediaType="application/xml"/>
</response>
</method>
<resource path="(item: .+)">
<param xmlns:xs="http://www.w3.org/2001/XMLSchema"
type="xs:string" style="template" name="item"/>
<method name="PUT" id="putItem">
<request>
<representation mediaType="*/*"/>
</request>
<response>
<representation mediaType="*/*"/>
</response>
</method> Figure 5.114 Data Services REST Web Service Functions Example
<method name="DELETE" id="deleteItem"/>
<method name="GET" id="getItem">
<response>
<representation mediaType="*/*"/>
</response>
</method>
</resource>
</resource>
</resource>
</resources>
</application>
Listing 5.3 WADL File Example

In addition to the WADL file, the web service datastore objects can be configured
for security and encryption as well as configured to use XML- or JSON-formatted
data. After the web service datastore has been created, the available functions can
be browsed through the Datastore Explorer in the Data Services Designer appli-
Figure 5.115 Data Services REST Function Example
cation. Figure 5.114 shows the view from the Datastore Explorer in Data Services
Designer for the same REST application with functions for manipulating con-
tainer and item data. Notice all seven functions from the WADL file are present
along with a nested structure for the functions. 5.7 File Formats
Any of the available functions can then be imported into the repository through As datastore structures provide metadata regarding column attribution to data-
the Data Store Explorer. Once imported, they become ready for use in data flows. bases and application, file formats provide the same for files. We’ll go over the
Figure 5.115 shows the getItem function from our WADL file with both a request different file format types and explain how to use them in the following sections.
schema and reply schema defined. Notice the function is based on the HTTP

308 309
5 Objects File Formats 5.7

5.7.1 Flat File Format


To use a flat file as a source or target data source, it needs to first exist in the local
repository and visible in the Local Object Library area of the Data Services
Designer. The object is a template rather than a specific file. After the object is
added to a data flow, then it can be configured to represent a specific file or set of
files. We’ll step through the configuration later in this section, but first we’ll cre-
ate the file format template.

The first step in creating a flat file format template is to right-click Flat Files, and
select New as shown in Figure 5.116. The Flat Files objects are located in the
Data Services Designer, in the Local Object Library area, in the File Format sec-
tion.

Figure 5.117 File Format Editor

In the left-hand column, the first option to consider is the type of flat file. Your
default choice is Delimited and appropriately is the typical selection. The other
choices include Fixed Width, in which the data fields are fixed lengths. These
files tend to be easier to read but can be inefficient for storing empty spaces for
values smaller than defined widths. SAP Transports type is used when you want
to create your own SAP application file format. This is the case if the predefined
Transport_Format wasn’t suitable to read from or write to a file in an SAP data
flow. The last two types are Unstructured Text and Unstructured Binary.
Unstructured Text type is used to consume unstructured files such as text files,
HTML, or XML. Unstructured Binary is used to read unstructured binary files
such as Microsoft Office documents (Word, Excel, PowerPoint, Outlook emails),
Figure 5.116 Creating a Flat File Object
generic .eml files, PDF files, and other binary files (Open Document, Corel Word-
Perfect, etc.). You can then use these sources with Text Data Processing trans-
Creating a Flat File Template forms to extract and manipulate data.
The File Format Editor screen will appear after selecting New as shown in Fig-
The second configuration option is Name. It can’t be blank, so a generated name
ure 5.117. This editor will be used to configure the flat file’s structure, data for-
is entered by default. Assuming you want a more descriptive name to reference,
mat, and source or target file location. We’ll explain the common configuration
you’ll need to change the name prior to saving. After the template is created, the
steps as we step through examples of creating flat file templates. There are several
Name can’t be changed.
ways to define the flat file template; one way is to manually create one. To do
this, you modify the setting in the left-hand column (Figure 5.117) and define the The third item of the Delimited file is Adaptable Schema. This is a useful option
data structure in the upper-right section. when processing several files, where their formats aren’t consistent. For example,

310 311
5 Objects File Formats 5.7

you’re processing 10 files, and some have a comments column at the end, 왘 The default type of delimiter is Comma. If your results show only one column
whereas others do not. In this case, you’ll receive an error because your schema of combined data, adjust the Column in the Delimiter section as necessary.
is expecting a fixed number of elements and finds either more or less. With 왘 If the first rows on the files are headers, then you can select to Skip row header
Adaptable Schema set at Yes, you define the maximum number of columns. For to change the column names to the headers. If you want the headers written
the files with fewer columns, null values are inserted into the missing column when using the template as a target, then select Yes on the Write row header
placeholders. There is an assumption here that the missing columns are at the end option.
of the row, and the order of the fields isn’t changed.
왘 If you’ll be using the template for multiple files, review the data types and field
sizes in the right-hand section. The values are determined from the file you
Creating a Flat File Template with an Existing File imported; if the field sizes are variable, then ensure the values are set to their
Another method to create a flat file format is to import an existing file. This can maximum. Data types can also be incorrect based on a small set of values; for
be done by using the second section (Data Files) of the left-hand column of the example, the part IDs in the file may be all numbers, but in another file, they
File Format Editor screen. Using the yellow folder icons, select the directory may include alphanumeric values.
and file to import. You’ll receive a pop-up asking to overwrite the current
schema, as shown in Figure 5.118.

Figure 5.119 Imported File Format

Variable File Names


Figure 5.118 Import Flat File Format
When you move the file format into a data flow as a source, you can edit that
The File Format Editor creates the template based on the data in the specified object to change a subset of configurations compared to its template. After the
file and displays the results in the right-hand section as shown in Figure 5.119. name is saved, it can’t be changed on the template or data flow level. The data
Here are a couple of things to remember when using this method: types can be change only at the template level. In the File Format Editor in the
data flow (Figure 5.120), you can change the file and directory locations, and you

312 313
5 Objects File Formats 5.7

can change the Adaptable Schema setting. You can modify performance settings If you intend to process multiple files in a given directory, you can use a wildcard
that weren’t present on the template such as Join Rank, Cache, and Parallel (*). For example, if you want to read all text files in a directory, then the File
process threads. At the bottom of the left-hand section is the Source Informa- Name(s) setting would be *.txt. You can also use the wildcard for just a portion
tion. The Include file name column setting will add a column to the file’s data of the file name. This is useful if the directory contains different types of data. For
source that contains the file name. This is a useful setting when processing multi- example, a directory contains part data and price data, and the files are appended
ple files. with timestamps. To process only part data from the year 2014, the setting is
part_update_2014*.txt.

When using the wildcard, Data Services expects a file to exist. An error occurs
when the specified file can’t be found. This error is avoided by checking existence
prior to reading the flat file. One method of validating file existence is to use the
file_exists(filename) function, which given the path to the file will return 0 if
no files are located. The one shortcoming of this method is that the file_exists
function doesn’t process file paths that contain wildcards, which in many cases is
the scenario presented. There is another function, wait_for_file(filename,
timeout, interval), which does allow for wildcards in the file name and can be
set to accomplish the same results. Figure 5.121 shows the setup to accomplish
this. First a script calls the wait_for_file function. Then a conditional transform
is used to either process the flat file or catch when no file exists.

Figure 5.120 File Format Editor in the Data Flow

The Root Directory setting can be modified from what is in the template and be
specific to that one flat file source or target. You can hard-code the path into this
setting, but this path will likely change as you move the project through environ-
ments. A better practice is to use substitution parameters for the path. Using the
substitution parameter configurations for each environment allows the same
parameter to represent different values in each environment (or configuration).

The File Name(s) setting is similar to the Root Directory setting. The value
comes from the template, but it can be changed and is independent of the tem-
plate. You can hard-code the name, but that assumes the same file will be used
each time. It’s not uncommon to have timestamps, department codes, or other
identifiers/differentiators appended to file names. In the situation where the
name is changed and set during the execution of the job, you’ll use a parameter or Figure 5.121 Validating That Files Are Present to Read
global variables.

314 315
5 Objects File Formats 5.7

5.7.2 Creating a Flat File Template from a Query Transform UNIX Server: Adding an Excel Adapter

The methods to create flat file templates discussed so far can be used for either To read the data from an Excel file, you must first define an Excel adapter on the
source or target data sources. Target data sources have one other method to job server in the Data Services Management Console.
quickly create a flat file template. Within the Query Editor, you can right-click the
Warning
output schema and select Create File Format as shown in Figure 5.122. This will
use the output schema as the data type definition and create a template. Only one Excel adapter can be configured for each job server.

Follow these steps:

1. Go to the Data Services Management Console.


2. On the left tab under Adapter Instances, click the job server adapter for which
you want to add the Excel adapter (see Figure 5.123).
3. On the Adapter Configuration tab of the Adapter Instances page, click Add.

Figure 5.122 Query Transform to Create a Flat File Template

5.7.3 Excel File Format


Excel files are widely used in every organization. From marketing to finance
departments, a large number of data sets throughout a company are stored in
Excel format. As a result, learning to work efficiently with these in Data Services
is critical to the success of many data-provisioning projects. Figure 5.123 Adding a New Excel Adapter

Excel supports multiple Excel versions and Excel file extensions. The following
4. On the Adapter Configuration page, enter “BOExcelAdapter” in the Adapter
versions are compatible as of Data Services version 4.2: Microsoft Excel 2003,
Instance Name field (see Figure 5.124). The entry is case sensitive.
2007, and 2010.
5. All the other entries may be left at their default values. The only exception to
that rule is when processing files which are larger than 1MB, in which case, you
Preliminary System Requirements would need to modify the Additional Java Launcher Options value to
This section describes the initial step to complete to use an Excel component in -Xms64m -Xmx512 or -Xms128m -Xmx1024m (the default is -Xms64m -Xmx-
Data Services. 256m).

316 317
5 Objects File Formats 5.7

Warning

Data Services references Excel workbooks as sources only. It’s not possible to define an
Excel workbook as a target. For situations where we’ve needed to output to Excel work-
books, we’ve used a User-Defined transform; an example is given in the “User-Defined
Transform” section found under Section 5.5.3.
At this point, you can’t import or read a password-protected Excel workbook.

Importing an Excel File


To work with Excel workbooks, file layouts need to be defined. To define the lay-
out by importing an existing file, perform the following steps:

1. Go to Local Object Library area of Data Services Designer.


2. Find the Excel type.
Figure 5.124 Configuring the Excel Adapter 3. Right-click Excel, and select New (see Figure 5.126).

6. Start the adapter from the Adapter Instance Status tab (see Figure 5.125).

Figure 5.125 Starting the Adapter

Windows Server: Microsoft Office Compatibility


The 64-bit Data Services Designer and job server are incompatible with Microsoft
Office products prior to Office 2010. During installation, if Microsoft Office isn’t
installed on the server, the installer will attempt to install Microsoft Access Data-
base Engine 2010. A warning message will appear if an earlier version of Micro-
soft Office is installed. To remedy this and be able to use Excel workbook sources,
upgrade to 2010 64 bit or greater or uninstall Microsoft Office products and
install the Microsoft Access Database Engine 2010 redistributable that comes
with the Data Services (located in <LINK_DIR>/ext/microsoft/AccessDatabase-
Figure 5.126 Creating a New Excel Format
Engine_X64.exe).

An Import Excel Workbook dialog window appears as shown in Figure 5.127.

318 319
5 Objects File Formats 5.7

Figure 5.128 Selecting an Excel File

If the first row of your spreadsheet contains the column names, don’t forget to
click the corresponding option as shown in Figure 5.129.

Figure 5.129 First Row as Column Names Option

Figure 5.127 New File Creation Details Tab


After the properties have been specified, click the Import schema button. This
will create a file definition based on your range selection. If the column names are
Format Name
present, they’ll be used to define each column, as shown in Figure 5.130. A blank
Give the component a name in the Format name field, and specify the file loca- will appear if the content type can’t be automatically determined for a column.
tion (or alternatively, a parameter if one has been defined) and file name as When it’s not possible for Data Services to determine the data type, the column
shown in Figure 5.128. will be assigned a default value of varchar(255).

Warning After confirming and updating column attributes if required, click the OK button,
and the Excel format object will be saved.
To import an Excel workbook into Data Services, it must first be available on the Win-
dows file system. You can later change its actual location (e.g., to the UNIX server). The Alternatively, an Excel format can be manually defined by entering the column
same goes if you want to reimport the workbook definition or view its content. names, data types, content types, and description. These can be entered in the
Schema pane at the top of the New File Creation tab. After the file format has
been defined, it can be used in any workflow as a source or a target.

320 321
5 Objects File Formats 5.7

Figure 5.131 Data Access Tab


Figure 5.130 Importing the Column Definition
FTP
Be careful with worksheet names. Blank names or names that contain special Provide the host name, fully qualified domain name, or IP address of the com-
characters may not be processed. puter where the data is stored. Also, provide the user name and password for that
server. Lastly, provide the file name and location of the Excel file you want to
access.
Data Access Tab
The Data Access tab, as shown in Figure 5.131, allows you to define how Data Custom
Services will access the data if the file isn’t accessible from the job server. Provide the name of the executable script that will access the data. Also provide
Through the tab, you can configure file access through FTP or a custom script, as the user name, password, and any arguments to the program.
shown in the following subsections.
Note

If both options (FTP/Custom) are left unchecked, Data Services assumes that the file is
located on the same machine where the job server is installed.

322 323
5 Objects File Formats 5.7

Usage Notes Prerequisites

왘 Data Services can deal with Excel formulas. However, if an invalid formula results in Before attempting to connect Data Services to your Hadoop instance, make sure to val-
an error such as #DIV/0, #VALUE, or #REF, then the software will process the cell as idate the following prerequisites and requirements:
a NULL value. 왘 The Data Services job server must be installed on Linux.
왘 Be careful when opening the file in Excel when using it from Data Services. Because 왘 Hadoop must be installed on the same server as the Data Services job server. The job
the application reads stored formula values, there is a risk of reading incorrect values server may or may not be part of the Hadoop cluster.
if Excel isn’t closed properly. Ideally, the file shouldn’t be opened by any other soft-
왘 The job server must start from an environment that has sourced the Hadoop environ-
ware when used in Data Services.
ment script.
왘 Also see Excel usage in Chapter 4, Section 4.2.
왘 Text processing components must be installed on Hadoop Distributed File System
(HDFS).

5.7.4 Hadoop
With the rise of big data, Apache Hadoop has become a widely used framework Loading and Reading Data with HDFS
for storing and processing very large volumes of structured and unstructured data To connect to HDFS from Data Services, you must first configure an HDFS file for-
on clusters of commodity hardware. While Hadoop has many components that mat in Data Services Designer. After this file format is created, it can be used to
can be used in various configurations, Data Services has the capability to connect load and read data from an HDFS file in a data flow.
and integrate with the Hadoop framework specifically in the following ways:
In this example, we’ll connect to HDFS in a Merge transform to read files from
왘 Hadoop distributed file system (HDFS) HDFS.
A distributed file system that provides high aggregate throughput access to
data. 1. Open Data Services Designer, and create a new data flow. Select the Formats
tab at the bottom (see Figure 5.132).
왘 MapReduce
A programming model for parallel processing of large data sets.
왘 Hive
An SQL-like interface used for read-only queries written in HiveQL.
왘 Pig
A scripting language used to simplify the generation of MapReduce programs
written in PigLatin.

Setup and Configuration


There are two main ways to configure Data Services to integrate with Hadoop.
One method is to run your job server on a node within your Hadoop cluster, and
the other method is to run the job server and Hadoop on the same machine but
outside your Hadoop cluster.

Figure 5.132 Create a New HDFS File Format

324 325
5 Objects Summary 5.8

2. Create a new HDFS file format for the source file (see Figure 5.133). 5.8 Summary
This chapter has traversed and described the majority of the Data Services Object
hierarchy. With these objects at your disposal within Data Services, virtually any
data provisioning requirement can be solved. In the next chapter we explore how
to use value placeholders to enable values to be set, shared, and passed between
objects.

Figure 5.133 HDFS File Format Editor

3. Set the appropriate Data File(s) properties for your Hadoop instance, such as
the hostname for the Hadoop cluster and path to the file you want to read in.
The HDFS file can now be connected to an object (in this case, a Merge trans-
form) and loaded into a number of different types of targets, such as a database
table, flat file, or even another file in HDFS.

In Figure 5.134, we connect HDFS to a Merge transform and then load the data
into a database table. The HDFS file can be read and sourced much like any other
source object.

Figure 5.134 HDFS File Format as a Source

326 327
Contents

Acknowledgments ............................................................................................ 15
Introduction ..................................................................................................... 17

PART I Installation and Configuration

1 System Considerations .............................................................. 23

1.1 Building to Fit Your Organization .................................................. 23


1.1.1 Data Services Architecture Scenarios ................................ 24
1.1.2 Determining Which Environments Your Organization
Requires ........................................................................... 29
1.1.3 Multi-Developer Environment .......................................... 31
1.1.4 IT and Company Policies .................................................. 33
1.2 Operating System Considerations .................................................. 36
1.2.1 Source Operating System Considerations .......................... 37
1.2.2 Monitoring System Resources .......................................... 37
1.2.3 CPU ................................................................................. 39
1.2.4 Memory Considerations ................................................... 40
1.2.5 Target Operating System Considerations .......................... 41
1.3 File System Settings ....................................................................... 41
1.3.1 Locales ............................................................................. 41
1.3.2 Commands ....................................................................... 42
1.4 Network ........................................................................................ 45
1.5 Sizing Appropriately ...................................................................... 45
1.5.1 Services That Make Up Data Services ............................... 46
1.5.2 Estimating Usage and Growth .......................................... 52
1.6 Summary ....................................................................................... 55

2 Installation ................................................................................ 57

2.1 SAP BusinessObjects Business Intelligence Platform and


Information Platform Services ........................................................ 57
2.1.1 Security and Administration Foundation ........................... 57
2.1.2 Reporting Environment .................................................... 62
2.2 Repositories .................................................................................. 64
2.2.1 Planning for Repositories .................................................. 65

7
Contents Contents

2.2.2 Preparing for Repository Creation ..................................... 67 3.6.2 Common Causes of Job Execution Failures ........................ 168
2.2.3 Creating Repositories ....................................................... 67 3.7 Summary ....................................................................................... 172
2.3 Postal Directories .......................................................................... 74
2.3.1 USA Postal Directories ..................................................... 74
2.3.2 Global Postal Directories .................................................. 81 PART II Jobs in SAP Data Services
2.4 Installing SAP Server Functions ...................................................... 82
2.5 Configuration for Excel Sources in Linux ........................................ 84 4 Application Navigation ............................................................. 175
2.5.1 Enabling Adapter Management in a Linux Job Server ........ 84
2.5.2 Configuring an Adapter for Excel on a Linux Job Server .... 86 4.1 Introduction to Data Services Object Types ................................... 175
2.6 SAP Information Steward ............................................................... 89 4.2 Hypothetical Work Request ........................................................... 177
2.7 Summary ....................................................................................... 90 4.3 Data Services Designer .................................................................. 180
4.3.1 Datastore ......................................................................... 181
3 Configuration and Administration ............................................ 91 4.3.2 File Format ....................................................................... 183
4.3.3 Data Flow ........................................................................ 184
3.1 Server Tools ................................................................................... 91 4.4 Data Services Workbench .............................................................. 190
3.1.1 Central Management Console .......................................... 91 4.4.1 Project ............................................................................. 192
3.1.2 Data Services Management Console ................................. 94 4.4.2 Datastore ......................................................................... 193
3.1.3 Data Services Server Manager for Linux/UNIX .................. 113 4.4.3 File Format ....................................................................... 196
3.1.4 Data Services Server Manager for Windows ...................... 125 4.4.4 Data Flow ........................................................................ 198
3.1.5 Data Services Repository Manager .................................... 132 4.4.5 Validation ........................................................................ 202
3.2 Set Up Landscape Components ...................................................... 132 4.4.6 Data Flow Execution ........................................................ 202
3.2.1 Datastores ........................................................................ 133 4.5 Summary ....................................................................................... 205
3.2.2 Table Owner Aliases ......................................................... 136
3.2.3 Substitute Parameters ...................................................... 138 5 Objects ...................................................................................... 207
3.2.4 System Configuration ....................................................... 139
3.3 Security and Securing Sensitive Data .............................................. 140 5.1 Jobs ............................................................................................... 208
3.3.1 Encryption ........................................................................ 141 5.1.1 Batch Job Object .............................................................. 208
3.3.2 Secure Socket Layer (SSL) ................................................. 142 5.1.2 Real-Time Job Object ....................................................... 211
3.3.3 Enable SSL Communication via CMS ................................. 143 5.2 Workflow ...................................................................................... 217
3.3.4 Data Services SSL Configuration ....................................... 147 5.2.1 Areas of a Workflow ......................................................... 219
3.4 Path to Production ........................................................................ 150 5.2.2 Continuous Workflow ...................................................... 220
3.4.1 Repository-Based Promotion ............................................ 150 5.3 Logical Flow Objects ..................................................................... 222
3.4.2 File-Based Promotion ....................................................... 152 5.3.1 Conditional ...................................................................... 222
3.4.3 Central Repository-Based Promotion ................................ 153 5.3.2 While Loop ...................................................................... 223
3.4.4 Object Promotion ............................................................ 155 5.3.3 Try and Catch Blocks ........................................................ 223
3.4.5 Object Promotion with CTS+ ............................................ 157 5.4 Data Flows .................................................................................... 223
3.5 Operation Readiness ...................................................................... 157 5.4.1 Creating a Standard Data Flow with an Embedded
3.5.1 Scheduling ....................................................................... 157 ABAP Data Flow .............................................................. 224
3.5.2 Support ............................................................................ 164 5.4.2 Creating an Embedded ABAP Data Flow .......................... 225
3.6 How to Troubleshoot Execution Exceptions ................................... 166 5.5 Transforms .................................................................................... 231
3.6.1 Viewing Job Execution Logs ............................................. 166 5.5.1 Platform Transforms ......................................................... 231

8 9
Contents Contents

5.5.2 Data Integrator Transforms ............................................... 249 8.3.2 Oracle .............................................................................. 369
5.5.3 Data Quality Transforms ................................................... 271 8.3.3 SQL Server ....................................................................... 369
5.6 Datastores ..................................................................................... 299 8.4 Target-Based CDC Solution ............................................................ 372
5.6.1 SAP BW Source Datastores ............................................... 299 8.5 Timestamp CDC Process ................................................................ 376
5.6.2 SAP BW Target Datastore ................................................. 303 8.5.1 Limitations ....................................................................... 376
5.6.3 RESTful Web Services ....................................................... 305 8.5.2 Salesforce ......................................................................... 377
5.6.4 Using RESTful Applications in Data Services ..................... 307 8.5.3 Example ........................................................................... 377
5.7 File Formats ................................................................................... 309 8.6 Summary ....................................................................................... 379
5.7.1 Flat File Format ................................................................ 310
5.7.2 Creating a Flat File Template from a Query Transform ...... 316
5.7.3 Excel File Format .............................................................. 316 PART III Applied Integrations and Design Considerations
5.7.4 Hadoop ............................................................................ 324
5.8 Summary ....................................................................................... 327 9 Social Media Analytics .............................................................. 383

6 Variables, Parameters, and Substitution Parameters ............... 329 9.1 The Use Case for Social Media Analytics ........................................ 384
9.1.1 It’s Not Just Social ............................................................ 385
6.1 Substitution Parameters ................................................................. 331 9.1.2 The Voice of Your Customer ............................................. 386
6.2 Global Variables ............................................................................ 335 9.2 The Process of Structuring Unstructured Data ................................ 387
6.3 Variables ....................................................................................... 336 9.2.1 Text Data Processing Overview ........................................ 387
6.4 Local Variables .............................................................................. 337 9.2.2 Entity and Fact Extraction ................................................. 388
6.5 Parameters .................................................................................... 339 9.2.3 Grammatical Parsing and Disambiguation ......................... 393
6.6 Example: Variables and Parameters in Action ................................. 340 9.3 The Entity Extraction Transform ..................................................... 394
6.7 Summary ....................................................................................... 342 9.3.1 Language Support ............................................................ 395
9.3.2 Entity Extraction Transform: Input, Output,
and Options ..................................................................... 396
7 Programming with SAP Data Services Scripting Language 9.4 Approach to Creating a Social Media Analytics Application ........... 404
and Python ................................................................................ 343 9.4.1 A Note on Data Sources ................................................... 405
9.4.2 The Data Services Social Media Analysis Project ............... 407
7.1 Why Code? .................................................................................... 343 9.5 Summary ....................................................................................... 411
7.2 Functions ...................................................................................... 345
7.3 Script Object ................................................................................. 349
7.4 Coding within Transform Objects ................................................... 352 10 Design Considerations .............................................................. 413
7.4.1 User-Defined Transforms .................................................. 352
7.4.2 SQL Transform ................................................................. 356 10.1 Performance .................................................................................. 414
7.5 Summary ....................................................................................... 357 10.1.1 Constraining Results ......................................................... 414
10.1.2 Pushdown ........................................................................ 414
10.1.3 Enhancing Performance When Joins Occur on
8 Change Data Capture ................................................................ 359 the Job Server .................................................................. 422
10.1.4 Caching ............................................................................ 423
8.1 Comparing CDC Types ................................................................... 360 10.1.5 Degree of Parallelism (DoP) .............................................. 424
8.2 CDC Design Considerations ........................................................... 362 10.1.6 Bulk Loading .................................................................... 425
8.3 Source-Based CDC Solutions .......................................................... 363
8.3.1 Using CDC Datastores in Data Services ............................. 363

10 11
Contents Contents

10.2 Simplicity ...................................................................................... 426


10.2.1 Rerunnable ....................................................................... 426
14 Where Is SAP Data Services Headed? ...................................... 499
10.2.2 Framework ....................................................................... 426
14.1 The Key Themes for SAP Data Services .......................................... 500
10.3 Summary ....................................................................................... 427
14.2 The State of the State .................................................................... 501
14.2.1 Simple .............................................................................. 502
11 Integration into Data Warehouses ........................................... 429 14.2.2 Big Data ........................................................................... 502
14.2.3 Enterprise Readiness ........................................................ 502
11.1 Kimball Methodology .................................................................... 430 14.3 Beyond SP03 ................................................................................. 504
11.1.1 Dimensional Data Model Overview .................................. 430 14.4 Help Shape the Future of Data Services ......................................... 505
11.1.2 Conformed Dimensions and the Bus Matrix ...................... 431 14.5 Summary ....................................................................................... 508
11.1.3 Dimensional Model Design Patterns ................................. 433
11.1.4 Example Orders Star Schema ............................................ 437
The Authors .................................................................................................... 509
11.1.5 Processing Slowly Changing Dimensions ........................... 438
Index ............................................................................................................... 513
11.1.6 Loading Fact Tables .......................................................... 446
11.2 Hadoop and SAP HANA ................................................................ 451
11.3 Summary ....................................................................................... 454

12 Industry-Specific Integrations .................................................. 455

12.1 Retail: Facilitating Customer Loyalty .............................................. 455


12.1.1 The Solution ..................................................................... 455
12.1.2 Results ............................................................................. 464
12.2 Distribution: SAP BW/SAP APO Integration with SAP ECC ............. 465
12.2.1 The Solution ..................................................................... 466
12.2.2 Challenges ........................................................................ 472
12.2.3 Results ............................................................................. 473
12.3 Summary ....................................................................................... 473

PART IV Special Topics

13 SAP Information Steward .......................................................... 477

13.1 Match Review ............................................................................... 477


13.1.1 Use Cases for Match Review ............................................. 479
13.1.2 Terminology ..................................................................... 479
13.1.3 Scenario ........................................................................... 481
13.2 Cleansing Package Builder ............................................................. 490
13.3 Metadata Management ................................................................. 496
13.4 Summary ....................................................................................... 498

12 13
Index

A Application (Cont.)
settings, 92
ABAP data flow, 223 Architecture
create, 225 performance considerations, 24
extraction, 229 scenario, 24
ABAP program, dynamically load/execute, 82 system availability/uptime considerations, 24
Access server web application performance, 27
configure, 123, 127 Associate transform, 292
parameters, 129 Asynchronous changes, 361
SSL, 148 ATL, export files from Data Services Designer,
Accumulating snapshot, 437, 450 32
Acta Transformation Language, 493 Auditing, prevent pushdown, 421
Adapter Management, 84 Authentication, 58
Adapter SDK, 504 Auto Documentation
Adaptive Job Server, 93 module, 109
Adaptive Process Server, 93 run, 109
Address
census data, 81
cleansing and standardization, 282 B
data, clean and standardize, 275
global validation, 81 BAPI function call, 466
ISO country codes, 295 BAPI function call, read data into
latitude/longitude, 294 Data Services, 470
list of potential matches, 297 Batch job, 208
street-level validation, 82 auditing, 210
Address Cleanse transform, 74 create, 189
Address SHS Directory, 75 execution properties, 209
Administrator module, 95 logging, 209
Adapter Instances submodule, 99 monitoring sample rate, 209
Central Repository, 100 performance, 210
Management configuration, 104 statistics, 210
Object Promotion submodule, 101 trace message, 209
real-time job, 97 Batch Job Configuration tab, 97
Real-Time submodule, 97 Big data, 500
SAP Connections submodule, 98 unstructured repositories, 451
Server Group, 100 BIP, 24, 57, 62, 63
Web Services submodule, 98 CMC, 58
Aggregate fact, 437, 450 licensing, 63
Alias, create, 136 patch, 63
All-world directories, 81 user mapping, 59
Apache Hadoop, 324 Blueprints, 404
Application BOE Scheduler, 160
authorization, 92 BOE 씮 BIP
framework, 46 Brand loyalty, 384

513
Index Index

Break key, 288 CMC (Cont.) D Data flow (Cont.)


Bulk loading, 425 authentication, 58 SQL Query transform, 244
Bus matrix, 430, 432, 433 repository, 61 Data target XML message, 460
Business processes and dimensions, 432 server services, 92 cleansing, 490 target-based CDC, 372
Business transaction, historical context, 360 users and groups, 60 compare with CDC, 373 XSD file, 457
uses, 91 consolidation project, 189 Data Generation transform, time dimension
CMS, 58, 90, 94 de-duplicate, 477 output, 251
C Enterprise authentication, 59 dictionary, 271 Data Integrator transform, 249
IPS, 26 extract and load, 231 Data Masking transform, 502
CA Wily Introscope, 131 login request, 58 gather/evaluate from social media site, 384 Data migration, 479
Caching, 423 logon parameters, 107 latency requirements, 362 Data provisioning
in-memory, 374 security plug-in, 58 mine versus query, 387 performance, 414
Case transform, 235, 468 SSO, 60 move from column to row, 267 principles, 413
configure, 236 sync backup with filestore, 26 move into data warehouse, 429 Data Quality Reports module, 113
CDC, 259, 359 Code nested, 263 Data Quality transform, reference data files,
datastore configuration, 363 move to production, 157 parse/clean/standardize, 271 171
datastore output fields, 365 promote between environments, 150 staging, 417 Data Services, 150
design considerations, 362 sharing between repositories, 32 unstructured to structured, 387 code promotion, 150
enable for database table, 369 Column derivations, map, 200 Data Cleanse transform, 271 connect to SAP BW for authorization, 59
Map Operation transform, 366 Command-line execution cleansing package, 273 ETL integration, 430
Oracle database, 369 UNIX, 72 configure, 271 install on multiple servers, 27
Salesforce functionality, 377 Windows, 70 date formatting options, 275 integrate with Salesforce, 377
source-based, 361, 363 Competitive intelligence, 384 firm standardization options, 274 integration, 452
subscription, 363 Conditional, 222 input fields, 272 key themes, 500
synchronous and asynchronous, 369 Conditional, expression, 235 options, 273 repository-based promotion, 150
target-based, 361, 372 Conformed dimension, 431 output fields, 272 script, 347
timestamp, 376 Connection parameter, 178 parsing rule, 272 scripting language, 345
types, 360 Consolidated fact, 437 person standardization options, 273 security and administration, 57
CDC-enabled table, 364 Country ID transform, 295 Data Cleansing Advisor, 478 services, 46
Central Management Console 씮 CMC Country-specific address directories, 82 Data flow, 176 Data Services Designer, 175, 180, 201
Central Management Server 씮 CMS CPU utilization, 39 add objects, 184 data provisioning, 180
Central repository, 65 Custom function branching paths, 235 execute job, 189
code considerations, 32 call, 349 bypass, 502 execute real-time job, 213
reports, 101 Data Services script, 347 calculate delivery time, 466 export files for code sharing, 32
Central repository-based promotion, 153 parameters, 347 CDC tables, 366 mapping derivation, 187
Certificate Logs, 109 Custom Functions tab, 345 configured CDC, 367 Microsoft Office compatibility, 318
Change data capture 씮 CDC Customer create, 184, 198 pros, 179
Change Tracking, 369 data, merge, 177 create with embedded ABAP data flow, 224 set up SAP BW Source datastore, 302
Changed Data Capture, 369 loyalty, 455 definition, 223 start, 180
Changed records, 361 problem, 390 delete records, 246 transform configuration, 394
City Directory, 76 request, 390 execution, 202 versus Data Services Workbench, 178
Cleansing package, 271 sentiment, 384, 390 fact table loading scenario, 448 Data Services Management Console, 91, 94
Cleansing Package Builder, 490 Customer Engagement Initiative (CEI), 507 flat file error, 169 data validation, 242
Client tier, 46 graphical portrayal, 109 validation rule reporting, 242
Cloud-based application, 377 include in batch job, 189 Data Services Repository Manager, 67, 132
CMC, 58, 91 periodic snapshot/aggregate fact table, 450 Data Services Scheduler, 158
application authorization, 92 SCD implementation, 439 Data Services scripting, built-in functions, 348

514 515
Index Index

Data Services Server Manager Degree of Parallelism, 424 E F


Linux/UNIX, 113 Delivery point, 79
Windows, 125 Delivery Point Validation (DPV) directory Early Warning System Directory, 80 Fact extraction, 390
Data Services Workbench, 175, 190 files, 76 Effective Date transform, 257 Fact table
data provisioning, 190 Delivery Sequence Files 2nd Generation configuration, 258 design pattern, 436
data-provisioning process, 196 (DSF2) files, 79 EIM Adaptive Processing Server, 94 load, 446
developer use cases, 191 Delivery Statistics Directory, 79 ELT sequence, 177 Fact-less fact, 437
pros, 179 Delivery time, 465 Encryption, 141 Fault tolerance, 28
validation, 202 Delivery time, calculations, 468 Secure Socket Layer, 142 File format, 196, 309
versus Data Services Designer, 178 Delta data set, 359 Enhanced Change and Transport System 씮 Excel, 183, 316
Data set Developer repository, 61 CTS+ flat, 310
compare and flag differences, 254 Development, 29 Enhanced Line of Travel (eLOT) Directory, 79 importing an Excel file, 319
duplicate records, 288 DI_OPERATION_TYPE, 365 Enterprise scheduler, 161 name, 313
join multiple, 232 DI_SEQUENCE_NUMBER, 365 Entity staging directory, 334
uniqueness, 235 Dictionary file for entity extraction, 390 predefined, 400 wildcard, 315
Data source, collect for analytics, 405 Dimension preserve history, 443 File Format Editor, substitution
Data steward, 480 degenerate, 435 Entity extraction parameter, 334
Data Store Explorer, 308 function call, 448 Entity Extraction transform, 390 File specification, create, 196
Data Transfer transform, 251 junk, 436 example, 391 File-based promotion, 152
automatic transfer type, 253 late-arriving record, 446 with TDP, 388 Filestore, 24
data staging, 418 many-valued, 436 Entity Extraction transform, 394 Flat file, 311
use other table/object, 252 mini, 436 binary document extraction, 401 Flat file template, 310
Data transformations, Python, 292 multiple tables, 436 blueprints, 404 create from Query transform, 316
Data transport object, 229 preserve change history, 434, 441 Dictionaries, 402 create with existing file, 312
Data Validation dashboard, 112 role-playing, 435 inputs, 396 Flattened hierarchy, 261
Data warehouse, 429 segmentation, 436 language support, 394 Format, 176
Match Review, 479 surrogate key storage, 436 Languages option, 399 Function, 176
SCD, 438 table design pattern, 435 multiple languages, 395 local variable, 337
Database Federation, 452 Dimension data, late arriving, 446 output mapping, 396
Database object, import metadata, 182 Dimensional data model, 431 Processing Options group, 400
Database triggers, 361 Dimensional model Voice of the Customer rule, 409 G
Data-intensive operations, 251 create, 433 Entity-attribute-value (EAV) model, 267
Data-provisioning process, 23, 180 design patterns, 433 Environment Generate XML Schema option, 457
Datastore, 176, 299 surrogate key, 435 common, 29 GeoCensus files, 81
alias, 136 Dimensions, 431 isolate, 30 Geocoder transform, 294
configuration, 136 conformed, 431 multi developer, 31 output, 294
connectivity to customer data, 181 represent hierarchies, 434 planning, 23 Geo-spatial data integration, 504
create, 133, 181, 193 Downstream subscriber system, 362 Error log, 97 Global Address Cleanse transform, 81,
editor, 196 Driving table, 422 ETL 275, 282
pushdown, 415 DSF2® Walk Sequencer transform, 299 sequence, 177 address class, 287
report, 112 DSN-less connection, 69 transform, 249 country assignment, 283
SAP BW source, 299 Duplicate records, 288 Excel Country ID Options group, 285
system configuration, 139 Duplicated key, 373 configure adapter communication, 86 Directory Path option, 284
Date dimension, 435 enable data flow from, 84 Engines option group, 283
Date Generation transform, 249 import workbook, 320 Field Class, 286
Decoupling, 339 read data from file, 317 map input fields, 282
Define Best Record Strategy options, 484 Execution properties, 189 standardize input address data, 285

516 517
Index Index

Global address, single transform, 282 Installation Job service Mapping, 200
Global digital landscape, 499 commands, 42 start, 115 input fields to content types, 272
Global Postal Directories, 81 deployment decisions, 62 stop, 115 multiple candidates, 200
Global Suggestion List transform, 297 Linux, 84 Join pair, 201 parent and child objects, 264
Global variable, 335 settings, 41 Join rank, 238, 422 Master data management, 479
data type, 336 Integrated development environment, 180 option, 232 Match results, create through
Integration, 30 Data Services, 481
IPS, 24, 57 Match Review, 189, 477
H cluster installation, 24 K approvers/reviewers, 486
CMC, 58 best record, 484
Hadoop, 451 licensing, 63 Key Generation transform, 249, 441 configuration, 482
Hadoop distributed file system (HDFS), 324 system availability/performance, 24 Kimball approach, 429 configure job status, 485
Haversine formula, 462 user mapping, 59 principles, 430 data migration, 481
HDFS Kimball methodology, 429 process, 478
connect to Merge transform, 326 Klout™ scores, 385 results table, 483
load/read data, 325 J Knowledge transfer, 165 task review, 487
Hierarchical data structure, 265 terminology, 479
read/write, 247 Job, 176, 208 use cases, 479
Hierarchy Flattening transform, 261 add global variable, 335 L Match transform, 288, 479
High-rise business building address, 78 blueprint, 404 break key, 288
History Preserving transform, 256 common execution failures, 168 LACSLink Directory files, 77 consolidate results from, 292
effective dates, 256 common processes, 426 Lastline, 276 Group Prioritization option, 288
History table, 444 data provisioning, 191 Latency, 178 output fields, 291
Horizontal flattening, 261 dependency, 160 Left outer join, 232 scoring method, 289
HotLog mode, 369 execution exception, 166 Lineage, 64, 496 Memory, 40
filter, 96 diagrams, 421 page caching, 41
graphic portrayal, 109 Linux Merge transform, 237
I processing tips, 426 configure Excel sources, 84 Metadata Management, 64, 496
read data from SAP BW, 300 enable Adapter Management, 84 Mirrored mount point, 25
I/O replication, 191 Local repository, 64 Monitor log, 97
performance, 37 rerunnable, 426 Local variable, 337 Monitor, system resources, 37
pushdown, 414 schedule, 157 Log file, retention history, 92 Multi-developer environment, 65
reduce, 414 specify specific execution, 158 Logical flow object, 222 Multi-developer landscape, 31
speed of memory versus disk, 423 standards, 330 lookup_ext(), 447, 448 Multiline field, 276
IDE, 175, 179 Job Error Log tab, 167
primary objects, 175 Job execution
Idea Place, 505 log, 166 M N
IDoc message transform, 471 statistics, 113
Impact and Lineage Analysis module, 112 Job server, 211 Mail carrier's route walk sequence, 79 Name Order option, 273
InfoPackage, 469 associate local repository, 119 Management or intelligence tier, 46 National Change of Address (NCOALink)
Information Platform Services 씮 IPS configure, 115 Map CDC Operation transform, 259, 366 Directory files, 80
Information Steward, 89, 477 create, 116 input data and results, 260 National Directory, 74
expose lineage, 496 default repository, 121 Map Operation transform, 246, 374 Native Component Supportability (NCS), 131
Match Review, 477 delete, 117 inverse ETL job, 246 Natural language processing, 394
InfoSource, create, 303 edit, 117 Map to output, 187 Nested data, 263
Inner join, 232, 422 join performance enhancement, 422 Map_CDC_Operation, 467 Nested Relational Data Model (NRDM), 235
Input and Output File Repository, 94 remove repository, 120 Map_Operation, 444, 446
SSL, 148 transform, 439

518 519
Index Index

Network traffic, 45 Product Availability Matrix (PAM), Real-time processing loop, 213 S
New numeric sequence, 238 repository planning, 65 Replication server, 363
Production, 30 repoman, 72 Salesforce source tables, CDC func-
Profiler repository, 65 Reporting environment, 62 tionality, 377
O Project, 176, 180, 191 patch, 63 Sandbox, 29
create, 189 Repository, 64 SAP APO, 455, 465
Object, 207 data source, 192 create, 67 interface, 469
common, 426 lifecycle, 34 create outside of installation program, 67 load calculations, 467
connect in Data Services Designer, 186 new, 192 default, 121 read data from, 470
hierarchy, 207 Public sector extraction, 391 execution of jobs, 95 SAP BusinessObjects BI platform 씮 BIP
pairs, 223 Pushdown, 414 import ATL, 493 SAP BusinessObjects Business Intelligence
types, 175 Data Services version, 419 manage users and groups, 100 platform, 24
Object Promotion, 155 determine degree of, 415 number of, 65 SAP BusinessObjects reporting, 63
execute, 156 manual, 421 planning for, 65 SAP BW, 465, 469
with CTS+, 157 multiple datastores, 415 pre-creation tasks, 67 load data, 303
Operating system, 36 prevented by auditing, 421 resync, 119 set up InfoCubes/InfoSources, 303
I/O operations, 37 transform instructions, 419 sizing, 66 source datastore, 299, 302
optimize to read from disk, 37 Python, 343, 352 update password, 118 source system, 470
Operation code, UPDATE/DELETE, 372 calculate geographic positions, 463 update with UNIX command line, 72 target datastore, 303, 304
Operational Dashboard module, 113 Expression Editor, 354 update with Windows command line, 70 SAP Change and Transport System (CTS),
Optimized SQL, 415, 416 options, 354 Repository Manager, 61 install, 83
perform insert, 417 programming language, 292 Windows, 68 SAP Customer Connection, 507
Python script, analysis project, 406 Representational State Transfer (REST), 305 SAP Customer Engagement Intelligence
Residential and business addresses, 80 application, 409
P Residential Delivery Indicator SAP Data Services roadmap, 499
Q Directory files, 77 SAP Data Warehousing Workbench, 303
Pageable cache, 122, 130 REST, use applications in Data Services, 307 SAP ECC, 455, 465, 466
Parameters, 339 Quality assurance, 30 RESTful web service, 47, 305 data load, 467
Parse Discrete Input option, 274 Query Editor, mapping derivation, 187 Reverse Soundex Address Directory, 75 extraction, 467
Parsing rule, 271 Query join, compare data source/target, 444 Reverse-Pivot transform, 269 interfaces, 471
Performance metrics and scoping, 131 Query transform, 229, 231 RFC Server configuration, 300 SAP HANA, 451, 505
Performance Monitor, 97 constraint violation, 170 RFC Server Interface, add/status, 98 SAP HANA, integrate platforms, 452
Periodic snapshot, 437, 450 data field modification, 232 Round-robin split, 425 SAP Information Steward 씮
Permissions, users and groups, 60 define joins, 232 Row Generation transform, 238 Information Steward
Pipeline forecast, 465 define mapping, 186 Row, normalized to canonical, 267 SAP server functions, 82
Pivot transform, 267 filter records output, 233 Row-by-row comparison, 361 SCD
Placeholder naming, 331 mapping, 232 Rule, topic subentity, 391 change history, 435
Platform transform, 231 primary key fields, 235 Rule-based scoring, 289 dimension entities, 435
Postal directory, 74, 278 sorting criteria, 234 RuleViolation output pipe, 241 type 1, 438
global, 81 Runbook, 165 type 1 implement in data flow, 439
USA, 74 Runtime type 2, 441
Postal discount, 299 R configure, 130 type 2 implement in data flow, 442
Postcode directory, 76 edit resource, 122 type 3, 443
Postcode reverse directory, 75 Range, date sequence, 249 type 3 implement in data flow, 444
Predictive analytics, 385 Real-time job, 211, 456 type 4, 444
Privileges, troubleshoot, 168 execute as service, 214 type 4 implement in data flow, 445
execute via Data Services Designer, 213 typical scenario, process and load, 437

520 521
Index Index

Script, 176 SQL, execution statements, 415 TDP (Cont.) USA Regulatory Address Cleanse
object, 349 SSL extraction, 388 transform, 275
Scripting language and Python, 343 communication via CMS, 143 grammatical parsing, 393 add-on mailing products, 279
Secure Socket Layer (SSL), 142 configuration for Data Services, 147 public sector extraction, 391 address assignment, 275
Semantic disambiguation, 393 configure, 130 SAP Customer Engagement Intelligence address class, 281
Semantics and linguistic context, 388 enable for Data Services Designer Clients, 146 application, 409 address standardized on output, 279
Sentiment extraction demonstration, 407 enable for the Server Intelligent Agent, 143 SAP HANA, 410 CASS certification rules, 280
Server enable for web servers, 145 SAP Lumira, 404 Dual Address option, 279
group, 100 key and certificate files, 143 semantic disambiguation, 393 field class, 281
list of, 93 path, 124 Temporary cache file, encrypt, 149 map your input fields, 276
Server-based tools, 91 Staging, 30 Text Data Processing Extraction Customization output fields, 281
Services configuration, 50 tier, 426 Guide, 391 Reference Files option group, 278
Shared directory, export, 103 Star schema, 431, 437 Text data processing 씮 TDP User
Shared object library, 65 dimensional (typical), 433 Text data, social media analysis, 384 acceptance, 30
SIA Start and end dates, 257 Threshold, Match Review, 482 authenticate, 58
clustering, 24 Statistic-generation transform object, 113 Tier, 46 manage for central repository, 100
clustering example, 28 Street name spelling variations, 75 Tier to server to services mapping, 49 permissions, 92
subnet, 27 Structured data, 387 Time dimension table, 250 User acceptance testing (UAT), 31
Simple, 500 Substitute parameter, 133, 138 Timestamp User-defined transform, 292, 352
Simple Object Access Protocol (SOAP), 306 Substitution parameter, 331 CDC, 361, 376 map input, 353
Sizing, 45 configuration, 105 Salesforce table, 377 USPS
Sizing, planning, 54 create, 332 TNS-less connection, 69 change of address, 80
Slowly changing dimensions (SCD), 434 design pattern, 332 Trace log, 97 discounts, 79
SMTP value override, 334 Trace message, 209 UTF-8 locale, 41
configuration, 124 Substitution variable configuration, 171 Transaction fact, 437
for Windows, 131 SuiteLink Directory file, 78 load, 448
Snowflaking, 436 Surrogate key, 435 Transform, 176, 231 V
SoapUI, 217 pipeline, 446 compare set of records, 374
Social media analytics, 383 svrcfg, 113 Data Integrator, 249 Validation rule, 112
create application, 404 Synchronous changes, 361 data quality, 271 definition, 241
Voice of the Customer domain, 390 System availability, 24 Entity Extraction, 394 fail, 241
Social media data, filter, 409 System resources, 37 object, 352 failed records, 243
Source data, surrogate key, 249 platform, 231 reporting, 242
Source object, join multiple, 237 Transparency, decreased, 421 Validation transform, 240
Source system T Try and catch block, 223 define rules, 241
audit column, 414 Twitter data, 408 Varchar, 202
tune statement, 421 Table Comparison transform, 254, 361, 372, Variable, 336
Source table 375, 439, 442 pass, 339
prepare/review match results, 478 comparison columns, 255 U vs. local variable, 339
run snapshot CDC, 376 primary key, 255 Vertical flattening, 261
Source-based CDC, 360 Table, pivot, 267 Unauthorized access, 30 VOC, requests, 390
datastore configuration, 363 Target table, load only updated rows, 367 UNIX, adding an Excel Adapter, 317 Voice of the Customer (VOC) domain, 390
SQL Query transform, 244 Target-based CDC, 360, 372 Unstructured data, 387
SQL Server database CDC, 369 data flow design consideration, 375 Unstructured text, TDP, 387
SQL Server Management Studio, 369 Targeted mailings, 79 Upgrades, 63 W
SQL transform, 356 TDP, 387 US zip codes, 78
SQL transform, custom code, 356 dictionary, 390 Warranty support, 165
Web presence, 384

522 523
Index

Web server, SSL, 145 Workflow (Cont.)


Web service, 377 properties, 219
external testing, 217 versus job, 217
history log, 217
Web service datastore object, encryption, 308
Web Services Description Language (WSDL), X
view, 98
Web tier, 46 XML Map transform, 247, 265
clustering, 26 XML Pipeline transform, 263
Web-based application, CMC, 91 XML schema, 456, 457
Weighted average scoring, 290 file format object, 458
While loop, 223 use in data flow, 459
Wildcard, 315 XML_Map, 470, 471
Windows cluster settings, 131 batch mode, 247
Workflow, 176 XML_Pipeline transform, 460
conditional object, 222 XSD, 456
continuous, 220 create, 457
drill down, 111
execution types, 219
Z
Z4 Change Directory, 78

524
First-hand knowledge.

Bing Chen leads the Advanced Analytics practice at Me-


thod360 and has over 18 years of experience in IT consul-
ting, from custom application development to data integrati-
on, data warehousing, and business analytics applications on
multiple database and BI platforms.

James Hanck is co-founder of Method360 and currently


leads the enterprise information management practic

Patrick Hanck is a lead Enterprise Information Management


consultant at Method360. He specializes in provisioning in-
formation, system implementation, and process engineering,
and he has been recognized through business excellence and
process improvement awards.

Scott Hertel is a senior Enterprise Information Manage-


ment consultant with over 15 years of consulting experien-
ce helping companies integrate, manage, and improve the
quality of their data.

Allen Lissarrague is a senior BI technologist with over 15


years of experience directly in SAP analytics and information
© 2015 by Rheinwerk Publishing, Inc. This reading sample may be distributed free of charge. In no way must the file be altered, or
individual pages be removed. The use for any commercial purpose other than promoting the book is strictly prohibited. products.

Paul Médaille has worked for SAP for over 15 years as a


Bing Chen, James Hanck, Patrick Hanck, Scott Hertel, Allen Lissarrague, Paul Médaille
consultant, trainer, and product manager, and he is current-
ly a director in the Solution Management group covering
SAP Data Services Enterprise Information Management topics, including Data
The Comprehensive Guide Services.

524 Pages, 2015, $79.95/€79.95 We hope you have enjoyed this reading sample. You may recommend
ISBN 978-1-4932-1167-8 or pass it on to others, but only in its entirety, including all pages. This
reading sample and all its parts are protected by copyright law. All usage
www.sap-press.com/3688 and exploitation rights are reserved by the author and the publisher.

You might also like