You are on page 1of 90

PUBLIC

SAP Data Services


Document Version: 4.2 Support Package 9 (14.2.9.0) – 2018-02-26

Data Services Supplement for Big Data


Content

1 About this supplement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Big data in SAP Data Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


2.1 Apache Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
Setting ODBC driver configuration on Unix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Data source properties for Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Prerequisites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Common commands for correct Linux setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
Hadoop support for the Windows platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
HDFS file format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Connecting to HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Connecting to Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Hive adapter datastore configuration options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
Using Hive metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 HP Vertica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
Enable MIT Kerberos for HP Vertica SSL protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
Creating a DSN for HP Vertica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Creating HP Vertica datastore with SSL encryption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Bulk loading for HP Vertica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
HP Vertica data type conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
HP Vertica table source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
HP Vertica target table options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
2.4 MongoDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Using MongoDB metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
MongoDB as a source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
MongoDB as a target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
MongoDB template documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Previewing MongoDB document data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Parallel Scan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Re-importing schemas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Searching for MongoDB documents in the repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5 SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Configure DSN SSL for SAP HANA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
Creating an SAP HANA datastore with SSL encryption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
SAP HANA datastore options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
SAP HANA target table options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Data Services Supplement for Big Data


2 PUBLIC Content
Creating stored procedures in SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bulk loading in SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Metadata mapping for SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Using spatial data with SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Data Services Connection Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Cloud computing services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


4.1 Cloud databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Amazon Redshift database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Azure SQL database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Google BigQuery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Cloud storages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Amazon S3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
Azure blob storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Google cloud storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Data Services Supplement for Big Data


Content PUBLIC 3
1 About this supplement

This supplement contains information about the big data products that SAP Data Services supports.

The supplement contains information about the following:

● Supported big data products


● Supported cloud computing technologies including cloud databases and cloud storages.

Find basic information in the Reference Guide, Designer Guide, and some of the applicable supplement guides. For
example, to learn about datastores and creating datastores, see the Reference Guide. To learn about Google
BigQuery, refer to the Supplement for Google BigQuery.

Data Services Supplement for Big Data


4 PUBLIC About this supplement
2 Big data in SAP Data Services

SAP Data Services supports many types of big data through various object types and file formats.

2.1 Apache Cassandra

Apache Cassandra is an open-source data storage system that you can access with SAP Data Services as a source
or target in a dataflow.

Data Services natively supports Cassandra as an ODBC data source with a DSN connection. Cassandra uses the
generic ODBC driver. Use Cassandra on Windows or Linux operating systems.

You can do the following with Cassandra:

● Use as sources, targets, or template tables


● Preview data
● Query using distinct, where, group by, and order by
● Use with functions such as math, string, date, aggregate, and ifthenelse

To use Cassandra in Data Services:

● Add the appropriate environment variables to the al_env.sh file.


● For Data Services on Unix platforms, set the ODBC configurations using the Connection Manager.

Note
For Data Services on Windows platforms, driver support is through the generic ODBC driver.

2.1.1 Setting ODBC driver configuration on Unix

Use the Connection Manager to create, edit, or delete ODBC data sources and ODBC drivers for natively
supported ODBC databases when Data Services is installed on a Unix platform.

1. In a command prompt, set $ODBCINI to a file in which the Connection Manager defines the DSN. The file must
be readable and writable.

Sample Code

export ODBCINI=<dir-path>/odbc.ini
touch $ODBCINI

The Connection Manager uses this .ini file, along with other information that you enter into the Connection
Manager Data Sources tab to define the DSN for Cassandra.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 5
Note
Do not point to the Data Services ODBC .ini file.

2. Start the Connection Manager by entering the following command:

Sample Code

$ cd <LINK_DIR>/bin/
$ ./DSConnectionManager.sh

Note
<LINK_DIR> is the Data Services installation directory.

3. In Connection Manager, open the Data Sources tab, and click Add to display the list of database types.
4. On the Select Database Type window, select Cassandra and click OK.

The Configuration for... window opens. It contains the absolute location of the odbc.ini file that you set in the
first step.
5. Provide values for additional connection properties for the Cassandra database type as applicable. See Data
source properties for Cassandra [page 7] for Cassandra properties.
6. Provide the following properties:

○ User name
○ Password

Note
The software does not save these properties for other users.

7. To test the connection, click Test Connection.


8. Click Restart Services to restart the following services:

If Data Services is installed on the same machine and in the same folder as the IPS or BI platform, restart the
following services:
○ EIM Adaptive Process Service
○ Data Services Job Service

If Data Services is not installed on the same machine and in the same folder as the IPS or BI platform, restart
the following service:
○ Data Services Job Service
9. If you run another command, such as the Repository Manager, source the al_env.sh script to set the
environment variables.

By default, the script is located at <LINK_DIR>/bin/al_env.sh.

Data Services Supplement for Big Data


6 PUBLIC Big data in SAP Data Services
2.1.2 Data source properties for Cassandra

The Connection Manager configures the $ODBCINI file based on the property values that you enter on the Data
Sources tab. The following table lists the properties that are relevant for Apache Cassandra.

Database Type Properties on Data Sources tab

Apache Cassandra ● User Name


● Database password
● Host Name
● Port
● Database
● Unix ODBC Lib Path
● Driver
● Cassandra SSL Certificate Mode [0:disabled|1:one-way|2:two-way]

Depending on the value you choose for the certificate mode, you may be asked to
define some or all of the following

● Cassandra SSL Server Certificate File


● Cassandra SSL Client Certificate File
● Cassandra SSL Client Key File
● Cassandra SSL Client Key Password
● Cassandra SSL Validate Server Hostname? [0:disabled|1:enabled]

2.2 Apache Hadoop

Use SAP Data Services to connect to Apache Hadoop frameworks, including Hadoop Distributive File Systems
(HDFS) and Hive sources and targets.

Data Services supports Hadoop on both the Linux and Windows platform. For Windows support, Data Services
uses Hortonworks HDP only. See the latest Product Availability Matrix (PAM) for the supported versions of
Hortonworks HDP https://help.sap.com/viewer/disclaimer-for-links?q=https%3A%2F%2Fapps.support.sap.com
%2Fsap%2Fsupport%2Fpam%3Fhash%3Dpvnr%253D67838200100900005703.

For information about deploying Data Services on a Hadoop MapR cluster machine, see SAP Note 2404486 .

The following table describes the relevant components of Hadoop:

Component Description

Hadoop distributed file system (HDFS) Stores data on nodes, providing very high aggregate bandwidth across the cluster.

Hive A data warehouse infrastructure that allows SQL-like ad-hoc querying of data (in any
format) stored in Hadoop.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 7
Component Description

Pig A high-level data-flow language and execution framework for parallel computation
that is built on top of Hadoop. Data Services uses Pig scripts to read from and write
to HDFS, including join and push-down operations.

Map/Reduce A computational paradigm where the application is divided into many small frag­
ments of work, each of which may be executed or re-executed on any node in the
cluster. Data Services uses map/reduce to do text data processing.

2.2.1 Prerequisites

Before configuring SAP Data Services to connect to Hadoop, verify that your configuration is correct.

Ensure that your Data Services system configuration meets the following prerequisites:

● For Linux and Windows platforms, make sure the machine where the Data Services Job Server is installed is
configured to work with Hadoop.
● For Linux and Windows platforms, make sure the machine where the Data Services Job Server is installed has
the Pig client installed.
● For Linux and Windows platforms, if you are using Hive, verify that the Hive client is installed. To verify this, log
on to the node and issue Pig and Hive commands that invoke the respective interfaces.
● For Linux and Windows platforms, install the Data Services Job Server on one of the Hadoop cluster machines,
which can be either an Edge or a Data node.
● For Linux platforms, ensure that the environment is set up correctly for interaction with Hadoop. The Job
Server should start from an environment that has sourced the Hadoop environment script. For example:

source <$LINK_DIR>/hadoop/bin/hadoop_env_setup.sh -e

● For Linux and Windows platforms, enable text data processing. To enable text data processing, ensure that you
have copied the necessary text data processing components to the HDFS file system, which enables
MapReduce functionality.

2.2.2 Common commands for correct Linux setup

Use common commands to verify that SAP Data Services system on Windows is correctly configured for Hadoop.

When you use the commands in this topic, you may get outputs that are different than what we show. That is okay.
The only important factor is that your commands don't result in errors.

Data Services Supplement for Big Data


8 PUBLIC Big data in SAP Data Services
Setting up the environment

To set up the Data Services environment for Hadoop, use the following command:

$ cd <DS Install Directory>/bin


$ source ./al_env.sh
$ cd ../hadoop/bin
$ source ./hadoop_env_setup.sh -e

Checking components

To make sure that Hadoop, Pig, and Hive are set up correctly on the machine where the Data Services Job Server
for Hadoop is configured and installed, use the following command:

$ hadoop fs -ls /

For Hadoop, you should see output similar to the following:

$ hadoop fs -ls /
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmp
drwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user

For Pig, you should see output similar to the following:

$ pig
INFO org.apache.pig.Main - Logging error messages to: /hadoop/pig_1363897065467.log
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: hdfs://machine:9000
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to map-reduce job tracker at: machine:9001
grunt> fs -ls /
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmp
drwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user
grunt> quit

For Hive, you should see output similar to the following:

$ hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201303211318_504071234.txt
hive> show databases;
OK
default
Time taken: 1.312 seconds
hive> quit;

Set up or restart the Job Server

If all commands pass, use $LINK_DIR/bin/svrcfg from within the same shell to set up or restart the Job Server.

By running this, you are giving the Job Server the proper environment from which it can start engines that can call
Hadoop, Pig, and Hive.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 9
2.2.3 Hadoop support for the Windows platform

SAP Data Services supports Hadoop on the Windows platform using Hortonworks.

Use the supported version of Hortonworks HDP only. See the Product Availability Matrix (PAM) for the most recent
supported version number.

When you use Hadoop on the Windows platform, you can use Data Services to do the following tasks:

● Use Hive tables as a source or target in your data flows.


● Use HDFS files as a source or target in your data flows using PIG script or the HDFS library API.
● Stage non-Hive data in a data flow using the Data_Transfer transform. To do this, join it with a Hive source, and
then push down the Join operation to Hive.
● Preview data for HDFS files and Hive tables.

Requirements

Make sure that you set up your system as follows:

● Install the Data Services Job Server in one of the nodes of the Hadoop cluster.
● Set the system environment variables, such as PATH and CLASSPATH, so that the Job Server can run as a
service.
● Set the HDFS file system permission requirements for using HDFS or Hive.

Related Information

Connecting to HDFS [page 15]


Previewing HDFS file data [page 15]
Supplement for Adapters: Using Hive metadata [page 18]

2.2.3.1 Setting up HDFS and Hive on Windows

Set system environment variables and use command prompts to configure HDFS and Hive for Windows.

Install the SAP Data Services Job Server component.

1. Set the following system environment variable:

HDFS_LIB_DIR = /sap/dataservices/hadoop/tdp

2. Add <%LINK_DIR%>\ext\jre\bin\server to the PATH.


3. Run the command hadoop classpath --jar c:\temp\hdpclasspath.jar and update CLASSPATH=
%CLASSPATH%; c:\temp\hdpclasspath.jar.
4. Set the location of the Hadoop .jar file and the CLASSPATH .jar file that the Hadoop CLASSPATH
command generates.
5. When the Hadoop CLASSPATH command completes successfully, check the content of the .jar file for the
Manifest file.

Data Services Supplement for Big Data


10 PUBLIC Big data in SAP Data Services
6. Check that the hdfs.dll has symbols exported.
If the hdfs.dll does not have symbols exported, you should install the fix from Hortonworks for the export of
symbols. If the .dll still does not have symbols exported, use the .dll from Hortonworks 2.3.
7. Required only if you use TDP transforms in jobs, and only once per Data Services install: Run the
Hadoop_env_setup.bat from %LINK_DIR%\bin. The .bat file copies the Text Analysis Language file to the
HDFS cache directory.
8. Ensure that the Hadoop or Hive .jar files are installed. The Data Services Hive adapter uses the
following .jar files:
○ commons-httpclient-3.0.1.jar
○ commons-logging-1.1.3.jar
○ hadoop-common-2.6.0.2.2.6.0-2800.jar
○ hive-exec-0.14.0.2.2.6.0-2800.jar
○ hive-jdbc-0.14.0.2.2.6.0-2800-standalone.jar
○ hive-jdbc-0.14.0.2.2.6.0-2800.jar
○ hive-metastore-0.14.0.2.2.6.0-2800.jar
○ hive-service-0.14.0.2.2.6.0-2800.jar
○ httpclient-4.2.5.jar
○ httpcore-4.2.5.jar
○ libfb303-0.9.0.jar
○ log4j-1.2.16.jar
○ slf4j-api-1.7.5.jar
○ slf4j-log4j12-1.7.5.jar
9. Run the following commands to set up the permissions on the HDFS file system:

hdfs dfs -chmod -R 777 /mapred


hdfs dfs –mkdir /tmp
hdfs dfs –chmod –R 777 /tmp
hdfs dfs -mkdir /tmp/hive/
hdfs dfs -chmod -R 777 /tmp/hive
hdfs dfs –mkdir –p /sap/dataservices/hadoop/tdp
hdfs dfs -mkdir -p /user/hive
hdfs dfs -mkdir -p /hive/warehouse
hdfs dfs -chown hadoop:hadoop /user/hive
hdfs dfs -chmod -R 755 /user/hive
hdfs dfs -chmod -R 777 /hive/warehouse

Related Information

Supplement for Adapters: Using Hive metadata [page 18]

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 11
2.2.4 HDFS file format

The file format for the Hadoop distributed file system (HDFS) describes the file system structure.

Characteristic Description

Class Reusable

Access In the object library, click the Formats tab.

Description An HDFS file format describes the structure of a Hadoop distributed file system. Store tem­
plates for HDFS file formats in the object library. The format consists of multiple properties that
you set in the file format editor. Available properties vary by the mode of the editor.

The HDFS file format editor includes most of the regular file format editor options plus options
that are unique to HDFS.

2.2.4.1 HDFS file format options

File format option descriptions for Hadoop distributed file system (HDFS).

Access the following options in the source or target file editors when you use the HDFS file format in a data flow.

Option Possible values Description Mode

Data File(s)

NameNode host Computer name, fully Name of the NameNode computer. All
qualified domain name,
If you use the following default settings, the local Hadoop system uses
IP address, or variable
what is set as the default file system in the Hadoop configuration files.

● NameNode Host: default


● NameNode port: 0

NameNode port Positive integer or varia­ Port on which the NameNode listens. All
ble

Hadoop user Alphanumeric characters Hadoop user name. All


and underscores or vari­
If you are using Kerberos authentication, include the Kerberos realm in
able
the user name. For example: dsuser@BIGDATA.COM.

Data Services Supplement for Big Data


12 PUBLIC Big data in SAP Data Services
Option Possible values Description Mode

Authentication Kerberos Indicates the type of authentication for the HDFS connection. Select All
Kerberos keytab either value for Hadoop and Hive data sources when they are Kerberos
enabled.

Kerberos: Select when you have a password to enter in the Password


option.

Kerberos keytab: Select when you have a generated keytab file. With
this option, you do not need to enter a value for Password, but you en­
ter a location for File Location.

A Kerberos keytab file contains a list of authorized users for a specific


password. The software uses the keytab information instead of the en­
tered password in the Password option. For more information about
keytabs, see the MIT Kerberos documentation at http://web.mit.edu/
kerberos/krb5-latest/doc/basic/keytab_def.html .

File Location File path Location for the applicable Kerberos keytab that you generated for
this connection.

Password Alphanumeric characters Password associated with the selected authentication type. All
and underscores or vari­
This field is required for Authentication type Kerberos. This field is not
able
applicable for Authentication type Kerberos keytab.

Root directory Directory path or variable Root directory path or variable name for the output file. All

File name(s) Alphanumeric characters Select the source connection file name or browse to the file by clicking All
the dropdown arrow. For added flexibility, you can select a variable for
and underscores or vari­
this option or use the * wildcard.
able

Pig

Working Directory path or variable The Pig script uses this directory to store intermediate data. All
directory
Note
When you leave this option blank, Data Services creates and uses a
directory in /user/sapds_temp, within the HDFS.

Clean up Yes, No Yes: Deletes working directory files All


working
No: Preserves working directory files
directory
The software stores the Pig output file and other intermediate files in
the working directory. Files include scripts, log files, and the
$LINK_DIR/log/hadoop directory.

Note
If you select No, intermediate files remain in both the Pig Working
Directory and the Data Services directory $LINK_DIR/log/
hadoop.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 13
Option Possible values Description Mode

Custom Pig Directory path or variable Location of a custom Pig script. All
script
Use the results of the script as a source in a data flow.

Custom Pig script can contain any valid Pig Latin command, including
calls to any MapReduce jobs that you want to use with Data Services.
See the Pig documentation for information about Pig Latin com­
mands.

Custom Pig scripts must reside on and be runnable from the local file
system that contains the Data Services Job Server that is configured
for Hadoop. It is not the Job Server on HDFS. Any external reference or
dependency in the script should be available on the Data Services Job
Server machine configured for Hadoop.

To test your custom Pig script, execute the script from the command
prompt and check that it finishes without errors. For example, you
could use the following command:

$ pig -f myscript

To use the results of the script by using the HDFS file format as a
source in a data flow, complete the steps in Configuring custom Pig
script results as source [page 14].

Locale

Code page <default> The applicable Pig code page. All

us-ascii The Default option uses UTF-8 for the code page. Select one of these
options for better performance.

Note
For other types of code pages, Data Services uses HDFS API-based
file reading.

2.2.4.2 Configuring custom Pig script results as source

Output the results of a custom Pig script to a specified file so that you can use it as a source in a data flow.

Create a new HDFS file format or edit an existing one. Create or locate a custom Pig script that outputs data to use
as a source in your data flow.

Follow these steps to use the results of a custom Pig script in your HDFS file format as a source:

1. In the HDFS file format editor, select Delimited for Type in the General section.
2. Enter the location for the custom Pig script results output file in Root directory in the Data File(s) section.
3. Enter the name of the file to contain the results of the custom Pig script in File name(s).
4. In the Pig section, set Custom Pig script to the path of the custom Pig script. The location must be on the
machine that contains the Data Services Job Server.

Data Services Supplement for Big Data


14 PUBLIC Big data in SAP Data Services
5. Complete the applicable output schema options for the custom Pig script.
6. Set the delimiters for the output file in the Delimiters section.

Use the file format as a source in a data flow. When the software runs the custom Pig script in the HDFS file format,
the software uses the script results as source data in the job.

2.2.5 Connecting to HDFS

Quickly connect to an HDFS file using a file format.

To connect to a Hadoop Distributed File System (HDFS), configure an HDFS file format. Use the file format as a
source or target in a data flow.

Related Information

Reference Guide: HDFS file format [page 12]

2.2.5.1 Previewing HDFS file data

Preview HDFS file data for delimited and fixed width file types.

To preview the first 20 or so rows of an HDFS file:

1. Right-click an HDFS file name in the Format tab of the Local Object Library
2. Click Edit.

The File Format Editor opens. You can only view the data. Sorting and filtering are not available when you view
sample data in this manner.

Use one of the following methods to access HDFS file data so that you can view, sort, and filter the data:

● Right-click on HDFS source or target object in a data flow and click View Data.
● Click the magnifying glass icon located in the lower right corner of the HDFS source or target objects in the
data flow.
● Right-click an HDFS file in the Format tab of the Local Object Library, click Properties, and then open the View
Data tab.

Note
By default, the maximum number of rows displayed for data preview and filtering is 1000, but you can adjust the
number lower or higher, up to a maximum of 5000. To change the maximum number of rows to display:

1. Select Tools Options Designer General .


2. Set the View data sampling size (rows) to the desired number of rows.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 15
2.2.6 Connecting to Hive

Use the Hive adapter to connect to Hive.

Complete the following group of tasks to connect to Hive using the Hive adapter:

1. Open the Administrator in the Management Console and enable the Job Server to support adapters.
2. In the Administrator, add, configure, and start an adapter instance.
3. In Data Services Designer, add and configure a Hive adapter datastore.

Note
Data Services supports Apache Hive and HiveServer2 version 0.11 and higher. For the most recent compatibility
information, see the Product Availability Matrix (PAM) at https://apps.support.sap.com/sap/support/pam .

Related Information

Hive adapter datastore configuration options [page 16]

2.2.7 Hive adapter datastore configuration options

Option descriptions for the Hive adapter datastore editor.

The following datastore configuration options apply to the Hive adapter:

Option Description

Host name The name of the machine that is running the Hive service.

Port number The port number of the machine that is running the Hive service.

Username and Password The user name and password associated with the adapter database to which you are
connecting.

If you are using Kerberos authentication, the user name should include the Kerberos
realm. For example: dsuser@BIGDATA.COM. If you use Kerberos keytab for authenti­
cation, you do not need to complete this option.

Local working directory The path to your local working directory.

HDFS working directory The path to your Hadoop Distributed File System (HDFS) directory. If you leave this
blank, Data Services uses /user/sapds_hivetmp as the default.

String size The size of the Hive STRING datatype. The default is 100.

Data Services Supplement for Big Data


16 PUBLIC Big data in SAP Data Services
Option Description

SSL enabled Select Yes to use a Secure Socket Layer (SSL) connection to connect to the Hive server.

Note
If you use Kerberos or Kerberos keytab for authentication, set this option to No.

SSL Trust Store The name of the trust store that verifies credentials and stores certificates.

Trust Store Passwordd The password associated with the trust store.

Authentication Indicates the type of authentication you are using for the Hive connection:

Kerberos: Enter your Kerberos password in the Username and Password option.

Kerberos keytab: The generated keytab file. Enter the keytab file location in Kerberos
Keytab Location option.

A Kerberos keytab file contains a list of authorized users for a specific password. The
software uses the keytab information instead of the entered password in the Username
and Password option. For more information about keytabs, see the MIT Kerberos docu­
mentation at http://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html .

Data Services supports Kerberos authentication for Hadoop and Hive data sources
when you use Hadoop and Hive services that are Kerberos enabled.

Note
● Data Services supports Hadoop and Hive on Linux 64-bit platform only.
● You cannot use SSL and Kerberos or Kerberos keytab authentication together.
Set the SSL enabled option to No when using Kerberos authentication.
● To enable SASL-QOP support for Kerberos, enter a sasl.qop value into the
Additional Properties field. For more information, see the Additional Properties
field description.

To use Kerberos authentication, do the following:

1. Install Kerberos 5 client 64-bit packages (krb5, krb5-client).


2. Configure Kerberos KDC according to the Hadoop/Hive distribution requirements.
3. Make sure the Kerberos configuration file (krb5.conf) is available and contains the
correct REALM/KDC configurations. Note that the location is installation-specific,
under /etc/krb5.conf on Linux.
4. Point /usr/lib64: linkrb5.so to the preferred version of
libkrb5.so.<version> library.

For more information about Kerberos, visit http://web.mit.edu/kerberos/ .

Kerberos Realm Specifies the name of your Kerberos realm. A realm contains the services, host ma­
chines, and so on, that users can access. For example, BIGDATA.COM.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 17
Option Description

Kerberos KDC Specifies the server name of the Key Distribution Center (KDC). Secret keys for user
machines and services are stored in the KDC database.

Configure the Kerberos KDC with renewable tickets (ticket validity as required by Ha­
doop/Hive installation).

Note
Data Services supports MIT KDC and Microsoft AD for Kerberos authentication.

Kerberos Hive Principal The Hive principal name for the KDC. The name can be the same as the user name that
you use when installing Data Services. Find the Hive service principal information in the
hive-site.xml file. For example, hive/<hostname>/@realm.

Kerberos Keytab Location Location for the applicable Kerberos keytab that you generated for this connection.

See the description for Authentication for more information about Kerberos keytab au­
thentication.

Additional Properties Specify any additional connection properties. Follow property value pairs with a semico­
lon (;). Separate multiple property value pairs with a semicolon. For example:

name1=value1;

name1=value1; name2=value2;

To enable SASL-QOP support, set the Authentication option to Kerberos. Then enter one
of the following values, which should match the value on the Hive server:

● Use ;sasl.qop=auth; for authentication only.


● Use ;sasl.qop=auth-int; for authentication with integrity protection.
● Use ;sasl.qop=auth-conf; for authentication with integrity and confidential-
ity protection.

Related Information

Using Hive metadata [page 18]

2.2.8 Using Hive metadata

Use the Hive adapter to connect to a Hive server so that you can work with tables from Hadoop.

You can use a Hive table as a source or a target in a data flow.

Note
Data Services supports Apache Hive and HiveServer2 version 0.11 and higher. For the most recent compatibility
information, see the Product Availability Matrix (PAM) at https://apps.support.sap.com/sap/support/pam .

Data Services Supplement for Big Data


18 PUBLIC Big data in SAP Data Services
For more information about Hadoop and Hive, see "Hadoop" in the Reference Guide.

Related Information

Hive adapter datastore configuration options [page 16]


Metadata mapping for Hive [page 23]

2.2.8.1 Configuring Hadoop for text data processing

SAP Data Services supports text data processing in the Hadoop framework using a MapReduce form of the Entity
Extraction transform.

To use text data processing in Hadoop, copy the language modules and other dependent libraries to the Hadoop
file system (so they can be distributed during the MapReduce job setup) by running the Hadoop environment
script as follows:

$LINK_DIR/hadoop/bin/hadoop_env_setup.sh -c

You only have to do this file-copying operation once after an installation or update, or when you want to use
custom dictionaries or rule files. If you are using the Entity Extraction transform with custom dictionaries or rule
files, you must copy these files to the Hadoop file system for distribution. To do so, first copy the files into the
languages directory of the Data Services installation, then rerun the Hadoop environment script. For example:

cp /myhome/myDictionary.nc $LINK_DIR/TextAnalysis/languages

$LINK_DIR/hadoop/bin/hadoop_env_setup.sh -c

Once this environment is set up, in order to have the Entity Extraction transform operations pushed down and
handled by the Hadoop system, it must be connected to a single HDFS Unstructured Text source.

Optimizing text data processing for use in the Hadoop framework

When using text data processing in the Hadoop framework, the amount of data a mapper can handle and
consequently the number of mappers a job uses, is controlled by the Hadoop configuration setting,
mapred.max.split.size.

You can set the value for mapred.max.split.size in the Hadoop configuration file (located at $HADOOP_HOME/
conf/core-site.xml or an alternate configuration location, depending on the flavor of Hadoop you are using).

By default, the value for mapred.max.split.size is 0, which means that there is no limit and text data
processing would run with only one mapper. You should change this configuration value to the amount of data a
mapper can handle.

For example, you might have a Hadoop cluster that contains twenty machines and each machine is set up to run a
maximum of ten mappers (20 x 10 = 200 mappers available in the cluster). The input data averages 200 GB. If you

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 19
want the text data processing job to consume 100 percent of the available mappers (200 GB ÷ 200 mappers = 1
GB per mapper), you would set mapred.max.split.size to 1073741824 (1 GB).

<property>
<name>mapred.max.split.size</name>
<value>1073741824</value>
</property>

If you want the text data processing job to consume 50 percent of the available mappers (200 GB ÷ 100 mappers
= 2 GB per mapper), you would set mapred.max.split.size to 2147483648 (2 GB).

Related Information

HDFS file format [page 12]

2.2.8.2 Hadoop Hive adapter source options

The options and descriptions for Hadoop Hive adapter source.

You can set the following options on the Adapter Source tab of the source table editor.

Option Possible values Description

Clean up working directory True, False Select True to delete the working directory after the job com­
pletes successfully.

Execution engine type Default, Map Reduce, Spark ● Default: Data Services uses the default Hive engine.
● Spark: Data Services uses the Spark engine to read data
from Spark.
● Map Reduce : Data Services uses the Map Reduce engine
to read data from Hive.

Parallel process threads Positive integers Specify the number of threads for parallel processing. More
than one thread may improve performance by maximizing
CPU usage on the Job Server computer. For example, if you
have four CPUs, enter 4 for the number of parallel process
threads.

2.2.8.3 Hadoop Hive adapter target options

The options and descriptions for Hadoop Hive adapter target options.

You can set the following options on the Adapter Target tab of the target table editor.

Data Services Supplement for Big Data


20 PUBLIC Big data in SAP Data Services
Option Possible values Description

Append True, False Select True to append new data to the table or partition.

Select False to delete all existing data, then add new data.

Clean up working directory True, False Select True to delete the working directory after the job com­
pletes successfully.

Dynamic partition True, False Select True for dynamic partitions. Hive evaluates the parti­
tions when scanning the input data.

Select False for static partitions.

Only all-dynamic or all-static partitions are supported.

Drop and re-create table True, False Select True to drop the existing table and create a new one
before loading with the same name before loading.

This option displays only for template tables. Template tables


are used in design or test environments.

Number of loaders Positive integers Enter a positive integer for the number of loaders (threads).

Loading with one loader is known as single loader loading.


Loading when the number of loaders is greater than one is
known as parallel loading. You can specify any number of load­
ers. The default is 1.

2.2.8.4 Hive adapter datastore support for SQL function and


transform

The Hive adapter datastore can process data using the SQL function and the SQL transform.

After connecting to a Hive datastore, you can do the following in Data Services:

● Use the SQL Transform to read data through a Hive adapter datastore. Keep in mind that the SQL transform
supports a single SELECT statement only.

Note
Select table column plus constant expression is not supported.

● Use the sql() function to:


○ create, drop, or INSERT Hive tables
○ return a single string value from a Hive table
○ select a Hive table that contains aggression functions (max, min, count, avg, and sum)
○ perform inner and outer joins

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 21
2.2.8.5 Pushing the JOIN operation to Hive

Stage non-Hive data in a dataflow with the Data Transfer transform before joining it with a Hive source.

When you join the non-hive data to a Hive source, pushdown the Join operation to Hive.

Using pushdown, staging data is more efficient because Data Services doesn't have to read all the data from the
Hive data source into memory before performing the join.

Before staging can occur, you must first enable the Enable automatic data transfer option for the Hive datastore.
Find this option in the Create New Datastore or Edit Datastore window.

After adding the Data_Transfer transform to your dataflow, open the editor and verify that Transfer Type is set to
Table and Database type is set to Hive.

Note
If you select Automatic for the Data Transfer Type in the Data Transfer transform you need to turn off the Enable
automatic data transfer option in all relational database datastores (with exeception of the Hive datastore).

2.2.8.6 About partitions

Partition columns display at the end of the table column list.

Data Services imports Hive partition columns the same way as regular columns. The column attribute Partition
Column identifies whether the column is partitioned.

When loading to a Hive target, select whether or not to use the Dynamic partition option on the Adapter Target tab
of the target table editor. The partitioned data is evaluated dynamically by Hive when scanning the input data. If
Dynamic partition is not selected, Data Services uses Hive static loading. All rows are loaded to the same partition.
The partitioned data comes from the first row that the loader receives.

Related Information

Hadoop Hive adapter target options [page 20]

2.2.8.7 Previewing Hive table data

Preview data in Hive tables.

To preview Hive table data, right-click a Hive table name in the Local Object Library and click View Data.
Alternatively, you can click the magnifying glass icon on Hive source and target objects in a data flow or open the
View Data tab of the Hive table view.

Data Services Supplement for Big Data


22 PUBLIC Big data in SAP Data Services
Note
Hive table data preview is only available with Apache Hive version 0.11 and later.

2.2.8.8 Using Hive template tables

After you create a Hive application datastore in Data Services, use a Hive template table in a data flow.

Start to create a data flow in Data Services Designer and follow these steps to add a Hive template table as a
target.

1. When you are ready to complete the target portion of the data flow, either drag a template table from the
toolbar to your workspace or drag a template table from the Datastore tab under the Hive node to your
workspace.

The Create Template window opens.


2. Enter a template table name in Template name.
3. Select the applicable Hive datastore name from the In datastore dropdown list.
4. Enter the Hive dataset name in Owner name.
5. Select the format of the table from the Format dropdown list. Select Text file, Parquet, ORC, or AVRO.
6. Click OK to close the Create Template window.
7. Connect the dataflow to the Hive target template table.
8. Open the target table and set the options in the Target tab.
The software automatically completes the input and output schema areas based on the schema in the stated
Hive dataset.
9. Execute the data flow.

The software opens the applicable project and dataset, and creates the table. The table name is the name you
entered for Template name in the Create Template window. The software populates the table with the results of the
data flow.

2.2.8.9 Metadata mapping for Hive

Data type conversion when you import metadata from Hadoop Hive to SAP Data Services.

The following table shows the conversion between Hadoop Hive data types and Data Services data types when
Data Services imports metadata from a Hadoop Hive source or target.

Hadoop Hive data type Converts to Data Services data type

tinyint int

smallint int

int int

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 23
Hadoop Hive data type Converts to Data Services data type

bigint decimal(20,0)

float real

double double

string varchar

boolean varchar(5)

complex not supported

2.3 HP Vertica

Process your HP Vertica data in SAP Data Services by creating an HP Vertica database datastore.

Use an HP Vertica datastore as a source or target in a data flow. Implement SSL secure data transfer with MIT
Kerberos to securely access HP Vertica data. Additionally, set settings in the source or target table options to
enhance HP Vertica performance.

2.3.1 Enable MIT Kerberos for HP Vertica SSL protocol

SAP Data Services uses MIT Kerberos 5 authentication to securely access an HP Vertica database using SSL
protocol.

You must have Database Administrator permissions to install MIT Kerberos 5 on your Data Services client
machine. Additionally, the Database Administrator must establish a Kerberos Key Distribution Center (KDC)
server for authentication. The KDC server must support Kerberos 5 using the Generic Security Service (GSS) API.
The GSS API also supports non_MIT Kerberos implementations, such as Java and Windows clients.

Note
Specific Kerberos and HP Vertica database processes are required before you can enable SSL protocol in Data
Services. For complete explanations and processes for security and authentication, consult your HP Vertica
user documentation and the MIT Kerberos user documentation.

MIT Kerberos authorizes connections to the HP Vertica database using a ticket system. The ticket system
eliminates the need for users to enter a password.

Related Information

Information to edit configuration or initialization file [page 25]


Generate secure key with kinit command [page 27]
Creating a DSN for HP Vertica [page 27]

Data Services Supplement for Big Data


24 PUBLIC Big data in SAP Data Services
2.3.1.1 Information to edit configuration or initialization file

Descriptions for kerberos properties for configuration or initialization files.

After you install MIT Kerberos, define the specific Kerberos properties in the Kerberos configuration or
initialization file and save it to your domain. For example, save krb5.ini to C:\Windows.

See the MIT Kerberos documentation for information about completing the Unix krb5.conf property file or the
Windows krb5.ini property file. Kerberos documentation is located at: http://web.mit.edu/kerberos/krb5-
current/doc/admin/conf_files/krb5_conf.html .

Log file locations for Kerberos

[logging] Locations for Kerberos log files

Property Description

default = <value> The location for the Kerberos library log file, krb5libs.log.
For example: default = FILE:/var/log/
krb5libs.log

kdc = <value> The location for the Kerberos Data Center log file,
krb5kdc.log. For example: kdc = FILE:/var/log/
krb5kdc.log

admin_server = <value> The location for the administrator log file, kadmind.log. For
example: admin_server = FILE:/var/log/
kadmind.log

Kerberos 5 library settings

[libdefaults] Settings used by the Kerberos 5 library

Property Description

default_realm = <value> The location of your domain. Example: default_realm =


EXAMPLE.COM

Domain must be in all capital letters.

dns_lookup_realm = <value> Set to False: dns_lookup_realm = false

dns_lookup_kdc = <value> Set to False: dns_lookup_kdc = false

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 25
Property Description

ticket_lifetime = <value> Set number of hours for the initial ticket request. For example:
ticket_lifetime = 24h

The default is 24h.

renew_lifetime = <value> Set number of days a ticket can be renewed after the ticket
lifetime expiration. For example: renew_lifetime = 7d

The default is 0.

forwardable = <value> Initial tickets can be forwarded when this value is set to True.
For example: forwardable = true

Kerberos realm values

[realms] Value for each Kerberos realm

Property Description

<kerberos_realm> = {<subsection_property = Location for each property of the Kerberos realm. For example:
value>}
EXAMPLE.COM = {kdc=<location>
admin_server=<location>
kpasswd_server=<location>}

Properties include:

● KDC location
● Admin Server location
● Kerberos Password Server location

Note
Host and server names are lowercase.

Kerberos domain realm

[domain_realm]

Property Description

<server_host_name>=<kerberos_realm> Maps the server host name to the Kerberos realm name. If you
use a domain name, prefix the name with a period (.).

Data Services Supplement for Big Data


26 PUBLIC Big data in SAP Data Services
Related Information

Generate secure key with kinit command [page 27]

2.3.1.2 Generate secure key with kinit command

Execute the kinit command to generate a secure key.

After you have updated the configuration or initialization file and saved it to the client domain, execute the kinit
command to generate a secure key.

For example, enter the following command using your own information for the variables: kinit
<user_name>@<realm_name>

The command should generate the following keys:

Key Description

-k Precedes the service name portion of the Kerberos principal.


The default is vertica.

-K Precedes the instance or host name portion of the Kerberos


principal.

-h Precedes the machine host name for the server.

-d Precedes the HP Vertica database name that you want to con­


nect to.

-U Precedes the user name of the administrator user.

See the MIT Kerberos ticket management documentation for complete information about using the kinit
command to obtain tickets: http://web.mit.edu/kerberos/krb5-current/doc/user/tkt_mgmt.html .

2.3.2 Creating a DSN for HP Vertica

To enable SSL for HP Vertica database datastores, first create a data source name (DSN).

You must be an HP Vertica user with database administrator permissions to perform these steps. Other non
database administrators can access the HP Vertica database only when the they are associated with an
authentication method through a GRANT statement.

You must be using SAP Data Services 4.2 SP7 Patch 1 (14.2.7.1) or later to create a DSN for HP Vertica.

Install MIT Kerberos 5 and perform all of the required steps for MIT Kerberos authentication for HP Vertica. See
your HP Vertica documentation in the security and authentication sections for details.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 27
Follow these steps to create a DSN for HP Vertica:

1. Open the ODBC Data Source Administrator.

You can access the ODBC Data Source Administrator either from the Datastore Editor in Data Services
Designer or directly from your Start menu.
2. In the ODBC Data Source Administrator, open the System DSN tab and click Add.
3. Select the HP Vertica driver from the list and click Finish.
4. Open the Basic Settings tab and complete the following options:

HP Vertica ODBC DSN Configuration Basic Settings tab

Option Value

DSN Enter the HP Vertica data source name.

Description Optional. Enter a description for this data source.

Database Enter the name of the database that is running on the


server.

Server Enter the server name.

Port Enter the port number on which HP Vertica listens for ODBC
connections. The default is 5433.

User Name Enter the database user name. This is the user with DBAD­
MIN permission, or a user who is associated with the au­
thentication method through a GRANT statement.

5. Optional. Select Test Connection.


6. Open the Client Settings tab and complete the options as described in the following table.

HP Vertica ODBC DSN Configuration Client Settings tab

Option Value

Kerberos Host Name Enter the name of the host computer where Kerberos is in­
stalled.

Kerberos Service Name Enter the applicable value.

SSL Mode Select Require.

Address Family Preference Select None.

Autocommit Select this option.

Driver String Conversions Select Output.

Result Buffer Size (bytes) Enter the applicable value in bytes. Default is 131072.

Three Part Naming Select this option.

Data Services Supplement for Big Data


28 PUBLIC Big data in SAP Data Services
Option Value

Log Level Select No logging from the dropdown list.

7. Click Test Connection. When the connection test is successful click OK and close the ODBC Data Source
Administrator.

Now the HP Vertica DSN that you just created is included in the DSN option in the datastore editor.

Create the HP Vertica database datastore in Data Services Designer and select the DSN that you just created.

Related Information

Creating HP Vertica datastore with SSL encryption [page 29]

2.3.3 Creating HP Vertica datastore with SSL encryption

SSL encryption protects data as it is transferred between the database server and Data Services.

An administrator must install MIT Kerberos 5 and enable Kerberos for HP Vertica SSL protocol. Additionally, an
administrator must create an SSL data source name (DSN) using the ODBC Data Source Administrator so that it is
available to choose when you create the datastore. See the Administrator Guide for more information about
configuring MIT Kerberos.

SSL encryption for HP Vertica is available in SAP Data Services version 4.2 Support Package 7 Patch 1 (14.2.7.1) or
later.

Note
Enabling SSL encryption slows down job performance.

Note
An HP Vertica database datastore requires that you choose DSN as a connection method. DSN-less
connections are not allowed for HP Vertica datastore with SSL encryption.

1. In Designer, select Project New Datastore .


2. Complete the options as you would for an HP Vertica database datastore. Complete the following options
specifically for SSL encryption:

SSL-specific options

Option Value

Use Data Source Name (DSN) Select this option.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 29
Option Value

Data Source Name Select the HP Vertica SSL DSN data source file that was cre­
ated previously in the ODBC Data Source Administrator.

3. Complete the remaining applicable advanced options and save your datastore.

Related Information

HP Vertica datastore options [page 30]

2.3.3.1 HP Vertica datastore options

Options, descriptions, and possible values for creating an HP Vertica database datastore.

Configure an HP Vertica database datastore using a DSN (data source name).

After you create the HP Vertica database datastore, you can import HP Vertica tables into Data Services. Use the
tables as source or targets in a dataflow, and create HP Vertica template tables.

SSL protocol is available for HP Vertica database datastores. Before you can create an SSL-enabled HP Vertica
datastore, the HP Vertica database administrator user must install and configure MIT Kerberos 5 and create a DSN
in the ODBC Data Source Administrator.

Main window

HP Vertica option Possible values Description

Database version HP Vertica 7.1.x Select your HP Vertica client version


from the drop-down list. This is the ver­
sion of HP Vertica that this datastore ac­
cesses.

Data source name Refer to the requirement of your data­ Required. Select a DSN from the drop­
base down list if you have already defined one.
If you haven't defined a DSN previously,
click ODBC Admin to define a DSN.

You must first install and configure MIT


Kerberos 5 and perform other HP Vertica
set up tasks before you can define a
DSN.

For more information about HP Vertica


MIT Kerberos and DSN for HP Vertica,
read the Server Management section of
the Administrator Guide.

Data Services Supplement for Big Data


30 PUBLIC Big data in SAP Data Services
HP Vertica option Possible values Description

User name Alphanumeric characters and under­ Enter the user name of the account
scores through which SAP Data Services ac­
cesses the database.

Password Alphanumeric characters and under­ Enter the user password.


scores

Connection

HP Vertica option Possible values Description

Additional connection parameters Alphanumeric characters and under­ Enter information for any additional con­
scores, or blank nection parameters. Use the format:
<parameter1=value1;
parameter2=value2>

General

HP Vertica option Possible values Description

Rows per commit Positive integer Enter the maximum number of rows
loaded to a target table before saving the
data. This value is the default commit
size for target tables in this datastore.
You can overwrite this value for individual
target tables.

Overflow file directory Directory path or click Browse A working directory on the database
server that stores files such as logs.
Must be defined to use FTP.

Session

HP Vertica option Possible values Description

Additional session parameters A valid SQL statement or multiple SQL Additional session parameters specified
statements delimited by semicolon as valid SQL statements.

Aliases (Click here to create)

HP Vertica option Possible values Description

Aliases Alphanumeric characters and under­ Click the option to open a Create New
scores, or blank Alias window.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 31
2.3.4 Bulk loading for HP Vertica

Set up an HP Vertica database datastore for bulk loading by increasing the commit size in the loader and by
selecting to use the native connection load balancing option when you configure the ODBC driver.

Consult with the Connecting to HP Vertica guide at https://my.vertica.com/docs/7.1.x/PDF/


HP_Vertica_7.1.x_ConnectingToHPVertica.pdf .

There are no specific bulk loading options when you create an HP Vertica database datastore. However, when you
load data to an HP Vertica target in a data flow, the software automatically executes an HP Vertica statement that
contains a COPY Local statement. This statement makes the ODBC driver read and stream the data file from the
client to the server.

You can further increase loading speed by making the following settings in Designer:

1. Increase rows per commit in source:


a. In Designer, open the applicable data flow.
b. In the workspace, double-click the HP Vertica datastore target object to open it.
c. Open the Options tab in the lower pane.
d. Increase the number of rows in the Rows per commit option.
2. In HP Vertica, enable the option to use native connection load balancing when you configure the ODBC driver
for HP Vertica.

2.3.5 HP Vertica data type conversion

Data type conversion between HP Vertica and SAP Data Services.

HP Vertica data type Data Services data type

Boolean Int

Integer, INT, BIGINT, INT8, SMALLINT, TINYINT Decimal

FLOAT Double

Money Decimal

Numeric Decimal

Number Decimal

Decimal Decimal

Binary, Varbinary, Long Varbinary Blob

Long Varchar Long

Char Varchar

Data Services Supplement for Big Data


32 PUBLIC Big data in SAP Data Services
HP Vertica data type Data Services data type

Varchar Varchar

Char(n), Varchar(n) Varchar(n)

DATE Date

TIMESTAMP Datetime

TIMESTAMPTZ Varchar

Time Time

TIMETZ Varchar

INTERVAL Varchar

Data type conversion from internal data types to HP Vertica data types for template tables or Data_Transfer
transform tables.

Data Services data type HP Vertica data type in template table

Blob Long Varbinary

Date Date

Datetime Timestamp

Decimal Decimal

Double Float

Int Int

Interval Float

Long Long Varchar

Real Float

Time Time

Varchar Varchar

Timestamp Timestamp

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 33
2.3.6 HP Vertica table source

Options and descriptions for setting up an HP Vertica table as a source in a data flow.

Option Description

Table name The name of the table that you added as a source to the data­
flow.

Table owner The owner that you entered when you created the HP Vertica
table.

Datastore name The name of the HP Vertica datastore.

Database type Set to HP Vertica by default. The database type that you chose
when you created the datastore. You cannot change this op­
tion.

2.3.7 HP Vertica target table options

Options and descriptions for setting up an HP Vertica table as a target in a data flow.

General

Option Description

Column comparison Default is Compare by name.

Specifies how the input columns are mapped to output col­


umns. There are two options:

● Compare by position: The software disregards the column


names and maps source columns to target columns by
position.
● Compare by name: The software maps source columns to
target columns by name.

Validation errors occur if the data types of the columns do not


match.

Data Services Supplement for Big Data


34 PUBLIC Big data in SAP Data Services
Option Description

Number of loaders The default number of loaders is 1, which is single loader load­
ing.

● Single loader loading: Loading with one loader.


● Parallel loading: Loading when the number of loaders is
greater than one.

When parallel loading, each loader receives the number of


rows indicated in the Rows per commit option. Each loader ap­
plies the rows in parallel with the other loaders.

For example, if you choose a Rows per commit of 1000 and set
the Number of Loaders to 3, the software loads data as follows:

● Sends the first 1000 rows to the first loader


● Sends the second 1000 rows to the second loader
● Sends the third 1000 rows to the third loader
● Sends the next 1000 rows back to the first loader

Error handling

Option Description

Use overflow file Default is No.

This option is used for recovery purposes. If the software can­


not load a row, the row is written to a file. When this option is
set to Yes, options are enabled for the file name and file for­
mat.

Update control

Option Description

Use input keys Default is No.

Yes: If the target table does not contain a primary key, this op­
tion enables the software to use the primary keys from the in­
put.

No: If the target is a Microsoft SQL Server database table, and


the identity column is mapped as the primary key, this option
must = No.

Update key columns Default is No.

Yes: The software updates key column values when it loads


data to the target.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 35
Option Description

Auto correct load Default is No.

Yes: Use auto correct loading. Auto correct loading ensures


that the same row is not duplicated in a target table. This is
particularly useful for data recovery operations.

Note
This option is not available for targets in real time jobs or
target tables that contain LONG columns.

When you select Yes for this option, the software reads a row
from the source, then checks if the row exists in the target ta­
ble with the same values in the primary key. If Use input keys is
set to Yes, the software uses the primary key of the source ta­
ble. Otherwise, the software uses the primary key of the target
table. If the target table has no primary key, the software con­
siders the primary key to be all the columns in the target.

If a matching row does not exist, the software inserts a new


row, regardless of other options.

If a matching row exists, the software updates the row de­


pending on the value of Ignore columns with value.

When the column data from the source matches the value in
Ignore columns with value, the software does not update the
corresponding column in the target table. The value may be
spaces. Otherwise, the software updates the corresponding
column in the target with the source data.

Ignore columns with value Enter a value that might appear in a source column and that
you do not want updated in the target table. The value must be
a string, it can include spaces, but the string cannot be in sin­
gle or double quotations. When this value appears in the
source column, the software does not update the correspond­
ing target column during auto correct loading.

Data Services Supplement for Big Data


36 PUBLIC Big data in SAP Data Services
Transaction control

Option Description

Include in transaction Default is No.

Yes: Indicates that this target is included in the transaction


processed by a batch or real-time job. This option allows you
to commit data to multiple tables as part of the same transac­
tion. If loading fails for any one of the tables, the software does
not commit an data to any of the tables. The tables must be
from the same datastore.

Transactional loading can require rows to be buffered to en­


sure the correct load order. If the data being buffered is larger
than the virtual memory available, the software reports a
memory error.

If you choose to enable transactional loading, the following op­


tions are not available:

● Rows per commit


● Use overflow file and overflow file specification
● Number of loaders

The software does not push down a complete operation to the


database if transactional loading is enabled.

2.4 MongoDB

The MongoDB adapter allows you to read data from MongoDB to other SAP Data Services targets.

MongoDB is an open-source document database which has JSON-like documents called BSON with dynamic
schemas instead of traditional schema-based data.

Data Services needs metadata to gain access to data for task design and execution. Use Data Services processes
to generate schema by converting each row of the BSON file into XML and converting XML to XSD.

Data Services uses the converted metadata in XSD files to access MongoDB data.

2.4.1 Using MongoDB metadata

Use data from MongoDB as a source or target in a data flow, and also create templates.

The embedded documents and arrays in MongoDB are represented as nested data. SAP Data Services processes
can convert MongoDB BSON files to XML and then to XSD. Data Services saves the XSD file to the following
location: %DS_COMMON_DIR%\ext\mongo\mcache in your local drive.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 37
Restrictions and limitations

Data Services has the following restrictions and limitations for working with MongoDB:

● In the MongoDB collection, the tag name should not contain special characters, which are invalid for the XSD
file (for example, >, <, &, /, \, #, and so on). If special characters exist, Data Services removes them.
● MongDB data is always changing, so the XSD may not reflect the entire data structure of all the documents in
the MongoDB.
● Projection queries on adapters are not supported.
● Data Services ignores any new fields that you add after the metadata schema creation that were not present in
the common documents.
● Push down operators are not supported when using MongoDB as a target.

2.4.2 MongoDB as a source

Use MongoDB as a source in Data Services and then flatten the schema by using the XML_Map transform.

Example 1: This data flow changes the schema via the Query transform and then loads the data to an XML target.

Example 2: This data flow simply reads the schema and then loads it directly into an XML template file.

Example 3: This data flow flattens the schema using the XML_Map tranform and then loads the data to a table or
flat file.

Note
Specify conditions in the Query and XML_Map transforms. Some of them can be pushed down and others are
processed by Data Services.

Data Services Supplement for Big Data


38 PUBLIC Big data in SAP Data Services
Related Information

MongoDB query conditions [page 39]


Push down operator information [page 40]

2.4.2.1 MongoDB query conditions

Use query criteria to retrieve documents from a MongoDB collection.

Query criteria is used as a parameter of the db.<collection>.find() method. After dropping a MongoDB
table into a data flow as a source, open the source and add MongoDB query conditions.

To add a MongoDB query format, enter a value next to the Query criteria parameter in the Adapter Source tab.

Note
The query criteria should be in MongoDB query format. For example, { type: { $in:
[‘food’, ’snacks’] } }.

For example, given a value of {prize:100}, MongoDB returns only rows that have a field named “prize” with a
value of 100. MongoDB won't return rows that don't match this condition. If you don’t specify a value, MongoDB
returns all the rows.

If you specify a Where condition in a Query or XML_Map transform that comes after the MongoDB source in the
data flow, Data Services pushes down the condition to MongoDB so that MongoDB returns only the rows that you
want.

For more information about the MongoDB query format, see the MongoDB website.

Note
When you use the XML_Map transform, you may have a query condition with a SQL format. When this happens,
Data Services converts the SQL format to the MongoDB query format and uses the MongoDB specification to
push down operations to the source database. In addition, be aware that Data Services does not support push
down of query for nested arrays.

Related Information

Push down operator information [page 40]

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 39
2.4.2.2 Push down operator information

How SAP Data Services processes push down operators in a MongoDB source.

Data Services does not push down Sort by conditions but it does push down Where conditions. However, if you
use a nested array in a Where condition, Data Services does not push down the nested array.

Note
Data Services does not support push down operators when you use MongoDB as a target.

Data Services supports the following operators when you use MongoDB as a source:

● Comparison operators =, !=, >, >=, <, <=, like, and in.
● Logical operators and and or in SQL query.

2.4.3 MongoDB as a target

Use MongoDB as a target in your data flow.

Note
The _id field is considered the primary key. If you create a new document with a field named _id, that field will
be recognized as the unique BSON ObjectID. If a document contains more than one _id field (at a different
level), only the _id field in the first level will be considered the ObjectID.

You can set the following options in the Adapter Target tab of the target document editor:

Option Description

Use auto correct Specifies basic operations when using MongoDB as your target datastore. The following values are
available:

● True: The writing behavior is in Upsert mode. The software updates the document with the
same _id or it inserts a new _id.

Note
Using True may slow the performance of writing operations.

● False (default): The writing behavior is in Insert mode. If documents have the same _id in the
MongoDB collection, then an error message appears.

Data Services Supplement for Big Data


40 PUBLIC Big data in SAP Data Services
Option Description

Write concern level Write concern is a guarantee that MongoDB provides when reporting on the success of a write oper­
ation. This option allows you to enable or disable different levels of acknowledgment for writing oper­
ations.

The following values are available:

● Acknowledged (default): Provides acknowledgment of write operations on a standalone


mongod or the primary in a replica set.
● Unacknowledged: Disables the basic acknowledgment and only returns errors of socket excep­
tions and networking errors.
● Replica Set Acknowledged: Guarantees that write operations have propagated successfully to
the specified number of replica set members, including the primary.
● Journaled: Acknowledges the write operation only after MongoDB has committed the data to a
journal.
● Majority: Confirms that the write operations have propagated to the majority of voting nodes.

Use bulk Indicates whether or not you want to execute writing operations in bulk, which provides better per­
formance.

When set to True, the software runs a write operation in a bulk for a single collection in order to opti­
mize the CRUD efficiency.

If the write operation in a bulk is more than 1000, MongoDB automatically splits into multiple bulk
groups.

For more information about bulk, ordered bulk, and bulk maximum rejects, see the MongoDB docu­
mentation at http://help.sap.com/disclaimer?site=http://docs.mongodb.org/manual/core/bulk-
write-operations/.

Use ordered bulk Specifies if you want to execute the write operations in serial (True) or parallel (False) order. The de­
fault value is False.

If you execute in parallel order (False), then MongoDB processes the remaining write operations even
when there are errors.

Documents per commit Specifies the maximum number of documents that are loaded to a target before the software saves
the data. If this option is left blank, the software uses 1000 (default).

Bulk maximum rejects Specifies the maximum number of acceptable errors before Data Services fails the job. Note that
data will still load to the target MongoDB even if the job fails.

For unordered bulk loading, if the number of errors is less than, or equal to, the number you specify
here, Data Services allows the job to succeed and logs a summary of errors in the adapter instance
trace log.

Enter -1 to ignore any bulk loading errors. Errors will not be logged in this situation.

Note
This option does not apply when Use ordered bulk is set to True.

Delete data before Deletes existing documents in the current collection before loading occurs, and retains all the config-
loading uration, including indexes, validation rules, and so on.

Drop and re-create Drops the existing MongoDB collection and creates a new one with the same name before loading
occurs. If Drop and re-create is set to True, the software ignores the value of Delete data before
loading. This option is available for template documents only. The default value is True.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 41
Option Description

Use audit Logs data for auditing. Data Services creates audit files containing write operation information and
stores them in the <DS_COMMON_DIR>/adapters/audits/ directory. The name of the file is
<MongoAdapter_instance_name>.txt.

Here's what you can expect to see when using this option:

● If a regular load fails and Use audit is set to False, loading errors appear in the job trace log.
● If a regular load fails and Use audit is set to True, loading errors appear in the job trace log and in
the audit log.
● If a bulk load fails and Use audit is set to False, the job trace log provides a summary, but it does
not contain details about each row of bad data. There is no way to obtain details about bad data.
● If a bulk load fails and Use audit is set to True, the job trace log provides a summary, but it does
not contain details about each row of bad data. However, the job trace log tells you where to look
in the audit file for this information.

2.4.4 MongoDB template documents

Use template documents as a target in one data flow or as a source in multiple data flows.

Template documents are particularly useful in early application development when you are designing and testing a
project. Find template documents in the Datastore tab of the Local Object Library. Expand the Template
Documents node and find the MongoDB datastore.

When you import a template document, the software converts it to a regular document. You can use the regular
document as a target or source in your data flow.

Note
Template documents are available in Data Services 4.2.7 and later. If you are upgrading from a previous version,
you need to edit the MongoDB datastore and then click OK to see the Template Documents node and any other
template document related options.

Template documents are similar to template tables. For information about template tables, see the Data Services
User Guide and the Reference Guide.

2.4.4.1 Creating template documents

Create MongoDB template documents and use them as targets or sources in data flows.

1. In Data Services Designer, click the template icon from the tool palette.
2. Click inside a data flow in the workspace.

The template appears in the workspace.


3. Open the Datastore tab in the Local Object Library and choose the MongoDB datastore.
4. In the Create Template window, enter a template name.

Data Services Supplement for Big Data


42 PUBLIC Big data in SAP Data Services
Note
The maximum length of the collection namespace (<database>.<collection>) should not exceed 120
bytes.

5. Click OK.
6. To use the template document as a target in the data flow, connect the template document to an input object.
7. Click Save.

Linking a data source to the template document and then saving the project generates a schema for the
template document. The icon changes in the workspace and the template document appears in the Template
Documents node under the datastore in the Local Object Library.

Drag template documents from the Template Documents node into the workspace to use them as a source.

Related Information

Previewing MongoDB document data [page 44]


MongoDB as a source [page 38]
MongoDB as a target [page 40]

2.4.4.2 Converting a template document into a regular


document

Importing a template document to convert it into a regular document.

Use one of the following methods to import a MongoDB template document:

● Open a data flow and select one or more template target documents in the workspace. Right-click, and choose
Import Document.
● Select one or more template documents in the Local Object Library, right-click and choose Import Document.

The icon changes and the document appears under Documents instead of Template Documents in the Local Object
Library.

Note
The Drop and re-create configuration option is available only for template target documents. Therefore it is not
available after you convert the template target into a regular document.

Related Information

Re-importing schemas [page 44]

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 43
2.4.5 Previewing MongoDB document data

Data preview allows you to view a sampling of MongoDB data from documents.

To preview MongoDB document data, right-click on a MongoDB document name in the Local Object Library or on a
document in the data flow and then select View Data.

You can also click the magnifying glass icon on a MongoDB source and target object in the data flow.

Note
By default, the maximum number of rows displayed for data preview is 100. To change this number, use the
Rows To Scan adapter datastore configuration option. Enter -1 to display all rows.

For more information, see “Using View Data”, “Viewing and adding filters”, and “Sorting” in the Designer Guide.

2.4.6 Parallel Scan

SAP Data Services uses a Parallel Scan process to improve performance while it generates metadata for big data.

Generating metadata can be time consuming because Data Services needs to first scan all documents in the
MongoDB collection. Parallel Scan allows Data Services to use multiple parallel cursors when reading all the
documents in a collection, thus increasing performance.

Note
Parallel Scan works with MongoDB server version 2.6.0 and above.

For more information about the parallelCollectionScan command, consult the MongoDB documentation.

2.4.7 Re-importing schemas

The software honors the MongoDB adapter datastore settings when re-importing.

To re-import a single document, right-click on the document and click Reimport.

To re-import all documents, right-click on a MongoDB datastore or on the Documents node and click Reimport All.

Note
When Use Cache is enabled, the software uses the cached schema.

When Use Cache is disabled, the software looks in the sample directory for a sample JSON file with the same
name. If there is a matching file, the software uses the schema from that file. If there isn't a matching JSON file
in the sample directory, the software re-imports the schema from the database.

Data Services Supplement for Big Data


44 PUBLIC Big data in SAP Data Services
2.4.8 Searching for MongoDB documents in the repository

Search for MongoDB documents in a repository from within the object library.

1. In the Designer, right-click in the object library and choose Search.


The Search window appears.
2. Select the MongoDB datastore name to which the document belongs from the Look in drop-down menu.
Choose Repository to search the entire repository.
3. Select Documents from the Object Type drop-down menu.
4. Enter the criteria for the search.
5. Click Search.
The documents matching your entries are listed in the window. A status line at the bottom of the Search
window shows where the search was conducted (Local or Central), the total number of items found, and the
amount of time it took to complete the search.

2.5 SAP HANA

Process your SAP HANA data in SAP Data Services by creating an SAP HANA database datastore.

Use SAP HANA database datastores as sources and targets in Data Services processes. Protect your HANA data
using SSL protocol and cryptographic libraries. Create stored procedures and enable bulk loading for faster
reading and loading. Additionally, load spatial and complex spatial data from Oracle to SAP HANA.

Note
Beginning with SAP HANA 2.0 SP1, you can access databases only through a multitenant database container
(MDC). If you use a version of SAP HANA that is earlier than 2.0 SP1, you can access only a single database.

2.5.1 Configure DSN SSL for SAP HANA

Configure SAP HANA database datastores to use SSL encryption for all network transmissions between the
database server and SAP Data Services.

Caution
Only an administrator or someone with sufficient experience should configure SSL encryption for SAP HANA.

Using DSN SSL for SAP HANA network transmissions is available beginning with SAP Data Services version 4.2 SP
7 (14.2.7.0) or later.

Configure SSL on both the SAP HANA server side and the Data Services client side.

SSL encryption for SAP HANA database datastores requires a DSN (data source name) connection. You cannot
use a server name connection.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 45
Note
Enabling SSL encryption slows down job performance but may be necessary for security purposes.

The tasks for enabling SSL encryption require you to have either the SAPCrypto library or the OpenSSL library.
These libraries may have been included with the database or with the platform you use. If you do not have either of
these libraries, or you have older versions, download the latest versions from the SAP Support Portal. To configure
the server side, make settings in the communication section of the global.ini file.

For more information about cryptographic libraries and settings for secure external connections in the
global.ini file for SAP HANA database, see the SAP HANA Network and Communication Security section of the
SAP HANA Security Guide.

2.5.1.1 Cryptographic libraries and global.ini settings

When you create an SAP HANA database datastore with SSL encryption, configure the database server and SAP
Data Services for certificate authentication.

On the database server side, make settings in the communications section of the global.ini file based on the
cryptographic library you use.

For more information about cryptographic libraries and settings for secure external connections in the
global.ini file for SAP HANA database, see the SAP HANA Network and Communication Security section of the
SAP HANA Security Guide.

The following table lists the requirements for each type of SAP Data Services SSL provider.

SSL providers for SAP HANA

SSL provider Requirements

SAPCrypto 1. Install SAPCrypto if necessary.


2. Set the environment variable SECUDIR in your system to
the location of the SAPCrypto library.
3. Ensure that you set the global.ini based on the infor­
mation in the SAP HANA database documentation.
4. Import the SAP HANA server certificate into your SAP
Data Services trusted certificate folder using Microsoft
Management Console.

Data Services Supplement for Big Data


46 PUBLIC Big data in SAP Data Services
SSL provider Requirements

OpenSSL Installed by default with your operating system.

1. Make suggested global.ini connection settings and


then restart the server. For suggested global.ini set­
tings, see the SAP HANA database documentation.
2. Configure an SAP HANA data source with SSL informa­
tion. See the following topics for steps:
○ Windows: Follow Configuring SSL for SAP HANA on
Windows [page 47]
○ Unix: Follow Configuring SSL for SAP HANA on Unix
[page 48]
3. Create the SAP HANA database datastore using the SSL
DSN that you created.
4. Import the server certificate using Microsoft Manage­
ment Console.

Related Information

Configuring SSL for SAP HANA on Windows [page 47]


Configuring SSL for SAP HANA on Unix [page 48]

2.5.1.2 Configuring SSL for SAP HANA on Windows

Configure SSL (secure shell) encryption for an SAP HANA database datastore on a Windows operating system.

Verify the type of your SSL encryption. It should be either SAPCrypto or OpenSSL.

1. Open Microsoft Management Console (MMC).


2. Select to import the SAP HANA certificate file to Data Services trusted root certification certificates folder.

For information about MMC, see the Microsoft Web site https://msdn.microsoft.com/en-us/library/
bb742442.aspx .
3. Open the ODBC Data Source Administrator.

Access the ODBC Data Source Administrator either from the Datastore Editor in Data Services Designer or
directly from your Start menu.
4. In the ODBC Data Source Administrator, open the System DSN tab and click Add.
5. Select the driver HDBODBC and click Finish.
6. Enter values in Data Source Name and Description.
7. Enter the Server:Port information and click Settings.
8. In the SSL Connection group, select the following options:

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 47
○ Connect using SSL
○ Validate the SSL certificate
9. Complete any remaining settings as applicable and click OK.
10. Close the ODBC Data Source Administrator.

Create the SAP HANA datastore using the SSL DSN that you just created.

Related Information

Creating an SAP HANA datastore with SSL encryption [page 50]

2.5.1.3 Configuring SSL for SAP HANA on Unix

Configure SSL encryption for an SAP HANA database datastore on a Unix operating system.

Verify the type of your SSL encryption. It should be either SAPCrypto or OpenSSL.

During this configuration process, use the SAP Data Services Connection Manager. Read about the Connection
Manager in the Administrator Guide.

1. Export $ODBCINI to a file in the same computer as the SAP HANA data source. For example:

Sample Code

export ODBCINI=<dir_path>/odbc.ini
touch $ODBCINI

2. Start Connection Manager by entering the following command:

Sample Code

$LINK_DIR/bin/DSConnectionManager.sh

3. Click the Data Sources tab and click Add to display the list of database types.
4. On the Select Database Type window, select the SAP HANA database type and click OK.

The configuration page opens with some of the connection information automatically completed:
○ Absolute location of the odbc.ini file
○ Driver for SAP HANA
○ Driver Version
5. Complete the remaining applicable options including:

○ DSN Name
○ Driver
○ Server Name
○ Instance

Data Services Supplement for Big Data


48 PUBLIC Big data in SAP Data Services
○ User Name
○ Password
6. Select Y for Specify the SSL Encryption Option.
7. Complete the remaining SSL options based on the descriptions in the following table.

SAP HANA DSN SSL connection options in Connection Manager

Option Description

SSL Encryption Option (Y/N) Set to Y so that the client server, Data Services, verifies the
certificate from the database server before accepting it. If
you set to N, the client server accepts the certificate without
verifying it, which is less secure.

HANA SSL Provider (sapcrypto/openssl) Specify the cryptographic provider for your SAP HANA SSL
connectivity. Options are:
○ OpenSSL (.pem)
○ SAPCrypto (.pse)

SSL Certificate File Enter the location and file name for the SSL certificate file.

SSL Key File Enter the location and file name for the SSL key file.

If you choose OpenSSL (.pem) for the HANA SSL provider option, use the Data Services bundled OpenSSL and
not your operating system OpenSSL.

8. To ensure that you use the Data Services bundled OpenSSL, follow these substeps:
a. Check your OpenSSL version and dependencies with the shared library (using Idd command).
For example, if your client operating system has OpenSSL version 0.9.8, run the following command:

Sample Code

ldd /usr/bin/openssl
linux-vdso.so.1 => (0x00007fff37dff000)
libssl.so.0.9.8 => /usr/lib64/libssl.so.0.9.8 (0x00007f0586e05000)
libcrypto.so.0.9.8 => /usr/lib64/libcrypto.so.0.9.8
(0x00007f0586a65000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f058682f000)
libz.so.1 => /build/i311498/ds427/dataservices/bin/libz.so.1
(0x00007f0586614000)
libc.so.6 => /lib64/libc.so.6 (0x00007f058629c000)
/lib64/ld-linux-x86-64.so.2 (0x00007f058705c000)

b. Create a soft link in <LINK_DIR>/bin. Use the same version name but refer to the Data Services SSL
libraries:

Sample Code

ln -s libbodi_ssl.so.1.0.0
ln -s libbodi_crypto.so.1.0.0 libcrypto.so.0.9.8

When you have completed the configuration, the Connection Manager automatically tests the connection.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 49
Create the SAP HANA datastore using the SSL DSN that you just created.

2.5.2 Creating an SAP HANA datastore with SSL encryption

Create an SAP HANA database datastore with SSL encryption. SSL encryption protects data as it is transferred
between the database server and Data Services.

An administrator must import and configure the SAP HANA database certificate. Additionally, you must create an
SSL data source (DSN) so that it is available to choose when you create the datastore. Information about
importing and configuring an SAP HANA database certificate is in the Administrator Guide.

SSL encryption is available in SAP Data Services version 4.2 SP7 (14.2.7.0) or later.

Note
Enabling SSL encryption will slow down job performance.

Note
An SAP HANA database datastore requires that you choose DSN as a connection method. DSN-less
connections are not allowed when you enable SSL encryption.

Note
If you are using SAP HANA version 2.0 SPS 01 multitenancy database container (MDC) or later, specify the port
number and the database server name specific to the tenant database you are accessing.

1. In Designer select Project New Datastore .


2. Complete the options as you would for an SAP HANA database datastore. Complete the following options
specifically for SSL encryption:

SSL-specific options

Option Value

Use Data Source Name (DSN) Select

Data Source Name Select the SAP HANA SSL DSN data source file that was
created previously (see Prerequisites above).

Find descriptions for all of the SAP HANA database datastore options in the Reference Guide.

3. Complete the remaining applicable Advanced options and save your datastore.

Data Services Supplement for Big Data


50 PUBLIC Big data in SAP Data Services
2.5.3 SAP HANA datastore options

When you create an SAP HANA database datastore, there are several options and settings that are unique for SAP
HANA.

Beginning with SAP HANA 2.0 SPS 01 MDC, use a database datastore to access a specified tenant database.

Possible values Description


Option

Database version SAP HANA database <version number> Select the version of your SAP HANA database client
(the version of the SAP HANA database that this data­
store accesses).

Use data source Checkbox selected or not selected Select to use a data source name (DSN) to connect to
name (DSN) the database.

Note
For SSL encryption, use the DSN SSL that you cre­
ated in “Configure DSN SSL for SAP HANA” in the
Administrator Guide.

This option is not selected by default. When not se­


lected, the software uses a server name, also known as
a DSN-less connection. For a DSN-less connection,
complete the Database server name and Port options.

Database server Computer name Enter the name of the computer where the SAP HANA
name server is located.

This option is required if you did not select Use data


source name (DSN).

If you are connecting to HANA MDC, enter the HANA


database server name for the applicable tenant data­
base.

Port Five-digit integer Enter the port number to connect to the SAP HANA
Server.
Default: 30015
This option is required if you did not select Use data
source name (DSN).

If you are connecting to SAP HANA 2.0 SPS 01 MDC or


later, enter the port number of the specific tenant data­
base.

Note
See SAP HANA documentation to learn how to find
the specific tenant database port number.

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 51
Possible values Description
Option

Data source name Refer to the requirements of your database Select or type the data source name that you defined in
the ODBC Administrator for connecting to your data­
base.

Note
For SSL encryption, use the DSN SSL that you cre­
ated in “Configure DSN SSL for SAP HANA” in the
Administrator Guide.

This option is required when you select Use data source


name (DSN).

User name Alphanumeric characters and underscores Enter the user name of the account through which the
software accesses the database.

Password Alphanumeric characters, underscores, and Enter the user password.


punctuation

Enable automatic This option is selected by default under the following


data transfer circumstance:

● When you create a new datastore and


● When you choose Database for the option
Datastore type.

Select this option so that the Data Transfer transform


pushes down subsequent database operations to the
database.

Database name Refer to the requirements of your database Optional. Enter the specific tenant database name.

Applicable for SAP HANA versions 2.0 SPS 01 MDC and


later.

Additional Alphanumeric characters and underscores, or Enter information for any additional parameters that
connection blank the data source supports (parameters that the data
information source ODBC driver and database support). Use the
format:

<parameter1=value1; parameter2=val
ue2>

Rows per commit Positive integer Enter the maximum number of rows loaded to a target
table before saving the data.

This value is the default commit size for target tables in


this datastore. You can overwrite this value for individ­
ual target tables.

Overflow file Directory path or click Browse. Enter the location of overflow files written by target ta­
directory bles in this datastore. You could also use a variable.

Aliases (Click here Enter the alias name and the owner name to which the
to create) alias name maps.

Data Services Supplement for Big Data


52 PUBLIC Big data in SAP Data Services
2.5.4 SAP HANA target table options

Use SAP HANA tables as targets in a data flow when applicable, and complete the options specific to SAP HANA.

Options
Option Description

Table type For template tables, select the appropriate table type for your SAP HANA target:

● Column Store (default)


● Row Store

Bulk loading
Option Description

Bulk load Select to enable bulk loading.

Mode Specify the mode for loading data to the target table:

● Append: Adds new records to the table (default).


● Truncate: Deletes all existing records in the table and then adds new records.

Commit size default: Data Services identifies the SAP HANA target table type and applies a default com­
mit size for the maximum number of rows loaded to the staging and target tables before sav­
ing the data (committing):

● Column Store: commit size is 10,000


● Row Store: commit size is 1,000

You can also type any value in the field that is greater than one.

Update method Specify how the input rows are applied to the target table:

Default: Data Services applies the default value for this option based on the SAP HANA target
table type:

● Column Store tables use UPDATE.


● Row Store tables use DELETE-INSERT.

● UPDATE: Issues an UPDATE to the target table.


● DELETE-INSERT: Issues a DELETE to the target table for data that matches the old data
in the staging table, and then issues an INSERT with the new data.

Note
Do not use DELETE-INSERT if the update rows do not contain data for all columns in
the target table, because Data Services will replace missing data with NULLs.

Related Information

Performance Optimization Guide: Using Bulk Loading, Bulk loading in SAP HANA [page 54]

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 53
2.5.5 Creating stored procedures in SAP HANA

SAP Data Services supports SAP HANA stored procedures with zero, one, or more output parameters.

Data Services supports scalar data types for input and output parameters. Data Services does not support table
data types. If you try to import a procedure with table data type, the software issues an error. Data Services does
not support data types such as binary, blob, clob, nclob, or varbinary for SAP HANA procedure parameters.

Procedures can be called from a script or from a Query transform as a new function call.

Example
Syntax

The SAP HANA syntax for the stored procedure:

CREATE PROCEDURE GET_EMP_REC (IN EMP_NUMBER INTEGER, OUT EMP_NAME VARCHAR(20),


OUT EMP_HIREDATE DATE) AS
BEGIN
SELECT ENAME, HIREDATE
INTO EMP_NAME, EMP_HIREDATE
FROM EMPLOYEE
WHERE EMPNO = EMP_NUMBER;
END;

Limitations

SAP HANA provides limited support of user-defined functions that can return one or several scalar values. These
user-defined functions are usually written in L. If you use user-defined functions, limit them to the projection list
and the GROUP BY clause of an aggregation query on top of an OLAP cube or a column table. These functions are
not supported by Data Services.

SAP HANA procedures cannot be called from a WHERE clause.

2.5.6 Bulk loading in SAP HANA

SAP Data Services improves bulk load performance by using a staging mechanism during bulk loading to the SAP
HANA database.

When Data Services uses changed data capture (CDC) or auto correct load, it uses a temporary staging table to
load the target table. Data Services loads the data to the staging table and applies the operation codes INSERT,
UPDATE, and DELETE to update the target table. With the Bulk load option selected in the target table editor, any
one of the following conditions triggers the staging mechanism:

● The data flow contains a Map CDC Operation transform.


● The data flow contains a Map Operation transform that outputs UPDATE or DELETE rows.
● The data flow contains a Table Comparison transform.
● The Auto correct load option in the target table editor is set to Yes.

Data Services Supplement for Big Data


54 PUBLIC Big data in SAP Data Services
If none of these conditions are met, that means the input data contains only INSERT rows. Therefore Data Services
does only a bulk insert operation, which does not require a staging table or the need to execute any additional SQL.

By default, Data Services automatically detects the SAP HANA target table type and updates the table accordingly
for optimal performance.

Because the bulk loader for SAP HANA is scalable and supports UPDATE and DELETE operations, the following
options in the target table editor are also available for bulk loading:

● Use input keys


● Auto correct load

Find these options in the Target Table editor, Options Advanced Update Control .

Related Information

Reference Guide: Objects, SAP HANA target table options [page 53]

2.5.7 Metadata mapping for SAP HANA

Data Services performs data type conversions when it imports metadata from external sources or targets into the
repository and when it loads data into an external table or file.

Data Services uses its own conversion functions instead of conversion functions that are specific to the database
or application that is the source of the data.

Additionally, if you use a template table or Data_Transfer table as a target, the software converts from internal data
types to the data types of the respective DBMS.

2.5.7.1 SAP HANA

Data type conversion when SAP Data Services imports metadata from an SAP HANA source or target into the
repository and then loads data to an external table or file.

SAP HANA data type Converts to Data Services data type

integer int

tinyint int

smallint int

bigint decimal

char varchar

nchar varchar

varchar varchar

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 55
SAP HANA data type Converts to Data Services data type

nvarchar varchar

decimal or numeric decimal

float double

real real

double double

date date

time time

timestamp datetime

clob long

nclob long

blob blob

binary blob

varbinary blob

The following table shows the conversion from internal data types to SAP HANA data types in template tables.

Data Services data type Converts to SAP HANA data type

blob blob

date date

datetime timestamp

decimal decimal

double double

int integer

interval real

long clob/nclob

real decimal

time time

timestamp timestamp

varchar varchar/nvarchar

2.5.8 Using spatial data with SAP HANA

SAP Data Services supports spatial data such as point, line, polygon, collection, or a heterogeneous collection) for
specific databases.

The following list contains specific databases that support spatial data in SAP Data Services:

Data Services Supplement for Big Data


56 PUBLIC Big data in SAP Data Services
● Microsoft SQL Server for reading
● Oracle for reading
● SAP HANA for reading and loading

When you import a table with spatial data columns, Data Services imports the spatial type columns as character
based large objects (clob). The column attribute is Native Type, which has the value of the actual data type in the
database. For example, Oracle is SDO_GEOMETRY, Microsoft SQL Server is geometry/geography, and SAP HANA
is ST_GEOMETRY.

Limitations

● You cannot create template tables with spatial types because spatial columns are imported into Data Services
as clob.
● You cannot manipulate spatial data inside a data flow because the spatial utility functions are not supported.

2.5.8.1 Loading spatial data to SAP HANA

Load spacial data from Oracle or Microsoft SQL Server to SAP HANA.

Learn more about spatial data by reading the SAP HANA documentation.

1. Import a source table from Oracle or Microsoft SQL Server to SAP Data Services.
2. Create a target table in SAP HANA with the appropriate spatial columns.
3. Import the SAP HANA target table into Data Services.
4. Create a data flow with an Oracle or Microsoft SQL Server source as reader.
Include any necessary transformations.
5. Add the SAP HANA target table as a loader.
Make sure not to change the data type of spatial columns inside the transformations.
6. Build a job that includes the data flow and run it to load the data into the target table.

2.5.8.2 Loading complex spatial data from Oracle to SAP


HANA

Complex spatial data is data such as circular arcs and LRS geometries.

1. Create an Oracle datastore for the Oracle table.

For instructions, see the guide Supplement for Oracle Applications.


2. Import a source table from Oracle to SAP Data Services using the Oracle datastore.
3. Create a target table in SAP HANA with the appropriate spatial columns.
4. Import the SAP HANA target table into Data Services.
5. Create a data flow in Data Services, but instead of including an Oracle source, include a SQL transform as
reader.
6. Retrieve the data from the Oracle database directly. First, open the SQL transform, then add the SQL Select
statement. Add the SQL Select statement by calling the following functions against the spatial data column:

Data Services Supplement for Big Data


Big data in SAP Data Services PUBLIC 57
○ SDO_UTIL.TO_WKTGEOMETRY
○ SDO_GEOM.SDO_ARC_DENSIFY

For example, in the SQL below, the table name is “Points”. The “geom” column contains the following
geospatial data:

SELECT
SDO_UTIL.TO_WKTGEOMETRY(
SDO_GEOM.SDO_ARC_DENSIFY(
geom,
(MDSYS.SDO_DIM_ARRAY(
MDSYS.SDO_DIM_ELEMENT('X',-83000,275000,0.0001),
MDSYS.SDO_DIM_ELEMENT('Y',366000,670000,0.0001)
)),
'arc_tolerance=0.001'
)
)
from "SYSTEM"."POINTS"

For more information about how to use these functions, see the Oracle Spatial Developer's Guide on the Oracle
Web page at SDO_GEOM Package (Geometry) .
7. Build a job in Data Services that includes the data flow and run it to load the data into the target table.

Data Services Supplement for Big Data


58 PUBLIC Big data in SAP Data Services
3 Data Services Connection Manager

Use the Data Services Connection Manager for Unix platforms to configure ODBC databases and ODBC drivers to
use specific databases as repositories, sources, and targets in Data Services.

The Connection Manager is a command-line utility. However, a graphical user interface (GUI) is available.

Note
To use the graphical user interface for Connection Manager, install the GTK+2 library. The GTK+2 is a free multi-
platform toolkit that creates user interfaces. For more information about obtaining and installing GTK+2, see
https://help.sap.com/viewer/disclaimer-for-links?q=https%3A%2F%2Fwww.gtk.org%2F.

When you use DSConnectionManager.sh in the command line, the -c parameter must be the first parameter.

If an error occurs when using the Connection Manager, use the -d option to show details in the log.

Example
$LINK_DIR/bin/DSConnectionManager.sh -c -d

Note
For Windows installation, use the ODBC Driver Selector to configure ODBC databases and drivers for
repositories, sources, and targets.

Data Services Supplement for Big Data


Data Services Connection Manager PUBLIC 59
4 Cloud computing services

SAP Data Services provides access to various cloud databases and storages to use for reading or loading big data.

4.1 Cloud databases

Access various cloud databases through file location objects and file format objects.

SAP Data Services supports many cloud database types to use as readers and loaders in a data flow.

4.1.1 Amazon Redshift database

Redshift is a cloud database designed for large data files.

In SAP Data Services, you create a database datastore to access your data from Amazon Redshift. Additionally,
load Amazon S3 data files into Redshift using the build-in function load_from_s3_to_redshift.

4.1.1.1 Amazon Redshift

Option descriptions for creating an Amazon Redshift database datastore.

Use the Amazon Redshift ODBC driver to connect to the Redshift cluster database. The Redshift ODBC driver
connects to Redshift on Windows and Linux platforms only.

For information about downloading and installing the Amazon Redshift ODBC driver, see the Amazon Redshift
documentation on the Amazon website.

Note
SSL settings are managed through the Amazon Redshift ODBC Driver. In the Amazon Redshift ODBC Driver DSN
Setup window, set the SSL Authentication option to allow.

Use a Redshift database datastore for the following tasks:

● Import tables
● Read or load Redshift tables in a data flow
● Preview data
● Create and import template tables
● Load Amazon S3 data files into a Redshift table using the built-in function load_from_s3_to_redshift

For more information about template tables and data preview, see the Designer Guide.

Data Services Supplement for Big Data


60 PUBLIC Cloud computing services
Main window

Redshift option Possible values Description

Database Version Redshift <version number> Enter the Redshift database version. For example, Red­
shift 8.<x>.

Data Source Name Refer to the requirements of your Type the data source name (DSN) configuration name,
database. which is defined in Amazon Redshift ODBC Driver, for
connecting to your database.

User Name Alphanumeric characters and un­ Enter the user name of the account through which Data
derscores Services accesses the database.

Password Alphanumeric characters, under­ Enter the user's database password.


scores, and punctuation

Enable Automatic Data n/a Enables transfer tables in this datastore, which the
Transfer Data_Transfer transform can use to push down subse­
quent database operations.

This option is enabled by default.

Connection

Redshift option Possible values Description

Additional connection parameters Alphanumeric characters and under­ Enter information for any additional con­
scores, or blank nection parameters. Use the format:
<parameter1=value1;
parameter2=value2>

General

Redshift option Possible values Description

Rows per commit Positive integer Enter the maximum number of rows loaded to a target ta­
ble before saving the data. This value is the default com­
mit size for target tables in this datastore. You can over­
write this value for individual target tables.

Bulk loader directory Directory path or click Browse Enter the location where data files are written for bulk
loading. You can enter a variable for this option.

The default value is %ds_common_dir%/log/


bulkload.

Overflow file directory Directory path or click Browse Enter the location of overflow files written by target tables
in this datastore. A variable can also be used.

Data Services Supplement for Big Data


Cloud computing services PUBLIC 61
Session

Redshift option Possible values Description

Additional session parameters A valid SQL statement or multiple Additional session parameters specified as valid SQL
SQL statements delimited by a semi­ statement(s)
colon

Aliases
Redshift option Possible values Description

Aliases Alphanumeric characters and the un­ Enter the alias name of the database owner. For more
derscore symbol (_) information, see “Creating an alias” in the Designer
Guide.

Related Information

Amazon S3 protocol [page 75]


load_from_s3_to_redshift [page 77]
Amazon Redshift data types [page 66]
Amazon Redshift source [page 64]
Amazon Redshift target table options [page 64]
Configuring Amazon Redshift as a data source using DSConnectionManager [page 62]

4.1.1.2 Configuring Amazon Redshift as a data source using


DSConnectionManager

Information about how to configure Amazon Redshift as a data source in DSConnectionManager.

1. Download and install the Amazon Redshift ODBC driver for Linux. For more information, see “Install the
Amazon Redshift ODBC Driver on Linux Operating Systems” in the Amazon Redshift Management Guide on
the Amazon website ( http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-linux.html ).

After installing the ODBC driver on Linux, you'll need to configure the following files:
○ amazon.redshiftodbc.ini
○ odbc.ini
○ odbcinst.ini

For more information about these files and other configuration information, see “Configure the ODBC Driver
on Linux and Mac OS X Operating Systems” in the Amazon Redshift Management Guide on the Amazon
website (http://docs.aws.amazon.com/redshift/latest/mgmt/odbc-driver-configure-linux-mac.html ).
2. At the end of /opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini, add a line to point to
the libodbcinst.so file. This file is in the unixODBC/lib directory.

For example, ODBCInstLib=/home/ec2-user/unixODBC/lib/libodbcinst.so.

Data Services Supplement for Big Data


62 PUBLIC Cloud computing services
In addition, in the [Driver] section of the amazon.redshiftodbc.ini file , set DriverManagerEncoding to
UTF-16.

For example,

[Driver]
DriverManagerEncoding=UTF-16

3. Configure the Linux ODBC environment.


a. Run DSConnectionManager.sh and configure a data source for Redshift.

Note
The Unix ODBC Lib Path is based on where you install the driver. For example, for Unix ODBC 2.3.4 the
path would be /build/unixODBC-232/lib.

Specify the DSN name from the list or add a new one:
DS42_REDSHIFT
Specify the User Name:
<name of the user>
Type database password:(no echo)
Retype database password:(no echo)
Specify the Unix ODBC Lib Path:
/build/unixODBC-232/lib
Specify the Driver:
/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so
Specify the Driver Version:'8'
8
Specify the Host Name:
<host name/IP address>
Specify the Port:
<port number>
Specify the Database:
<database name>
Specify the Redshift SSL certificate verification mode
[require|allow|disable|prefer|verify-ca|verify-full]:'require'
require
Testing connection...
Successfully added database source.

Related Information

Amazon Redshift data types [page 66]


Amazon Redshift source [page 64]
Amazon Redshift target table options [page 64]
Amazon S3 protocol [page 75]
load_from_s3_to_redshift [page 77]
Amazon Redshift [page 60]

Data Services Supplement for Big Data


Cloud computing services PUBLIC 63
4.1.1.3 Amazon Redshift source

Option descriptions for using an Amazon Redshift database table as a source in a data flow.

When you use an Amazon Redshift table as a source, the software supports the following features:

● All Redshift data types


● Optimized SQL
● Basic push-down functions

The following list contains behavior differences from Data Services when you use certain functions with Amazon
Redshift:

● When using add_month(datetime, int), pushdown doesn't occur if the second parameter is not in an
integer data type.
● When using cast(input as ‘datatype’), pushdown does not occur if you use the real data type.
● When using to_char(input, format), pushdown doesn't occur if the format is ‘XX’ or a number such as
‘099’, ‘999’, ‘99D99’, ‘99G99’.
● When using to_date(date, format), pushdown doesn't occur if the format includes a time part, such as
‘YYYY-MM-DD HH:MI:SS’.

For more information, see SAP Note 2212730 and “Maximizing Push-Down Operations” in the Performance
Optimization Guide.

The following table lists source options when you use an Amazon Redshift table as a source:

Option Description

Table name Name of the table that you added as a source to the data flow.

Table owner Owner that you entered when you created the Redshift table.

Datastore name Name of the Redshift datastore.

Database type Database type that you chose when you created the datastore. You cannot change this
option.

The Redshift source table also uses common table source options.

Related Information

Amazon Redshift data types [page 66]


Amazon Redshift target table options [page 64]
Amazon Redshift [page 60]

4.1.1.4 Amazon Redshift target table options

Descriptions of options for using an Amazon Redshift table as a target in a data flow.

The Amazon Redshift target supports the following features:

Data Services Supplement for Big Data


64 PUBLIC Cloud computing services
● input keys
● auto correct
● data deletion from a table before loading
● transactional loads
● load triggers, pre load commands, and post-load commands
● bulk loading
When you use the bulk load feature, Data Services generates files and saves the files to the bulk load directory
that is defined in the Amazon Redshift datastore. If there is no value set for the bulk load directory, the
software saves the data files to the default bulk load location at: %DS_COMMON_DIR%/log/BulkLoader. Data
Services then copies the files to Amazon S3 and executes the Redshift copy command to upload the data files
to the Redshift table.

Note
The Amazon Redshift primary key is informational only and the software does not enforce key constraints for
the primary key. Be aware that using SELECT DISTINCT may return duplicate rows if the primary key is not
unique.

Note
The Amazon Redshift ODBC driver does not support parallelize load via ODBC into a single table. Therefore, the
Number of Loaders option in the Options tab is not applicable for a regular loader.

Bulk loader tab

Option Description

Bulk load Select to use bulk loading options to write the data.

Mode Select the mode for loading data in the target table:

● Append: Adds new records to the table.

Note
Append mode does not apply to template tables.

● Truncate: Deletes all existing records in the table, and then adds new records.

S3 file location Enter or select the path to the Amazon S3 configuration file. You can enter a variable for this
option.

Maximum rejects Enter the maximum number of acceptable errors. After the maximum is reached, the soft­
ware stops Bulk loading. Set this option when you expect some errors. If you enter 0, or if
you do not specify a value, the software stops the bulk loading when the first error occurs.

Column delimiter Enter a single-character column delimiter.

Data Services Supplement for Big Data


Cloud computing services PUBLIC 65
Option Description

Generate files only Enable to generate data files that you can use for bulk loading.

When enabled, the software loads data into data files instead of the target in the data flow.
The software writes the data files into the bulk loader directory specified in the datastore
definition.

If you do not specify a bulk loader directory, the software writes the files to <
%DS_COMMON_DIR%>\log\bulkloader\<tablename><PID>. Then you manually
copy the files to the Amazon S3 remote system.

The file name is


<tablename><PID>_<timestamp>_<loader_number>_<number of files
generated by each loader>.dat, where <tablename> is the name of the target
table.

Clean up bulk loader directory Enable to delete all bulk load-oriented files from the bulk load directory and the Amazon S3
after load remote system after the load is complete.

Number of loaders Sets the number of threads to generate multiple data files for a parallel load job. Enter a posi­
tive integer for the number of loaders (threads).

Related Information

Amazon Redshift source [page 64]


Amazon Redshift [page 60]

4.1.1.5 Amazon Redshift data types

SAP Data Services converts Redshift data types to Data Services data types when Data Services imports
metadata from a Redshift source or target into the repository.

The following table shows the data type conversions.

Redshift data type Converts to Data Services data type

smallint int

integer int

bigint decimal(19,0)

decimal decimal

real real

Data Services Supplement for Big Data


66 PUBLIC Cloud computing services
Redshift data type Converts to Data Services data type

float double

boolean varchar(5)

char char

Note
The char data type doesn't support multi-byte characters. The maximum range is 4096 bytes.

nchar char

varchar varchar

nvarchar
Note
The varchar and nvarchar data types support utf8 multi-byte characters. The size is the num­
ber of bytes and the max range is 65535.

Caution
If you try to load multi-byte characters into a char or nchar data type column, Redshift will pro­
duce an error. Redshift internally converts nchar and nvarchar data types to char and varchar.
The char data type in Redshift doesn't support multi-byte characters. Use overflow to catch the
unsupported data or, to avoid this problem, create a varchar column instead of using the char
data type.

date date

timestamp datetime

text varchar(256)

bpchar char(256)

The following data type conversions apply when you create a template table:

Data Services data type Redshift template table data type

blob varchar(max)

date date

datetime datetime

decimal decimal

double double precision

Data Services Supplement for Big Data


Cloud computing services PUBLIC 67
Data Services data type Redshift template table data type

int integer

interval float

long varchar(8190)

real float

time varchar(25)

timestamp datetime

varchar varchar/nvarchar

char char/nchar

4.1.2 Azure SQL database

Developers and administrators who use Microsoft SQL Server can store on-premise SQL Server workloads on an
Azure virtual machine in the cloud.

The Azure virtual machine supports both Unix and Windows platforms.

4.1.2.1 Moving files to and from Azure containers

Data Services lets you move files from local storage such as a local drive or folder to an Azure container.

Data Services lets you move files from local storage such as a local drive or folder to an Azure container. You can
use an existing container or create one if it does not exist. You can also import files (called “blobs” when in a
container) from an Azure container to a local drive or folder. The files can be any type and are not internally
manipulated by Data Services. Currently, Data Services supports the block blob in the container storage type.

You use a file format to describe a blob file and use it within a data flow to perform extra operations on the file. The
file format can also be used in a script to automate upload and local file deletion.

The following are the high-level steps for uploading files to a container storage blob in Microsoft Azure.

1. Create a storage account in Azure and take note of the primary shared key. For more information, see
Microsoft documentation or Microsoft technical support.
2. Create a file location object with the Azure Cloud Storage protocol. For details about the file location object
option settings in Azure, see the Reference Guide.
3. Create a job in Data Services Designer.
4. Add a script containing the appropriate function to the job.

○ copy_to_remote_system

Data Services Supplement for Big Data


68 PUBLIC Cloud computing services
○ copy_from_remote_system
Example

copy_to_remote_system('New_FileLocation', '*')

A script that contains this function copies all of the files from the local directory specified in the file location
object to the container specified in the same object.
5. Save and run the job.

Related Information

Azure Cloud Storage protocol [page 80]

4.1.3 Google BigQuery

The Google BigQuery datastore contains access information and passwords so that the software can open your
Google BigQuery account on your behalf.

After accessing your account, SAP Data Services can load data to or extract data from your Google BigQuery
projects:

● Extract data from a Google BigQuery table to use as a source for Data Services processes.
● Load generated data from Data Services to Google BigQuery for analysis.
● Automatically create and populate a table in your Google BigQuery dataset by using a Google BigQuery
template table.

For complete information about Data Services and Google BigQuery, see the Supplement for Google BigQuery.

4.1.3.1 Datastore option descriptions

Option descriptions for the SAP Data Services Google BigQuery datastore editor.

Create a new datastore to open the Google BigQuery datastore editor. See the Designer Guide for information
about creating a datastore.

Datastore option descriptions

Option Instruction

Datastore Name Enter a unique name for the datastore.

Datastore Type Select Google BigQuery.

Web Service URL Accept the default: https://www.googleapis.com/


bigquery/v2.

Data Services Supplement for Big Data


Cloud computing services PUBLIC 69
Option Instruction

Advanced options (click Advanced to access)

Authentication Server URL Accept the default: https://


accounts.google.com/o/oauth2/token.

Consists of the Google URL plus the name of the Web access
service provider, OAuth 2.0.

Authentication Access Scope Accept the default: https://www.googleapis.com/


auth/bigquery.

Grants Data Services read and write access to your Google


projects.

Service Account Email Address Paste the service account e-mail address that you copied
from your Google project.

For instructions, see the Supplement for Google BigQuery.

Service Account Signature Algorithm Accept the default: SHA256withRSA

Algorithm that Data Services uses to sign JSON Web Tokens


with your service account private key to obtain an access to­
ken from the Authentication Server.

Substitute Access Email Address Optional. Enter the substitute e-mail address from your Goo­
gle BigQuery datastore.

Proxy Required only if your network uses a proxy service to connect


to the internet. Complete the following options under Proxy
when applicable:

● Proxy host
● Proxy port
● Proxy user name
● Proxy password

Google Cloud Storage for Reading Set this option only when you are downloading data from Goo­
gle BigQuery as a source, and the data sets are larger than ap­
proximately 10 MB. Otherwise, leave the default setting of
blank.

Using your Google Cloud Storage (GCS) account for reading


Google BigQuery data may improve processing performance.

For more information see Optimize data extraction perform­


ance [page 71].

Data Services Supplement for Big Data


70 PUBLIC Cloud computing services
4.1.3.2 Google BigQuery target table

Option descriptions for the Target tab in the datastore explorer for the Google BigQuery datastore table.

When you include a Google BigQuery table in a data flow, you edit the target information for the target table.
Double-click the target table in the data flow to open the target editor.

Options specific to Google BigQuery

Option Description

Make Port Creates an embedded data flow port from a source or target
file.

Default is No. Choose Yes to make a source or target file an


embedded data flow port.

For more information, see “Creating embedded data flows” in


the Designer Guide.

Mode Designates how Data Services updates the Google BigQuery


table. The default is Truncate.

● Append: Adds new records generated from Data Services


processing to the existing Google BigQuery table.
● Truncate: Replaces all existing records from the Google
project table with the uploaded data from Data Services.

Number of loaders Sets the number of threads to use for processing.

Enter a positive integer for the number of loaders (threads).

Each loader starts one resumable load job in Google BigQuery


to load data.

Loading with one loader is known as single loader loading.


Loading when the number of loaders is greater than 1 is known
as parallel loading. You can specify any number of loaders.

Maximum failed records per loader Sets the maximum number of records that can fail per loader
before Google stops loading records. The default is zero (0).

The Target tab also displays the Google table name and the datastore used to access the table.

4.1.3.3 Optimize data extraction performance

When you have larger data files to extract from Google BigQuery, create a file location object that uses Google
Cloud Storage (GCS) protocol to optimize data extraction.

Consider the following factors before you decide to use the GCS file location object for optimization. Compare the
time saved using optimization against the potential fees from using your GCS account in this manner. Additionally,
the optimization may not be beneficial for smaller data files of less than or equal to 10 MB.

Data Services Supplement for Big Data


Cloud computing services PUBLIC 71
What you need

Required information to complete the GCS file location object includes the following:

● GCS bucket name


● Bucket folder structure
● Authentication access scope
● Service account private key file

How to set it up

1. Create a GCS file location object in Designer.


2. Select gzip for the compression type in the GCS file location.
3. Create a Google BigQuery datastore object in Designer.
4. In the datastore, complete the Use Google Cloud Storage for Reading option by selecting the GCS file location
name from the dropdown list.
5. Create a data flow in Designer and add a SQL transform as a reader.
6. Open the SQL transform and enter a SQL statement in the SQL tab to specify the data to extract.
7. Set up the remaining data flow in Designer.

4.1.3.4 load_from_gcs_to_gbq

a function that uses information from the named file location object to copy data from Google Cloud Storage into
Google BigQuery tables.

Use this function in a workflow script to transfer data from a Google Cloud Storage into Google BigQuery tables to
be used as a source in a data flow. The software uses the local and remote paths and Google Cloud Storage
protocol information from the named file location object.

Syntax

load_from_gcs_to_gbq(“<datastore_name>”, “<remote_file_name>”, “<table_name>”,


“<write_mode>”, “<file_format>”)

Return value

int

Returns 1 if function is successful. Returns 0 if function is not successful.

Data Services Supplement for Big Data


72 PUBLIC Cloud computing services
Where

<datastore_name> Name of the Google BigQuery datastore.

<remote_file_name> Name of the file to copy from the remote server in the format gs://bucket/filename.
Wildcards may be used.

<table_name> Name of the Google BigQuery table name in the format dataset.table.

<write_mode> (Optional.) The write mode value can be append (default) or truncated.

<file_format> The format of the data files using one of the following values:

● CSV: For CSV files. This is the default value.


● DATASTORE_BACKUP: For datastore backups.
● NEWLINE_DELIMITED_JSON: For newline-delimited JSON.
● AVRO: For Avro.

Example
To copy a file json08_from_gbq.json from a Google BigQuery datastore named NewGBQ1 on a remote server
to a Google BigQuery table named test.json08 on a local server, set up a script object that contains the
load_from_gcs_to_gbq function as follows:

Sample Code

load_from_gcs_to_gbq('NewGBQ1', 'gs://test-bucket_1229/from_gbq/
json08_from_gbq.json', 'test.json08', 'append', 'NEWLINE_DELIMITED_JSON');

4.1.3.5 gbq2file

A function that optimizes software performance when you export large-volume Google BigQuery results to a user-
specified file on your local machine.

The software uses information in the associated Google cloud storage (GCS) file location object to identify your
GCS connection information, bucket name, and compression information.

Syntax

gbq2file('<GBQ_datastore_name>','<any_query_in_GBQ>','<local_file_name>','<file_lo
cation_object>', '<field_delimiter>','/<numeric_row_delimiter>');

Data Services Supplement for Big Data


Cloud computing services PUBLIC 73
Return value

int

Returns 1 if function is successful. Returns 0 if function is not successful.

Where

<GBQ_datastore_name> Name of the Google BigQuery application datastore in Data Services.

<any_query_in_GBQ> Name of the applicable query in Google BigQuery.

<local_file_name> Local file location and name in which to store the Google data.

Should be the location of your local server.

<file_location_object> Name of the Google Cloud Storage file location object in Data Services.

<field_delimiter> Optional. The field delimiter to use between fields in the exported data. The default is a comma.

<numeric row Numeric value for the row delimiter


delimiter>
For example, /013

Note
Default is 10, hex 0A.

How the function works

1. The function saves your Google BigQuery results to a temporary table in Google.
2. The function uses export job to export data from the temporary table to GCS.

Note
If the data is larger than 1 GB, Google exports the data in multiple files.

3. The function transfers the data from your Google Cloud Storage to the local file that you specified.
4. After the transfer is complete, the function deletes the temporary table and any files from Google Cloud
Storage.

For details about creating a Google BigQuery application datastore, see the Supplement for Google BigQuery.

Data Services Supplement for Big Data


74 PUBLIC Cloud computing services
Related Information

Google Cloud Storage protocol [page 84]

4.2 Cloud storages

Access various cloud storages through file location objects.

File location objects specify specific file transfer protocols so that SAP Data Services safely transfers data from
server to server.

4.2.1 Amazon S3

Amazon Simple Storage Service (S3) is a product of Amazon Web Services that provides scalable storage in the
cloud.

Amazon S3 provides a service where you can store large volumes of data. In SAP Data Services, access your
Amazon S3 account using a file location object.

Data Services provides built-in functions for processing data that you can use with data from S3 and data that you
load to S3. There is one built-in function specifically for moving data from S3 to Amazon Redshift named
load_from_s3_to_redshift.

4.2.1.1 Amazon S3 protocol

Use a file location object to access data or upload data stored in your Amazon S3 account.

To create an Amazon S3 file location object:

1. Open the Format tab in the Local Object Library in Designer.


2. Right-click the File Location node.
3. Select New.

The following table describes the file location options that are specific to the Amazon S3 protocol.

Option Description

Access Key Amazon S3 identification input value.

Secret Key Amazon S3 authorization input value.

Region Name of the region you are transferring data to and from; for
example, "South America (Sao Palo)".

Data Services Supplement for Big Data


Cloud computing services PUBLIC 75
Option Description

Communication Protocol Communication protocol you are using with S3, either http or
https.

Compression Type The compression type to use.

Files are compressed before upload to S3 and decompressed


after download from S3.

Connection Retry Count Number of times the software should try to upload or down­
load data before stopping the upload or download.

Batch size for uploading data, MB Size of the data transfer you want the software to use for up­
loading data to S3.

Data Services uses single-part uploads for files less than 5 MB


in size, and multi-part uploads for files larger than 5 MB. Data
Services limits the total upload batch size to 100 MB.

Batch size for downloading data, MB Size of the data transfer the software uses to download data
from S3.

Number of threads Number of upload and download threads for transferring data
to S3.

Remote directory Optional. Name of the directory for Amazon S3 to transfer files
to and from.

Bucket Name of the Amazon S3 bucket containing the data.

Local directory Optional. Name of the local directory to use to create the files.
If you leave this field empty, the software uses the default Data
Services workspace.

Proxy host, port, user name, password Proxy information if you use a proxy server.

Related Information

Amazon Redshift [page 60]


load_from_s3_to_redshift [page 77]

Data Services Supplement for Big Data


76 PUBLIC Cloud computing services
4.2.1.2 load_from_s3_to_redshift

Uses the Redshift COPY command to copy data files from an Amazon Simple Storage Service (S3) bucket to a
Redshift table.

Before using this function, set up an S3 file location object. For more information, see Amazon S3 protocol [page
75].

Syntax

load_from_s3_to_redshift("<datastore name>", "<table name>", "<file location


name>", "<file name>", "<file options>")

Where

<datastore name> Name of the Redshift datastore.

<table name> Name of the target table.

You can also specify the following:

● <table name> = <table name>


● <table name> = <schema name>.<table name>
● <table name> = <Redshift datastore name>.<schema name>.<table name>
● <table name> = <Redshift datastore name>.<Alias name used in
datastore>.<table name>

<file location name> Location of the S3 file.

<file name> Fully qualified name of the Amazon S3 file to copy to the Redshift table. Wild cards are allowed.

Data Services Supplement for Big Data


Cloud computing services PUBLIC 77
<file options> (optional) Use the following file options as applicable when copying a file:

● acceptanydate: Accepts any date, even those with invalid formats, without throwing an error.
● acceptinvchars: Replaces invalid UTF-8 characters.
● blankasnull: Inserts null if the input data is blank.
● dateformat: Defines the date format. For example, \'YYYY-MM-DD\'.
● delimiter: Defines the column delimiter. For example, \'|\'.
● emptyasnull: Inserts null if input data is empty.
● encoding: Defines the data file encoding type. Valid values include utf8 (default), utf16, utf16le,
and utf16be.
● encrypted: Loads encrypted data files from S3.
● escape: Removes escape (\) character. For example,a\\b\\c would be a\b\c.
● explicit_ids: Data values must match the Identity format and Identity columns.
● fillrecord: Fills null if any record is missed.
● ignoreblanklines: Ignores blank lines.
● ignoreheader: Skips the specified number rows as a file header. The default is 0.
● manifest: Loads manifest data files from S3.
● maxerror: Defines the maximum number of errors allowed. The default is 0.
● null as: Defines the special null string
● removequotes: Removes quotes from the data file.
● roundec: Rounds up numeric values when the input value is greater than the scale defined for
the column.
● timeformat: Defines the timestamp format. For example, \'YYYY-MM-DD HH:MI:SS\'.
● trimblanks: Removes whitespace characters. Only applies to the varchar data type.
● truncatecolumns: Truncates data in columns when the input value is greater than the column
defined. Applies to varchar or char data types and rows 4MB or less in size.
● gzip: Loads compressed data files from S3.
● lzop: Loads compressed data files from S3.
● bzip2: Loads compressed data files from S3.

Sample Code

CREATE __AL_REPO_FUNCTION load_from_s3_to_redshift("Datastore" __FUNC_CHAR IN,


"Table name" __FUNC_CHAR IN, "File location name" __FUNC_CHAR IN, "File name"
__FUNC_CHAR IN, "File options" __FUNC_CHAR IN )
SET(database_type = 'ACTA',
function_type = 'Miscellaneous_Function',
DB_FunctionName = 'load_from_s3_to_redshift',
Description = 'This function loads Amazon S3 data file(s) to a Amazon Redshift
table',
Parallelizable = 'NO',
External_name = 'load_from_s3_to_redshift',
return_param_dep = 'null',
return_datatype = '5',
return_datatype_size = '4',
param0 = 'Name of the Amazon Redshift datastore.',
param1 = 'Name of the target table.',
param2 = 'Name of the Amazon S3 File location.' ,
param3 = 'Fully qualified name of the Amazon S3 file(s). Wild cards are
allowed.' ,
param4 = 'File options that can be applied when copying the file. For example,
\'delimiter \',\' encoding \'utf8\'\'.'
)

Data Services Supplement for Big Data


78 PUBLIC Cloud computing services
Example
To copy a data file inside <bucket name>/<sub directory> on S3 to a Redshift table, define the following in
the S3 datastore:

● Bucket = <bucket name>


● Remote Directory=<sub directory>

Then enter the following:

load_from_s3_to_redshift('redshift_ft', ‘customer ', 'S3_to_Redshift’,


'customer.dat', 'delimiter \',\' ');

Example
To generate an AES256 key, enter the following:

encrypt_aes('<plain password>', '<passphrase>', 256)

You can then use the key to upload data from the Redshift table to the S3 bucket.

unload ('select * from <redshift table>')


to 's3://<bucket name>/<sub directory>/'
credentials 'aws_access_key_id=<access key>;aws_secret_access_key=<secret access
key>;master_symmetric_key=<AES256 key> '
delimiter '|' encrypted bzip2;

To copy the encrypted data files on S3 back to a Redshift table, enter the following:

load_from_s3_to_redshift('redshift_ft', 'public.t31_household',
'S3_to_Redshift_3', 't31_encrypted', 'master_symmetric_key \'<AES256 key> \'
encrypted bzip2 delimiter \'|\'');

Example
To copy JSON data from S3 to a Redshift table, with a JSON path, enter the following:

load_from_s3_to_redshift('redshift_ft', 'public.t32_category',
'S3_to_Redshift_3', 't33_category.json', 'json \'s3://dsqa-redshift-bkt3/
t33_category_jsonpath.json\'');

Example
To copy CSV data from S3 to a Redshift table, enter the following:

load_from_s3_to_redshift('redshift_ft', 'public.t32_category',
'S3_to_Redshift_3', 't34_category_csv.txt', 'csv quote as \'%\'');

Data Services Supplement for Big Data


Cloud computing services PUBLIC 79
Example
To copy fixed-width data from S3 to a Redshift table, enter the following:

load_from_s3_to_redshift('redshift_ft', 'public.t35_fixed_width',
'S3_to_Redshift_3', 't35', 'fixedwidth \'catid:5,catgroup:10,catname:9,catdesc:
40\'');

4.2.2 Azure blob storage

Blob data is unstructured data that is stored as objects in the cloud. Blob data is text or binary data such as
documents, media files, or application installation files.

Access Azure blob storage by creating an Azure cloud file location object.

Related Information

Moving files to and from Azure containers [page 68]

4.2.2.1 Azure Cloud Storage protocol

Option descriptions for the Create New File Location window for the Azure Cloud Storage protocol.

Follow these steps to open the File Location editor to create a new file location object:

1. Open the Format tab in the Designer Local Object Library.


2. Right-click on the File Locations node and select New.

The following table lists the file location object descriptions for the Azure Cloud Storage protocol.

Option Description

Name File name of the file location object.

Protocol Type of file transfer protocol.

For Azure, the protocol is Azure Cloud Storage.

Account Name Name for the Azure storage account in the Azure Portal.

Storage Type Select Container storage, block blobs.

Data Services only supports this type of storage for Azure


Cloud Storage.

Data Services Supplement for Big Data


80 PUBLIC Cloud computing services
Option Description

Authorization Type Select Primary Shared Key.

Data Services only supports this authorization type for Azure


Cloud Storage.

Account Shared Key Copy and paste the primary shared key from the Azure portal
in the storage account information.

Note
For security, the software does not export the account
shared key when you export a data flow or file location ob­
ject that specifies Azure Cloud Storage as the protocol.

Web Service URL Web services server URL that the data flow uses to access the
Web server.

Connection Retry Count Number of times the computer tries to create a connection
with the remote server after a connection fails. After the speci­
fied number of retries, Data Services issues an error message
and stops the job.

The default value is 10. The value cannot be zero.

Batch size for uploading data, MB Maximum size of a data block per request when transferring
data files. The limit is 4 MB.

Caution
Accept the default setting unless you are an experienced
user with an understanding of your network capacities in
relation to bandwidth, network traffic, and network speed.

Batch size for downloading data, MB Maximum size of a data range to be downloaded per request
when transferring data files. The limit is 4 MB.

Caution
Accept the default setting unless you are an experienced
user with an understanding of your network capacities in
relation to bandwidth, network traffic, and network speed.

Number of threads Number of upload and download threads for transferring data
to Azure Cloud Storage. The default value is 1.

When you set this parameter correctly, it could decrease the


download and upload time for blobs. For more information,
see Number of threads for Azure blobs [page 83].

Data Services Supplement for Big Data


Cloud computing services PUBLIC 81
Option Description

Remote Path Prefix Optional. File path for the remote server, excluding the server
name. You must have permission to this directory.

If you leave this option blank, the software assumes that the
remote path prefix is the user home directory used for FTP.

When an associated file format is used as a reader in a data


flow, the software accesses the remote directory and transfers
a copy of the data file to the local directory for processing.

When an associated file format is used as a loader in a data


flow, the software accesses the local directory location and
transfers a copy of the processed file to the remote directory

Container type storage is a flat file storage system and it does


not support subfolders. However, Microsoft allows forward
slashes with names to form the remote path prefix, and a vir­
tual folder in the container where you upload the files.

Example
You currently have a container for finance database files.
You want to create a virtual folder for each year for the blob
file into. For 2016, you set the remote path prefix to:
2016/. When you use this file location, all of the files up­
load into the virtual folder “2016”.

Local Directory Path of your local server directory for the file upload or down­
load.

Requirements for local server:

● must exist
● located where the Job Server resides
● you have appropriate permissions for this directory

When an associated file format is used as a reader in a data


flow, the software accesses the remote directory and transfers
a copy of the data file to the local directory for processing.

When an associated file format is used as a loader in a data


flow, the software accesses the local directory location and
transfers a copy of the processed file to the remote directory.

Data Services Supplement for Big Data


82 PUBLIC Cloud computing services
Option Description

Container Azure container name for uploading or downloading blobs to


your local directory.

If you specified the connection information, including account


name, shared key, and proxy information (if applicable), click
the Container field. The software sends a request to the server
for a list of existing containers for the specific account. Either
select an existing container or specify a new one. When you
specify a new one, the software creates it when you run a job
using this file location object.

Proxy Host, Port, User Name, Password Optional. Enter the proxy information if you use a proxy server.

4.2.2.1.1 Number of threads for Azure blobs

The number of threads is the number of parallel uploaders or downloaders to be run simultaneously when you
upload or download blobs.

The Number of threads setting affects the efficiency of downloading and uploading blobs to or from Azure Cloud
Storage.

Determine the number of threads

To determine the number of threads to set for the Azure file location object, base the number of threads on the
number of logical cores in the processor that you use.

Example thread settings

Processor logical cores Set Number of threads

8 8

16 16

The software automatically re-adjusts the number of threads based on the blob size you are uploading or
downloading. For example, when you upload or download a small file, the software may adjust to use fewer
numbers of threads and use the block or range size you specified in the Batch size for uploading data, MB or Batch
size for downloading data, MB options.

Upload Blob to an Azure container

When you upload a large file to an Azure container, the software may divide the file into the same number of lists of
blocks as the setting you have for Number of threads in the file location object. For example, when the Number of

Data Services Supplement for Big Data


Cloud computing services PUBLIC 83
threads is set to 16 for a large file upload, the software divides the file into 16 lists of blocks. Additionally, each
thread reads the blocks simultaneously from the local file and also uploads the blocks simultaneously to the Azure
container.

When all the blocks are successfully uploaded, the software sends a list of commit blocks to the Azure Blob
Service to commit the new blob.

If there is an upload failure, the software issues an error message. If they already existed before the upload failure,
the blobs in the Azure container stay intact.

When you set the number of threads correctly, you may see a decrease in upload time for large files.

Download Blob from an Azure container

When you download a large file from the Azure container to your local storage, the software may divide the file into
the Number of threads setting in the file location object. For example, when the Number of threads is set to 16 for a
large file download to your local container, the software divides the blobs into 16 lists of ranges. Additionally, each
thread downloads the ranges simultaneously from the Azure container and also writes the ranges simultaneously
to your local storage.

When your software downloads a blob from an Azure container, it creates a temporary file to hold all of the
threads. When all of the ranges are successfully downloaded, the software deletes the existing file from your local
storage if it existed, and renames the temporary file using the name of the file that was deleted from local storage.

If there is a download failure, the software issues an error message. The existing data in local storage stays intact if
it existed before the download failure.

When you set the number of threads correctly, you may see a decrease in download time.

4.2.3 Google cloud storage

Use a Google file location object to access data in your Google cloud account.

4.2.3.1 Google Cloud Storage protocol

Option descriptions for the Create New File Location editor for Google Cloud Storage protocol.

To open this editor, follow these steps:

1. Open the Format tab in the Designer Local Object Library.


2. Right-click on the File Locations node, and select New.

The following table lists the file location object descriptions for the Google Cloud Storage protocol.

Data Services Supplement for Big Data


84 PUBLIC Cloud computing services
Option Description

Name File name of the file location object.

Protocol Type of file transfer protocol.

For Google, the protocol is Google Cloud Storage.

Project Google BigQuery project name.

Upload URL Accept the default, https://www.googleapis.com/


upload/storage/v1.

Download URL Accept the default, https://www.googleapis.com/


storage/v1.

Authentication Server URL Accept the default, https://


accounts.google.com/o/oauth2/token.

The default is the Google URL plus the name of the Web ac­
cess service provider, OAuth 2.0.

Authentication Access Scope Enables access to specific user data. Cloud-platform is the de­
fault.

● Read-only: Allows access to read data, including listing


buckets.
Google information about read-only: https://
www.googleapis.com/auth/devstorage.read_only
● Read-write: Allows access to read and change data, but
not metadata like ACLs.
Google information about read-write: https://
www.googleapis.com/auth/devstorage.read_write
● Full-control: Allows full control over data, including the
ability to modify ACLs.
Google information about full-control: https://
www.googleapis.com/auth/devstorage.full_control
● Cloud-platform.read-only: View your data across Google
Cloud Platform services. For Google Cloud Storage, this
option is the same as devstorage.read-only.
Google information about cloud-platform.read-only:
https://www.googleapis.com/auth/cloud-platform.read-
only
● Cloud-platform: View and manage data across all Google
Cloud Platform services. For Google Cloud Storage, this
option is the same as devstorage.full-control.
Google ionformation about cloud-platform: https://
www.googleapis.com/auth/cloud-platform

Service Account Email Address Enter the e-mail address from your Google project. This e-mail
is the same as the service account e-mail address that you en­
ter into the applicable Google BigQuery datastore.

Data Services Supplement for Big Data


Cloud computing services PUBLIC 85
Option Description

Service Account Private Key Click the Browse icon and select the .p12 file that you cre­
ated in your Google project and downloaded locally. Click
Open.

Service Account Signature Algorithm Accept the default: SHA256withRSA. This value is the algo­
rithm type that the software uses to sign JSON Web Tokens.
The software uses this value, along with your service account
private key, to obtain an access token from the Authentication
Server.

Substitute Access Email Address Optional. Enter the substitute e-mail address from your Goo­
gle BigQuery application datastore.

Web Service URL Web services server URL that the data flow uses to access the
Web server.

Compression Type Select None or gzip. The gzip type lets you upload gzip files to
Google Cloud Storage.

Connection Retry Count Number of times the computer tries to create a connection
with the remote server after a connection fails. After the speci­
fied number of retries, Data Services issues an error notifica-
tion and stops the job.

The default value is 10. The value cannot be zero.

Batch size for uploading data, MB Maximum size of a data block to be uploaded per request
when transferring data files. The limit is 5 TB.

Batch size for downloading data, MB Maximum size of a data block to be downloaded per request
when transferring data files. The limit is 5 TB.

Number of threads Number of upload and download threads for transferring data
to Google Cloud Storage.

The default is 1.

Enter a number from 1 to 30. If you enter any number outside


this range, the software automatically adjusts the number at
runtime.

Bucket Bucket name, which is the name of the basic container that
holds your data.

Select a bucket name from the dropdown list. The list only
contains bucket names that exist in the datastore. To create a
new bucket, enter the name of the bucket here. If the bucket
does not exist in Google Cloud Storage, Google creates the
bucket when you perform an upload for the specified bucket.

Note
If you attempt to download the bucket and it does not exist
in Google, the software issues an error.

Data Services Supplement for Big Data


86 PUBLIC Cloud computing services
Option Description

Remote Path Prefix Optional. Folder structure of the Google Cloud Storage bucket.
It should end with a forward slash (/). For example,
test_folder1/folder2/. You must have permission to
this directory.

If you leave this option blank, the software assumes the home
directory of your file transfer protocol.

When an associated file format is used as a reader in a data


flow, the software accesses the remote directory and transfers
a copy of the data file to the local directory for processing.

When an associated file format is used as a loader in a data


flow, the software accesses the local directory location and
transfers a copy of the processed file to the remote directory

Local Directory The file path of the local server that you use for this file loca­
tion object. The local server directory is located where the Job
Server resides. You must have permission to this directory.

Note
If this option is blank, the software assumes the directory
%DS_COMMON_DIR%/workspace as the default direc­
tory.

When an associated file format is used as a reader in a data


flow, the software accesses the remote directory and transfers
a copy of the data file to the local directory for processing.

When an associated file format is used as a loader in a data


flow, the software accesses the local directory location and
transfers a copy of the processed file to the remote directory.

Proxy Host, Port, User Name, Password Optional. Enter the proxy information if you use a proxy server.

Data Services Supplement for Big Data


Cloud computing services PUBLIC 87
Important Disclaimers and Legal Information

Coding Samples
Any software coding and/or code lines / strings ("Code") included in this documentation are only examples and are not intended to be used in a productive system
environment. The Code is only intended to better explain and visualize the syntax and phrasing rules of certain coding. SAP does not warrant the correctness and
completeness of the Code given herein, and SAP shall not be liable for errors or damages caused by the usage of the Code, unless damages were caused by SAP
intentionally or by SAP's gross negligence.

Gender-Neutral Language
As far as possible, SAP documentation is gender neutral. Depending on the context, the reader is addressed directly with "you", or a gender-neutral noun (such as "sales
person" or "working days") is used. If when referring to members of both sexes, however, the third-person singular cannot be avoided or a gender-neutral noun does not
exist, SAP reserves the right to use the masculine form of the noun and pronoun. This is to ensure that the documentation remains comprehensible.

Internet Hyperlinks
The SAP documentation may contain hyperlinks to the Internet. These hyperlinks are intended to serve as a hint about where to find related information. SAP does not
warrant the availability and correctness of this related information or the ability of this information to serve a particular purpose. SAP shall not be liable for any damages
caused by the use of related information unless damages have been caused by SAP's gross negligence or willful misconduct. All links are categorized for transparency (see:
https://help.sap.com/viewer/disclaimer).

Data Services Supplement for Big Data


88 PUBLIC Important Disclaimers and Legal Information
Data Services Supplement for Big Data
Important Disclaimers and Legal Information PUBLIC 89
go.sap.com/registration/
contact.html

© 2018 SAP SE or an SAP affiliate company. All rights reserved.


No part of this publication may be reproduced or transmitted in any
form or for any purpose without the express permission of SAP SE
or an SAP affiliate company. The information contained herein may
be changed without prior notice.
Some software products marketed by SAP SE and its distributors
contain proprietary software components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company
for informational purposes only, without representation or warranty
of any kind, and SAP or its affiliated companies shall not be liable for
errors or omissions with respect to the materials. The only
warranties for SAP or SAP affiliate company products and services
are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein
should be construed as constituting an additional warranty.
SAP and other SAP products and services mentioned herein as well
as their respective logos are trademarks or registered trademarks of
SAP SE (or an SAP affiliate company) in Germany and other
countries. All other product and service names mentioned are the
trademarks of their respective companies.
Please see https://www.sap.com/corporate/en/legal/copyright.html
for additional trademark information and notices.

You might also like