Professional Documents
Culture Documents
Ds 42 Big Data en
Ds 42 Big Data en
This supplement contains information about the big data products that SAP Data Services supports.
Find basic information in the Reference Guide, Designer Guide, and some of the applicable supplement guides. For
example, to learn about datastores and creating datastores, see the Reference Guide. To learn about Google
BigQuery, refer to the Supplement for Google BigQuery.
SAP Data Services supports many types of big data through various object types and file formats.
Apache Cassandra is an open-source data storage system that you can access with SAP Data Services as a source
or target in a dataflow.
Data Services natively supports Cassandra as an ODBC data source with a DSN connection. Cassandra uses the
generic ODBC driver. Use Cassandra on Windows or Linux operating systems.
Note
For Data Services on Windows platforms, driver support is through the generic ODBC driver.
Use the Connection Manager to create, edit, or delete ODBC data sources and ODBC drivers for natively
supported ODBC databases when Data Services is installed on a Unix platform.
1. In a command prompt, set $ODBCINI to a file in which the Connection Manager defines the DSN. The file must
be readable and writable.
Sample Code
export ODBCINI=<dir-path>/odbc.ini
touch $ODBCINI
The Connection Manager uses this .ini file, along with other information that you enter into the Connection
Manager Data Sources tab to define the DSN for Cassandra.
Sample Code
$ cd <LINK_DIR>/bin/
$ ./DSConnectionManager.sh
Note
<LINK_DIR> is the Data Services installation directory.
3. In Connection Manager, open the Data Sources tab, and click Add to display the list of database types.
4. On the Select Database Type window, select Cassandra and click OK.
The Configuration for... window opens. It contains the absolute location of the odbc.ini file that you set in the
first step.
5. Provide values for additional connection properties for the Cassandra database type as applicable. See Data
source properties for Cassandra [page 7] for Cassandra properties.
6. Provide the following properties:
○ User name
○ Password
Note
The software does not save these properties for other users.
If Data Services is installed on the same machine and in the same folder as the IPS or BI platform, restart the
following services:
○ EIM Adaptive Process Service
○ Data Services Job Service
If Data Services is not installed on the same machine and in the same folder as the IPS or BI platform, restart
the following service:
○ Data Services Job Service
9. If you run another command, such as the Repository Manager, source the al_env.sh script to set the
environment variables.
The Connection Manager configures the $ODBCINI file based on the property values that you enter on the Data
Sources tab. The following table lists the properties that are relevant for Apache Cassandra.
Depending on the value you choose for the certificate mode, you may be asked to
define some or all of the following
Use SAP Data Services to connect to Apache Hadoop frameworks, including Hadoop Distributive File Systems
(HDFS) and Hive sources and targets.
Data Services supports Hadoop on both the Linux and Windows platform. For Windows support, Data Services
uses Hortonworks HDP only. See the latest Product Availability Matrix (PAM) for the supported versions of
Hortonworks HDP https://help.sap.com/viewer/disclaimer-for-links?q=https%3A%2F%2Fapps.support.sap.com
%2Fsap%2Fsupport%2Fpam%3Fhash%3Dpvnr%253D67838200100900005703.
For information about deploying Data Services on a Hadoop MapR cluster machine, see SAP Note 2404486 .
Component Description
Hadoop distributed file system (HDFS) Stores data on nodes, providing very high aggregate bandwidth across the cluster.
Hive A data warehouse infrastructure that allows SQL-like ad-hoc querying of data (in any
format) stored in Hadoop.
Pig A high-level data-flow language and execution framework for parallel computation
that is built on top of Hadoop. Data Services uses Pig scripts to read from and write
to HDFS, including join and push-down operations.
Map/Reduce A computational paradigm where the application is divided into many small frag
ments of work, each of which may be executed or re-executed on any node in the
cluster. Data Services uses map/reduce to do text data processing.
2.2.1 Prerequisites
Before configuring SAP Data Services to connect to Hadoop, verify that your configuration is correct.
Ensure that your Data Services system configuration meets the following prerequisites:
● For Linux and Windows platforms, make sure the machine where the Data Services Job Server is installed is
configured to work with Hadoop.
● For Linux and Windows platforms, make sure the machine where the Data Services Job Server is installed has
the Pig client installed.
● For Linux and Windows platforms, if you are using Hive, verify that the Hive client is installed. To verify this, log
on to the node and issue Pig and Hive commands that invoke the respective interfaces.
● For Linux and Windows platforms, install the Data Services Job Server on one of the Hadoop cluster machines,
which can be either an Edge or a Data node.
● For Linux platforms, ensure that the environment is set up correctly for interaction with Hadoop. The Job
Server should start from an environment that has sourced the Hadoop environment script. For example:
source <$LINK_DIR>/hadoop/bin/hadoop_env_setup.sh -e
● For Linux and Windows platforms, enable text data processing. To enable text data processing, ensure that you
have copied the necessary text data processing components to the HDFS file system, which enables
MapReduce functionality.
Use common commands to verify that SAP Data Services system on Windows is correctly configured for Hadoop.
When you use the commands in this topic, you may get outputs that are different than what we show. That is okay.
The only important factor is that your commands don't result in errors.
To set up the Data Services environment for Hadoop, use the following command:
Checking components
To make sure that Hadoop, Pig, and Hive are set up correctly on the machine where the Data Services Job Server
for Hadoop is configured and installed, use the following command:
$ hadoop fs -ls /
$ hadoop fs -ls /
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmp
drwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user
$ pig
INFO org.apache.pig.Main - Logging error messages to: /hadoop/pig_1363897065467.log
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: hdfs://machine:9000
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to map-reduce job tracker at: machine:9001
grunt> fs -ls /
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmp
drwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user
grunt> quit
$ hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201303211318_504071234.txt
hive> show databases;
OK
default
Time taken: 1.312 seconds
hive> quit;
If all commands pass, use $LINK_DIR/bin/svrcfg from within the same shell to set up or restart the Job Server.
By running this, you are giving the Job Server the proper environment from which it can start engines that can call
Hadoop, Pig, and Hive.
SAP Data Services supports Hadoop on the Windows platform using Hortonworks.
Use the supported version of Hortonworks HDP only. See the Product Availability Matrix (PAM) for the most recent
supported version number.
When you use Hadoop on the Windows platform, you can use Data Services to do the following tasks:
Requirements
● Install the Data Services Job Server in one of the nodes of the Hadoop cluster.
● Set the system environment variables, such as PATH and CLASSPATH, so that the Job Server can run as a
service.
● Set the HDFS file system permission requirements for using HDFS or Hive.
Related Information
Set system environment variables and use command prompts to configure HDFS and Hive for Windows.
HDFS_LIB_DIR = /sap/dataservices/hadoop/tdp
Related Information
The file format for the Hadoop distributed file system (HDFS) describes the file system structure.
Characteristic Description
Class Reusable
Description An HDFS file format describes the structure of a Hadoop distributed file system. Store tem
plates for HDFS file formats in the object library. The format consists of multiple properties that
you set in the file format editor. Available properties vary by the mode of the editor.
The HDFS file format editor includes most of the regular file format editor options plus options
that are unique to HDFS.
File format option descriptions for Hadoop distributed file system (HDFS).
Access the following options in the source or target file editors when you use the HDFS file format in a data flow.
Data File(s)
NameNode host Computer name, fully Name of the NameNode computer. All
qualified domain name,
If you use the following default settings, the local Hadoop system uses
IP address, or variable
what is set as the default file system in the Hadoop configuration files.
NameNode port Positive integer or varia Port on which the NameNode listens. All
ble
Authentication Kerberos Indicates the type of authentication for the HDFS connection. Select All
Kerberos keytab either value for Hadoop and Hive data sources when they are Kerberos
enabled.
Kerberos keytab: Select when you have a generated keytab file. With
this option, you do not need to enter a value for Password, but you en
ter a location for File Location.
File Location File path Location for the applicable Kerberos keytab that you generated for
this connection.
Password Alphanumeric characters Password associated with the selected authentication type. All
and underscores or vari
This field is required for Authentication type Kerberos. This field is not
able
applicable for Authentication type Kerberos keytab.
Root directory Directory path or variable Root directory path or variable name for the output file. All
File name(s) Alphanumeric characters Select the source connection file name or browse to the file by clicking All
the dropdown arrow. For added flexibility, you can select a variable for
and underscores or vari
this option or use the * wildcard.
able
Pig
Working Directory path or variable The Pig script uses this directory to store intermediate data. All
directory
Note
When you leave this option blank, Data Services creates and uses a
directory in /user/sapds_temp, within the HDFS.
Note
If you select No, intermediate files remain in both the Pig Working
Directory and the Data Services directory $LINK_DIR/log/
hadoop.
Custom Pig Directory path or variable Location of a custom Pig script. All
script
Use the results of the script as a source in a data flow.
Custom Pig script can contain any valid Pig Latin command, including
calls to any MapReduce jobs that you want to use with Data Services.
See the Pig documentation for information about Pig Latin com
mands.
Custom Pig scripts must reside on and be runnable from the local file
system that contains the Data Services Job Server that is configured
for Hadoop. It is not the Job Server on HDFS. Any external reference or
dependency in the script should be available on the Data Services Job
Server machine configured for Hadoop.
To test your custom Pig script, execute the script from the command
prompt and check that it finishes without errors. For example, you
could use the following command:
$ pig -f myscript
To use the results of the script by using the HDFS file format as a
source in a data flow, complete the steps in Configuring custom Pig
script results as source [page 14].
Locale
us-ascii The Default option uses UTF-8 for the code page. Select one of these
options for better performance.
Note
For other types of code pages, Data Services uses HDFS API-based
file reading.
Output the results of a custom Pig script to a specified file so that you can use it as a source in a data flow.
Create a new HDFS file format or edit an existing one. Create or locate a custom Pig script that outputs data to use
as a source in your data flow.
Follow these steps to use the results of a custom Pig script in your HDFS file format as a source:
1. In the HDFS file format editor, select Delimited for Type in the General section.
2. Enter the location for the custom Pig script results output file in Root directory in the Data File(s) section.
3. Enter the name of the file to contain the results of the custom Pig script in File name(s).
4. In the Pig section, set Custom Pig script to the path of the custom Pig script. The location must be on the
machine that contains the Data Services Job Server.
Use the file format as a source in a data flow. When the software runs the custom Pig script in the HDFS file format,
the software uses the script results as source data in the job.
To connect to a Hadoop Distributed File System (HDFS), configure an HDFS file format. Use the file format as a
source or target in a data flow.
Related Information
Preview HDFS file data for delimited and fixed width file types.
1. Right-click an HDFS file name in the Format tab of the Local Object Library
2. Click Edit.
The File Format Editor opens. You can only view the data. Sorting and filtering are not available when you view
sample data in this manner.
Use one of the following methods to access HDFS file data so that you can view, sort, and filter the data:
● Right-click on HDFS source or target object in a data flow and click View Data.
● Click the magnifying glass icon located in the lower right corner of the HDFS source or target objects in the
data flow.
● Right-click an HDFS file in the Format tab of the Local Object Library, click Properties, and then open the View
Data tab.
Note
By default, the maximum number of rows displayed for data preview and filtering is 1000, but you can adjust the
number lower or higher, up to a maximum of 5000. To change the maximum number of rows to display:
Complete the following group of tasks to connect to Hive using the Hive adapter:
1. Open the Administrator in the Management Console and enable the Job Server to support adapters.
2. In the Administrator, add, configure, and start an adapter instance.
3. In Data Services Designer, add and configure a Hive adapter datastore.
Note
Data Services supports Apache Hive and HiveServer2 version 0.11 and higher. For the most recent compatibility
information, see the Product Availability Matrix (PAM) at https://apps.support.sap.com/sap/support/pam .
Related Information
Option Description
Host name The name of the machine that is running the Hive service.
Port number The port number of the machine that is running the Hive service.
Username and Password The user name and password associated with the adapter database to which you are
connecting.
If you are using Kerberos authentication, the user name should include the Kerberos
realm. For example: dsuser@BIGDATA.COM. If you use Kerberos keytab for authenti
cation, you do not need to complete this option.
HDFS working directory The path to your Hadoop Distributed File System (HDFS) directory. If you leave this
blank, Data Services uses /user/sapds_hivetmp as the default.
String size The size of the Hive STRING datatype. The default is 100.
SSL enabled Select Yes to use a Secure Socket Layer (SSL) connection to connect to the Hive server.
Note
If you use Kerberos or Kerberos keytab for authentication, set this option to No.
SSL Trust Store The name of the trust store that verifies credentials and stores certificates.
Trust Store Passwordd The password associated with the trust store.
Authentication Indicates the type of authentication you are using for the Hive connection:
Kerberos: Enter your Kerberos password in the Username and Password option.
Kerberos keytab: The generated keytab file. Enter the keytab file location in Kerberos
Keytab Location option.
A Kerberos keytab file contains a list of authorized users for a specific password. The
software uses the keytab information instead of the entered password in the Username
and Password option. For more information about keytabs, see the MIT Kerberos docu
mentation at http://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html .
Data Services supports Kerberos authentication for Hadoop and Hive data sources
when you use Hadoop and Hive services that are Kerberos enabled.
Note
● Data Services supports Hadoop and Hive on Linux 64-bit platform only.
● You cannot use SSL and Kerberos or Kerberos keytab authentication together.
Set the SSL enabled option to No when using Kerberos authentication.
● To enable SASL-QOP support for Kerberos, enter a sasl.qop value into the
Additional Properties field. For more information, see the Additional Properties
field description.
Kerberos Realm Specifies the name of your Kerberos realm. A realm contains the services, host ma
chines, and so on, that users can access. For example, BIGDATA.COM.
Kerberos KDC Specifies the server name of the Key Distribution Center (KDC). Secret keys for user
machines and services are stored in the KDC database.
Configure the Kerberos KDC with renewable tickets (ticket validity as required by Ha
doop/Hive installation).
Note
Data Services supports MIT KDC and Microsoft AD for Kerberos authentication.
Kerberos Hive Principal The Hive principal name for the KDC. The name can be the same as the user name that
you use when installing Data Services. Find the Hive service principal information in the
hive-site.xml file. For example, hive/<hostname>/@realm.
Kerberos Keytab Location Location for the applicable Kerberos keytab that you generated for this connection.
See the description for Authentication for more information about Kerberos keytab au
thentication.
Additional Properties Specify any additional connection properties. Follow property value pairs with a semico
lon (;). Separate multiple property value pairs with a semicolon. For example:
name1=value1;
name1=value1; name2=value2;
To enable SASL-QOP support, set the Authentication option to Kerberos. Then enter one
of the following values, which should match the value on the Hive server:
Related Information
Use the Hive adapter to connect to a Hive server so that you can work with tables from Hadoop.
Note
Data Services supports Apache Hive and HiveServer2 version 0.11 and higher. For the most recent compatibility
information, see the Product Availability Matrix (PAM) at https://apps.support.sap.com/sap/support/pam .
Related Information
SAP Data Services supports text data processing in the Hadoop framework using a MapReduce form of the Entity
Extraction transform.
To use text data processing in Hadoop, copy the language modules and other dependent libraries to the Hadoop
file system (so they can be distributed during the MapReduce job setup) by running the Hadoop environment
script as follows:
$LINK_DIR/hadoop/bin/hadoop_env_setup.sh -c
You only have to do this file-copying operation once after an installation or update, or when you want to use
custom dictionaries or rule files. If you are using the Entity Extraction transform with custom dictionaries or rule
files, you must copy these files to the Hadoop file system for distribution. To do so, first copy the files into the
languages directory of the Data Services installation, then rerun the Hadoop environment script. For example:
cp /myhome/myDictionary.nc $LINK_DIR/TextAnalysis/languages
$LINK_DIR/hadoop/bin/hadoop_env_setup.sh -c
Once this environment is set up, in order to have the Entity Extraction transform operations pushed down and
handled by the Hadoop system, it must be connected to a single HDFS Unstructured Text source.
When using text data processing in the Hadoop framework, the amount of data a mapper can handle and
consequently the number of mappers a job uses, is controlled by the Hadoop configuration setting,
mapred.max.split.size.
You can set the value for mapred.max.split.size in the Hadoop configuration file (located at $HADOOP_HOME/
conf/core-site.xml or an alternate configuration location, depending on the flavor of Hadoop you are using).
By default, the value for mapred.max.split.size is 0, which means that there is no limit and text data
processing would run with only one mapper. You should change this configuration value to the amount of data a
mapper can handle.
For example, you might have a Hadoop cluster that contains twenty machines and each machine is set up to run a
maximum of ten mappers (20 x 10 = 200 mappers available in the cluster). The input data averages 200 GB. If you
<property>
<name>mapred.max.split.size</name>
<value>1073741824</value>
</property>
If you want the text data processing job to consume 50 percent of the available mappers (200 GB ÷ 100 mappers
= 2 GB per mapper), you would set mapred.max.split.size to 2147483648 (2 GB).
Related Information
You can set the following options on the Adapter Source tab of the source table editor.
Clean up working directory True, False Select True to delete the working directory after the job com
pletes successfully.
Execution engine type Default, Map Reduce, Spark ● Default: Data Services uses the default Hive engine.
● Spark: Data Services uses the Spark engine to read data
from Spark.
● Map Reduce : Data Services uses the Map Reduce engine
to read data from Hive.
Parallel process threads Positive integers Specify the number of threads for parallel processing. More
than one thread may improve performance by maximizing
CPU usage on the Job Server computer. For example, if you
have four CPUs, enter 4 for the number of parallel process
threads.
The options and descriptions for Hadoop Hive adapter target options.
You can set the following options on the Adapter Target tab of the target table editor.
Append True, False Select True to append new data to the table or partition.
Select False to delete all existing data, then add new data.
Clean up working directory True, False Select True to delete the working directory after the job com
pletes successfully.
Dynamic partition True, False Select True for dynamic partitions. Hive evaluates the parti
tions when scanning the input data.
Drop and re-create table True, False Select True to drop the existing table and create a new one
before loading with the same name before loading.
Number of loaders Positive integers Enter a positive integer for the number of loaders (threads).
The Hive adapter datastore can process data using the SQL function and the SQL transform.
After connecting to a Hive datastore, you can do the following in Data Services:
● Use the SQL Transform to read data through a Hive adapter datastore. Keep in mind that the SQL transform
supports a single SELECT statement only.
Note
Select table column plus constant expression is not supported.
Stage non-Hive data in a dataflow with the Data Transfer transform before joining it with a Hive source.
When you join the non-hive data to a Hive source, pushdown the Join operation to Hive.
Using pushdown, staging data is more efficient because Data Services doesn't have to read all the data from the
Hive data source into memory before performing the join.
Before staging can occur, you must first enable the Enable automatic data transfer option for the Hive datastore.
Find this option in the Create New Datastore or Edit Datastore window.
After adding the Data_Transfer transform to your dataflow, open the editor and verify that Transfer Type is set to
Table and Database type is set to Hive.
Note
If you select Automatic for the Data Transfer Type in the Data Transfer transform you need to turn off the Enable
automatic data transfer option in all relational database datastores (with exeception of the Hive datastore).
Data Services imports Hive partition columns the same way as regular columns. The column attribute Partition
Column identifies whether the column is partitioned.
When loading to a Hive target, select whether or not to use the Dynamic partition option on the Adapter Target tab
of the target table editor. The partitioned data is evaluated dynamically by Hive when scanning the input data. If
Dynamic partition is not selected, Data Services uses Hive static loading. All rows are loaded to the same partition.
The partitioned data comes from the first row that the loader receives.
Related Information
To preview Hive table data, right-click a Hive table name in the Local Object Library and click View Data.
Alternatively, you can click the magnifying glass icon on Hive source and target objects in a data flow or open the
View Data tab of the Hive table view.
After you create a Hive application datastore in Data Services, use a Hive template table in a data flow.
Start to create a data flow in Data Services Designer and follow these steps to add a Hive template table as a
target.
1. When you are ready to complete the target portion of the data flow, either drag a template table from the
toolbar to your workspace or drag a template table from the Datastore tab under the Hive node to your
workspace.
The software opens the applicable project and dataset, and creates the table. The table name is the name you
entered for Template name in the Create Template window. The software populates the table with the results of the
data flow.
Data type conversion when you import metadata from Hadoop Hive to SAP Data Services.
The following table shows the conversion between Hadoop Hive data types and Data Services data types when
Data Services imports metadata from a Hadoop Hive source or target.
tinyint int
smallint int
int int
bigint decimal(20,0)
float real
double double
string varchar
boolean varchar(5)
2.3 HP Vertica
Process your HP Vertica data in SAP Data Services by creating an HP Vertica database datastore.
Use an HP Vertica datastore as a source or target in a data flow. Implement SSL secure data transfer with MIT
Kerberos to securely access HP Vertica data. Additionally, set settings in the source or target table options to
enhance HP Vertica performance.
SAP Data Services uses MIT Kerberos 5 authentication to securely access an HP Vertica database using SSL
protocol.
You must have Database Administrator permissions to install MIT Kerberos 5 on your Data Services client
machine. Additionally, the Database Administrator must establish a Kerberos Key Distribution Center (KDC)
server for authentication. The KDC server must support Kerberos 5 using the Generic Security Service (GSS) API.
The GSS API also supports non_MIT Kerberos implementations, such as Java and Windows clients.
Note
Specific Kerberos and HP Vertica database processes are required before you can enable SSL protocol in Data
Services. For complete explanations and processes for security and authentication, consult your HP Vertica
user documentation and the MIT Kerberos user documentation.
MIT Kerberos authorizes connections to the HP Vertica database using a ticket system. The ticket system
eliminates the need for users to enter a password.
Related Information
After you install MIT Kerberos, define the specific Kerberos properties in the Kerberos configuration or
initialization file and save it to your domain. For example, save krb5.ini to C:\Windows.
See the MIT Kerberos documentation for information about completing the Unix krb5.conf property file or the
Windows krb5.ini property file. Kerberos documentation is located at: http://web.mit.edu/kerberos/krb5-
current/doc/admin/conf_files/krb5_conf.html .
Property Description
default = <value> The location for the Kerberos library log file, krb5libs.log.
For example: default = FILE:/var/log/
krb5libs.log
kdc = <value> The location for the Kerberos Data Center log file,
krb5kdc.log. For example: kdc = FILE:/var/log/
krb5kdc.log
admin_server = <value> The location for the administrator log file, kadmind.log. For
example: admin_server = FILE:/var/log/
kadmind.log
Property Description
ticket_lifetime = <value> Set number of hours for the initial ticket request. For example:
ticket_lifetime = 24h
renew_lifetime = <value> Set number of days a ticket can be renewed after the ticket
lifetime expiration. For example: renew_lifetime = 7d
The default is 0.
forwardable = <value> Initial tickets can be forwarded when this value is set to True.
For example: forwardable = true
Property Description
<kerberos_realm> = {<subsection_property = Location for each property of the Kerberos realm. For example:
value>}
EXAMPLE.COM = {kdc=<location>
admin_server=<location>
kpasswd_server=<location>}
Properties include:
● KDC location
● Admin Server location
● Kerberos Password Server location
Note
Host and server names are lowercase.
[domain_realm]
Property Description
<server_host_name>=<kerberos_realm> Maps the server host name to the Kerberos realm name. If you
use a domain name, prefix the name with a period (.).
After you have updated the configuration or initialization file and saved it to the client domain, execute the kinit
command to generate a secure key.
For example, enter the following command using your own information for the variables: kinit
<user_name>@<realm_name>
Key Description
See the MIT Kerberos ticket management documentation for complete information about using the kinit
command to obtain tickets: http://web.mit.edu/kerberos/krb5-current/doc/user/tkt_mgmt.html .
To enable SSL for HP Vertica database datastores, first create a data source name (DSN).
You must be an HP Vertica user with database administrator permissions to perform these steps. Other non
database administrators can access the HP Vertica database only when the they are associated with an
authentication method through a GRANT statement.
You must be using SAP Data Services 4.2 SP7 Patch 1 (14.2.7.1) or later to create a DSN for HP Vertica.
Install MIT Kerberos 5 and perform all of the required steps for MIT Kerberos authentication for HP Vertica. See
your HP Vertica documentation in the security and authentication sections for details.
You can access the ODBC Data Source Administrator either from the Datastore Editor in Data Services
Designer or directly from your Start menu.
2. In the ODBC Data Source Administrator, open the System DSN tab and click Add.
3. Select the HP Vertica driver from the list and click Finish.
4. Open the Basic Settings tab and complete the following options:
Option Value
Port Enter the port number on which HP Vertica listens for ODBC
connections. The default is 5433.
User Name Enter the database user name. This is the user with DBAD
MIN permission, or a user who is associated with the au
thentication method through a GRANT statement.
Option Value
Kerberos Host Name Enter the name of the host computer where Kerberos is in
stalled.
Result Buffer Size (bytes) Enter the applicable value in bytes. Default is 131072.
7. Click Test Connection. When the connection test is successful click OK and close the ODBC Data Source
Administrator.
Now the HP Vertica DSN that you just created is included in the DSN option in the datastore editor.
Create the HP Vertica database datastore in Data Services Designer and select the DSN that you just created.
Related Information
SSL encryption protects data as it is transferred between the database server and Data Services.
An administrator must install MIT Kerberos 5 and enable Kerberos for HP Vertica SSL protocol. Additionally, an
administrator must create an SSL data source name (DSN) using the ODBC Data Source Administrator so that it is
available to choose when you create the datastore. See the Administrator Guide for more information about
configuring MIT Kerberos.
SSL encryption for HP Vertica is available in SAP Data Services version 4.2 Support Package 7 Patch 1 (14.2.7.1) or
later.
Note
Enabling SSL encryption slows down job performance.
Note
An HP Vertica database datastore requires that you choose DSN as a connection method. DSN-less
connections are not allowed for HP Vertica datastore with SSL encryption.
SSL-specific options
Option Value
Data Source Name Select the HP Vertica SSL DSN data source file that was cre
ated previously in the ODBC Data Source Administrator.
3. Complete the remaining applicable advanced options and save your datastore.
Related Information
Options, descriptions, and possible values for creating an HP Vertica database datastore.
After you create the HP Vertica database datastore, you can import HP Vertica tables into Data Services. Use the
tables as source or targets in a dataflow, and create HP Vertica template tables.
SSL protocol is available for HP Vertica database datastores. Before you can create an SSL-enabled HP Vertica
datastore, the HP Vertica database administrator user must install and configure MIT Kerberos 5 and create a DSN
in the ODBC Data Source Administrator.
Main window
Data source name Refer to the requirement of your data Required. Select a DSN from the drop
base down list if you have already defined one.
If you haven't defined a DSN previously,
click ODBC Admin to define a DSN.
User name Alphanumeric characters and under Enter the user name of the account
scores through which SAP Data Services ac
cesses the database.
Connection
Additional connection parameters Alphanumeric characters and under Enter information for any additional con
scores, or blank nection parameters. Use the format:
<parameter1=value1;
parameter2=value2>
General
Rows per commit Positive integer Enter the maximum number of rows
loaded to a target table before saving the
data. This value is the default commit
size for target tables in this datastore.
You can overwrite this value for individual
target tables.
Overflow file directory Directory path or click Browse A working directory on the database
server that stores files such as logs.
Must be defined to use FTP.
Session
Additional session parameters A valid SQL statement or multiple SQL Additional session parameters specified
statements delimited by semicolon as valid SQL statements.
Aliases Alphanumeric characters and under Click the option to open a Create New
scores, or blank Alias window.
Set up an HP Vertica database datastore for bulk loading by increasing the commit size in the loader and by
selecting to use the native connection load balancing option when you configure the ODBC driver.
There are no specific bulk loading options when you create an HP Vertica database datastore. However, when you
load data to an HP Vertica target in a data flow, the software automatically executes an HP Vertica statement that
contains a COPY Local statement. This statement makes the ODBC driver read and stream the data file from the
client to the server.
You can further increase loading speed by making the following settings in Designer:
Boolean Int
FLOAT Double
Money Decimal
Numeric Decimal
Number Decimal
Decimal Decimal
Char Varchar
Varchar Varchar
DATE Date
TIMESTAMP Datetime
TIMESTAMPTZ Varchar
Time Time
TIMETZ Varchar
INTERVAL Varchar
Data type conversion from internal data types to HP Vertica data types for template tables or Data_Transfer
transform tables.
Date Date
Datetime Timestamp
Decimal Decimal
Double Float
Int Int
Interval Float
Real Float
Time Time
Varchar Varchar
Timestamp Timestamp
Options and descriptions for setting up an HP Vertica table as a source in a data flow.
Option Description
Table name The name of the table that you added as a source to the data
flow.
Table owner The owner that you entered when you created the HP Vertica
table.
Database type Set to HP Vertica by default. The database type that you chose
when you created the datastore. You cannot change this op
tion.
Options and descriptions for setting up an HP Vertica table as a target in a data flow.
General
Option Description
Number of loaders The default number of loaders is 1, which is single loader load
ing.
For example, if you choose a Rows per commit of 1000 and set
the Number of Loaders to 3, the software loads data as follows:
Error handling
Option Description
Update control
Option Description
Yes: If the target table does not contain a primary key, this op
tion enables the software to use the primary keys from the in
put.
Note
This option is not available for targets in real time jobs or
target tables that contain LONG columns.
When you select Yes for this option, the software reads a row
from the source, then checks if the row exists in the target ta
ble with the same values in the primary key. If Use input keys is
set to Yes, the software uses the primary key of the source ta
ble. Otherwise, the software uses the primary key of the target
table. If the target table has no primary key, the software con
siders the primary key to be all the columns in the target.
When the column data from the source matches the value in
Ignore columns with value, the software does not update the
corresponding column in the target table. The value may be
spaces. Otherwise, the software updates the corresponding
column in the target with the source data.
Ignore columns with value Enter a value that might appear in a source column and that
you do not want updated in the target table. The value must be
a string, it can include spaces, but the string cannot be in sin
gle or double quotations. When this value appears in the
source column, the software does not update the correspond
ing target column during auto correct loading.
Option Description
2.4 MongoDB
The MongoDB adapter allows you to read data from MongoDB to other SAP Data Services targets.
MongoDB is an open-source document database which has JSON-like documents called BSON with dynamic
schemas instead of traditional schema-based data.
Data Services needs metadata to gain access to data for task design and execution. Use Data Services processes
to generate schema by converting each row of the BSON file into XML and converting XML to XSD.
Data Services uses the converted metadata in XSD files to access MongoDB data.
Use data from MongoDB as a source or target in a data flow, and also create templates.
The embedded documents and arrays in MongoDB are represented as nested data. SAP Data Services processes
can convert MongoDB BSON files to XML and then to XSD. Data Services saves the XSD file to the following
location: %DS_COMMON_DIR%\ext\mongo\mcache in your local drive.
Data Services has the following restrictions and limitations for working with MongoDB:
● In the MongoDB collection, the tag name should not contain special characters, which are invalid for the XSD
file (for example, >, <, &, /, \, #, and so on). If special characters exist, Data Services removes them.
● MongDB data is always changing, so the XSD may not reflect the entire data structure of all the documents in
the MongoDB.
● Projection queries on adapters are not supported.
● Data Services ignores any new fields that you add after the metadata schema creation that were not present in
the common documents.
● Push down operators are not supported when using MongoDB as a target.
Use MongoDB as a source in Data Services and then flatten the schema by using the XML_Map transform.
Example 1: This data flow changes the schema via the Query transform and then loads the data to an XML target.
Example 2: This data flow simply reads the schema and then loads it directly into an XML template file.
Example 3: This data flow flattens the schema using the XML_Map tranform and then loads the data to a table or
flat file.
Note
Specify conditions in the Query and XML_Map transforms. Some of them can be pushed down and others are
processed by Data Services.
Query criteria is used as a parameter of the db.<collection>.find() method. After dropping a MongoDB
table into a data flow as a source, open the source and add MongoDB query conditions.
To add a MongoDB query format, enter a value next to the Query criteria parameter in the Adapter Source tab.
Note
The query criteria should be in MongoDB query format. For example, { type: { $in:
[‘food’, ’snacks’] } }.
For example, given a value of {prize:100}, MongoDB returns only rows that have a field named “prize” with a
value of 100. MongoDB won't return rows that don't match this condition. If you don’t specify a value, MongoDB
returns all the rows.
If you specify a Where condition in a Query or XML_Map transform that comes after the MongoDB source in the
data flow, Data Services pushes down the condition to MongoDB so that MongoDB returns only the rows that you
want.
For more information about the MongoDB query format, see the MongoDB website.
Note
When you use the XML_Map transform, you may have a query condition with a SQL format. When this happens,
Data Services converts the SQL format to the MongoDB query format and uses the MongoDB specification to
push down operations to the source database. In addition, be aware that Data Services does not support push
down of query for nested arrays.
Related Information
How SAP Data Services processes push down operators in a MongoDB source.
Data Services does not push down Sort by conditions but it does push down Where conditions. However, if you
use a nested array in a Where condition, Data Services does not push down the nested array.
Note
Data Services does not support push down operators when you use MongoDB as a target.
Data Services supports the following operators when you use MongoDB as a source:
● Comparison operators =, !=, >, >=, <, <=, like, and in.
● Logical operators and and or in SQL query.
Note
The _id field is considered the primary key. If you create a new document with a field named _id, that field will
be recognized as the unique BSON ObjectID. If a document contains more than one _id field (at a different
level), only the _id field in the first level will be considered the ObjectID.
You can set the following options in the Adapter Target tab of the target document editor:
Option Description
Use auto correct Specifies basic operations when using MongoDB as your target datastore. The following values are
available:
● True: The writing behavior is in Upsert mode. The software updates the document with the
same _id or it inserts a new _id.
Note
Using True may slow the performance of writing operations.
● False (default): The writing behavior is in Insert mode. If documents have the same _id in the
MongoDB collection, then an error message appears.
Write concern level Write concern is a guarantee that MongoDB provides when reporting on the success of a write oper
ation. This option allows you to enable or disable different levels of acknowledgment for writing oper
ations.
Use bulk Indicates whether or not you want to execute writing operations in bulk, which provides better per
formance.
When set to True, the software runs a write operation in a bulk for a single collection in order to opti
mize the CRUD efficiency.
If the write operation in a bulk is more than 1000, MongoDB automatically splits into multiple bulk
groups.
For more information about bulk, ordered bulk, and bulk maximum rejects, see the MongoDB docu
mentation at http://help.sap.com/disclaimer?site=http://docs.mongodb.org/manual/core/bulk-
write-operations/.
Use ordered bulk Specifies if you want to execute the write operations in serial (True) or parallel (False) order. The de
fault value is False.
If you execute in parallel order (False), then MongoDB processes the remaining write operations even
when there are errors.
Documents per commit Specifies the maximum number of documents that are loaded to a target before the software saves
the data. If this option is left blank, the software uses 1000 (default).
Bulk maximum rejects Specifies the maximum number of acceptable errors before Data Services fails the job. Note that
data will still load to the target MongoDB even if the job fails.
For unordered bulk loading, if the number of errors is less than, or equal to, the number you specify
here, Data Services allows the job to succeed and logs a summary of errors in the adapter instance
trace log.
Enter -1 to ignore any bulk loading errors. Errors will not be logged in this situation.
Note
This option does not apply when Use ordered bulk is set to True.
Delete data before Deletes existing documents in the current collection before loading occurs, and retains all the config-
loading uration, including indexes, validation rules, and so on.
Drop and re-create Drops the existing MongoDB collection and creates a new one with the same name before loading
occurs. If Drop and re-create is set to True, the software ignores the value of Delete data before
loading. This option is available for template documents only. The default value is True.
Use audit Logs data for auditing. Data Services creates audit files containing write operation information and
stores them in the <DS_COMMON_DIR>/adapters/audits/ directory. The name of the file is
<MongoAdapter_instance_name>.txt.
Here's what you can expect to see when using this option:
● If a regular load fails and Use audit is set to False, loading errors appear in the job trace log.
● If a regular load fails and Use audit is set to True, loading errors appear in the job trace log and in
the audit log.
● If a bulk load fails and Use audit is set to False, the job trace log provides a summary, but it does
not contain details about each row of bad data. There is no way to obtain details about bad data.
● If a bulk load fails and Use audit is set to True, the job trace log provides a summary, but it does
not contain details about each row of bad data. However, the job trace log tells you where to look
in the audit file for this information.
Use template documents as a target in one data flow or as a source in multiple data flows.
Template documents are particularly useful in early application development when you are designing and testing a
project. Find template documents in the Datastore tab of the Local Object Library. Expand the Template
Documents node and find the MongoDB datastore.
When you import a template document, the software converts it to a regular document. You can use the regular
document as a target or source in your data flow.
Note
Template documents are available in Data Services 4.2.7 and later. If you are upgrading from a previous version,
you need to edit the MongoDB datastore and then click OK to see the Template Documents node and any other
template document related options.
Template documents are similar to template tables. For information about template tables, see the Data Services
User Guide and the Reference Guide.
Create MongoDB template documents and use them as targets or sources in data flows.
1. In Data Services Designer, click the template icon from the tool palette.
2. Click inside a data flow in the workspace.
5. Click OK.
6. To use the template document as a target in the data flow, connect the template document to an input object.
7. Click Save.
Linking a data source to the template document and then saving the project generates a schema for the
template document. The icon changes in the workspace and the template document appears in the Template
Documents node under the datastore in the Local Object Library.
Drag template documents from the Template Documents node into the workspace to use them as a source.
Related Information
● Open a data flow and select one or more template target documents in the workspace. Right-click, and choose
Import Document.
● Select one or more template documents in the Local Object Library, right-click and choose Import Document.
The icon changes and the document appears under Documents instead of Template Documents in the Local Object
Library.
Note
The Drop and re-create configuration option is available only for template target documents. Therefore it is not
available after you convert the template target into a regular document.
Related Information
Data preview allows you to view a sampling of MongoDB data from documents.
To preview MongoDB document data, right-click on a MongoDB document name in the Local Object Library or on a
document in the data flow and then select View Data.
You can also click the magnifying glass icon on a MongoDB source and target object in the data flow.
Note
By default, the maximum number of rows displayed for data preview is 100. To change this number, use the
Rows To Scan adapter datastore configuration option. Enter -1 to display all rows.
For more information, see “Using View Data”, “Viewing and adding filters”, and “Sorting” in the Designer Guide.
SAP Data Services uses a Parallel Scan process to improve performance while it generates metadata for big data.
Generating metadata can be time consuming because Data Services needs to first scan all documents in the
MongoDB collection. Parallel Scan allows Data Services to use multiple parallel cursors when reading all the
documents in a collection, thus increasing performance.
Note
Parallel Scan works with MongoDB server version 2.6.0 and above.
For more information about the parallelCollectionScan command, consult the MongoDB documentation.
The software honors the MongoDB adapter datastore settings when re-importing.
To re-import all documents, right-click on a MongoDB datastore or on the Documents node and click Reimport All.
Note
When Use Cache is enabled, the software uses the cached schema.
When Use Cache is disabled, the software looks in the sample directory for a sample JSON file with the same
name. If there is a matching file, the software uses the schema from that file. If there isn't a matching JSON file
in the sample directory, the software re-imports the schema from the database.
Search for MongoDB documents in a repository from within the object library.
Process your SAP HANA data in SAP Data Services by creating an SAP HANA database datastore.
Use SAP HANA database datastores as sources and targets in Data Services processes. Protect your HANA data
using SSL protocol and cryptographic libraries. Create stored procedures and enable bulk loading for faster
reading and loading. Additionally, load spatial and complex spatial data from Oracle to SAP HANA.
Note
Beginning with SAP HANA 2.0 SP1, you can access databases only through a multitenant database container
(MDC). If you use a version of SAP HANA that is earlier than 2.0 SP1, you can access only a single database.
Configure SAP HANA database datastores to use SSL encryption for all network transmissions between the
database server and SAP Data Services.
Caution
Only an administrator or someone with sufficient experience should configure SSL encryption for SAP HANA.
Using DSN SSL for SAP HANA network transmissions is available beginning with SAP Data Services version 4.2 SP
7 (14.2.7.0) or later.
Configure SSL on both the SAP HANA server side and the Data Services client side.
SSL encryption for SAP HANA database datastores requires a DSN (data source name) connection. You cannot
use a server name connection.
The tasks for enabling SSL encryption require you to have either the SAPCrypto library or the OpenSSL library.
These libraries may have been included with the database or with the platform you use. If you do not have either of
these libraries, or you have older versions, download the latest versions from the SAP Support Portal. To configure
the server side, make settings in the communication section of the global.ini file.
For more information about cryptographic libraries and settings for secure external connections in the
global.ini file for SAP HANA database, see the SAP HANA Network and Communication Security section of the
SAP HANA Security Guide.
When you create an SAP HANA database datastore with SSL encryption, configure the database server and SAP
Data Services for certificate authentication.
On the database server side, make settings in the communications section of the global.ini file based on the
cryptographic library you use.
For more information about cryptographic libraries and settings for secure external connections in the
global.ini file for SAP HANA database, see the SAP HANA Network and Communication Security section of the
SAP HANA Security Guide.
The following table lists the requirements for each type of SAP Data Services SSL provider.
Related Information
Configure SSL (secure shell) encryption for an SAP HANA database datastore on a Windows operating system.
Verify the type of your SSL encryption. It should be either SAPCrypto or OpenSSL.
For information about MMC, see the Microsoft Web site https://msdn.microsoft.com/en-us/library/
bb742442.aspx .
3. Open the ODBC Data Source Administrator.
Access the ODBC Data Source Administrator either from the Datastore Editor in Data Services Designer or
directly from your Start menu.
4. In the ODBC Data Source Administrator, open the System DSN tab and click Add.
5. Select the driver HDBODBC and click Finish.
6. Enter values in Data Source Name and Description.
7. Enter the Server:Port information and click Settings.
8. In the SSL Connection group, select the following options:
Create the SAP HANA datastore using the SSL DSN that you just created.
Related Information
Configure SSL encryption for an SAP HANA database datastore on a Unix operating system.
Verify the type of your SSL encryption. It should be either SAPCrypto or OpenSSL.
During this configuration process, use the SAP Data Services Connection Manager. Read about the Connection
Manager in the Administrator Guide.
1. Export $ODBCINI to a file in the same computer as the SAP HANA data source. For example:
Sample Code
export ODBCINI=<dir_path>/odbc.ini
touch $ODBCINI
Sample Code
$LINK_DIR/bin/DSConnectionManager.sh
3. Click the Data Sources tab and click Add to display the list of database types.
4. On the Select Database Type window, select the SAP HANA database type and click OK.
The configuration page opens with some of the connection information automatically completed:
○ Absolute location of the odbc.ini file
○ Driver for SAP HANA
○ Driver Version
5. Complete the remaining applicable options including:
○ DSN Name
○ Driver
○ Server Name
○ Instance
Option Description
SSL Encryption Option (Y/N) Set to Y so that the client server, Data Services, verifies the
certificate from the database server before accepting it. If
you set to N, the client server accepts the certificate without
verifying it, which is less secure.
HANA SSL Provider (sapcrypto/openssl) Specify the cryptographic provider for your SAP HANA SSL
connectivity. Options are:
○ OpenSSL (.pem)
○ SAPCrypto (.pse)
SSL Certificate File Enter the location and file name for the SSL certificate file.
SSL Key File Enter the location and file name for the SSL key file.
If you choose OpenSSL (.pem) for the HANA SSL provider option, use the Data Services bundled OpenSSL and
not your operating system OpenSSL.
8. To ensure that you use the Data Services bundled OpenSSL, follow these substeps:
a. Check your OpenSSL version and dependencies with the shared library (using Idd command).
For example, if your client operating system has OpenSSL version 0.9.8, run the following command:
Sample Code
ldd /usr/bin/openssl
linux-vdso.so.1 => (0x00007fff37dff000)
libssl.so.0.9.8 => /usr/lib64/libssl.so.0.9.8 (0x00007f0586e05000)
libcrypto.so.0.9.8 => /usr/lib64/libcrypto.so.0.9.8
(0x00007f0586a65000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f058682f000)
libz.so.1 => /build/i311498/ds427/dataservices/bin/libz.so.1
(0x00007f0586614000)
libc.so.6 => /lib64/libc.so.6 (0x00007f058629c000)
/lib64/ld-linux-x86-64.so.2 (0x00007f058705c000)
b. Create a soft link in <LINK_DIR>/bin. Use the same version name but refer to the Data Services SSL
libraries:
Sample Code
ln -s libbodi_ssl.so.1.0.0
ln -s libbodi_crypto.so.1.0.0 libcrypto.so.0.9.8
When you have completed the configuration, the Connection Manager automatically tests the connection.
Create an SAP HANA database datastore with SSL encryption. SSL encryption protects data as it is transferred
between the database server and Data Services.
An administrator must import and configure the SAP HANA database certificate. Additionally, you must create an
SSL data source (DSN) so that it is available to choose when you create the datastore. Information about
importing and configuring an SAP HANA database certificate is in the Administrator Guide.
SSL encryption is available in SAP Data Services version 4.2 SP7 (14.2.7.0) or later.
Note
Enabling SSL encryption will slow down job performance.
Note
An SAP HANA database datastore requires that you choose DSN as a connection method. DSN-less
connections are not allowed when you enable SSL encryption.
Note
If you are using SAP HANA version 2.0 SPS 01 multitenancy database container (MDC) or later, specify the port
number and the database server name specific to the tenant database you are accessing.
SSL-specific options
Option Value
Data Source Name Select the SAP HANA SSL DSN data source file that was
created previously (see Prerequisites above).
Find descriptions for all of the SAP HANA database datastore options in the Reference Guide.
3. Complete the remaining applicable Advanced options and save your datastore.
When you create an SAP HANA database datastore, there are several options and settings that are unique for SAP
HANA.
Beginning with SAP HANA 2.0 SPS 01 MDC, use a database datastore to access a specified tenant database.
Database version SAP HANA database <version number> Select the version of your SAP HANA database client
(the version of the SAP HANA database that this data
store accesses).
Use data source Checkbox selected or not selected Select to use a data source name (DSN) to connect to
name (DSN) the database.
Note
For SSL encryption, use the DSN SSL that you cre
ated in “Configure DSN SSL for SAP HANA” in the
Administrator Guide.
Database server Computer name Enter the name of the computer where the SAP HANA
name server is located.
Port Five-digit integer Enter the port number to connect to the SAP HANA
Server.
Default: 30015
This option is required if you did not select Use data
source name (DSN).
Note
See SAP HANA documentation to learn how to find
the specific tenant database port number.
Data source name Refer to the requirements of your database Select or type the data source name that you defined in
the ODBC Administrator for connecting to your data
base.
Note
For SSL encryption, use the DSN SSL that you cre
ated in “Configure DSN SSL for SAP HANA” in the
Administrator Guide.
User name Alphanumeric characters and underscores Enter the user name of the account through which the
software accesses the database.
Database name Refer to the requirements of your database Optional. Enter the specific tenant database name.
Additional Alphanumeric characters and underscores, or Enter information for any additional parameters that
connection blank the data source supports (parameters that the data
information source ODBC driver and database support). Use the
format:
<parameter1=value1; parameter2=val
ue2>
Rows per commit Positive integer Enter the maximum number of rows loaded to a target
table before saving the data.
Overflow file Directory path or click Browse. Enter the location of overflow files written by target ta
directory bles in this datastore. You could also use a variable.
Aliases (Click here Enter the alias name and the owner name to which the
to create) alias name maps.
Use SAP HANA tables as targets in a data flow when applicable, and complete the options specific to SAP HANA.
Options
Option Description
Table type For template tables, select the appropriate table type for your SAP HANA target:
Bulk loading
Option Description
Mode Specify the mode for loading data to the target table:
Commit size default: Data Services identifies the SAP HANA target table type and applies a default com
mit size for the maximum number of rows loaded to the staging and target tables before sav
ing the data (committing):
You can also type any value in the field that is greater than one.
Update method Specify how the input rows are applied to the target table:
Default: Data Services applies the default value for this option based on the SAP HANA target
table type:
Note
Do not use DELETE-INSERT if the update rows do not contain data for all columns in
the target table, because Data Services will replace missing data with NULLs.
Related Information
Performance Optimization Guide: Using Bulk Loading, Bulk loading in SAP HANA [page 54]
SAP Data Services supports SAP HANA stored procedures with zero, one, or more output parameters.
Data Services supports scalar data types for input and output parameters. Data Services does not support table
data types. If you try to import a procedure with table data type, the software issues an error. Data Services does
not support data types such as binary, blob, clob, nclob, or varbinary for SAP HANA procedure parameters.
Procedures can be called from a script or from a Query transform as a new function call.
Example
Syntax
Limitations
SAP HANA provides limited support of user-defined functions that can return one or several scalar values. These
user-defined functions are usually written in L. If you use user-defined functions, limit them to the projection list
and the GROUP BY clause of an aggregation query on top of an OLAP cube or a column table. These functions are
not supported by Data Services.
SAP Data Services improves bulk load performance by using a staging mechanism during bulk loading to the SAP
HANA database.
When Data Services uses changed data capture (CDC) or auto correct load, it uses a temporary staging table to
load the target table. Data Services loads the data to the staging table and applies the operation codes INSERT,
UPDATE, and DELETE to update the target table. With the Bulk load option selected in the target table editor, any
one of the following conditions triggers the staging mechanism:
By default, Data Services automatically detects the SAP HANA target table type and updates the table accordingly
for optimal performance.
Because the bulk loader for SAP HANA is scalable and supports UPDATE and DELETE operations, the following
options in the target table editor are also available for bulk loading:
Find these options in the Target Table editor, Options Advanced Update Control .
Related Information
Reference Guide: Objects, SAP HANA target table options [page 53]
Data Services performs data type conversions when it imports metadata from external sources or targets into the
repository and when it loads data into an external table or file.
Data Services uses its own conversion functions instead of conversion functions that are specific to the database
or application that is the source of the data.
Additionally, if you use a template table or Data_Transfer table as a target, the software converts from internal data
types to the data types of the respective DBMS.
Data type conversion when SAP Data Services imports metadata from an SAP HANA source or target into the
repository and then loads data to an external table or file.
integer int
tinyint int
smallint int
bigint decimal
char varchar
nchar varchar
varchar varchar
nvarchar varchar
float double
real real
double double
date date
time time
timestamp datetime
clob long
nclob long
blob blob
binary blob
varbinary blob
The following table shows the conversion from internal data types to SAP HANA data types in template tables.
blob blob
date date
datetime timestamp
decimal decimal
double double
int integer
interval real
long clob/nclob
real decimal
time time
timestamp timestamp
varchar varchar/nvarchar
SAP Data Services supports spatial data such as point, line, polygon, collection, or a heterogeneous collection) for
specific databases.
The following list contains specific databases that support spatial data in SAP Data Services:
When you import a table with spatial data columns, Data Services imports the spatial type columns as character
based large objects (clob). The column attribute is Native Type, which has the value of the actual data type in the
database. For example, Oracle is SDO_GEOMETRY, Microsoft SQL Server is geometry/geography, and SAP HANA
is ST_GEOMETRY.
Limitations
● You cannot create template tables with spatial types because spatial columns are imported into Data Services
as clob.
● You cannot manipulate spatial data inside a data flow because the spatial utility functions are not supported.
Load spacial data from Oracle or Microsoft SQL Server to SAP HANA.
Learn more about spatial data by reading the SAP HANA documentation.
1. Import a source table from Oracle or Microsoft SQL Server to SAP Data Services.
2. Create a target table in SAP HANA with the appropriate spatial columns.
3. Import the SAP HANA target table into Data Services.
4. Create a data flow with an Oracle or Microsoft SQL Server source as reader.
Include any necessary transformations.
5. Add the SAP HANA target table as a loader.
Make sure not to change the data type of spatial columns inside the transformations.
6. Build a job that includes the data flow and run it to load the data into the target table.
Complex spatial data is data such as circular arcs and LRS geometries.
For example, in the SQL below, the table name is “Points”. The “geom” column contains the following
geospatial data:
SELECT
SDO_UTIL.TO_WKTGEOMETRY(
SDO_GEOM.SDO_ARC_DENSIFY(
geom,
(MDSYS.SDO_DIM_ARRAY(
MDSYS.SDO_DIM_ELEMENT('X',-83000,275000,0.0001),
MDSYS.SDO_DIM_ELEMENT('Y',366000,670000,0.0001)
)),
'arc_tolerance=0.001'
)
)
from "SYSTEM"."POINTS"
For more information about how to use these functions, see the Oracle Spatial Developer's Guide on the Oracle
Web page at SDO_GEOM Package (Geometry) .
7. Build a job in Data Services that includes the data flow and run it to load the data into the target table.
Use the Data Services Connection Manager for Unix platforms to configure ODBC databases and ODBC drivers to
use specific databases as repositories, sources, and targets in Data Services.
The Connection Manager is a command-line utility. However, a graphical user interface (GUI) is available.
Note
To use the graphical user interface for Connection Manager, install the GTK+2 library. The GTK+2 is a free multi-
platform toolkit that creates user interfaces. For more information about obtaining and installing GTK+2, see
https://help.sap.com/viewer/disclaimer-for-links?q=https%3A%2F%2Fwww.gtk.org%2F.
When you use DSConnectionManager.sh in the command line, the -c parameter must be the first parameter.
If an error occurs when using the Connection Manager, use the -d option to show details in the log.
Example
$LINK_DIR/bin/DSConnectionManager.sh -c -d
Note
For Windows installation, use the ODBC Driver Selector to configure ODBC databases and drivers for
repositories, sources, and targets.
SAP Data Services provides access to various cloud databases and storages to use for reading or loading big data.
Access various cloud databases through file location objects and file format objects.
SAP Data Services supports many cloud database types to use as readers and loaders in a data flow.
In SAP Data Services, you create a database datastore to access your data from Amazon Redshift. Additionally,
load Amazon S3 data files into Redshift using the build-in function load_from_s3_to_redshift.
Use the Amazon Redshift ODBC driver to connect to the Redshift cluster database. The Redshift ODBC driver
connects to Redshift on Windows and Linux platforms only.
For information about downloading and installing the Amazon Redshift ODBC driver, see the Amazon Redshift
documentation on the Amazon website.
Note
SSL settings are managed through the Amazon Redshift ODBC Driver. In the Amazon Redshift ODBC Driver DSN
Setup window, set the SSL Authentication option to allow.
● Import tables
● Read or load Redshift tables in a data flow
● Preview data
● Create and import template tables
● Load Amazon S3 data files into a Redshift table using the built-in function load_from_s3_to_redshift
For more information about template tables and data preview, see the Designer Guide.
Database Version Redshift <version number> Enter the Redshift database version. For example, Red
shift 8.<x>.
Data Source Name Refer to the requirements of your Type the data source name (DSN) configuration name,
database. which is defined in Amazon Redshift ODBC Driver, for
connecting to your database.
User Name Alphanumeric characters and un Enter the user name of the account through which Data
derscores Services accesses the database.
Enable Automatic Data n/a Enables transfer tables in this datastore, which the
Transfer Data_Transfer transform can use to push down subse
quent database operations.
Connection
Additional connection parameters Alphanumeric characters and under Enter information for any additional con
scores, or blank nection parameters. Use the format:
<parameter1=value1;
parameter2=value2>
General
Rows per commit Positive integer Enter the maximum number of rows loaded to a target ta
ble before saving the data. This value is the default com
mit size for target tables in this datastore. You can over
write this value for individual target tables.
Bulk loader directory Directory path or click Browse Enter the location where data files are written for bulk
loading. You can enter a variable for this option.
Overflow file directory Directory path or click Browse Enter the location of overflow files written by target tables
in this datastore. A variable can also be used.
Additional session parameters A valid SQL statement or multiple Additional session parameters specified as valid SQL
SQL statements delimited by a semi statement(s)
colon
Aliases
Redshift option Possible values Description
Aliases Alphanumeric characters and the un Enter the alias name of the database owner. For more
derscore symbol (_) information, see “Creating an alias” in the Designer
Guide.
Related Information
1. Download and install the Amazon Redshift ODBC driver for Linux. For more information, see “Install the
Amazon Redshift ODBC Driver on Linux Operating Systems” in the Amazon Redshift Management Guide on
the Amazon website ( http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-linux.html ).
After installing the ODBC driver on Linux, you'll need to configure the following files:
○ amazon.redshiftodbc.ini
○ odbc.ini
○ odbcinst.ini
For more information about these files and other configuration information, see “Configure the ODBC Driver
on Linux and Mac OS X Operating Systems” in the Amazon Redshift Management Guide on the Amazon
website (http://docs.aws.amazon.com/redshift/latest/mgmt/odbc-driver-configure-linux-mac.html ).
2. At the end of /opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini, add a line to point to
the libodbcinst.so file. This file is in the unixODBC/lib directory.
For example,
[Driver]
DriverManagerEncoding=UTF-16
Note
The Unix ODBC Lib Path is based on where you install the driver. For example, for Unix ODBC 2.3.4 the
path would be /build/unixODBC-232/lib.
Specify the DSN name from the list or add a new one:
DS42_REDSHIFT
Specify the User Name:
<name of the user>
Type database password:(no echo)
Retype database password:(no echo)
Specify the Unix ODBC Lib Path:
/build/unixODBC-232/lib
Specify the Driver:
/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so
Specify the Driver Version:'8'
8
Specify the Host Name:
<host name/IP address>
Specify the Port:
<port number>
Specify the Database:
<database name>
Specify the Redshift SSL certificate verification mode
[require|allow|disable|prefer|verify-ca|verify-full]:'require'
require
Testing connection...
Successfully added database source.
Related Information
Option descriptions for using an Amazon Redshift database table as a source in a data flow.
When you use an Amazon Redshift table as a source, the software supports the following features:
The following list contains behavior differences from Data Services when you use certain functions with Amazon
Redshift:
● When using add_month(datetime, int), pushdown doesn't occur if the second parameter is not in an
integer data type.
● When using cast(input as ‘datatype’), pushdown does not occur if you use the real data type.
● When using to_char(input, format), pushdown doesn't occur if the format is ‘XX’ or a number such as
‘099’, ‘999’, ‘99D99’, ‘99G99’.
● When using to_date(date, format), pushdown doesn't occur if the format includes a time part, such as
‘YYYY-MM-DD HH:MI:SS’.
For more information, see SAP Note 2212730 and “Maximizing Push-Down Operations” in the Performance
Optimization Guide.
The following table lists source options when you use an Amazon Redshift table as a source:
Option Description
Table name Name of the table that you added as a source to the data flow.
Table owner Owner that you entered when you created the Redshift table.
Database type Database type that you chose when you created the datastore. You cannot change this
option.
The Redshift source table also uses common table source options.
Related Information
Descriptions of options for using an Amazon Redshift table as a target in a data flow.
Note
The Amazon Redshift primary key is informational only and the software does not enforce key constraints for
the primary key. Be aware that using SELECT DISTINCT may return duplicate rows if the primary key is not
unique.
Note
The Amazon Redshift ODBC driver does not support parallelize load via ODBC into a single table. Therefore, the
Number of Loaders option in the Options tab is not applicable for a regular loader.
Option Description
Bulk load Select to use bulk loading options to write the data.
Mode Select the mode for loading data in the target table:
Note
Append mode does not apply to template tables.
● Truncate: Deletes all existing records in the table, and then adds new records.
S3 file location Enter or select the path to the Amazon S3 configuration file. You can enter a variable for this
option.
Maximum rejects Enter the maximum number of acceptable errors. After the maximum is reached, the soft
ware stops Bulk loading. Set this option when you expect some errors. If you enter 0, or if
you do not specify a value, the software stops the bulk loading when the first error occurs.
Generate files only Enable to generate data files that you can use for bulk loading.
When enabled, the software loads data into data files instead of the target in the data flow.
The software writes the data files into the bulk loader directory specified in the datastore
definition.
If you do not specify a bulk loader directory, the software writes the files to <
%DS_COMMON_DIR%>\log\bulkloader\<tablename><PID>. Then you manually
copy the files to the Amazon S3 remote system.
Clean up bulk loader directory Enable to delete all bulk load-oriented files from the bulk load directory and the Amazon S3
after load remote system after the load is complete.
Number of loaders Sets the number of threads to generate multiple data files for a parallel load job. Enter a posi
tive integer for the number of loaders (threads).
Related Information
SAP Data Services converts Redshift data types to Data Services data types when Data Services imports
metadata from a Redshift source or target into the repository.
smallint int
integer int
bigint decimal(19,0)
decimal decimal
real real
float double
boolean varchar(5)
char char
Note
The char data type doesn't support multi-byte characters. The maximum range is 4096 bytes.
nchar char
varchar varchar
nvarchar
Note
The varchar and nvarchar data types support utf8 multi-byte characters. The size is the num
ber of bytes and the max range is 65535.
Caution
If you try to load multi-byte characters into a char or nchar data type column, Redshift will pro
duce an error. Redshift internally converts nchar and nvarchar data types to char and varchar.
The char data type in Redshift doesn't support multi-byte characters. Use overflow to catch the
unsupported data or, to avoid this problem, create a varchar column instead of using the char
data type.
date date
timestamp datetime
text varchar(256)
bpchar char(256)
The following data type conversions apply when you create a template table:
blob varchar(max)
date date
datetime datetime
decimal decimal
int integer
interval float
long varchar(8190)
real float
time varchar(25)
timestamp datetime
varchar varchar/nvarchar
char char/nchar
Developers and administrators who use Microsoft SQL Server can store on-premise SQL Server workloads on an
Azure virtual machine in the cloud.
The Azure virtual machine supports both Unix and Windows platforms.
Data Services lets you move files from local storage such as a local drive or folder to an Azure container.
Data Services lets you move files from local storage such as a local drive or folder to an Azure container. You can
use an existing container or create one if it does not exist. You can also import files (called “blobs” when in a
container) from an Azure container to a local drive or folder. The files can be any type and are not internally
manipulated by Data Services. Currently, Data Services supports the block blob in the container storage type.
You use a file format to describe a blob file and use it within a data flow to perform extra operations on the file. The
file format can also be used in a script to automate upload and local file deletion.
The following are the high-level steps for uploading files to a container storage blob in Microsoft Azure.
1. Create a storage account in Azure and take note of the primary shared key. For more information, see
Microsoft documentation or Microsoft technical support.
2. Create a file location object with the Azure Cloud Storage protocol. For details about the file location object
option settings in Azure, see the Reference Guide.
3. Create a job in Data Services Designer.
4. Add a script containing the appropriate function to the job.
○ copy_to_remote_system
copy_to_remote_system('New_FileLocation', '*')
A script that contains this function copies all of the files from the local directory specified in the file location
object to the container specified in the same object.
5. Save and run the job.
Related Information
The Google BigQuery datastore contains access information and passwords so that the software can open your
Google BigQuery account on your behalf.
After accessing your account, SAP Data Services can load data to or extract data from your Google BigQuery
projects:
● Extract data from a Google BigQuery table to use as a source for Data Services processes.
● Load generated data from Data Services to Google BigQuery for analysis.
● Automatically create and populate a table in your Google BigQuery dataset by using a Google BigQuery
template table.
For complete information about Data Services and Google BigQuery, see the Supplement for Google BigQuery.
Option descriptions for the SAP Data Services Google BigQuery datastore editor.
Create a new datastore to open the Google BigQuery datastore editor. See the Designer Guide for information
about creating a datastore.
Option Instruction
Consists of the Google URL plus the name of the Web access
service provider, OAuth 2.0.
Service Account Email Address Paste the service account e-mail address that you copied
from your Google project.
Substitute Access Email Address Optional. Enter the substitute e-mail address from your Goo
gle BigQuery datastore.
● Proxy host
● Proxy port
● Proxy user name
● Proxy password
Google Cloud Storage for Reading Set this option only when you are downloading data from Goo
gle BigQuery as a source, and the data sets are larger than ap
proximately 10 MB. Otherwise, leave the default setting of
blank.
Option descriptions for the Target tab in the datastore explorer for the Google BigQuery datastore table.
When you include a Google BigQuery table in a data flow, you edit the target information for the target table.
Double-click the target table in the data flow to open the target editor.
Option Description
Make Port Creates an embedded data flow port from a source or target
file.
Maximum failed records per loader Sets the maximum number of records that can fail per loader
before Google stops loading records. The default is zero (0).
The Target tab also displays the Google table name and the datastore used to access the table.
When you have larger data files to extract from Google BigQuery, create a file location object that uses Google
Cloud Storage (GCS) protocol to optimize data extraction.
Consider the following factors before you decide to use the GCS file location object for optimization. Compare the
time saved using optimization against the potential fees from using your GCS account in this manner. Additionally,
the optimization may not be beneficial for smaller data files of less than or equal to 10 MB.
Required information to complete the GCS file location object includes the following:
How to set it up
4.1.3.4 load_from_gcs_to_gbq
a function that uses information from the named file location object to copy data from Google Cloud Storage into
Google BigQuery tables.
Use this function in a workflow script to transfer data from a Google Cloud Storage into Google BigQuery tables to
be used as a source in a data flow. The software uses the local and remote paths and Google Cloud Storage
protocol information from the named file location object.
Syntax
Return value
int
<remote_file_name> Name of the file to copy from the remote server in the format gs://bucket/filename.
Wildcards may be used.
<table_name> Name of the Google BigQuery table name in the format dataset.table.
<write_mode> (Optional.) The write mode value can be append (default) or truncated.
<file_format> The format of the data files using one of the following values:
Example
To copy a file json08_from_gbq.json from a Google BigQuery datastore named NewGBQ1 on a remote server
to a Google BigQuery table named test.json08 on a local server, set up a script object that contains the
load_from_gcs_to_gbq function as follows:
Sample Code
load_from_gcs_to_gbq('NewGBQ1', 'gs://test-bucket_1229/from_gbq/
json08_from_gbq.json', 'test.json08', 'append', 'NEWLINE_DELIMITED_JSON');
4.1.3.5 gbq2file
A function that optimizes software performance when you export large-volume Google BigQuery results to a user-
specified file on your local machine.
The software uses information in the associated Google cloud storage (GCS) file location object to identify your
GCS connection information, bucket name, and compression information.
Syntax
gbq2file('<GBQ_datastore_name>','<any_query_in_GBQ>','<local_file_name>','<file_lo
cation_object>', '<field_delimiter>','/<numeric_row_delimiter>');
int
Where
<local_file_name> Local file location and name in which to store the Google data.
<file_location_object> Name of the Google Cloud Storage file location object in Data Services.
<field_delimiter> Optional. The field delimiter to use between fields in the exported data. The default is a comma.
Note
Default is 10, hex 0A.
1. The function saves your Google BigQuery results to a temporary table in Google.
2. The function uses export job to export data from the temporary table to GCS.
Note
If the data is larger than 1 GB, Google exports the data in multiple files.
3. The function transfers the data from your Google Cloud Storage to the local file that you specified.
4. After the transfer is complete, the function deletes the temporary table and any files from Google Cloud
Storage.
For details about creating a Google BigQuery application datastore, see the Supplement for Google BigQuery.
File location objects specify specific file transfer protocols so that SAP Data Services safely transfers data from
server to server.
4.2.1 Amazon S3
Amazon Simple Storage Service (S3) is a product of Amazon Web Services that provides scalable storage in the
cloud.
Amazon S3 provides a service where you can store large volumes of data. In SAP Data Services, access your
Amazon S3 account using a file location object.
Data Services provides built-in functions for processing data that you can use with data from S3 and data that you
load to S3. There is one built-in function specifically for moving data from S3 to Amazon Redshift named
load_from_s3_to_redshift.
Use a file location object to access data or upload data stored in your Amazon S3 account.
The following table describes the file location options that are specific to the Amazon S3 protocol.
Option Description
Region Name of the region you are transferring data to and from; for
example, "South America (Sao Palo)".
Communication Protocol Communication protocol you are using with S3, either http or
https.
Connection Retry Count Number of times the software should try to upload or down
load data before stopping the upload or download.
Batch size for uploading data, MB Size of the data transfer you want the software to use for up
loading data to S3.
Batch size for downloading data, MB Size of the data transfer the software uses to download data
from S3.
Number of threads Number of upload and download threads for transferring data
to S3.
Remote directory Optional. Name of the directory for Amazon S3 to transfer files
to and from.
Local directory Optional. Name of the local directory to use to create the files.
If you leave this field empty, the software uses the default Data
Services workspace.
Proxy host, port, user name, password Proxy information if you use a proxy server.
Related Information
Uses the Redshift COPY command to copy data files from an Amazon Simple Storage Service (S3) bucket to a
Redshift table.
Before using this function, set up an S3 file location object. For more information, see Amazon S3 protocol [page
75].
Syntax
Where
<file name> Fully qualified name of the Amazon S3 file to copy to the Redshift table. Wild cards are allowed.
● acceptanydate: Accepts any date, even those with invalid formats, without throwing an error.
● acceptinvchars: Replaces invalid UTF-8 characters.
● blankasnull: Inserts null if the input data is blank.
● dateformat: Defines the date format. For example, \'YYYY-MM-DD\'.
● delimiter: Defines the column delimiter. For example, \'|\'.
● emptyasnull: Inserts null if input data is empty.
● encoding: Defines the data file encoding type. Valid values include utf8 (default), utf16, utf16le,
and utf16be.
● encrypted: Loads encrypted data files from S3.
● escape: Removes escape (\) character. For example,a\\b\\c would be a\b\c.
● explicit_ids: Data values must match the Identity format and Identity columns.
● fillrecord: Fills null if any record is missed.
● ignoreblanklines: Ignores blank lines.
● ignoreheader: Skips the specified number rows as a file header. The default is 0.
● manifest: Loads manifest data files from S3.
● maxerror: Defines the maximum number of errors allowed. The default is 0.
● null as: Defines the special null string
● removequotes: Removes quotes from the data file.
● roundec: Rounds up numeric values when the input value is greater than the scale defined for
the column.
● timeformat: Defines the timestamp format. For example, \'YYYY-MM-DD HH:MI:SS\'.
● trimblanks: Removes whitespace characters. Only applies to the varchar data type.
● truncatecolumns: Truncates data in columns when the input value is greater than the column
defined. Applies to varchar or char data types and rows 4MB or less in size.
● gzip: Loads compressed data files from S3.
● lzop: Loads compressed data files from S3.
● bzip2: Loads compressed data files from S3.
Sample Code
Example
To generate an AES256 key, enter the following:
You can then use the key to upload data from the Redshift table to the S3 bucket.
To copy the encrypted data files on S3 back to a Redshift table, enter the following:
load_from_s3_to_redshift('redshift_ft', 'public.t31_household',
'S3_to_Redshift_3', 't31_encrypted', 'master_symmetric_key \'<AES256 key> \'
encrypted bzip2 delimiter \'|\'');
Example
To copy JSON data from S3 to a Redshift table, with a JSON path, enter the following:
load_from_s3_to_redshift('redshift_ft', 'public.t32_category',
'S3_to_Redshift_3', 't33_category.json', 'json \'s3://dsqa-redshift-bkt3/
t33_category_jsonpath.json\'');
Example
To copy CSV data from S3 to a Redshift table, enter the following:
load_from_s3_to_redshift('redshift_ft', 'public.t32_category',
'S3_to_Redshift_3', 't34_category_csv.txt', 'csv quote as \'%\'');
load_from_s3_to_redshift('redshift_ft', 'public.t35_fixed_width',
'S3_to_Redshift_3', 't35', 'fixedwidth \'catid:5,catgroup:10,catname:9,catdesc:
40\'');
Blob data is unstructured data that is stored as objects in the cloud. Blob data is text or binary data such as
documents, media files, or application installation files.
Access Azure blob storage by creating an Azure cloud file location object.
Related Information
Option descriptions for the Create New File Location window for the Azure Cloud Storage protocol.
Follow these steps to open the File Location editor to create a new file location object:
The following table lists the file location object descriptions for the Azure Cloud Storage protocol.
Option Description
Account Name Name for the Azure storage account in the Azure Portal.
Account Shared Key Copy and paste the primary shared key from the Azure portal
in the storage account information.
Note
For security, the software does not export the account
shared key when you export a data flow or file location ob
ject that specifies Azure Cloud Storage as the protocol.
Web Service URL Web services server URL that the data flow uses to access the
Web server.
Connection Retry Count Number of times the computer tries to create a connection
with the remote server after a connection fails. After the speci
fied number of retries, Data Services issues an error message
and stops the job.
Batch size for uploading data, MB Maximum size of a data block per request when transferring
data files. The limit is 4 MB.
Caution
Accept the default setting unless you are an experienced
user with an understanding of your network capacities in
relation to bandwidth, network traffic, and network speed.
Batch size for downloading data, MB Maximum size of a data range to be downloaded per request
when transferring data files. The limit is 4 MB.
Caution
Accept the default setting unless you are an experienced
user with an understanding of your network capacities in
relation to bandwidth, network traffic, and network speed.
Number of threads Number of upload and download threads for transferring data
to Azure Cloud Storage. The default value is 1.
Remote Path Prefix Optional. File path for the remote server, excluding the server
name. You must have permission to this directory.
If you leave this option blank, the software assumes that the
remote path prefix is the user home directory used for FTP.
Example
You currently have a container for finance database files.
You want to create a virtual folder for each year for the blob
file into. For 2016, you set the remote path prefix to:
2016/. When you use this file location, all of the files up
load into the virtual folder “2016”.
Local Directory Path of your local server directory for the file upload or down
load.
● must exist
● located where the Job Server resides
● you have appropriate permissions for this directory
Proxy Host, Port, User Name, Password Optional. Enter the proxy information if you use a proxy server.
The number of threads is the number of parallel uploaders or downloaders to be run simultaneously when you
upload or download blobs.
The Number of threads setting affects the efficiency of downloading and uploading blobs to or from Azure Cloud
Storage.
To determine the number of threads to set for the Azure file location object, base the number of threads on the
number of logical cores in the processor that you use.
8 8
16 16
The software automatically re-adjusts the number of threads based on the blob size you are uploading or
downloading. For example, when you upload or download a small file, the software may adjust to use fewer
numbers of threads and use the block or range size you specified in the Batch size for uploading data, MB or Batch
size for downloading data, MB options.
When you upload a large file to an Azure container, the software may divide the file into the same number of lists of
blocks as the setting you have for Number of threads in the file location object. For example, when the Number of
When all the blocks are successfully uploaded, the software sends a list of commit blocks to the Azure Blob
Service to commit the new blob.
If there is an upload failure, the software issues an error message. If they already existed before the upload failure,
the blobs in the Azure container stay intact.
When you set the number of threads correctly, you may see a decrease in upload time for large files.
When you download a large file from the Azure container to your local storage, the software may divide the file into
the Number of threads setting in the file location object. For example, when the Number of threads is set to 16 for a
large file download to your local container, the software divides the blobs into 16 lists of ranges. Additionally, each
thread downloads the ranges simultaneously from the Azure container and also writes the ranges simultaneously
to your local storage.
When your software downloads a blob from an Azure container, it creates a temporary file to hold all of the
threads. When all of the ranges are successfully downloaded, the software deletes the existing file from your local
storage if it existed, and renames the temporary file using the name of the file that was deleted from local storage.
If there is a download failure, the software issues an error message. The existing data in local storage stays intact if
it existed before the download failure.
When you set the number of threads correctly, you may see a decrease in download time.
Use a Google file location object to access data in your Google cloud account.
Option descriptions for the Create New File Location editor for Google Cloud Storage protocol.
The following table lists the file location object descriptions for the Google Cloud Storage protocol.
The default is the Google URL plus the name of the Web ac
cess service provider, OAuth 2.0.
Authentication Access Scope Enables access to specific user data. Cloud-platform is the de
fault.
Service Account Email Address Enter the e-mail address from your Google project. This e-mail
is the same as the service account e-mail address that you en
ter into the applicable Google BigQuery datastore.
Service Account Private Key Click the Browse icon and select the .p12 file that you cre
ated in your Google project and downloaded locally. Click
Open.
Service Account Signature Algorithm Accept the default: SHA256withRSA. This value is the algo
rithm type that the software uses to sign JSON Web Tokens.
The software uses this value, along with your service account
private key, to obtain an access token from the Authentication
Server.
Substitute Access Email Address Optional. Enter the substitute e-mail address from your Goo
gle BigQuery application datastore.
Web Service URL Web services server URL that the data flow uses to access the
Web server.
Compression Type Select None or gzip. The gzip type lets you upload gzip files to
Google Cloud Storage.
Connection Retry Count Number of times the computer tries to create a connection
with the remote server after a connection fails. After the speci
fied number of retries, Data Services issues an error notifica-
tion and stops the job.
Batch size for uploading data, MB Maximum size of a data block to be uploaded per request
when transferring data files. The limit is 5 TB.
Batch size for downloading data, MB Maximum size of a data block to be downloaded per request
when transferring data files. The limit is 5 TB.
Number of threads Number of upload and download threads for transferring data
to Google Cloud Storage.
The default is 1.
Bucket Bucket name, which is the name of the basic container that
holds your data.
Select a bucket name from the dropdown list. The list only
contains bucket names that exist in the datastore. To create a
new bucket, enter the name of the bucket here. If the bucket
does not exist in Google Cloud Storage, Google creates the
bucket when you perform an upload for the specified bucket.
Note
If you attempt to download the bucket and it does not exist
in Google, the software issues an error.
Remote Path Prefix Optional. Folder structure of the Google Cloud Storage bucket.
It should end with a forward slash (/). For example,
test_folder1/folder2/. You must have permission to
this directory.
If you leave this option blank, the software assumes the home
directory of your file transfer protocol.
Local Directory The file path of the local server that you use for this file loca
tion object. The local server directory is located where the Job
Server resides. You must have permission to this directory.
Note
If this option is blank, the software assumes the directory
%DS_COMMON_DIR%/workspace as the default direc
tory.
Proxy Host, Port, User Name, Password Optional. Enter the proxy information if you use a proxy server.
Coding Samples
Any software coding and/or code lines / strings ("Code") included in this documentation are only examples and are not intended to be used in a productive system
environment. The Code is only intended to better explain and visualize the syntax and phrasing rules of certain coding. SAP does not warrant the correctness and
completeness of the Code given herein, and SAP shall not be liable for errors or damages caused by the usage of the Code, unless damages were caused by SAP
intentionally or by SAP's gross negligence.
Gender-Neutral Language
As far as possible, SAP documentation is gender neutral. Depending on the context, the reader is addressed directly with "you", or a gender-neutral noun (such as "sales
person" or "working days") is used. If when referring to members of both sexes, however, the third-person singular cannot be avoided or a gender-neutral noun does not
exist, SAP reserves the right to use the masculine form of the noun and pronoun. This is to ensure that the documentation remains comprehensible.
Internet Hyperlinks
The SAP documentation may contain hyperlinks to the Internet. These hyperlinks are intended to serve as a hint about where to find related information. SAP does not
warrant the availability and correctness of this related information or the ability of this information to serve a particular purpose. SAP shall not be liable for any damages
caused by the use of related information unless damages have been caused by SAP's gross negligence or willful misconduct. All links are categorized for transparency (see:
https://help.sap.com/viewer/disclaimer).