You are on page 1of 16

DEPLOYMENT GUIDE

Cloudian HyperStore
Microsoft SQL Server
Backup and Data Virtualization
Configuration & Deployment Guide
NOVEMBER 2022

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

Table of Contents

Solution Overview........................................................................................................................... 2
Backup Integration........................................................................................................................ 3
Data Virtualization.......................................................................................................................... 4
Configuration for Backup to S3..................................................................................................... 4
Prerequisites for the S3 endpoint.................................................................................................. 4
Supported features........................................................................................................................ 5
Backup Configuration Steps......................................................................................................... 5
Testing Backup Configuration....................................................................................................... 7
Other options............................................................................................................................... 11
Considerations............................................................................................................................ 12
Recommendations...................................................................................................................... 12
Configuration for Data Virtualization with S3............................................................................. 12
Prerequisites................................................................................................................................ 12
Configuration Steps..................................................................................................................... 13
Running Queries on External Data Source.......................................................................... 115-16

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

Solution Overview
SQL is a common application in enterprises, a database on traditional IT. However, many enterprises are
also adopting cloud solutions as they journey through their digital transformation. Additionally, unstructured
data is ever increasing. Leading to abundance of unanalysed data waiting to be tapped into. Data could
also be a liability that needs to be protected, but this liability can be an asset if meaningful information is
extracted through analytics. With these factors in mind, there’s a need for traditional data applications that
typically run-on traditional infrastructure, evolve into a cloud environment while gaining increased insights
into ever increasing data.
Microsoft SQL 2022 supports S3 cloud protocol for backup to object storage as well as support data lakes
stored on S3 object storage.

Microsoft SQL 2022 support of S3 object storage enables backup to URL through a S3 endpoint from Cloudian
HyperStore that could be running on public cloud or on-prem cloud. Cloudian HyperStore is distributed object
storage where the backups from SQL 2022 will be distributed across the cluster. When more capacity is
needed, the cluster can be scaled with additional HyperStore nodes. HyperStore is a software solution that can
run on-prem in appliance models or as VMs running on-prem or public cloud space.
Microsoft SQL 2022 also provides data lake virtualization with S3 object storage. Where data in parquet and
other supported file types residing object storage can be queried with T-SQL. Data does not need to be moved
from the object storage, minimising ETL processing. With a data lake on Cloudian HyperStore, this enables
processing data that resides on on-prem clouds or public clouds. The data can be distributed and replicated to
different cloud deployments.

Backup Integration
To provide backup integration, SQL Server has been enhanced with a new S3 connector, which uses the S3
REST API to connect to an object storage platform such as Cloudian HyperStore. SQL Server 2022 extends the
existing BACKUP/RESTORE TO/FROM URL syntax by adding support for the new S3 connector using the REST
API. URLs pointing to S3-compatible resources are prefixed with s3:// to denote that the S3 connector is being
used. URLs beginning with s3:// will always assume that the underlying protocol will be https.

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

Data Virtualization
SQL Server 2022 using PolyBase allows for data to be queried from several external data sources, leaving data in-place, and
reducing the need for duplication and/or movement of data between platforms.
SQL Server 2022 supports various file formats including: Parquet, Delta, CSV as well as Text files stored on S3 complaint
object storage such as Cloudian HyperStore.

PolyBase enables your SQL Server instance to query data with T-SQL directly from SQL Server, Oracle, Teradata, MongoDB,
Hadoop clusters, Cosmos DB, and S3-compatible object storage without separately installing client connection software.
You can also use the generic ODBC connector to connect to additional providers using third-party ODBC drivers. PolyBase
allows T-SQL queries to join the data from external sources to relational tables in an instance of SQL Server.
A key use case for data virtualization with the PolyBase feature is to allow the data to stay in its original location and format.
You can virtualize the external data through the SQL Server instance, so that it can be queried in place like any other table
in SQL Server. This process minimises the need for ETL processes for data movement. This data virtualization scenario is
possible with the use of PolyBase connectors.

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

Configuration for Backup to S3


Follow these steps to configure Microsoft SQL Server 2022 to use Cloudian HyperStore as the S3
backup target.

Prerequisites for the S3 endpoint


The S3 endpoint must have been configured as follows:
• TLS must be configured. It is assumed that all connections will be securely transmitted over HTTPS
not HTTP. The endpoint will be validated by a certificate installed on the SQL Server OS Host.
• Credentials created on the S3-compatible object storage with proper permissions to perform the
operation. The user and password created on the storage layer are named the Access Key ID and
Secret Key ID. You will need both to authenticate against the S3 endpoint.
• At least one bucket has been configured for use with HyperStore. Buckets cannot be created or
configured from SQL Server 2022.

Supported features
High-level overview of the supported features for BACKUP and RESTORE:
• A single backup file can be up to 200,000 MiB per URL (with MAXTRANSFERSIZE set to 20 MB).
• Backups can be striped across a maximum of 64 URLs.
• Mirroring is supported, but only across URLs. Mirroring using both URL and DISK is not supported.
• Compression is supported and recommended.
• Encryption is supported.
• Restore from URL with S3-compatible object storage has no size limitation.
• When restoring a database, the MAXTRANSFERSIZE is determined by value assigned during the
backup phase.
• URLs can be specified either in virtual host or path style format.
• WITH CREDENTIAL is supported.
• REGION is supported and the default value is us-east-1.
• MAXTRANSFERSIZE will range from 5 MB to 20 MB. 10 MB is the default value for the S3
connector.

Backup Configuration Steps


Follow these steps to configure backup with Cloudian HyperStore as the S3 backup target.
Before proceeding ensure you have:
• Provisioned a bucket (or multiple buckets) for use with SQL Server 2022
• Have noted your S3 endpoint. For example, https://s3-region1.yourdomain.com
• Have noted your S3 Access Key and Secret Key security credentials

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

Create S3 credentials on SQL Server 2022 by executing the following T-SQL statement. The credentials will
be used to store the S3 access key and secret key and will provide the connection to Cloudian HyperStore.
The S3 prefix of the endpoint indicates object storage will be used.
CREATE CREDENTIAL [s3://<endpoint>:<port>]
WITH
IDENTITY = ‘S3 Access Key’,
SECRET =‘<AccessKeyID>:<SecretKeyID>’;
e.g.
CREATE CREDENTIAL [s3://s3-us1.czero.cloudian.com:443]
WITH
IDENTITY = ‘S3 Access Key’,
SECRET = ‘00a88b0fe3d5f5c71372:g9+7kH1HMstbRV4WehTBQdvDZseqiWY8NUKIj7dr’;

Executing the T-SQL statement to create the S3 security credentials

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

The S3 Security Credentials visible in the SQL Server 2022 Object Explorer

Note:
• To restrict access, the S3 security credentials can also include the S3 bucket name. If included,
you will need to create S3 credentials for each bucket you wish to use in the following format:
[s3://<endpoint>:<port>/<bucket>]:
• The IDENTITY value must be the S3 Access Key. It denotes the use of the S3 connector.
• For SECRET, specify the Access Key and Secret Key separated by a colon.

Testing Backup Configuration


Follow these steps to test the backup configuration and to restore from S3 data source on Cloudian
HyperStore
1. Test connectivity to HyperStore by backing up a database to a HyperStore bucket using a single URL. This
will limit the backup to a single object on the HyperStore system. Please note the bucket must already
be created on HyperStore.
In this example, a 10GB Database ‘StackOverflow2010’ is being backed up to a single file ‘stackbackup01.bak’
in bucket ‘sql01’ using compression.
BACKUP DATABASE StackOverflow2010
TO URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup01.bak’
WITH FORMAT -- overwrite
, STATS = 10
, COMPRESSION;

The T-SQL Statement to back up the ‘StackOverflow2010’ database to single HyperStore object

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

SQL Server output indicating the backup database operation was successfully processed.

Note MPU status in the CMC during backup (10MB parts by default)

The contents of the target bucket with the single backup object.

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

2. The next statement will split the backup into multiple objects on HyperStore by specifying multiple
URLS, in this example 4 URLs are used. This will create 4 objects in the HyperStore bucket sql01.
Each object will use multipart upload (MPU) with a default block size of 10MB.
BACKUP DATABASE StackOverflow2010
TO URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup01.bak’
, URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup02.bak’
, URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup03.bak’
, URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup04.bak’
WITH FORMAT -- overwrite
, STATS = 10
, COMPRESSION;

The T-SQL Statement to back up the ‘StackOverflow2010’ database into four HyperStore objects

Performance increases when using multiple URLs

Multipart upload (MPU) is for each object/URL using 10MB parts by default

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

The database backup is split into 4 HyperStore objects when viewed from the bucket.

3. To restore the database run the following command. The restore statement must include the URLs that were
specified during the backup operation. In the following example, the restore is using 4 URLs.
RESTORE DATABASE StackOverflow2010
FROM URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup01.bak’
, URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup02.bak’
, URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup03.bak’
, URL = ‘s3://s3-us1.czero.cloudian.com:443/sql01/stackbackup04.bak’

WITH REPLACE -- overwrite


, STATS = 10

The T-SQL Statement to restore the ‘StackOverflow2010’ database into four HyperStore objects

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

The database restoring in the SQL Object Explorer

The database restored successfully

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

Other options
As with traditional SQL backups, additional backup options are supported when writing to an S3 target.
Please refer to SQL Server 2022 documentation for further details on the following:
• SQL Server Encrypted Backup
• SQL Server MIRROR Backup *Note: Mirror backup is supported between URLs, for example S3 URLs.
Mirror backup between URL and DISK is not supported as of this writing.
• SQL Server COPY_ONLY backup
• SQL Server T-Log Backup

Considerations
• SQL Server 2022 will split backup files into multiple parts and use multi part upload.
• Each backup file can be split into 10,000 parts. Each part size can be between 5MB > 20MB.
• SQL Server 2022 uses by default 10 MB value for the parameter MAXTRANSFERSIZE. This is
optimal for Cloudian HyperStore.
• SQL Server 2022 supports a maximum file size of a single file as 10,000 parts * MAXTRANSFERSIZE.
• You can control this range using the MAXTRANSFERSIZE in the T-SQL backup statement.
• You can split a more extensive backup into up to 64 URLs. Therefore, the maximum supported file size is
10,000 parts * URL * MAXTRANSFERSIZE.
• A single SQL database backup file can be up to 200,000 MB per URL (20*10,000)
• Note: You must specify COMPRESSION in the backup statement to change MAXTRANSFERSIZE values.

Recommendations
• When creating the S3 security credentials, use the S3 endpoint and port as opposed to setting credentials
for each bucket.
• Use multiple URLs to improve performance
- 5-8 URLs have been optimal in initial testing
• Do not modify MAX TRANSFER SIZE, 10MB is optimal for HyperStore.
• Use compression as recommended by Microsoft. In initial testing a 10GB test database compressed to
2.5GB (4:1).
- No performance gains on restores when not using compression
- Performance is optimal with compression enabled
• MPU will be used by default for each object/URL, the default 10MB size is optimal for HyperStore.
• HyperStore Storage Policy recommendations. As the object sizes are typically optimal for HyperStore with
the default settings a variety of Storage Policies can be used.

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

Configuration for Data Virtualization with S3


Follow these steps to configure Microsoft SQL Server 2022 to use Cloudian HyperStore as the S3
data source in data virtualization.

Prerequisites
To use the S3-compatible object storage integration features, you will need the following tools and
resources:
• Install the PolyBase feature for SQL Server.
• Install SQL Server Management Studio (SSMS)
• S3-compatible storage.
• An S3 bucket created. Buckets cannot be created or configured from SQL Server.
• A user (Access Key ID) has been configured and the secret (Secret Key ID) and that user is known to
you. You will need both to authenticate against the S3 object storage endpoint.
• ListBucket permission on S3 user for browse privileges. **
• ReadOnly permission on S3 user for read privileges. **
• WriteOnly permission on S3 user for write privileges. **
• TLS must have been configured. It is assumed that all connections will be securely transmitted
over HTTPS not HTTP. The endpoint will be validated by a certificate installed on the SQL Server
OS Host.
Please note- ListBucket, ReadOnly and WriteOnly permissions only need to be configured if you are
using an IAM user. A HyperStore regular user account has full permissions.

Configuration Steps
Follow these steps to configure a S3 external data source located on Cloudian HyperStore
1. If not already installed, Install PolyBase via the SQL Server 2022 Installation Center.

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

2. Enable PolyBase by running the following T-SQL statement


exec sp_configure @configname = ‘polybase enabled’, @configvalue = 1
;
RECONFIGURE
;
exec sp_configure @configname = ‘polybase enabled’
;

©2022 Cloudian, Inc. All Right Reserved.


DEPLOYMENT GUIDE

3. Create a master key. In this example a simple password is used, for other options please see here:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-master-key-transact-sql?view=sql-server-ver16
CREATE MASTER KEY ENCRYPTION BY PASSWORD =’123Cloudian456!’
GO

Running Queries on External Data Source


Following steps provide an example of running queries on S3 external data source located on Cloudian
HyperStore
1. Create a new database for testing the external data source, in this example a database called ‘taxi’ is created.
CREATE DATABASE taxi
GO

2. Now switch to the newly created database


USE taxi
GO

3. The next step is to create a database scoped credential. A database credential is not mapped to a server
login or database user. The credential is used by the database to access to the external location anytime the
database is performing an operation that requires access.
The database scoped credential contains the S3 Access Key and Secret Key for your HyperStore or
IAM user.
In this example the scoped credential is named ‘hscredentials’
USE taxi;
GO
IF NOT EXISTS(SELECT * FROM sys.credentials WHERE name = ‘hscredentials’)
BEGIN
CREATE DATABASE SCOPED CREDENTIAL hscredentials
WITH IDENTITY = ‘S3 Access Key’,
SECRET = ‘00a88b0fe3d5f5c71372:g9+7kH1HMstbRV4WehTBQdvDZseqiWY8NUKIj7dr’ ;
END
GO

4. Create an external data source that points to your Cloudian HyperStore S3 endpoint and uses the credential
name specified in the step above. In the example below, the data source name is ‘hyperstoredatalake’ and
the credential name is ‘hscredentials’
CREATE EXTERNAL DATA SOURCE hyperstoredatalake
WITH
( LOCATION = ‘s3://s3-us1.czero.cloudian.com:443/’
, CREDENTIAL = hscredentials
);
GO

5. The next step is to load some data into a bucket that can be used by SQL server for testing. In this example,
New York taxi data stored in PARQUET format is loaded into a bucket named ‘data’. Example PARQUET files
can be downloaded from here https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

177 Bovet Road, Suite 450, San Mateo, California 94402 ©2022 Cloudian, Inc. All Right Reserved.
650.227.2380 | info@cloudian.com | cloudian.com
DEPLOYMENT GUIDE

6. The next command will select all rows from the parquet file stored in the S3 bucket. Note the bucket and
object name are specified after the BULK statement and the DATA_SOURCE statement specifies the external
data source created in the step above, which in turn utilises the database scoped credential created in the
steps above.
SELECT *
FROM OPENROWSET
( BULK ‘/data/yellow.parquet’
, FORMAT = ‘PARQUET’
, DATA_SOURCE = ‘hyperstoredatalake’
) AS [cc];

In the example above ~2.5 million rows are returned in 30 seconds from the S3 bucket.

177 Bovet Road, Suite 450, San Mateo, California 94402 ©2022 Cloudian, Inc. All Right Reserved.
650.227.2380 | info@cloudian.com | cloudian.com DG-1222

You might also like