You are on page 1of 327

Contents

Introduction
About the exam
About the author
About the method: PQRST
Chapter 1: Manage data with Transact-SQL
Chapter overview
Requirements
Exam objectives
Create Transact-SQL SELECT queries
Identify proper SELECT query structure
Write specific queries to satisfy business requirements
Construct results from multiple queries using set operators
Distinguish between UNION and UNION ALL behavior
Identify the query that would return expected results based on provided table structure and/or
data
Query multiple tables by using joins
Write queries with join statements based on provided tables, data, and requirements
Determine proper usage of INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, and CROSS
JOIN
Construct multiple JOIN operators using AND and OR
Determine the correct results when presented with multi-table SELECT statements and source
data
Write queries with NULLs on joins
Implement functions and aggregate data
Construct queries using scalar-valued and table-valued functions
Identify the impact of function usage to query performance and WHERE clause sargability
Identify the differences between deterministic and non-deterministic functions
Use built-in aggregate functions
Use arithmetic functions, date-related functions, and system functions
String functions
Modify data
Write INSERT, UPDATE, and DELETE statements
Determine which statements can be used to load data to a table based on its structure and
constraints
Construct Data Manipulation Language (DML) statements using the OUTPUT statement
Determine the results of Data Definition Language (DDL) statements on supplied tables and data
Summary
Questions
Answers
Chapter 2: Query data with advanced Transact-SQL components
Chapter overview
Exam objectives
Query data by using subqueries and APPLY
Determine the results of queries using subqueries and table joins
Evaluate performance differences between table joins and correlated subqueries based on
provided data and query plans
Distinguish between the use of CROSS APPLY and OUTER APPLY
Write APPLY statements that return a given data set based on supplied data
Query data by using table expressions
Identify basic components of table expressions
Construct recursive table expressions to meet business requirements
Define usage differences between table expressions and temporary tables
Group and pivot data by using queries
Construct complex GROUP BY clauses using GROUPING SETS, and CUBE
Use windowing functions to group and rank the results of a query
Distinguish between using windowing functions and GROUP BY
Construct PIVOT and UNPIVOT statements to return desired results based on supplied data
Determine the impact of NULL values in PIVOT and UNPIVOT queries
Query temporal data and non-relational data
Query historic data by using temporal tables
Query and output JSON data
Query and output XML data
Summary
Questions
Answers
Chapter 3: Program databases by using Transact-SQL
Chapter overview
Exam objectives
Create database programmability objects by using Transact-SQL
Create stored procedures, table-valued and scalar-valued user-defined functions, and views;
Create indexed views
Implement input and output parameters in stored procedures
Identify whether to use scalar-valued or table-valued functions
Distinguish between deterministic and non-deterministic functions
Implement error handling and transactions
Determine results of Data Definition Language (DDL) statements based on transaction control
statements
Implement TRY…CATCH error handling with Transact-SQL
Generate error messages with THROW and RAISERROR
Implement transaction control in conjunction with error handling in stored procedures
Implement data types and NULLs
Evaluate results of data type conversions
Determine proper data types for given data elements or table columns
Identify locations of implicit data type conversions in queries
Determine the correct results of joins and functions in the presence of NULL values
Identify proper usage of ISNULL and COALESCE functions
Summary
Questions
Answers
Copyright © 2018 by R.B. van den Berg
All rights reserved. This book or any portion thereof
may not be reproduced or used in any manner whatsoever
without the express written permission of the publisher
except for the use of brief quotations in a book review.
Introduction
This book is about the Microsoft Certified Solutions Associate exam 70-761: Querying data with
Transact-SQL. This book will help prepare you for the exam.

The intended audience for this book are people who already work with SQL Server, and have
experience querying databases. That means that, if you are new to databases, this exam and therefore
this book is probably not for you. It might be wiser to start with an entry level exam, such as the
Microsoft Technology Associate exam on Database Fundamentals. For this, we recommend the book
MTA 98-364: Database Fundamentals by the same author.

After reading this book, trying the code examples for yourself and practicing the questions, you should
also be able to pass the Microsoft exam 70-761: Querying data with Transact-SQL. This book is not
intended as a kind of cram session. Our intention is to teach you the necessary skills and
understanding; and as we go along, you’ll gain enough knowledge to pass the exam. So if you do not
plan on taking this exam, but do want to learn to query SQL databases, this is still the right book for
you. If you just want to pass the exam and have no interest in actually learning the material, this book
is not for you; if we could tell you just the facts you need to know to pass the exams, and nothing
more, we wouldn’t. We want you to understand the why as well as the how. We will, however, use the
objectives of this exam as guideline throughout this book.

This book is divided into 3 chapters. In chapter 1, we’ll begin by laying the foundation: inserting data
into tables, changing data in the tables and removing it. We’ll also teach you how to filter result sets,
and combine result sets using joins and set operators. And we’ll cover a lot of system functions.
In chapter 2, we’ll build on that by covering more advanced concepts, such as windowing functions,
temporal tables and grouping data. Also, we’ll discuss how to work with non-relational data in JSON
and XML format.
In chapter 3, we’ll teach you how to program databases, by creating your own functions, stored
procedures and views. Some of these are concepts we’ll touch upon in the first two chapters, and
revisit in chapter 3.
This is something we’ll have to do quite a lot. We try to teach you every building block one by one,
but they are all related; so at times, we simply can’t avoid referring to a topic we haven’t explained
yet; we’ll have to jump ahead a little, or jump back, to explain how the current building block works
in conjunction with other building blocks. This is one reason that real life experience with SQL
Server is very important; if everything is completely new to you , the amount of information is
probably overwhelming.

About the exam


The Microsoft exam 70-761: Querying data with Transact-SQL is one of a series of mid-level
exams on Microsoft technologies. The exam consists of several multiple choice questions, and
questions in other formats. If you pass this exam, plus the 70-762 Designing SQL Databases exam,
you can call yourself Microsoft Certified Solutions Associate: Database Development (MCSA). For
more details on the exam, and how to schedule the exam, go to the Microsoft web site. At the time of
writing this book, this was the direct link to the web page:
https://www.microsoft.com/en-us/learning/exam-70-761.aspx
Here, Microsoft makes two important statements we want to bring to your attention. The first is: “This
objective may include but is not limited to…”. This phrase is used for most objectives. The second
is: “This preparation guide is subject to change at any time without prior notice and at the sole
discretion of Microsoft”. So you won’t know up front what might be asked on the exam. For you as a
student, this means that you must practice, not just the questions and exercises in this book. And
review the list of exam objectives on this web site, as it may have changed since the writing of this
book.

This is another reason that, if you have no real life experience, the exam will be a challenge, even
after reading this book. The exam is not limited to the list of exam objective, so the questions could
involve anything SQL Server related.

As stated above, this exam is the first of two Microsoft exams needed to achieve the title of Microsoft
Certified Solutions Associate: Database Development. The next step after this would be to either try
for one of the two other SQL MCSA titles, Database Administration or Business Intelligence, or
MCSE (Microsoft Certified Solutions Expert) on Data Management and Analytics.

About the author


Robert is an independent IT consultant, with over 20 years of IT experience. Starting as system
engineer, he was introduced to a wide variety of hardware and software, among which SQL Server
(at that time: version 7.0). He found that he liked working with databases; both the complexity of the
technology and the importance of data to the business processes appealed to him, and still does.
That’s why he decided to specialize in SQL Server. Since then, he has held a number of database
related roles: consultant, engineer, architect and database administrator.

About the method: PQRST


When studying this book, we recommend using the PQRST learning method: Preview, Question,
Read, Summarize, Test. This method consists of the following steps:
* Preview. At the start of each chapter, flip through the pages to get an idea of the topics that will be
covered. To support this, we’ll give a chapter overview and mention some key concepts that will be
covered.
* Question. At the beginning of each chapter, think of some questions you might have about these
topics: “in what situations can I use this? why does it work this way? Why not do it like that?”.
Maybe you have encountered relevant situations in your past. We advise you to actually write these
questions down.
* Read. This should be obvious.
* Summarize. After reading each chapter, we’ll give a summary. It is a good idea to make your own
summary before reading ours, and then compare notes.
* Test. See if you can answer the questions you asked yourself. That’s why we recommend writing
them down before you start reading.

Just reading is not the best way to memorize material. Actually formulating your own questions about
the material beforehand, and seeing if you can answer your own questions afterwards, will make you
a much more active participant. This will help you remember the material. If you’re unable to answer
your questions, or if you have additional questions, look them up online. Or post the questions on a
SQL server related web site.
Also, we suggest following along with every code example we give you. You should be able to come
up with the same result. And get creative: see if you can make some modifications to the code, to
further your understanding.

If you don’t like typing over all the information, you can find most of the code on the accompanying
web site: http://www.rbvandenberg.com/books/

Of course, you’ll also have to practice writing SQL code. A lot. So let’s get started.
Chapter 1: Manage data with Transact-SQL
Chapter overview

In this chapter, we’ll explain all the components of a SELECT query, and how to apply filters to
achieve the desired result. After that, we’ll show you how to combine different result sets using all
sorts of set operators and join operators. We’ll also demonstrate the use of system functions; there is a
very long list of these to cover. Finally, we’ll show you how to modify tables, using INSERT,
UPDATE and DELETE

Requirements
* SQL Server trial software (available for download on http://www.microsoft.com/en-
us/download/default.aspx);
* a pc or a server that is powerful enough to install SQL Server on. Microsoft states the minimum
requirements as:
* Supported operating systems: Windows Server 2003 Service Pack 2, Windows
Server 2008, Windows Vista, Windows Vista Service Pack 1, Windows XP Service
Pack 2, Windows XP Service Pack 3
* 32-bit systems: Computer with Intel or compatible 1GHz or faster processor (2
GHz or faster is recommended.)
* 64-bit systems: 1.4 GHz or higher processor
* Minimum of 512 MB of RAM (2 GB or more is recommended.)
* 2.2 GB of available hard disk space
* WideWorldImporters sample database (available for download on
http://msftdbprodsamples.codeplex.com/ ). A lot of examples will use this database, so in order to
follow along, you’ll need access to a copy of this database as well.

Exam objectives
For the exam, the relevant objectives are:

Manage data with Transact-SQL (40–45%)


Create Transact-SQL SELECT queries
Identify proper SELECT query structure, write specific queries to satisfy business
requirements, construct results from multiple queries using set operators, distinguish
between UNION and UNION ALL behaviour, identify the query that would return
expected results based on provided table structure and/or data
Query multiple tables by using joins
Write queries with join statements based on provided tables, data, and requirements;
determine proper usage of INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, and
CROSS JOIN; construct multiple JOIN operators using AND and OR; determine the
correct results when presented with multi-table SELECT statements and source data;
write queries with NULLs on joins
Implement functions and aggregate data
Construct queries using scalar-valued and table-valued functions; identify the impact of
function usage to query performance and WHERE clause sargability; identify the
differences between deterministic and non-deterministic functions; use built-in
aggregate functions; use arithmetic functions, date-related functions, and system
functions
Modify data
Write INSERT, UPDATE, and DELETE statements; determine which statements can be
used to load data to a table based on its structure and constraints; construct Data
Manipulation Language (DML) statements using the OUTPUT statement; determine the
results of Data Definition Language (DDL) statements on supplied tables and data
Create Transact-SQL SELECT queries
At this point in your career, you should already know how to write a basic select statement. The
prerequisite knowledge of this exam, and therefore this book, is the MTA Database Fundamentals
exam, or equivalent knowledge through working experience. However, we’ll cover the basics
anyway, just to make sure we cover everything.
In this section, we’ll show you how to write a proper SELECT statement, how to use SET operators,
and the difference between UNION and UNION ALL.

Identify proper SELECT query structure


This is what a basic SELECT statement looks like:

SELECT [column A] as ‘Column alias’


, [column B] as ‘Column alias’
,…
FROM [server].[database].[schema].[tableA] as t
WHERE [filter criteria]
ORDER BY [column A], [column B];

If you’re already familiar with all parts of this statement, you can proceed to the next section. If not,
follow along. Quite often in this section, we’ll take a look forward, and mention that we’ll go into
further detail about a topic later on.

In order, we’ll cover:


* The four-part table name;
* The table alias;
* The list of columns;
* The column alias;
* The filter criteria in the WHERE clause;
* The ORDER BY clause;
* The semicolon at the end of the statement.

In its shortest form, a SELECT statement becomes: SELECT * FROM tableA. We’ll start from here.

FROM clause
WideWorldImporters is our test database, provided by Microsoft. This test database contains sales
information for a fictional company. As such, it contains a table called Orders, that contains a record
for every order. To retrieve these records, use the following statement:

SELECT *
FROM Orders

Even though this table actually exists, this statement will probably result in an error message:

Msg 208, Level 16, State 1, Line 1


Invalid object name 'Orders'.
It could be that you lack proper permissions, but if you’re sysadmin on this instance of SQL Server
(which is a requirement for this course), this error is probably caused by something else. As
mentioned above, a complete FROM clause should actually provide a four-part name, not just the
table name: [server].[database].[schema].[tableA]. The table name is essential; for all the other parts
of the four-part name, SQL can substitute a default, but SQL might not provide the values you need,
causing something to go wrong. So let’s provide the full four-part name, and run the statement again.
In order to do that, we need to know the name of your server. If you’re unsure about that, you can
check it using the following statement:

SELECT name
FROM sys.servers
WHERE server_id = 0

In my case, the name of my server is DESKTOP-LO5S40T. Therefore, the complete SELECT


statement on my laptop is:

SELECT *
FROM [DESKTOP-LO5S40T].WideWorldImporters.sales.Orders

Obviously, on your computer, you’ll have to substitute the name of your computer.

This statement actually works; it will return over 73 thousand rows.


Now we’ve got a working SELECT statement, let’s break that down. First, the square brackets ( [ ] )
surrounding the server name. The name of my laptop contains a “-“, which is considered a special
character. Therefore, this name needs to be surrounded (delimited) by square brackets. Without the
square brackets, SQL would raise a syntax error. The official term for the name of an object, such as a
server name, database name, schema name or table name is an identifier; that makes my server name
a bracketed identifiers. For identifiers without special characters, the square brackets are optional;
therefore, the previous statement is identical to this one:

SELECT *
FROM [DESKTOP-LO5S40T].[WideWorldImporters].[Sales].[Orders]

For more information about identifiers, see: https://technet.microsoft.com/en-


us/library/ms175874(v=sql.105).aspx ; in that article, you’ll also find a link to the concept of
delimited identifiers.
A word about the formatting: formatting is not required. You can put the entire SELECT statement on
a single line; it doesn’t matter to SQL Server. But it does matter to other human beings reading the
code. So pay a little attention to the use of capitals and where you put line endings, spaces, tabs etc.
Especially if you work in a team, in which case you might want to agree upon a formatting standard
with your colleagues. Better yet: use a formatting tool such as Red Gate Prompt, or a web site such as
http://www.dpriver.com/pp/sqlformat.htm .

Back to the four part name. The server name is an identifier you will not be using a lot for T-SQL
queries. Most of the time you’ll query the SQL Server instance you’re connected to, so if you do not
specify a server name, SQL will default to the local instance. Usually, you only provide the server
name if you’re querying another instance than the one you are connected to.
In order to reference another SQL Server instance in a T-SQL query, you’ll have to do a little more
than just provide that server name; before you can do that, you have to set up a connection to that
instance, and configure security. You can do this through something called a linked server. This is
outside the scope of this exam; just remember that, if you set it up properly, it is possible to reference
another SQL Server instance in a query, and that, if you omit the server name from the four-part table
name, SQL will default to the instance you’re connected to.
The next part of the four-part name is the database name. Like the server name, if you omit the
database name, SQL will use the database you’re connected to. You can change the database you’re
connected to using the statement USE [database_name], such as:

USE master

You can also use the database name in the four-part name. That way, you can reference tables in two
different databases in the same SELECT statement. Same as with querying another instance, there are
some security matters that need to be taken care of in order for this to work, but this is the general
idea of a SELECT statement referencing tables in two different databases:

SELECT *
FROM [databaseA].[schema].[tableA]
, [databaseB].[schema].[tableB]

Later on, we’ll explain how to properly join two tables; for now, we’ll just leave it at this general
idea, and an example that will actually work. We only have one user database to work with, so we’ll
use two system tables, master and msdb, for this example:

SELECT d.name
,MAX(b.backup_finish_date) AS 'Most recent full backup'
FROM master.sys.databases d
LEFT OUTER JOIN msdb..backupset b ON d.name = b.database_name
WHERE d.name <> 'tempdb'
AND (b.type = 'D' OR b.type IS NULL)
GROUP BY d.name
ORDER BY d.name;

Don’t worry, we’ll cover all parts of this statement in the remainder of this book. In this example, you
see how to use the database name in a select statement. If you execute this statement in the master
database, you can omit that name from the statement, and the same applies if you execute this
statement in the msdb database. By the way: this statement shows the time of the most recent full
backup for each database except system database tempdb. Now you know how to use the database
name in the four-part name; let’s move on to the third part: the schema name.

First, what is a schema? A schema is a collection of database objects. These objects can be tables,
but also some objects we’ll come across later, such as views and stored procedures. Schemas can be
used for the purpose of security, or to group related objects.
We’ve already seen some schemas: sys, dbo and Sales. The schema sys is reserved for system
objects; the schema dbo is the schema that is created by default for each database for user objects,
and the schema Sales is created in the WideWorldImporters database (along with some other
schemas) to group sales tables.
The name of an object has to be unique within the schema. We’ve seen that there is a table called
Orders in the Sales schema, so that will prevent us from making another Orders table in the Sales
schema, but that does not prevent us from making a Orders table in the dbo schema. Given proper
permissions, the following will work:

CREATE TABLE dbo.Orders (id int)

SELECT *
FROM Orders

DROP TABLE dbo.Orders

You can ignore the details of the create table statement, as that is a subject for a different exam (the
70-462 exam), and just focus on the big picture: we created an Orders table in the dbo schema,
queried it (which returned zero records) and then deleted the table again.

As with the server name and the database name, the schema name is optional in a SELECT query; if
you omit this, SQL will use your default schema. Every user has a default schema. If you’re logged in
as sysadmin, your default schema will probably be dbo. By the way, you can check this by looking at
the properties of the user with which you’re connected to the database, or by using the following
code:

SELECT name
,default_schema_name
FROM sys.database_principals
WHERE principal_id = user_id();

But like we said: if your sysadmin, your default schema is probably dbo.

Above, we ran this code twice:

SELECT *
FROM Orders

If your default schema is dbo, this explains why the query didn’t work the first time: we omitted the
schema name, so SQL substituted your default schema, and it didn’t work because there was no table
Orders in the dbo schema; the second time, we had just created the table Orders in the dbo schema,
and the statement worked just fine.
There is no rule that states that, if you specify the database name, you also have to specify schema
name, so in our query to find the most recent backup, we have used msdb..backupset for the table name
(leaving out the schema name between the two periods).
A final word about the schema in a SELECT statement: it is considered best practice to specify the
schema name.

Last in the four-part name is the name of the object that has the data. This is not optional. The last part
of the four part name is usually a table, but it could also be a view, a common table expression or a
subquery. These are all concepts we’ll cover later in this course. There is one concept that is not part
of the exam requirements, but interesting nonetheless: the synonym. A synonym is just a different name
for an object. For example, if for whatever reason, you’d rather query from dbo.Orders than from
Sales.Orders, you could create the following synonym (provided you’ve dropped the table
dbo.Orders, that is):

CREATE SYNONYM [dbo].[Orders] FOR [WideWorldImporters].[Sales].[Orders]


SELECT *
FROM dbo.Orders

In this case, only the schema name of the synonym differs from the schema of the actual table, but you
could have used a different name for the table as well; just remember that the actual data is still in the
Sales.Orders table; dbo.Orders is just an alias. Synonyms are not used a lot, but it is a useful tool to
have in your toolbox. Before we move on, let’s clean it up:

DROP SYNONYM [dbo].[Orders]

This is all you need to know about the four-part name; now let’s move on to the other part of the
FROM statement we are going to cover in this section: the table alias. The table alias is usually
optional, but is does make for better readability when selecting data from multiple tables. Selecting
from multiple tables is something we’ll cover later on, but it makes more sense to already cover the
table alias here.
Let’s look back at our statement that retrieves the most recent backup. The first column we select is
d.name. We’ve used the letter “d” as a table alias. Both tables have a column called name; just
remove the “d.” and Intellisense will inform you that the name is ambiguous:

So in order to let SQL know which one you mean, you’ll have to specify that. You could, of course,
just repeat the entire three part table name:

SELECT master.sys.databases.name

This will work just fine (if you remove the table alias, that is, because you can’t use a table alias in
the FROM clause and use the actual table name in the SELECT clause), but it makes the code harder
to read, and therefore, harder to maintain. The best solution is the table alias.
As stated, it is almost always optional; an exception is when you select from a subquery instead of
from a table; in that case, the alias is mandatory. Try removing the table alias in the following
statement:

SELECT o.*
FROM (SELECT * FROM Sales.Orders) o

You can safely remove the table alias from the SELECT statement, but you cannot remove it from both
SELECT and FROM statement without getting a syntax error, because the subquery requires an alias.
By the way: in the last example, we used the result of a subquery as if it were an actual table. This is
called a derived table. A derived table has to follow the rules of an actual table. Every column needs
a unique name, and the derived table itself also needs a name. We’ll come back to this example in
chapter 2, when we talk about subqueries and when we talk about common table expressions.

We’ve now covered the four-part table name and the table alias, and this is all we’re going to cover
at this point for the FROM statement, so let’s move on to the SELECT statement.

SELECT clause
In the SELECT clause, you specify the columns you want your query to return. Here, we want to
mention the following:
* If you want to see all columns, you can use SELECT *, but it is a best practice to avoid doing this
and only list the columns you need (more on this later on in the next section);
* “Attribute” is a different term for a column;
* If you use a table alias in the FROM clause, you can use the table alias in the SELECT clause (and
other parts of the statement) as well, and as we saw earlier, this is mandatory if two or more columns
have the same name (which can only happen when joining tables, because the name of a column has to
be unique within the table);
* If you want the column name of the result set to differ from the column name of the table, you can
use a column alias. For example, if you’d rather see “Order number” than “OrderID” in your result
set, you can achieve this with either of the following statements:

SELECT OrderID as 'Order number'


FROM Sales.Orders

Or:

SELECT 'Order number' = OrderID


FROM Sales.Orders

* You do not have to return the data as it is stored in the column. Instead, if you want to, you can use a
function on the column. This will not change the data in the table, but it will display the data
differently in the result set. For example, the table People in the Application schema contains a
column called FullName. If you want to select this column and display the data in capital letters, you
could use the function UPPER:
SELECT UPPER(FullName)
FROM [Application].[People]

We’ll cover functions further on in more detail. Note that, when you use a function on a column, the
column name of the result set will be “(No column name)”, unless you specify a column alias.
At this moment, this is all you need to know about the SELECT clause, so let’s move on to the
WHERE clause.

WHERE clause
The WHERE clause is used to filter the records which the SELECT statement returns. Without a
WHERE clause, all records will be returned. You can use a single filter, or combine several filters
using the operators AND and OR. The SELECT statement will only return those records that meet the
filter criteria (if any). We’ll first show you how to write a single filter condition, then we’ll show you
how to combine several filters.
In a filter, also known as a predicate, you can compare attributes, variables, constants and functions.
Later on, we’ll also see that you can use a subquery.

We’ll start with a lot of different ways to compare the value of an attribute to a constant. The easiest
example is one of a single filter, comparing an attribute to a constant:

SELECT *
FROM [Application].[People]
WHERE PersonID = 1

This will only return the record with a PersonID of 1.

Note that we used SELECT * to return all columns, even though earlier we said that this is not
recommended. For most examples, we’ll use SELECT * just to keep the examples concise, and focus
on the new information we’re trying to demonstrate.

The next query will return all records from the Persons table whose full name is Kayla Woodcock:

SELECT *
FROM [Application].[People]
WHERE FullName = 'Kayla Woodcock'

Note that, when you filter on FullName, there are single quotes around the name, but that these quotes
were not used when filtering on PersonID. The reason for this: quotes are not necessary for numeric
values.

A special kind of filter is a filter for unknown values. Dealing correctly with unknown values is very
important in relational databases. In SQL, an unknown value is stored as NULL. You might expect to
find people with an unknown email address using the following query:
SELECT *
FROM [Application].[People]
WHERE EmailAddress = NULL

That, however, does not work. The query will return 0 rows, even though there are records for people
without an email address. This is for a very important reason: NULL does not equal NULL. NULL
means “unknown”, not “absent”. In the case of the Persons table: an email address of NULL does not
mean that a person does not have an email address, it just means that you do not know the email
address; just because you don’t know a person’s email address, that does not mean that it is the same
as the email address of another person whose email address you do not know; it also doesn’t mean
that it is not the same. They might both not have email addresses, they might share an email address;
you just don’t know. That is why, for unknown values, you do not use the equal sign, but the keyword
IS :

SELECT *
FROM [Application].[People]
WHERE EmailAddress IS NULL

You are not restricted to using IS or the equal sign. In fact, there is a long list of operators, such as:
* LIKE
* IN
* BETWEEN
* a large number of arithmetic operators, such as > (greater than), < (less than), >= (greater than or
equal), <= (less than or equal) and <> (not equal).

The arithmetic operators speak for themselves; the others, we’ll cover in more detail. For a complete
list of all operators in T-SQL, see: https://docs.microsoft.com/en-us/sql/t-sql/language-
elements/operators-transact-sql

The keyword LIKE enables you to search for a pattern instead of an exact match. LIKE is used in
conjunction with a wildcard to replace the unknown part, or parts, of the string. This query returns all
records for persons whose name starts with Daniel:

SELECT *
FROM [Application].[People]
WHERE FullName LIKE 'Daniel%'

The % sign is the wildcard; it can be used to substitute any character and any number of characters
(zero, one or more). There is another wildcard, the underscore (_); this can also be substituted by any
character, but it has to be exactly one character, as we can demonstrate with the following example:

SELECT *
FROM [Application].[People]
WHERE PreferredName LIKE 'Isabell_'
This query will return persons whose preferred name is Isabella, Isabelle and ‘Isabell ’ (this last
record is only returned because there is an extra space after the name).

You can narrow this down even further by specifying a range of allowed characters:

SELECT *
FROM [Application].[People]
WHERE PreferredName LIKE 'Isabell[a-z]'

This will return the records for Isabelle and Isabella, but not Isabell (the extra space after the name
does not match the pattern a-z). In a similar fashion, you can use the carrot ( ^ ) to match a character
not in the specified range, like [^a-c]. Just try this one for yourself to return Isabell (with the extra
space) and Isabelle, but not Isabella.
So to recap, these are the four wildcards:
* % (any character, any number of characters)
* _ (any character, exactly one character)
* [e-h] (exactly one character in the specified range)
* [^e-h] (exactly one character not in the specified range)

In all example above, we’ve used only one wildcard, and we’ve placed the wildcard at the end of the
search pattern. You’re not restricted to this. You can combine these wildcards, use any number of
wildcards and use them anywhere: at the beginning, at the end or anywhere in between. Later on,
we’ll see that, if you use a wildcard at the start of a search pattern, this might have an impact on
performance, but for now, we’ve said all we need to say about wildcards and the LIKE statement, so
let’s move on to the next keyword: IN.

If you need to find a record with an attribute that is in a list of possible values, you can use the IN
keyword. For example, to return the records for Isabell (with the extra space) and Isabelle, but not
Isabella, from the People table:

SELECT *
FROM [Application].[People]
WHERE PreferredName IN ('Isabelle', 'Isabell ')

You can use any number of items in the list. The keyword IN is often used with a subquery. Instead of
listing the values ('Isabelle', 'Isabell '), you get the values from another query (the subquery),
something like this:

SELECT *
FROM [Application].[People]
WHERE PreferredName IN ( SELECT CustomerName
FROM sales.Customers )

This example won’t return any results, but we’ll get back to the subquery later on. For now, just
notice that the subquery can only return one column (which makes sense; if you execute the subquery
by itself, it should return a list of values, because after all, this subquery substitutes the list of values).

To search for a range of possible values, you can use the BETWEEN keyword. You can do this with
strings, as in this example:

SELECT *
FROM [Application].[People]
WHERE PreferredName BETWEEN 'I' AND 'Isabelle'

More often, though, this is done with numbers or dates. For example, to find the persons with an ID
between 1 and 10:

SELECT *
FROM [Application].[People]
WHERE PersonID BETWEEN 1 AND 10

Notice that records with a PersonID of 1 and 10 are included in the result set.

We’ve now covered comparisons of attributes to constants, using a list of operators. As mentioned
earlier, you can also compare functions and variables. We’ll start with functions. If you do not know
what a function is, don’t worry; we’ll get back to this later. We’ve already seen functions MAX and
UPPER; for now, we’ll just use two other, simple examples.
The first is a system function that returns the current date and time: GETDATE. You can simply try
this out:

SELECT GETDATE()

This also shows that you can have a SELECT statement without even a FROM clause. You can use
this GETDATE function, for example, to return a list of records from the People table whose records
are still valid, by comparing the ValidTo date with the current date:

SELECT *
FROM [Application].[People]
WHERE ValidTo > GETDATE()

If the empty brackets behind the function look strange to you: this is where you’d normally put the
arguments for a function. A function works like this: you call a function with a number of arguments,
and the function returns a result. The function GETDATE requires no arguments, but you still need to
use the empty brackets: ().
The second example of a function we’d like to show is a function for string manipulation. The
function LEFT takes two arguments: the string from which to take the leftmost characters, and the
number of characters to take. So for example, this will return the 3 leftmost characters of the name
Isabelle:

SELECT LEFT('Isabelle', 3)
You can not only apply this function to a string literal, but also to an attribute. So for instance, to
return all PreferredNames from the People table for which the first 3 letters of the equal ‘Isa’, you’d
use the following query:

SELECT PreferredName
FROM [Application].[People]
WHERE LEFT(PreferredName,3) = 'Isa'

This, by the way, returns the same result set as this one:

SELECT PreferredName
FROM [Application].[People]
WHERE PreferredName LIKE 'Isa%'

This is something you’ll see a lot: at times, there are different ways of obtaining the same result in
SQL. Sometimes, two different ways are just as good; in this case, the second one might be a lot
better for performance. When talking about the position of a wildcard, we already mentioned that, if
you put a wildcard at the first position of the search pattern, this can have an impact on performance.
The same applies when you apply a function to a column. This is covered in the requirement “Identify
the impact of function usage to query performance and WHERE clause sargability” (since this is
the second time we mentioned we’re going to cover this later, you might get the impression that it is
important; it is).

There are a lot of functions built into SQL Server, and you can even create your own. This is
something we’ll cover later on, too. In the context of the current requirement, however, all you need to
know about functions is this: you apply a function to zero or more arguments and you get a result back
you can use in a WHERE clause.

We’ve now covered the use of attributes, constants and functions in the search predicate. The last
thing we need to cover in the search predicate is the variable. You might already know what a
variable is from other programming languages. If not, it is easiest just to show you. In Transact-SQL,
you always have to declare a variable before you can use it, and explicitly give it a data type. There a
lot of data types, and we’ll cover that later. For now, we’ll use varchar(50), meaning you can store a
variable number of characters with a maximum of 50 in the variable. After you’ve declared the
variable, you can assign a value to it (using the SET keyword) and use it, for example to compare to
an attribute:

DECLARE @FirstName varchar(50)

SET @FirstName = 'Isabelle'

SELECT *
FROM [Application].[People]
WHERE PreferredName = @FirstName

This is equivalent to the following statement:


SELECT *
FROM [Application].[People]
WHERE PreferredName = 'Isabelle'

By the way: you can also use the SELECT clause to assign a value to a variable.

DECLARE @FullName varchar(50)

SELECT @FullName = FullName


FROM [Application].[People]
WHERE PreferredName = 'Isabelle'

SELECT @FullName

In this case, you’ll have to ensure that the SELECT statement only returns one value. If the SELECT
statement returns multiple values, as would be the case when you change the WHERE clause to:

WHERE PreferredName LIKE 'Isa%'

In this case, 6 values will be returned, and the last value will be stored in the variable. With the
example shown above, you have no control over which of these 6 values that will turn out to be.

At times, using the variable will be necessary, for example to improve the legibility of the code, or
when using stored procedures with input or output variables (as we’ll see later on).

We’ve now covered the use of attributes, constants, functions and variables in the search predicate. In
all our examples, we’ve compared one type to another, e.g. a variable to an attribute, or an attribute to
a constant. This is not a requirement; you can also compare a constant to a constant, or an attribute to
an attribute. A particularly interesting example of comparing a constant to a constant is something you
might see in test code:

SELECT *
FROM [Application].[People]
WHERE 1=1

The equation “1=1” is true, no matter what the contents of a record; therefore, the above example will
return every record in the table. Conversely, you might see this code:

SELECT *
FROM [Application].[People]
WHERE 1=2

The equation “1=2” is false, no matter what the contents of a record; therefore, the above example
will return no records at all. This can be useful if you want to know the columns of the table, but you
don’t need the data.
An example of comparing an attribute to another attribute is something we’ll see when joining tables.

This is all you need to know in order to create a single search predicate. Now you know how to do
this, let’s start combining predicates. By the way: a combination of predicates is called a search
condition.
Combining search predicates is pretty straightforward if you’ve got just two search predicates. You
combine the predicates using either the keyword AND, or the keyword OR. If you use AND, both
predicates have to be true for a record to be included in the result set (the combination is more
restrictive than either predicate on its own, returning less, or the same, number of records); if you use
the keyword OR, one or both of the predicates have to be true for a record to be included (the
combination is less restrictive than either predicate on its own, returning more, or the same, number
of records). For example, this statement could never return any records:

SELECT *
FROM [Application].[People]
WHERE PersonID = 1
AND PersonID = 2

And the following statement will return two records:

SELECT *
FROM [Application].[People]
WHERE PersonID = 1
OR PersonID = 2

Which, by the way, is equivalent to the following statement:

SELECT *
FROM [Application].[People]
WHERE PersonID IN (1, 2)

If you start combining more than two search predicates, things might get a bit complicated if you are
using a combination of AND and OR, because it might not be clear which predicates will be
evaluated first. In that case, you should use brackets to group the predicates you want to be evaluated
together. You can test this for yourself using the following example:

SELECT *
FROM [Application].[People]
WHERE (PersonID = 1
OR PersonID = 2)
AND PreferredName = 'Isabella'

SELECT *
FROM [Application].[People]
WHERE PersonID = 1
OR ( PersonID = 2
AND PreferredName = 'Isabella')

That’s it for the WHERE clause. We’ve covered how to make a search predicate by using variables,
functions, attributes and constants, and how to combine search predicates to make a search condition,
and shown that only the records for which the search condition evaluates to true are included in the
result set. Along the way, we’ve also covered a little bit about data types, variables and unknown (or
NULL) values.

ORDER BY clause
This is the easiest clause of the list. In SQL Server, you never know the order in which the records of
the result set are returned to the application, unless you specify that order. You do this by listing the
column on which you’d like the result set to be ordered:

SELECT *
FROM [Application].[People]
ORDER BY PreferredName

The default sort order is always ascending, from lowest value to highest value. In the case of a string:
from a-z. If you want the result set to be sorted descending, you have to specify this:

SELECT *
FROM [Application].[People]
ORDER BY PreferredName DESC

The alternative for the keyword DESC is ASC, but since this is the default, you can omit this.

Any records with the same PreferredName will still be sorted in random order. If you want to prevent
this, you can also sort on more than one column. This will cause all records to be sorted on
PreferredName, and all records with the same PreferredName to be sorted on PersonID:

SELECT *
FROM [Application].[People]
ORDER BY PreferredName, PersonID

Note: the column you’re ordering on, does not have to be included in the result set. This will work
just fine:

SELECT PreferredName
FROM [Application].[People]
ORDER BY ValidFrom

You can also sort on a column alias, or on a function applied to a column. You can even sort on the
number of the column, but that last one is not recommended, as it makes the code more difficult to
maintain.
That’s it about the ORDER BY clause. But before we end this section on the proper structure of a
SELECT statement, there is one more thing we need to cover: the semicolon at the end of the
statement. This is, in most cases optional. Therefore, we’ve omitted it in almost all our examples, just
to make these examples as concise as possible, showing only the concept we’re trying to explain, and
as little else as possible. There were some exceptions; the statement to retrieve the last backup for
each database might be code you’d use outside of the preparation for this exam; that’s why we’ve
terminated that statement properly. Most other code in this book is intended only for instructional
purposes.
However, it is recommended to end each statement with the semicolon, to avoid situations where it
might be unclear whether the statement has in fact ended. Microsoft has announced that proper ending
of statements through the use of semicolons will become mandatory in a future version. At this
moment, however, it is optional in most situations. We’ll cover two exceptions later, the common
table expression and the throw statement; the statement preceding either the common table expression
or the throw statement has to be terminated with a semicolon.

That being said, we’ve now covered the proper structure of a SELECT statement:
* The four-part table name (server.database.schema.table), with the first three parts being optional;
* The table alias;
* The list of columns in the SELECT clause;
* The column alias;
* The filter criteria in the WHERE clause;
* The ORDER BY clause;
* The semicolon at the end of the statement.

Write specific queries to satisfy business requirements


What is the difference between this objective and the previous (“Identify proper SELECT query
structure”)? Not very much. If you can identify the proper SELECT query structure, you can retrieve
all the data you need. This second objective requires you to only retrieve the data you need, and in the
order you need it.
First, you should retrieve only those rows and columns you actually need. Make sure that the correct
WHERE clause predicates are in place to retrieve only the records you need, and specify the columns
you actually need; avoid using SELECT *. Using SELECT * will retrieve all columns. Even in
situations you do need to retrieve all columns, it is probably better to specify them than to use
SELECT *; at a later stage, someone might add a column to the table, causing your SELECT statement
to retrieve more columns than it needs. This makes avoiding SELECT * a best practice.
In the code samples throughout this book, we use SELECT * a lot. The reason behind that is, as
mentioned earlier, to make the code samples as concise as possible, in order to focus on the topic at
hand, and eliminating as much clutter as possible. For code you actually put into a production script
or application, however, you should take the extra effort and list all columns.
Second, you should retrieve the columns and rows in the order you need them. For columns, this
means listing the columns in the SELECT clause in the correct order; for rows, this means adding an
ORDER BY clause. If your application handles this ordering, there is no need to do this in T-SQL as
well. In fact, there may be situations where it is better to let your application server handle the load
of sorting records, as this is quite CPU expensive. But just remember: if you do not specify an
ORDER BY clause, SQL Server will return the records in whatever order is most convenient at that
time for SQL Server, so next time you run the exact same query on the exact same data, you may get
your results in a different order.

Construct results from multiple queries using set operators


In a previous section, we’ve demonstrated the proper structure of a SELECT statement. The data you
get back from a SELECT statement is called a result set. SQL Server is optimized to deal with record
sets, not individual records. In this section, we’ll discuss set operators: operators that combine two
or more result sets. These operators are: EXCEPT, INTERSECT, UNION and UNION ALL.
* UNION ALL adds the results of all result sets, including duplicates;
* UNION adds the results of all result sets, eliminating duplicates;
* EXCEPT returns the data of the first result that is not in the second result set;
* INTERSECT returns the data of the first result that is also in the second result set.

In order for this to work, there are a few restrictions on all result sets:
* All result sets must have the same number of columns;
* The data type of the columns in each result set has to match (or SQL has to be able to convert the
data type; we’ll get to this later);
* The column names of the first result set are used (you can specify a column alias in the other result
sets, but SQL will not use that);
* Only the last result set can be followed by an ORDER BY clause.

These set operators are most often used on different tables containing similar information (e.g.
address information in a customer table and address information in a Personnel table). In our test
database, we don’t have such different tables containing similar information, but that’s okay: we’ll
make our own data. We already saw that we don’t need an actual table for a working SELECT
statement. For the working of the set operators, it also doesn’t matter whether the data comes from an
actual table, so we’ll skip that part, and start with an easy UNION:

As expected, this combines the two result set. If, instead of UNION we’d use UNION ALL, the result
would be the same since the result sets do not contain any duplicates (remember UNION ALL returns
all data for both result sets without removing duplicates, and UNION does remove duplicates). If we
change the data a little, we can see the difference between UNION and UNION ALL:

As you can see, UNION removed the duplicate record, UNION ALL did not. You can check for
yourself that, in order for a record to be considered a duplicate, all columns have to match.

Now you know how the set operator UNION works, you can verify our claim that both result sets
need the same number of columns:

You can now also test our claim, that SQL uses the column names from the first result set:
As you can see, the column names from the second result set were completely ignored, regardless of
whether a column name was supplied in the first result set.

And finally, before we move on to the other set operators, what happens when data types don’t match
between the corresponding columns of all result sets:

Now, for the other two set operators. EXCEPT will return all records in the first result set that are not
present in the second result set:
INTERSECT will return all rows that are present in both result sets:

The examples for INTERSECT and EXCEPT were maybe a bit too easy, as they contained only one
record per result set. Since our test database WideWorldImporters doesn’t contain a suitable table,
we’ll to create two tables containing more than one record. If the examples using record sets
containing only one record were enough demonstration for you, you can move on to the next
requirement; otherwise, you can follow along with some more elaborate examples.
As mentioned earlier, set operators are often used to combine the result set of queries on tables
containing similar data. For example: you have a table with address information for employees, and a
table with address information for customers, and you want to create a list of addresses for a mailing
list. We’ll use four examples for each of the following criteria:
* You want a mailing list for all customers and all employees, and you want to make sure that an
employee who is also a customer doesn’t receive the mailing twice. In this case, you use UNION.
* You want a mailing list for all customers and all employees, and you don’t care if an employee who
is also a customer receives the mailing twice (or maybe you know that employees aren’t customers).
In this case, you use UNION ALL.
* You want a mailing list for all customers who are also employees. In this case, you use
INTERSECT.
* You want a mailing list for all customers who aren’t also employees. In this case, you use EXCEPT.

We’ll create a table in another test database, called TestDB:

USE TestDB
GO

Creating tables isn’t an objective for this exam, but since this example requires it, we’ll give a little
explanation anyway.
CREATE TABLE dbo.Customers
(
CustomerID tinyint NOT NULL IDENTITY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

You create a table using the CREATE TABLE statement (not to complicated). You specify a name for
the table, and optionally, the schema in which you want the table to be created. As in the SELECT
statement, if you do not specify a schema, your default schema will be used, but it is advised to
always specify the schema (even if you want to use your default schema). After the table name,
between the brackets, the list of columns for the table. For each column, you have to specify a name
and a data type (we’ll cover data types later on). Here, we’ve used the data types tinyint (which
allows you to store values between 1 and 256, so apparently the growth plans for this company are
limited), and varchar(100), which allows you to store a string of characters of a variable length, up to
a maximum length of 100 characters. Optionally, for each column, you can specify whether or not you
want to allow unknown values (NULL) to be stored in the column; we chose not to allow NULL
values. For the CustomerID column, we also specified this to be an identity column, meaning that we
don’t enter a value for this column when we insert data; instead, SQL will generate an incrementing
number, starting with 1. And finally, we had to enclose the column name Address with square
brackets, as Address is a reserved keyword in T-SQL.
Now you know how to create a table, let’s create a second one:

CREATE TABLE dbo.Employees


(
EmployeeID tinyint NOT NULL IDENTITY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

These tables look very similar, which is exactly what we need for this example.
Now, let’s insert some records. Inserting is something we’ll cover later in more detail.

INSERT dbo.Customers VALUES ('Bob', 'Jackson', 'Main street 1, Dallas');


INSERT dbo.Customers VALUES ('Frank', 'Smith', 'Second street 2, Miami');
INSERT dbo.Customers VALUES ('Joe', 'Johnson', 'Third Avenue 3, New York');

INSERT dbo.Employees VALUES ('Jack', 'Ford', 'Second street 2, Denver');


INSERT dbo.Employees VALUES ('Donald', 'Charleston', 'Times Square, New York');
INSERT dbo.Employees VALUES ('Bob', 'Jackson', 'Main street 1, Dallas');

Bob Jackson is a customer as well as an employee. We’ll select only the columns FirstName and
LastName; if we’d ignore best practice and use SELECT * instead of listing the columns, results
would differ, as he is CustomerID 1 and EmployeeID 3 (remember, all attributes of a record have to
match for the record to be a match). This means, that when we apply the set operators:
* UNION should return 5 records (the record for Bob will be returned only once);
* UNION ALL should return 6 records (the record for Bob will be returned twice);
* INTERSECT should return 1 record (only Bob).
* EXCEPT should return 2 records (the records for all customers except Bob).

We’ll only show the result of the UNION operator. You can try the others for yourself.

Finally, we have to remove these tables, as we don’t need them anymore:

DROP TABLE dbo.Customers;


DROP TABLE dbo.Employees;

This is all you need to know about the set operators for the exam. Just to recap:
* UNION ALL adds the results of all result sets, including duplicates;
* UNION adds the results of all result sets, eliminating duplicates;
* EXCEPT returns the data of the first result that is not in the second result set;
* INTERSECT returns the data of the first result that is also in the second result set.
Distinguish between UNION and UNION ALL behavior
The difference between the set operators UNION and UNION ALL is already covered in the previous
requirement: UNION ALL combines two (or more) result sets without eliminating duplicates; UNION
combines two (or more) result sets, but does eliminate duplicates. So why is this important enough to
be a separate requirement? Because the extra step, eliminating duplicates, can be quite costly (in
terms of performance). If the result sets are large, it can require a lot of work on the part of SQL
Server to eliminate the duplicates. So if you do not care about duplicates, or if you know that your
data doesn’t contain any duplicates, you should use UNION ALL instead of UNION.
As stated above: this matters for large result set. For our two little tables containing three records
each, the difference in performance is negligible; both queries completed in zero seconds. But there is
still a difference, and there is also a way we can make the difference visible. This is the execution
plan. You don’t have to know about execution plans for the exam, but since it is the best way to
illustrate the performance impact of the UNION statement, and it gives very important information for
performance problems in SQL Server, we’ll give a short introduction to execution plans anyway.
The execution plan is what operations SQL Server needs to perform to perform the query. Sometimes,
there is only one way for SQL to perform a query. For example, when you perform a simple SELECT
statement to find a single record on a table without indexes, SQL will have to read the entire table
until it finds that record. But if there are indexes on the table, SQL will have to choose whether or not
to use one of the indexes, or read from the table, or use a combination of the index(es) and the table.
And when there are more tables involved, SQL will have to choose which table to process first, and
choose one of several methods of combining the intermediate results of the different tables.
Understanding the execution plan is crucial in performance tuning. And fortunately, you can get a
graphical representation of the execution plan for a query. So we can use the graphical representation
of the execution plan of the UNION and UNION ALL queries from our example with the Customer
and Employees table.

First, you have to instruct SQL to display the actual execution plan after the query is finished. To do
this, you have to select the button for the Actual Execution Plan in the SQL Editor toolbar (or, as the
tooltip suggests, use the keyboard shortcut control + M):

If you look at the other buttons in the SQL Editor toolbar, you’ll notice that there is also an button
called Estimated Execution Plan (which will immediately display an execution plan without actually
performing the query), and, new in SQL 2016, a button called Include Live Query Statistics (which
will show the execution plan while the query is running). This last one, Live Query Statistics, is a
huge improvement in the area of troubleshooting, as it displays the execution plan while the query is
actually running, and exactly which part of the operation is being performed; but since our query
finishes in zero seconds flat, we’ll use the Actual Execution Plan instead of Live Query Statistics
(which is a topic for the next book, on exam 70-762).
We’ll start with the execution plan for just the SELECT from the Customers table (without the set
operator):

Next to the tabs Results and Messages, a new tab has appeared, showing the execution plan. Reading
from left to right, you ask for a select statement, and to fulfill that, SQL needs to perform a table scan
on the Customers table. As this is the only operation, this table scan is 100% of the total cost of this
query. Now, let’s perform the UNION all:

Again, reading left to right: you ask for a select statement, and to fulfill that, SQL needs to concatenate
two input streams, which it will get from two table scans. Each table scan is 50% of the cost, and the
cost of concatenating them is negligible. The total cost of a query always adds up to 100%. The
absolute cost of the table scan on the Customers table is unchanged, obviously, but relative to the total
cost of the query it has dropped from 100% to 50%.
Now, the UNION:

If you compare this plan to the last one, you can see that an extra operator is needed: a distinct sort.
SQL will sort all records, in order to find duplicates. And the cost of that operator is 63%. The
relative cost of the table scan on the Customers table has dropped from 50% to 18%. In other words,
the total cost of the query has more than doubled, just by using UNION instead of UNION ALL.

This clearly demonstrates the objective for this section: the extra operation required for a UNION (as
compared to a UNION ALL) is very expensive. The set operator UNION ALL combines two result
sets without eliminating duplicates; if you do not care about duplicates, or if you know your data does
not contain duplicates, use UNION ALL instead of UNION.

Identify the query that would return expected results based on provided table structure and/or data
This requirement tests whether you can actually apply the knowledge of the previous requirements.
We’re not going to do that separately for the previous requirements; instead, we’ll be doing that in the
questions at the end of this chapter.
Query multiple tables by using joins
Up until this point we’ve been querying just one table to get a result set, except for the query to
retrieve the latest backup for each database. But a database usually consists of multiple tables, so you
need to know how to combine tables in a single query. We’ll explain how to join tables, and show
you different ways of joining tables.

Write queries with join statements based on provided tables, data, and requirements
This requirement is also a test of whether you can actually apply your knowledge. We’ll test you in
the examples to follow, and in the questions at the end of this chapter.

Determine proper usage of INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, and CROSS JOIN
In our WideWorldImporters database, a sales order is stored in two tables: one record in the
Sales.Orders table, and for each order, for each item, one record in the Sales.Orderlines table. Now
let’s assume you want to retrieve all order information for a customer called Tailspin Toys (Lytle
Creek, CA) on July 17, 2014 (if any). The simplest way to start would be to select from both orders
and order lines tables. It is also the wrong way; don’t run this query unless your machine is really,
really powerful:

SELECT *
FROM Sales.Orders
,Sales.OrderLines

The Orders table has a little over 70 thousand records, the OrderLines table 230 thousand records.
The result of this query is a carthesian product, combining each member of the first set which each
member of the second set. Therefore, this query should theoretically return a whopping 16 billion
rows; instead, on my machine with 64 GB RAM, it returns an error:

An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown.

Each row of the Orders table gets combined with each row of the OrderLines table, and that result set
is just too much too handle for most systems. This type of join is called a cross join. So as an
alternative, we could have written our statement as follows:

SELECT *
FROM Sales.Orders
CROSS JOIN Sales.OrderLines

A cross join is not often used, however, and as you can see if you tried to run the query, it can kill the
performance of your SQL Server.
In our case, besides performance, it doesn’t make sense to combine an order with all order lines; we
only want to combine an order with the order lines that belong to that specific order. In order to do
that, we can use the WHERE clause:

SELECT *
FROM Sales.Orders
,Sales.OrderLines
WHERE Sales.Orders.OrderID = Sales.OrderLines.OrderID

Remember the table alias? We can use that to make the statement more readable:

SELECT *
FROM Sales.Orders o
,Sales.OrderLines ol
WHERE o.OrderID = ol.OrderID

Now, our result set contains a line for each order line and all information from the order it belongs to.
That makes this a different join type: an inner join instead of a cross join. We’re getting there, but this
is still not the way to write a proper join statement. Let’s add the restriction on order date:

SELECT *
FROM Sales.Orders o
,Sales.OrderLines ol
WHERE o.OrderID = ol.OrderID
AND o.OrderDate = '2014-07-17'

When we start adding more tables and more search criteria, it will become less apparent which
predicate is for joining the tables (the join predicate) and which predicate is for filtering (the search
predicate). To improve readability, and decrease the chance of forgetting a join predicate, you should
use the following syntax (called an ANSI style join):

SELECT *
FROM Sales.Orders o
JOIN Sales.OrderLines ol ON o.OrderID = ol.OrderID
WHERE o.OrderDate = '2014-07-17'

This way, you put the join predicates after the JOIN statement, and the search predicate(s) in the
WHERE clause.

We’ve now covered how to properly join to tables, and two types of join: the cross join, and the inner
join. We still have to cover the third type: the outer join. Whereas the inner join returns rows from
one set with a match in the other set, the outer join includes records without a match. Let’s find some
data we can use to demonstrate this outer join.
In the example of the orders and order lines tables, that would mean an order without order lines, or
order lines without a corresponding order. Given proper database design, it should not be possible to
have an order line that does not belong to an order. Database design is covered in the next book, on
exam 70-762, but just a little glance ahead: the design of the order lines table in the
WideWorldImporters database does in fact prevent order lines to be added without a corresponding
orderID in the orders table. It does this by a construct that is called a foreign key constraint. This
foreign key is defined on the column OrderId in the OrderLines table; it references the OrderId
column in the Orders table, and dictates that you can only have an OrderId that actually exists in the
Orders table. Quite often, a foreign key references the primary key of the table it references. A
primary key is an attribute that is guaranteed to be unique, and can therefore be used to uniquely
identify each row. In the case of the Orders table, that is indeed the case; OrderId is the primary key
of the orders table.

The order lines table also contains a column StockItemID, which logic suggests would be related to
the StockItems table (in the Warehouse schema); it would not be strange to find an item in stock that
had never been sold, so let’s see if this combination of tables has the required data to demonstrate an
outer join. We’ve seen enough of SQL to construct a query to select records in the StockItems table
that have a StockItemId that is not present in the OrderLines table:

SELECT *
FROM warehouse.StockItems
WHERE StockItemID NOT IN ( SELECT StockItemID
FROM sales.Orderlines)

Unfortunately, there are no items in stock that have never been sold. Good news for the fictional
company WideWorldImporters, bad news for us. But we can easily make our own example of an
order table and a customer table with data suitable to demonstrate the outer joins. That means we
need a customer without an order, and an order without a customer.

CREATE TABLE dbo.Customers


(
CustomerID tinyint NOT NULL IDENTITY PRIMARY KEY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

CREATE TABLE dbo.Orders


(
OrderID tinyint NOT NULL IDENTITY PRIMARY KEY
,CustomerID tinyint NOT NULL
,OrderDate datetime NOT NULL
,SalesAmount decimal(18,2) NOT NULL
);

INSERT dbo.Customers VALUES ('Bob', 'Jackson', 'Main street 1, Dallas');


INSERT dbo.Customers VALUES ('Frank', 'Smith', 'Second street 2, Miami');
INSERT dbo.Customers VALUES ('Joe', 'Johnson', 'Third Avenue 3, New York');

INSERT dbo.Orders VALUES (1, '2011-01-01', 30.02);


INSERT dbo.Orders VALUES (2, '2012-02-02', 15.99);
INSERT dbo.Orders VALUES (4, '2014-04-04', 107.00);

As you can see, the Orders table has a column called CustomerID. When properly designed, this table
should have a foreign key constraint so we can only insert the CustomerID of customers that actually
exist in the Customers table, but that would defeat the purpose of this whole exercise. We did add a
primary key, for reasons that will become apparent later on.
So now we have the data, let’s look at the three possible types of outer join: the left outer join, the
right outer join and the full outer join. A left outer join selects all matching pairs, plus the records in
the first table (the left one if you write the statement on a single line) without a match in the second
table:

As you can see, this query returns all records in the customer table, with the matching records in the
Orders table. Customer #3, Joe Johnson, has no order in the Orders table; therefore, all attributes for
the Orders table are returned as NULL for Joe.
The right outer join is the exact opposite of the left outer join: it selects all matching pairs, plus the
records in the second table (the right one if you write the statement on a single line) without a match
in the first table.

Now, the result shows all orders, with NULL values for the columns of the Customer table for the
order without a match in the customer table. Obviously, putting table Orders first in the join and
Customers second would have created the same effect as changing LEFT OUTER JOIN in RIGHT
OUTER JOIN.

Now, for each customer, we have (at most) one order. What happens if you add more records with a
match in the other table? Let’s find out. First we’ll insert another order for Bob, CustomerId #1:

INSERT dbo.Orders VALUES (1, '2015-05-05', 230.02)

And now, let’s rerun the left outer join query:


Customer Bob now shows up twice, once for each of his orders.

The final outer join type is the full outer join. This returns all records with a match in both tables,
plus all records without a match:

For now, we’ve only joined two tables. But you can easily join more tables. Let’s create a third table,
OrderLines, and join that table to the other two. Or to be more precise: join that table to the result of
the join of the first two tables.

CREATE TABLE dbo.OrderLines


(
OrderLineID tinyint NOT NULL
,OrderID tinyint NOT NULL
,Product varchar(100) NOT NULL
,Units tinyint NOT NULL
,UnitPrice decimal(18,2) NOT NULL
)

INSERT dbo.OrderLines VALUES (1, 1, 'ProductA', 1, 10.04);


INSERT dbo.OrderLines VALUES (2, 1, 'ProductB', 2, 9.99);

This only adds the order lines for one order; feel free to add more. Now, let’s write a query to select
all orders for all customers who’ve placed an order, with order lines when available. This will result
in the following query:

SELECT *
FROM dbo.Customers c
INNER JOIN dbo.Orders o ON c.CustomerID = o.CustomerID
LEFT OUTER JOIN dbo.OrderLines ol on ol.OrderID = o.OrderID

To select all customers with an order, we need the inner join; and to select only the order lines when
available (and NULL values when unavailable), we need the left outer join.
The result set will include all the ID columns, which we are not interested in, and makes the
screenshot completely illegible. So let’s only select the columns we need:

This concludes our demonstration of the different join types. We’ve shown you:
* the cross join, which pairs each record of the left-hand table with all records of the right-hand table
(which can possibly lead to an enormous amount of records being returned, as the number of records
in the result set is always: (the number of records in the left-hand table) multiplied by (the number of
records in the right-hand table);
* The inner join, which returns all pairs of records that have a matching value in the column (or
columns) you specify in the ON clause;
* Three different types of outer join which return all records of the inner join, plus records without a
match in the column(s) you specify in the ON clause. Which records without a match are returned,
depends upon the type of outer join. A left outer join will return unmatched records from the left-hand
table (but not unmatched records from the right-hand table); a right outer join will return unmatched
records from the right-hand table (but not unmatched records from the left-hand table); and a full outer
join will return both the unmatched records from the left-hand table and the unmatched records from
the right-hand table.
We’ve also mentioned the primary key and foreign key concepts, but only briefly, because those are
out of scope for this exam; just remember that tables are often joined on the foreign key of one table to
the primary of the table the foreign key references.

Now, let’s clean up the demo tables before we proceed:

DROP TABLE dbo.Customers;


DROP TABLE dbo.Orders;
DROP TABLE dbo.OrderLines;

Construct multiple JOIN operators using AND and OR


We’ve now seen how to join one or more tables. In our previous examples, we’ve joined tables
based on a match in one column, i.e. one join predicate. This is the most common way to join tables,
but at times, it will be necessary to join tables on more than one column. This occurs in situations
where you need more than one column to determine which record from the first table is related to a
record in the other table. We’ll limit our demonstration to joining on two columns, as it is very rare to
encounter joins using three or even more join operators. First, we’ll show multiple join operators
using AND, meaning a record in one table will need to have a match in two columns of each table.
After that, we’ll show the alternative: multiple join operators using OR, meaning that a record in one
table will need to have a match in either of the two columns that are used for joining the tables.

We’ve mentioned that tables are often joined on a column that is a foreign key in one table and the
primary key in the other table. In our examples above, we joined the Orders table to the Customers
table on the CustomerID column; this the primary key of the Customers table (the CustomerID column
is not actually defined as a foreign key in the Orders table, because otherwise we could not have
inserted a record without a match).
A primary key, however, does not necessarily have to be a single column; some tables are designed
with a primary key that consists of two or even more columns (called a composite key). In that case,
it is the combination of columns that uniquely identifies each row. A common scenario that requires
joining two tables using multiple join operators is a composite primary key in one table, and, in the
other table, a foreign key referencing this primary key.

So let’s recreate our Customers and Orders table with a composite key. Instead of using an artificial
number CustomerID, we’ll use FirstName and LastName as our composite primary key. If you’ve
been paying attention, you might remark that this combination will not make a very good primary key,
as this combination will have to be unique in the table, and this will prevent you from having more
than one customer called, for example, James Smith. You’d be correct; but as this example is easy to
understand, we’ll use it just the same. And besides, not every database you’ll meet in the wild will
have a good design, either.

CREATE TABLE dbo.Customers


(
FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

ALTER TABLE dbo.Customers ADD CONSTRAINT PK_Customers


PRIMARY KEY (FirstName, LastName)
GO

CREATE TABLE dbo.Orders


(
OrderID tinyint NOT NULL IDENTITY PRIMARY KEY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,OrderDate datetime NOT NULL
,SalesAmount decimal(18,2) NOT NULL
);

INSERT dbo.Customers VALUES ('Bob', 'Jackson', 'Main street 1, Dallas');


INSERT dbo.Customers VALUES ('Frank', 'Smith', 'Second street 2, Miami');
INSERT dbo.Customers VALUES ('Joe', 'Johnson', 'Third Avenue 3, New York');

INSERT dbo.Orders VALUES ('Bob', 'Jackson', '2011-01-01', 30.02);


INSERT dbo.Orders VALUES ('Frank', 'Smith', '2012-02-02', 15.99);
INSERT dbo.Orders VALUES ('Bob', 'Smith', '2014-04-04', 107.00);

This code differs in two places from the code we’ve used to create & fill these tables the first time.
First, the primary key for the customers table is not declared in line, but as a separate statement.
When you use a primary key that consists of a single column, you can either declare this primary key
in the same line of code where you define the column, or create the primary key separately; when you
use a composite key, you must create it separately. Second, in the orders table, we now have to
specify the first name and last name of the customer, instead of just the CustomerID (we need both
columns to join on).
For the order without a match in the customer table, we used the same first name as customer #1, and
the same last name as customer #2, so you can see what goes wrong if you match on only one column.
You can test this for yourself.

Please make sure you fully understand why these changes are required for this scenario before
proceeding.

Now, in order to match all orders to the corresponding customer, we have to match on two columns,
because both first name and last name will have to match:

SELECT *
FROM dbo.Customers c
FULL OUTER JOIN dbo.Orders o
ON c.FirstName = o.FirstName AND c.LastName = o.LastName
This example illustrates how to join two tables with multiple join operators using AND.

Again, let’s clean up the demo tables before we proceed:

DROP TABLE dbo.Customers;


DROP TABLE dbo.Orders;

Using multiple join operators using OR is far less common than using AND. Obviously, in the above
example, you could just use OR instead of AND, but the result wouldn’t make any sense. Let’s create
an example that does make sense, something from the administration of a school: a student table and a
related parent table.

CREATE TABLE dbo.Student


(
CustomerID tinyint NOT NULL IDENTITY PRIMARY KEY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,FatherID tinyint NOT NULL
,MotherID tinyint NOT NULL
,IsCurrent bit NOT NULL
);

CREATE TABLE dbo.Parent


(
ParentID tinyint NOT NULL IDENTITY PRIMARY KEY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,EmailAddress varchar(100) NULL
);

INSERT dbo.Student VALUES ('Bob jr', 'Jackson', 1,2, 0);


INSERT dbo.Student VALUES ('Joe jr', 'Johnson', 3,4, 1);

INSERT dbo.Parent VALUES ('Bob sr', 'Jackson', NULL);


INSERT dbo.Parent VALUES ('Betty', 'Jackson', 'bettyjackson@someprovider.com');
INSERT dbo.Parent VALUES ('Joe sr', 'Johnson', 'joejohnson@someprovider.com');
INSERT dbo.Parent VALUES ('Jane', 'Johnson', 'janejohnson@someprovider.com');
For each student, we’ve added a column called IsCurrent; 1 means that this student is still in school, 0
means this is a former student. The data type bit can only contain a 0 or a 1 (more on that later when
we discuss data types). For each parent, we’ve added an email address.
The assignment: to create a list of email addresses of all parents of current students, with the name of
their child. In order to do that, you have to find a record in the parent table that is listed as either a
mother or a father.

This example illustrates how to join two tables with multiple join operators using OR.

By now, you should know how to remove our example code:

DROP TABLE dbo.Student


DROP TABLE dbo.Parent

Determine the correct results when presented with multi-table SELECT statements and source data
This requirement is also a test of whether you can actually apply your knowledge. We’ll test you in
the questions at the end of this chapter.

Write queries with NULLs on joins


In all the examples above, we’ve used the equal sign for join predicates. As stated earlier, when
comparing attributes, two attributes that each have a NULL value are not considered to be equal. So if
you join on a column that has NULL values in both tables, these records will not be matched, and
therefore will not be returned in the result of the JOIN statement. This is by design, because as stated,
a NULL value means unknown, and two unknown values may or may not be equal.
However, there may come a time when you need to join NULL values. For example, you may have
data that is incomplete. Or you’re dealing with hierarchies that are uneven. The concept of an uneven
hierarchy will probably require a bit of explanation. A sales company like WideWorldImporters
might divide its products into categories with the following hierarchy: product group/product
category/product subcategory/product. If all categories have subcategories, you have an even
hierarchy. If some product categories do not have a subcategory, you end up with an uneven hierarchy.
The same thing applies to the subdivision of countries. In some countries, you’d divide a country into
states, then counties, then cities; in others, you don’t have the same number of levels. In all of these
cases, you might have to deal with joining on NULL values.

There are two possible solutions for this problem: either use a replacement value or join using the IS
operator. We’ll show both solutions.
First, let’s create the required tables. We’ll use the Customer and Orders table we created earlier,
with a little difference to actually allow for NULL values. We’ll define CustomerID as NULL instead
of NOT NULL, and omit both the IDENTITY and PRIMARY KEY on this column:

CREATE TABLE dbo.Customers


(
CustomerID tinyint NULL
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

CREATE TABLE dbo.Orders


(
OrderID tinyint NOT NULL IDENTITY PRIMARY KEY
,CustomerID tinyint NULL
,OrderDate datetime NOT NULL
,SalesAmount decimal(18,2) NOT NULL
);

INSERT dbo.Customers VALUES (1, 'Bob', 'Jackson', 'Main street 1, Dallas');


INSERT dbo.Customers VALUES (2, 'Frank', 'Smith', 'Second street 2, Miami');
INSERT dbo.Customers VALUES (NULL, 'Joe', 'Johnson', 'Third Avenue 3, New York');

INSERT dbo.Orders VALUES (1, '2011-01-01', 30.02);


INSERT dbo.Orders VALUES (2, '2012-02-02', 15.99);
INSERT dbo.Orders VALUES (NULL, '2014-04-04', 107.00);

Now, first we’ll verify that even though there is a Customer with an ID of NULL, and an Order for a
Customer with a CustomerID of NULL, these records will not match:

That’s because NULL does not equal NULL. As stated, the first solution is joining on a replacement
value. We haven’t covered this yet, but to replace a NULL value with something else, we can use the
function ISNULL:

If the attribute actually contains a value, the result of the function ISNULL is that value; if the attribute
contains NULL, the result of the function ISNULL is the replacement value (in this case, 99; be careful
though, that your replacement value is not a value that is actually in the data, or you’ll end up with a
mess). Now we can use this ISNULL function on both sides of the join predicate to join the unknown
attributes on the replacement value. This way, we’ll attribute all orders for an unknown customer to
the one customer whose Id is unknown:

The other solution is to use the keyword IS in the join predicate instead of the equal sign. However,
we also need two join predicates: we want to select all records that have matching customer id’s as
well as those that have an unknown customer id:

Again, let’s clean up the demo tables before we proceed:


DROP TABLE dbo.Customers;
DROP TABLE dbo.Orders;

So that’s it. Whenever you join two tables on a column that allows NULL values, you should think
carefully about what to do with these records: accept the default behavior, meaning these records will
not show up in the result of the join, or decide upon a way to include them in the result, either by
using the combination of ISNULL and OR, or using a replacement value. Joining on NULL values is
not something you’ll do very often, though.
Implement functions and aggregate data
In the previous sections, we’ve already seen a number of functions, such as MAX and UPPER. In this
section, we’ll show you a lot more functions, go into more detail about different types of functions
and talk about the impact of the use of functions on performance.

Let’s start with an explanation of what a function is. A function is a SQL statement that accepts a
number of input parameters, performs an action using these parameters and returns a value (either a
single value or a result set). Further on in this book, we’ll talk about stored procedures, which are a
little bit like functions. There are several differences, in usage, capabilities and performance between
functions and stored procedures. For now, the most important ones are:
* A function must return a value; a stored procedure may return a value, or even more than one.
* A function cannot be used to perform actions to change the database state (that is: you can’t perform
DML such as inserts, updates or deletes to change data, which we’ll cover later on).
* Stored procedures can be executed on their own (as we’ve seen) using “EXECUTE”, while
functions are executed as part of a SQL statement.

We’ll talk more about those differences when we talk about stored procedures.

There are two kinds of functions: user-defined functions and system functions (sometimes called
built-in functions). We’ll cover user-defined functions in a later requirement; for now, we’ll stick to
the system functions. However, we won’t try to cover them all. SQL Server has a very long list of
functions built in. For a complete list of all the functions, you can visit the Microsoft web site:
https://docs.microsoft.com/en-us/sql/t-sql/functions/functions

If you read that document, you might notice that it lists system functions as a subcategory of built-in
functions, while in Object Explorer, all built-in functions are listed under system functions. So even if
it is not exactly obvious which definition we should use, don’t worry; we’ll cover the most important
built-in functions from both the broader definition and the more narrow definition. Whoever, we
won’t cover them all; even the list of categories is too long to cover in detail:
Instead, we’ll cover the exam objectives, and along the way we’ll cover some of the functions that
are used most often. This is no guarantee, however, that you’ll only encounter functions on the exam
that are covered in this book. If you really want to be on the safe side, read all the definitions on the
previous link after you’ve completed this section of the book.

Construct queries using scalar-valued and table-valued functions


This requirement requires that you know how to use a scalar-valued function, as well as a table-
valued function.
As we said earlier, a scalar-valued function returns a single value; therefore, you can use a scalar-
valued function anywhere you can use a single value. We’ve already used the function UPPER, which
takes a string as input and returns the same string, but in all upper case letters. We’ve already seen
how you can use UPPER to display the FullName as all upper case characters:

SELECT UPPER(FullName)
FROM [Application].[People]
In this case, the attribute FullName is the input parameter. The alternative to UPPER, of course, is
LOWER, which takes a string as input and returns the same string but in all lower case letters. Here,
we’ve used a literal string as input parameter.

The output is another string. This seems like an appropriate time to point out, that you can nest
functions, i.e. using a function as input for another function:

Not a very useful combination of functions, but the easiest way of demonstrating that you can nest
functions.

In the examples to follow, we’ll alternate between applying functions to columns, and to literal
values. With most functions, you can do both; we’ll just use the example that most clearly
demonstrates the function.

As mentioned, you can use a scalar function anywhere you need a single value, for example, in the
WHERE clause:

SELECT *
FROM [Application].[People]
WHERE LOWER(PreferredName) = 'isabelle'

To explain in which situation this particular example would be useful, we’ll have to explain a little
bit about collation. Every SQL Server and every database has a collation: a number of rules
governing which characters are allowed, how to sort characters, and what to do with different
variations of the same character. For example, my test database has a collation of
Latin1_General_100_CI_AS; this means, among other things, that it is case insensitive (CI) and
accent sensitive (AS). For more explanation of collations, see:
https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support
Here, we’ll just stick to an explanation of accent sensitivity and case sensitivity. In an accent
insensitive database, accents are ignored: “e” equals “é”, “è”, “ê” and “ë”, 'Daniel' therefore equals
'Daniël' ; in an accent sensitive database, “e”, “é”, “è” , “ê” and “ë” are considered to be five different
characters when comparing values, therefore 'Daniel' does not equal 'Daniël' .
In a case insensitive database, “i” equals “I”, but in a case sensitive database, “i” does not equal “I”;
therefore, 'isabelle' does not equal 'Isabelle' . So if you want to find the record(s) with PreferredName
'Isabelle' in a case sensitive database, regardless of whether it has been spelled as 'isabelle' , 'Isabelle' or
even 'isAbELLe' , you need to use the function LOWER (or UPPER) to achieve that. For the next
requirement, we’ll see the downside of this query when we talk about performance.

LOWER and UPPER are, therefore, scalar valued functions: they return a single value. A table valued
function returns a table, and can therefore be used where you’d use a table. There is a table valued
system function that is very useful for database administrators, which we’ll use as an example: the
function fn_virtualfilestats. This function takes two input parameters: the id of a database and the id of
a file of that database, and returns some information about that file; among that information is
performance data, such as the number of writes to that file, the number of bytes written to that file and
the total amount of milliseconds the system had to wait for writes on that file (also, the same info for
reads). You can find a complete description of this function at Microsoft Docs:
https://docs.microsoft.com/en-us/sql/relational-databases/system-functions/sys-fn-virtualfilestats-
transact-sql

This is not a function you necessarily need to know, but we’ll use it to demonstrate the use of a table
valued function. Just execute the following statement, we’ll give a bit more information later:

As you can see from this partial screenshot, the function returns a table (with only a single record, but
a table nonetheless). This is the basic idea of a table-valued function, so we could just end the
discussion of this requirement here, but we’ll elaborate a bit on this example, as it is an excellent
starting point to demonstrate more functions. We’ll use this table valued function to create a script that
shows the IO performance of all database files (since the last time SQL started, as these statistics are
reset at that time).
First, it is not necessary that you know the id of the database you’re interested in. The easiest way to
look up the id of a database is the system function DB_ID, as in:

SELECT DB_ID('WideWorldImporters')

So, if you’re interested in the performance of the database files of the WideWorldImporters database
on your system, you could now substitute the id of the database manually. But as we said, you can use
a scalar valued function anywhere you can use a single value; which means, that you can use the
DB_ID in the fn_virtualfilestats function:

SELECT *
FROM fn_virtualfilestats (DB_ID('WideWorldImporters'), 1);

If you’ve read the description of the fn_virtualfilestats function on Microsoft Docs, you may have
noticed that this function takes a default of NULL for both input parameters; if you use a NULL as the
first parameter, it will return info for all databases; if you use NULL as the second parameter, it will
return info for all files; if you use NULL for both parameters, it will return info for all files for all
databases.

SELECT *
FROM fn_virtualfilestats(NULL,NULL);

Still, that doesn’t tell us the name of the database, or the file. To get the name of the database in the
result set, we’ll use the opposite of the DB_ID function: DB_NAME. DB_NAME takes a database id
as input parameter, and returns the name. To get the name of the file, we can use a table called
sys.master_files. A table valued function can be used the same way as a table; therefore, you can join
a table-valued function to another table. To join fn_virtualfilestats to sys.master_files, we need to join
on two columns: database id and file id.
Now, we just need to divide the amount of milliseconds for reads & writes by the number of reads
and writes. That turns our query into:

SELECT DB_NAME(vfs.Dbid) as 'Database'


,mf.name as 'File name'
,mf.physical_name
,IoStallReadMS/NumberReads as 'Read latency'
,IoStallWriteMS/NumberWrites as 'Write latency'
FROM fn_virtualfilestats(NULL,NULL) vfs
INNER JOIN sys.master_files mf ON vfs.DbId = mf.database_id
AND vfs.Fileid = mf.file_id
ORDER BY DbId, file_id;

This works perfectly, unless one of your database files hasn’t been used since the last start of SQL
Server, in which case you’ll get the following error:

Msg 8134, Level 16, State 1, Line 1


Divide by zero error encountered.
You can’t divide by zero, so we need a solution for cases when either the number of reads or the
number of writes is zero. This is not something you need to know for the exam, but since where
almost there, we might as well finish the example; the solution is the CASE expression. We need to
change this:

IoStallReadMS/NumberReads

into this:

CASE NumberReads
WHEN 0 THEN 'none'
ELSE IoStallReadMS/NumberReads
END

Basically this means: in case NumberOfReads is zero, return the string ‘none’; else return
( IoStallReadMS/NumberReads ). And the same for writes, of course.
But we’re still not there. Now, we get a different error:

Msg 8114, Level 16, State 5, Line 1


Error converting data type varchar to bigint.

SQL doesn’t want to mix data types in the column of a result set; the string ‘none’ is of the data type
varchar, and since both IOStallReadMS and NumberReads are of the data type bigint, the result of the
equation ( IoStallReadMS/NumberReads ) is also bigint. This is basically the same restriction we
encountered when discussing the set operators (UNION, UNION ALL, INTERSECT & EXCEPT);
there too, every column in each set has to be the same data type as the corresponding columns in the
other set(s).
To solve that problem, we’ll use the function CAST. This function takes two parameters: an input
value and a data type to change the first parameter into. Obviously, we can’t turn ‘none’ into a number
(bigint is a sort of number, as we’ll see later on), but we can turn a number into a character string,
using the following piece of code:

CAST(IoStallReadMS/NumberReads AS varchar(100))

Now, the finished end result is:

SELECT DB_NAME(vfs.Dbid) as 'Database'


,mf.name as 'File name'
,mf.physical_name
,CASE NumberReads
WHEN 0 THEN 'none'
ELSE CAST(IoStallReadMS/NumberReads AS varchar(100))
END as 'Read latency'
,CASE NumberWrites
WHEN 0 THEN 'none'
ELSE CAST(IoStallWriteMS/NumberWrites AS varchar(100))
END as 'Write latency'
FROM fn_virtualfilestats(NULL,NULL) vfs
INNER JOIN sys.master_files mf ON vfs.DbId = mf.database_id
AND vfs.Fileid = mf.file_id
ORDER BY DbId, file_id;

This shows a very practical combination of the use of both a table-valued function and scalar
functions. And along the way, you’ve seen how to divide two numbers, learnt the CASE expression,
and seen a real world example of a join on multiple columns.

So to reiterate: a scalar-valued functions takes either zero, one or more input parameters, returns a
single (scalar) value and can be used everywhere you’d use a single value; a table-valued function
also takes either zero, one or more input parameters, returns a table and can be used everywhere
you’d use a table. Furthermore, we’ve discussed the functions UPPER, LOWER, fn_virtualfilestats,
DB_ID, DB_NAME and CAST.

Identify the impact of function usage to query performance and WHERE clause sargability
In this section, we’ll revisit the use of functions in the WHERE clause in code we used earlier, and
discuss the effect on performance.

When explaining the WHERE clause earlier, we mentioned a situation where two queries produced
the same result, but one was better for performance than the other one. We also said we’d get back to
this example later; now is the time for that. These were the two queries:

SELECT PreferredName
FROM [Application].[People]
WHERE LEFT(PreferredName,3) = 'Isa'

SELECT PreferredName
FROM [Application].[People]
WHERE PreferredName LIKE 'Isa%'

In the original state of the WideWorldImporters database, these queries will have identical
performance. Still, the second query might be considered better than the first. The reason for this: if
we were to add an index on PreferredName to speed up these queries, SQL would use the index much
more effectively for query #1. We’ll demonstrate this first, and afterwards give a detailed
explanation. For this demonstration, you’ll have to include the Actual Execution Plan as we saw
earlier (click the button encircled in the screenshot, or use CTRL + M). That way, you can see both
queries perform the same:
As you can see on the Execution Plan tab, both queries have the same cost: 50%. And it may not be
clearly visible on the screenshot, but if you execute the queries yourself (as we highly recommend),
you can see that SQL Server suggests that you add an index to this table:

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>] ON [Application].[People] ([PreferredName])

Let’s add this index to PreferredName:

CREATE NONCLUSTERED INDEX [NonClusteredIndex-PerformanceDemo] ON [Application].[People]


(
[PreferredName] ASC
)

GO

Creating indexes is not an exam objective, so we won’t go into the details of this statement. Now,
when we run the both queries again, we see that one query is significantly faster than the other:
It would not be accurate to say that the second query uses the index and the first one doesn’t. For the
purpose of the exam, all you need to know is that the second query makes efficient use of the index,
and the first one does not (it uses the index in an inefficient way). However, a little more explanation
seems in order.
The first query does an index scan. This means that SQL reads the entire index, from Aahlada to
Zuzana. The table contains roughly a thousand records; if it would become ten times as big, so would
the index, and the index scan would take ten times as long. Reading the entire index is still faster than
the alternative (not using the index and instead reading the entire table), but this is not the best way to
use an index.
The second query does an index seek. This is a bit more like a human would use an index. When
looking for names starting with ‘Isa’ in an alphabetically ordered index of names, a human would start
somewhere in the middle. If the middle name is ‘Jack’, a human would back up a bit (maybe to
‘Ilse’), start reading the index again until it found the first name that matched the search pattern: ‘Isa’.
The human would then continue reading until it found the first name that did not match the search
pattern (‘Isidora’), knowing that any name further on the list wouldn’t match either (as the index is
sorted alphabetically).
SQL Server has an even better way of seeking for values in an index, especially for a large index. If
an index gets too large to fit on a single page, SQL puts another index level on top of that. This new
index level has a list of the minimum and maximum values in each page of the level beneath, such as:
page one has values Aahlada to Durjoy, page two has values Ebru to Fabrice, and so on. So this level
would be much, much smaller. And again: if that level of the index becomes too big to fit on a page,
SQL Server puts another index level on top of that; this way, the top level of an index is always a
single page. This way, SQL composes something called an inverted B-tree. This is an extremely
efficient structure; even in an index for a table containing billions of records, SQL needs to read only
a few pages to find the right index page. Whereas the index scan would require ten times as much
work if the table became ten times as big, the index seek would require, at the most, one extra level of
the index to pass through, and therefore, one extra page to read.
Even this more elaborate explanation is a simplification of the actual structure of an index, as this
explanation is only included to explain why an index seek can perform so much better than an index
scan; for a more complete explanation, you can start reading here: https://technet.microsoft.com/en-
us/library/ms177443(v=sql.105).aspx

We’ve now demonstrated that the first query will not use an index seek, even though the correct index
is present. The reason for this is, that the index we’ve created is on PreferredName, not on the
leftmost three letters of PreferredName; therefore, SQL Server will not use the index (for an index
seek, that is).

Before we give a more general explanation, let’s first drop the index:

DROP INDEX [NonClusteredIndex-PerformanceDemo] ON [Application].[People]

The requirement we are currently discussing is phrased: “Identify the impact of function usage to
query performance and WHERE clause sargability”. We’ve demonstrated what the impact of a
function on a column in the WHERE clause can be on query performance: an index on that column
won’t be used for an index seek, but at best for an index scan (which is much slower). But what is
sargability? That is the ability for a filter predicate to be used as a search argument for an index.
Search argument is abbreviated as SARG; hence, sargability. Generally speaking, you’ll lose
sargability if you either perform a function on a column or search for a pattern that starts with a
wildcard (e.g. ‘%Isa%’).

If you do need to perform some computation on the column, make sure you do it in a way so that SQL
can use an index seek (if there is an index).

We’ll give you another example of different ways to write a query that impact sargability, because
proper index usage is important in the real world, and it will allow us to demonstrate another system
function. Suppose you work in the shipping department of WideWorldImporters, and you need to
know the order id’s of the orders that have to be delivered tomorrow.
For this query, you need a function we haven’t covered yet: DATEADD. It takes three parameters:
* An interval (such as minute, hour, day or week);
* The number of intervals to add (or a negative number, if you need to subtract);
* And a date & time value to add the number of intervals to.

So to add 1 week to August 25th, 2017:

SELECT DATEADD(week, 1, '2017-08-25')

And to get the date of yesterday, subtract one day from today:

SELECT DATEADD(day, -1, GETDATE())


First, we have to create an index, because otherwise it doesn’t matter whether the filter predicate is
sargable or not:

CREATE NONCLUSTERED INDEX [NCIX-ExpectedDeliveryDate] ON [Sales].[Orders]


(
[ExpectedDeliveryDate] ASC
)

Now, let’s construct the query. If we want to select the records with a ExpectedDeliveryDate of
tomorrow, we can choose to either apply the dateadd function to the column ExpectedDeliveryDate,
or to today’s date. Both are logically equivalent, but one is better for performance than the other. As
you may have guessed, applying the function to the column will make the filter predicate non-
sargable, whereas the alternative (applying the function to today’s date) preserves the ability to use an
index:

The query with the function on the ExpectedDeliveryDate column results in an index scan; the other
query results in an index seek, which in this case, is roughly 49 times faster (as you can see from the
relative query cost).

A more legible way of coding this, is first declaring a variable and setting this to tomorrow’s date:

DECLARE @tomorrow date


SET @tomorrow = DATEADD(day, 1, GETDATE())

SELECT OrderID
FROM Sales.Orders
WHERE ExpectedDeliveryDate = @tomorrow

We haven’t covered the use of data types or variables yet, but we’ll get back to this example later on,
when we talk about creating stored procedures.

Before we move on, don’t forget to drop the index:


DROP INDEX [NCIX-ExpectedDeliveryDate] ON [Sales].[Orders]

Identify the differences between deterministic and non-deterministic functions


The difference between a deterministic and a non-deterministic function is pretty easy: a deterministic
function always returns the same result (given the same input parameters and the same state of the
database), while a non-deterministic function does not necessarily produce the same result, even
when called with the same input parameters and the same state of the database. We’ll not go through
the entire list of built-in functions; instead, we’ll give you a couple of examples based on the
functions we’ve already covered. After that, for most functions, you can figure out for yourself if the
function is deterministic or not. Later on, when discussing data types, we’ll cover an exception, a
function for which it is not immediately clear that it is non-deterministic.

All of the string functions are deterministic. We’ve already seen some of these functions, such as
LEFT, RIGHT, UPPER and LOWER. The function ISNULL is deterministic as well. Another example
of a deterministic function we’ve covered is DATEADD.
We’ve also seen a non-deterministic function: GETDATE. This will return the current date & time,
which is a different value each time you run it.
A special case is the function RAND. This will return a random number between 0 and 1. Try
executing this function a couple of times:

select RAND()

It will produce a different number every time. Still, it is not completely non-deterministic. RAND
takes one optional input parameter: the seed. If you execute this function several times with the same
seed, it will return the same value every time. For example, try this:
The reason behind this, is that the function RAND produces a pseudo random number (which is not
really random). SQL simply has a list of numbers, and will rotate through that list, giving you the next
number on the list each time you call the RAND function without an input parameter. When you call
RAND using a seed, SQL will simply pick the pseudo random number in a position in the list that is
calculated based on the seed. And the next time, it will give you the next number. Really random
numbers are really hard to do for computer programs, and SQL is no exception.

Use built-in aggregate functions


An aggregate function takes, as its input, a set of rows, and returns a single value. SQL has a long list
of them built-in:

We saw one already in our query to determine the most recent backup for each database: MAX. MAX
returns the highest value from a set of records. For example, to find the last name in the alphabet from
the People table, you could use:
SELECT MAX(preferredname)
FROM Application.People

Which, by the way, would return the same result as:

SELECT TOP 1 PreferredName


FROM Application.People
ORDER BY PreferredName DESC

Obviously, MIN does the exact opposite: it returns the lowest value from the set. MIN and MAX
work with any data type as input, and will return the same data type as output.

The function COUNT, well… it counts. For example, if you want to count the number of records in the
Sales.Orders table:

SELECT count(*)
FROM sales.orders

As an aside: if you just want to know the total number of records in a table, it is much faster to
achieve that with the following query:

EXEC sp_spaceused 'sales.orders'

This executes the system stored procedure sp_spaceused, which reads the number of rows from
metadata; the select statement actually reads the entire table, which can be a lot slower.

The function COUNT returns the record count as data type integer. The highest number that can fit in
an integer is roughly 2 billion. Should the number of sale orders of WideWorldImporters ever surpass
that value, you should use the function COUNT_BIG instead; this does the same as COUNT, but
instead it returns data type bigint, which can store a maximum value of 9,223,372,036,854,775,807.
Count is often used in conjunction with GROUP BY. This way, you can group records based on one or
more columns before counting the number of records in each group. For example, if you want to know
the number of sales per salesperson, you first group by the salesperson id, then count the number of
records in each group:
Note that when you group by a column (or number of columns), you can only use those columns, or
aggregate functions on the records in each group in both the SELECT and ORDER BY clause. Try
ordering on CustomerID, for example, and you’ll get the following error message:

Msg 8127, Level 16, State 1, Line 5


Column "sales.orders.CustomerID" is invalid in the ORDER BY clause because it is not contained in either an aggregate
function or the GROUP BY clause.

The same way as COUNT counts the records, the function AVG calculates the average of a group of
values. Or, to be more precise: it calculates the average of all known values; NULL values are
ignored when calculating the average. This function, of course, can only be used with numbers. SUM
calculates the total sum of a group of values.

Here’s a screenshot of all these aggregate functions in action:


Note the average, as calculated in line 1. The total number of items ordered is 9,310,904; divided by
231,412, this should return 40 dot something, but it doesn’t; it returns exactly 40. This is because
“quantity” is an integer. The end result of the manipulation of a parameter (or a field) is usually a
parameter of the same data type as the original. So, 9,310,904 divided by 231,412 becomes 40 unless
we first change the datatype by casting “quantity”, in this case to decimal(5,2). We’ve done just that
for “Average #2”. You can easily verify this behavior by entering the calculation directly:

SELECT cast(9310904 as dec(18,2))/cast(231412 as dec(18,2))


SELECT 9310904/231412

The rest of the functions should be self-evident.

We’ll not go into the checksum or statistical functions to calculate standard deviation and variance,
and we’ll save the GROUPING and GROUPING_ID functions for later, when we discuss grouping in
more detail.

Use arithmetic functions, date-related functions, and system functions


We’ll start with the date-related functions. We’ve already seen two of them: GETDATE and
DATEADD. There are a lot more date-related functions:
Date and time functions are important, because they are used very often, and sometimes used
incorrectly, so we’ll cover all of them here. We’ll start with the functions DAY, MONTH and YEAR.
They take a datetime value as input and return, respectively, the number of the day, month or year (as
we’ll demonstrate).
DATEPART does the same thing, but is more flexible; it takes as input two parameters: an interval
and a datetime value. For the interval, you can choose which part of the datetime value you want
returned as an integer; e.g. seconds, quarter or day of the year. But you could also use month, and
effectively make DATEPART work as MONTH. For a complete list of the possible intervals, see:
https://technet.microsoft.com/en-us/library/ms174420(v=sql.105).aspx

Using, the function DATENAME, you can turn the number for a day or month into the corresponding
name. The following screen print nicely captures all those functions.
Notice that DATEPART returns 7 as day of the week, because I’m writing this on a Saturday. You can
configure which day is considered the first day of the week. On my system, this is the default for US
English, which is Sunday. You can check what the first day of the week is using a system function:

SELECT @@DATEFIRST

On my system, this returns 7, meaning Sunday, making Saturday day number 7. You can change what
day is considered to be the first day of the week. Let’s change this setting, and use a CASE expression
to reveal what day we’ve set it to:

SET DATEFIRST 1;
SELECT 'First day of the week' = CASE @@DATEFIRST
WHEN 1 THEN 'Monday'
WHEN 2 THEN 'Tuesday'
WHEN 3 THEN 'Wednesday'
WHEN 4 THEN 'Thursday'
WHEN 5 THEN 'Friday'
WHEN 6 THEN 'Saturday'
WHEN 7 THEN 'Sunday'
ELSE 'I do not know what went wrong here'
END
SET DATEFIRST 7;

I’ve included the CASE expression here, because the other time we used the CASE expression (when
demonstrating the function fn_virtualfilestats), we only used one WHEN expression, which was
appropriate for the example, but felt a bit incomplete as a demonstration of the CASE expression; so
I’ve included it again in this example.

The functions CURRENT_TIMESTAMP, SYSDATETIME, GETUTCDATE, SYSUTCDATETIME


and SYSDATETIMEOFFSET are all alternatives for GETDATE; the difference is that they either
have more precision, or take UTC time into consideration.
In order to understand the difference, we’ll have to jump ahead and cover the date and time data
types. There are six:
* date. Can only contain a date. E.g. ‘2017-08-26’
* time. Can only contain a time, with an accuracy of up to 100 nanoseconds. E.g. ’17:32:00.1234567’
* smalldatetime. Contains both date and time, with an accuracy of up to one minute. E.g. ‘2017-08-26
17:33’
* datetime. This is the date & time data type that is used most often. Contains both date and time, with
an accuracy of up to a third of a hundredth of a second. E.g.: ‘2017-08-26 17:33:00.123’. Because
this accuracy is a third of a hundredth of a second, when you use 3 digits after the period, the
rightmost digit is always zero, three or seven.
* datetime2. Contains both date and time, with an accuracy of up to 100 nanoseconds. E.g. ‘2017-08-
26 17:33:00.1234567’.
* datetimeoffset. Contains both date and time, with an accuracy of up to 100 nanoseconds, plus the
UTC offset (difference, in hours, between local time and UTC). E.g. ‘2017-08-26 17:33:00.1234567
+ 02:00’

Talking about UTC time: the function TODATETIMEOFFSET takes two input parameters, a date &
time value in datetime format, and a time zone in hours ranging from -14 to + 14, and returns the same
date & time but in a datetimeoffset format.

SELECT TODATETIMEOFFSET(GETDATE(), '+00:00')

If you want to know what the time is in another time zone, you can use the function SWITCHOFFSET.

SELECT 'Moscow time' = SWITCHOFFSET (Sysdatetimeoffset(), '+03:00')

Be careful that you need to provide a date & time value with a datetimeoffset data type, otherwise
you’ll get a wrong result. Working with different time zones is error prone.

That leaves us with only two more functions: DATEDIFF and ISDATE. ISDATE takes as input
parameter a character string, and returns a 1 (true) if it can successfully convert this character string
to either a date, time or datetime data type; otherwise, it returns a zero (false). For example:

SELECT ISDATE('2017-02-29')

This is useful if you need to convert text data, and would like to do some error handling (and you
should; we’ll get back to error handling in the final chapter).
DATEDIFF is a bit like DATEADD. It calculates the difference between two datetime values. It
takes, therefore, two datetime values and an interval as input, and returns, as an integer, the number of
intervals between those datetime values. For example, to calculate how many days old I am, I’d use:

SELECT DATEDIFF(day,'1971-09-03', GETDATE())

DATEDIFF was the last of the date-related functions. We’ve covered them all, as they are
particularly error prone. And along the way, we also covered a system function datefirst, and covered
the various date & time data types.

Now, let’s move on to the arithmetic functions. These are either very easy to understand, or very
difficult to understand, depending on your knowledge of the arithmetic operation involved. The
arithmetic functions are listed under mathematical functions:
For example: the function ABS takes a number as input, and returns the absolute value of that number.
The absolute value of a positive number is that number itself; the absolute value of a negative number
is the same value, but positive. E.g.

SELECT ABS (7), ABS(-7)

Both will return 7.

Another example: the modulo function will return the remainder of a division. Seven divided by three
is two, with a remainder of one:

SELECT 7 % 3

I’m not going to try to explain the rest of the mathematical functions: if you know the mathematics
behind it, the SQL function is easy. There are just three of these functions that may not be that self-
evident: ROUND, FLOOR and CEILING. They all round numbers. FLOOR and CEILING both take
as input a numeric value, and round up to the whole number, or down to the nearest whole number.
ROUND also takes a second input parameter (and an optional third): an integer indicating the
accuracy, i.e. the number of positions either to the left or right of the decimal point to round to. A
screenshot to demonstrate these functions:

The reason to include these, is that, on more than one occasion, I’ve encountered some
overcomplicated custom code to achieve the same result, where these built-in functions would have
done the trick. So as a general advice: if you need a function, first make sure that there is no built-in
function that does the exact same thing. This might save you a lot of time writing custom code!

String functions
The next category of functions we’ll discuss, are the string functions. We’ve already covered
UPPER() and LOWER(), but still, a long list remains:
Of this list, we’ll cover the following functions:
* LEFT
* RIGHT
* SUBSTRING
* CHARINDEX
* PATINDEX
* REPLACE
* LEN
* ASCII
* CHAR
* REVERSE

Plus one that is not in the list: CONCAT.

CONCAT will concatenate a number of text values. It will take, as input, two or more text strings, and
it will return these strings glued together. For example, this will concatenate the strings “mon” and
“day”:

SELECT CONCAT('mon', 'day')

This obviously results in “monday”. The same can be achieved in another way:
SELECT 'mon' + 'day'

Using literal strings, the result will always be the same. However, when using columns of data from a
table, we have to be aware of the possibility of NULL values. That is where these functions differ.
Let’s say you executed the following queries:

SELECT firstname + ' ' + lastname


FROM dbo.parent

SELECT CONCAT(firstname, ' ', lastname)


FROM dbo.parent

If, for any given record, firstname and/or lastname is NULL, the result of the former query is NULL
(by default); in the result of the latter query, the NULL value will be replaced with an empty string.
For the former query, there is an execution setting that can change the default behavior of this query.
By default, concatenating strings using the + operator will result in NULL if one of the strings is
NULL; you can change this with the following command:

SET CONCAT_NULL_YIELDS_NULL OFF

For statements executed after this command in the same session, a NULL value will be converted to
an empty string when concatenating string values using the + operator. This setting will not affect the
result of concatenating using the CONCAT function. So, to reiterate: with
“CONCAT_NULL_YIELDS_NULL OFF”, both ways of concatenating will have the same result; with
“CONCAT_NULL_YIELDS_NULL ON”, results will differ for strings with NULL value.
To demonstrate this:
Next up, the functions LEFT and RIGHT. LEFT will take, as input parameters, a string and the number
of characters to return starting from the left; right will do the opposite: return the number of characters
starting from the right.
SUBSTRING, like LEFT and RIGHT, returns a number of characters from a text string. But whereas
LEFT and RIGHT start at a fixed position, with SUBSTRING, this is variable. SUBSTRING takes
three input parameters: the expression, the position to start from and the number of characters you
want to return. So, to return the word “middle” from the text string “the middle part”, you need to start
at position 5, and fetch 6 characters:

SELECT SUBSTRING('The middle part', 5, 6)

So the equivalent of:

SELECT LEFT('monday', 3)

is:

SELECT SUBSTRING('monday', 1, 3)

In order to fetch the rightmost characters, and thus create the equivalent of the RIGHT function, we
need to combine SUBSTRING with REVERSE, which we’ll cover later on.

CHARINDEX searches for an expression in another expression, and returns the starting position of
this expression (if the expression is found; otherwise, it returns 0). So, in order to find the word
“middle” in the text string “the middle part”:

SELECT CHARINDEX('middle', 'the middle part')

CHARINDEX takes an optional third input parameter: the position to start counting from. So the
function
SELECT CHARINDEX('t', 'the middle part')

will return 1; to ignore the first letter t, you can use the optional third parameter to start searching
somewhere after the first position:

SELECT CHARINDEX('t', 'the middle part', 2)

PATINDEX does something similar to CHARINDEX, but with PATINDEX, you can use wildcards.
The pattern you want to search for, has to start and end with the % character (unless when you search
for the first or last character). For example:

SELECT PATINDEX('%m%e%', 'the middle part')

This will, once again, find the word “middle” in the text string “the middle part”.

Note, that neither function provides an easy way to find the second occurrence of a string (or third,
etc); it only finds the first occurrence.
Using REPLACE, you can (as you may have guessed), replace an expression inside another
expression. The general syntax is:

REPLACE (expression, expression to be replaced, replacement value)

As an example:

SELECT REPLACE('the middle part', 'middle', 'best')

The function LEN returns the length of a string (or expression):

SELECT LEN('the middle part')

This will return 15.

Next up, two functions that will turn ASCII character codes into letters, and vice versa. Every letter,
number or character has a number in the ASCII code. For the list of ASCII characters and their
numbers, see: https://en.wikipedia.org/wiki/ASCII

For example, the letter A is character 65, so you can use the function ASCII to turn A into 65, and the
function CHAR to turn 65 into A:

SELECT CHAR(65)
SELECT ASCII('A')

Turning characters into ASCII numbers and vice versa can be quite useful when you encounter data
with special characters. In rare occasions, you might come across a string like “monday”. The fourth
character is a non-printable character, so SQL will substitute the “” instead. It is possible to analyze
what character this is by combining the ASCII and SUBSTRING functions:

DECLARE @strange_string VARCHAR(100)

SET @strange_string = 'mon' + CHAR(27) + 'day'

SELECT @strange_string

SELECT ASCII(SUBSTRING(@strange_string,4,1))

This is not something you’ll find very often, but it does happen, especially when importing data from
other systems.
Another use for the CHAR function is a task SQL DBA’s do a lot: using tables to generate code from.
Let’s say you need a script to take a backup of every database. This is the code to make a single
backup of the model database to a file called “d:\model.bak”:

BACKUP DATABASE model TO DISK = 'd:\model.bak'

In the table sys.databases we have the names of all databases. We’ll use that table as the starting point
of our script. The first part is easy:

SELECT 'BACKUP DATABASE ' + name


FROM sys.databases
WHERE name <> 'tempdb'

You can’t make a backup of tempdb; that is why we included the WHERE clause. On my server, this
results in the following code:

BACKUP DATABASE master


BACKUP DATABASE model
BACKUP DATABASE msdb
BACKUP DATABASE TestDB
BACKUP DATABASE WideWorldImporters

We still need to add the location of the backup. Changing the first line to the following won’t be
enough:

SELECT 'BACKUP DATABASE ' + name + ' TO DISK = D:\' + name + '.bak'

If we want the result of the query to be actual T-SQL code we can use, we need the file location to be
surrounded by single quotes. There are two ways we can fix this problem.
The first way to do this, is including the single quotes in the string directly. In that case, we need to
add not one, but two single quotes surrounding the file name, like this:

SELECT 'BACKUP DATABASE ' + name + ' TO DISK = ''D:\' + name + '.bak'''

Adding single quotes won’t work, as the single quote is used to indicate the end of the string; the
second single quote is called an escape character, and used to indicate to SQL that this is meant to be
a single quote included in the string.
The second way to achieve this, is the CHAR function. The ASCII code for a single quote is 39; thus,
CHAR(39) will return this single quote:

SELECT 'BACKUP DATABASE ' + name +


' TO DISK = ' + CHAR(39) + 'D:\' + name + '.bak' + CHAR(39)

In this simple example, both ways will work; choose the way you feel results in the most readable
code. The same advice goes for using the + sign to concatenate the pieces together; if you prefer, you
can use the CONCAT function instead:

SELECT CONCAT('BACKUP DATABASE ', name,


' TO DISK = ', CHAR(39), 'D:\', name, '.bak', CHAR(39))

We’ll get back to this example in the questions at the end of this chapter.

The last string function we’ll cover is a simple one: REVERSE. It takes a string as input parameter,
and will return the same string in reverse:

SELECT REVERSE('monday')

This was the last of the string functions we needed to cover.

System functions
The final category we need to discuss for this requirement is the category system functions.
According to the narrower definition used in Microsoft Docs online, we’ve already covered one:
ISNULL. In a later chapter, on the requirements on error handling, we’ll cover some system functions
related to error handling, such as @@ERROR and @@TRANCOUNT. Now, we’ll cover some
system functions that are related to material we’ve already covered.
Let’s go back to one of the first select statements we’ve performed:

SELECT name
FROM sys.servers
WHERE server_id = 0

We needed this query to get the server part of the four-part table name. Actually, there is a system
function that serves the exact same purpose:

SELECT HOST_NAME()

A similar system function will tell you the name of the database user you’re connected with:

SELECT USER_NAME()

This may not seem useful when you are the only one accessing your test database, but later on in this
chapter, we’ll see how this system function can be used for auditing purposes.

The next function: @@IDENTITY. When discussing set operations, we needed to create a table to be
used as an example for the UNION statement:

CREATE TABLE dbo.Employees


(
EmployeeID tinyint NOT NULL IDENTITY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

We explained that adding the IDENTITY specification, SQL will automatically generate an
incrementing number, starting with 1, for this column whenever we insert a new record. We added
records using INSERT statements such as the following:

INSERT dbo.Employees VALUES ('Jack', 'Ford', 'Second street 2, Denver');


In this statement, we did not provide a value for EmployeeID; SQL generated this automatically for
us. As a matter of fact, you can’t even insert a value into this column yourself. The following
statement would cause an error:

INSERT dbo.Employees VALUES (1, 'Jack', 'Ford', 'Second street 2, Denver');

Msg 8101, Level 16, State 1, Line 10


An explicit value for the identity column in table 'dbo.Employees' can only be specified when a column list is used and
IDENTITY_INSERT is ON.

Actually, the error states that you can insert a value into this column yourself, by using a column list
(more on this when we cover the insert statement) and setting IDENTITY_INSERT on for this table.
To demonstrate:

SET IDENTITY_INSERT dbo.Employees ON;


INSERT dbo.Employees (EmployeeID, FirstName, LastName, Address)
VALUES (1, 'Donald', 'Charleston', 'Times Square, New York');
SET IDENTITY_INSERT dbo.Employees OFF;

We messed it up pretty nicely here, because now we’ve got two records with an EmployeeID of 1.
That leads us to the problem the function @@IDENTITY is meant to solve: when SQL automatically
inserts an identity value, exactly what value is being inserted?

If you use the function @@IDENTITY, immediately after inserting a record with an identity column,
SQL will tell you what identity was inserted. Usually, you do this when you need to perform another
action on this record, such as first inserting a sales order and then inserting one or more order lines
for this order. Let’s demonstrate that. We’ll recreate our sales & sales order table, with an identity
column for the order id. I’m not sure if you actually dropped these tables, so we’ll drop them first (if
they exist). Then, we’ll insert a record in the order table, capture the automatically generated order id
in a variable, and use this variable to insert the correct order id in the order line record:

DROP TABLE IF EXISTS dbo.Orders


DROP TABLE IF EXISTS dbo.OrderLines

CREATE TABLE dbo.Orders


(
OrderID tinyint NOT NULL IDENTITY PRIMARY KEY
,CustomerID tinyint NOT NULL
,OrderDate datetime NOT NULL
,SalesAmount decimal(18,2) NOT NULL
);

CREATE TABLE dbo.OrderLines


(
OrderLineID tinyint NOT NULL IDENTITY PRIMARY KEY
,OrderID tinyint NOT NULL
,Product varchar(100) NOT NULL
,Units tinyint NOT NULL
,UnitPrice decimal(18,2) NOT NULL
);

DECLARE @OrderID tinyint


INSERT dbo.Orders VALUES (1,'2015-05-05', 230.02);
SELECT @OrderID = @@IDENTITY

INSERT dbo.OrderLines VALUES ( @OrderID, 'ProductA', 1, 10.04);

By the way: the DROP TABLE IF EXISTS statement is new in SQL 2016; in previous versions, you
had to use a more elaborate construction to check whether a table existed before dropping it,
something like:

IF EXISTS ( SELECT name


FROM sys.tables
WHERE name = 'Orders'
and schema_id=SCHEMA_ID('dbo'))
DROP TABLE dbo.Orders

We’ve now added a record to the orders table, let SQL add the identity, used the @@IDENTITY
function to retrieve the value SQL inserted and used this value to add a record to the table
dbo.orderlines with the correct order id. This demonstrates the use of the @@IDENTITY function.

Earlier on, we said that functions that didn’t taken any parameters, still used the brackets to mark the
spot where the parameters would otherwise go, such as GETDATE(). That’s only true for newer
functions. The functions starting with @@ don’t take any parameters, and do not have the syntax with
the empty brackets. These functions were, in older SQL versions, referred to as global variables, but
since you can’t actually store anything in a global variable, Microsoft now refers to them as functions
instead of global variables.

There are other functions related to this one. @@IDENTITY works directly after inserting a new
record. To be more precise, it will return the last identity value generated in the same session. But in
every session, you can check the latest generated identity value for any table using the function
IDENT_CURRENT:
SELECT IDENT_CURRENT('dbo.Orders')

Well, to be more precise: any table you have the permission to view, but security is another topic
altogether, so we won’t go into that.

There is a third function that allows you to check the latest generated identity value:
SCOPE_IDENTITY. Like @@IDENTITY, this will return the latest generated identity value in the
same session, but unlike @@IDENTITY, SCOPE_IDENTITY will return the last generated identity
value in the same scope. Scope refers to the a stored procedure, batch, trigger or function. Since
we’ll be creating stored procedures later on, that will be a good time to demonstrate scopes, and the
difference between @@IDENTITY and SCOPE_IDENTITY; for now, it is enough to know that
@@IDENTITY will return the latest generated identity value in the same session (regardless of
scope), and SCOPE_IDENTITY will return the latest generated identity value in the same session and
scope.

In these examples, we’ve used a number as our primary key (the column that uniquely identifies every
row). To be more precise: we’ve used the data type tinyint, which is the number that takes up the least
amount of storage to store (more on that later when we cover data types). This is often a good choice
for a primary key. As an alternative, you’ll often see a globally unique identifier. In order to explain
some of the functions that deal with unique identifiers, we’ll have to touch upon some concepts that
deal with designing databases, such as the best choice for a primary key column, and default values;
designing databases is out of the scope of this exam, but without this extra knowledge, unique
identifiers won’t make sense, so neither will the functions dealing with unique identifiers.

A globally unique identifier is often referred to as a GUID. It is a 16 byte combination of 32


hexadecimal digits, separated by hyphens in the format xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, so it
looks like this: 4D120DA1-C394-4C2B-9809-00D94936D2F1. The corresponding data type is
called uniqueidentifier.
You can generate a new globally unique identifier using the function NEW_ID():

As the name suggests, it is guaranteed to be unique, even across different servers. While there are
definitely situations where you need a primary key to be unique across servers, that is usually not the
case; so you’re better off using a smaller data type for your primary key (bigger is not always better).
There is another reason why a unique identifier does not make a great choice for a primary key: it is
usually best to have a primary key that increases with each additional record, and if you create
globally unique identifiers using the function NEW_ID(), you’ll get ID’s in a random order. There is a
function that avoids that problem: NEWSEQUENTIALID. You cannot use this function directly. The
following statement returns an error:

SELECT NEWSEQUENTIALID()

Msg 302, Level 16, State 0, Line 1


The newsequentialid() built-in function can only be used in a DEFAULT expression for a column of type 'uniqueidentifier' in a CREATE
TABLE or ALTER TABLE statement. It cannot be combined with other operators to form a complex scalar expression.

Fortunately, the error message tells you what the correct way of using the function
NEWSEQUENTIALID is. You have to use it as the default value for a column of the type
uniqueidentifier. A default value is a value SQL Server will automatically insert in a column,
whenever you insert a record without specifying a value for that particular column. This is quite like
the IDENTITY property we saw earlier; you don’t specify a value for a column with an IDENTITY
property, SQL does that for you. With the IDENTITY property, SQL will generate a number that is 1
higher than the number for the previously inserted record for that table; with the NEWSQUENTIALID
function, SQL will generate a unique identifier that is greater than the previously inserted record for
that table.
Two caveats here: for the IDENTITY property, you can optionally specify a seed and the increment.
Unless otherwise specified, the seed is 1, meaning the first value SQL will use is 1, but you can start
at a higher number if you like; and the increment is also 1 by default, meaning the next time SQL will
generate a number that is 1 higher than the previous time, but again, you can choose a higher number if
you like. So IDENTITY(100, 2) will generate 100 for the first record, 102 for the second record, then
104, etc.
For the NEWSQUENTIALID function, SQL will choose a greater value every time, until Windows is
restarted; after a reboot, SQL may choose a lower value for the next NEWSQUENTIALID, but it will
still be unique.

This is how you create and use a column using the default expression and the NEWSQUENTIALID
function:
For now, this is all you need to know about unique identifiers.

When talking about dates, we covered the function ISDATE. This takes one parameter, and returns 1 if
this parameter can be converted to a valid date; otherwise, it returns 0. The function ISNUMERIC
does basically the same; it takes one parameter, and returns 1 if this parameter can be converted to a
value of a numeric data type; otherwise, it returns 0.

Beware, however, that numeric data types can contain more than just numbers. For instance, there is a
data type called float, which contains floating numbers; an example of such a number is '1.23E2' . There
is also a data type called money; so the parameter '$20' can successfully be converted to a value of a
numeric data type, in this case money (or smallmoney, another numeric data type). The complete list
of numeric data types ISNUMERIC considers a valid number, is:
* tinyint, smallint, int and bigint;
* decimal, numeric, float, real;
* money, smallmoney (there is no data type called bigmoney, but feel free to insert your favorite Bill
Gates joke here).

This behavior of ISNUMERIC may not be what you want. If you want to know whether a value is of a
particular data type, instead of one of these 10 numeric data types, you can use the function
TRY_CAST. Earlier on, we’ve seen the function CAST. This will cast a value into another data type.
TRY_CAST is similar, but instead of returning an error when the CAST fails, TRY_CAST will return
NULL:

The value '$20' can be cast to data type money, but it cannot be cast to data type int. The difference
between CAST and TRY_CAST becomes clear; when unsuccessfully trying to cast '$20' to int,
TRY_CAST will return NULL, and CAST will return an error. We’ll return to this example in the
section on error handling.

By the way: to capture both the result of the successful TRY_CAST statements and the unsuccessful
CAST function, we’ve switched the query result tab to Text instead of the default, Grid. This can be
achieved using the button highlighted in the screenshot, or CTRL+T; you can revert back to Grid using
the button next to it, or CTRL+D.

An alternative to the CAST function is the CONVERT function. Both functions will try to change an
expression in one data type to another data type. So these will produce the same result:
SELECT CAST('20170914' AS datetime)
,CONVERT(datetime, '20170914')

The difference between the two functions is that CONVERT takes an optional third parameter: style.
Styles are most often used when converting date & time data to character strings (or vice versa). For
instance, check out these five different date time styles:

There are a lot of different styles available. Not just for date time data types, but also for real, float,
money, XML and binary data. For the complete list, check out this link:

https://docs.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql

Similar to TRY_CAST, there is also a function called TRY_CONVERT. This will, as you may have
guessed, try to convert an expression to a data type, and if unsuccessful, return NULL.

The next function we want to explain is the function @@ROWCOUNT. This function takes no
parameters, and it returns the number of rows affected by the last statement in this session. This is
basically the same information you see in the Messages tab.
For example, try the following code:

SELECT *
FROM [Sales].[Invoices]

SELECT @@ROWCOUNT AS 'Number of records affected'

The select statement returns 70.510 rows, so @@ROWCOUNT is 70.510. It also works with update,
select and delete statement. @@ROWCOUNT will return an integer value, and the maximum value
that can be stored in an integer is a little over 2 billion, so for very large databases, this may not be
big enough. In that case, there is an alternative function: ROWCOUNT_BIG(). This returns a value of
data type bigint; the biggest value that can be stored in a bigint is a lot bigger
(9,223,372,036,854,775,807), so you’re not likely to ever exceed that number. ROWCOUNT_BIG()
works exactly the same as @@ROWCOUNT:

SELECT *
FROM [Sales].[Invoices]

SELECT ROWCOUNT_BIG() AS 'Number of records affected'


Note that neither of these functions are affected by the statement “SET NOCOUNT ON”. The
statement “SET NOCOUNT ON” changes something that is called an execution setting (something we
saw earlier with “SET CONCAT_NULL_YIELDS_NULL OFF”). By default, SQL will always return
the number of affected rows by every statement to the client application (in the case of SSMS, this
number will show up on the Messages tab); i.e., the setting NOCOUNT is OFF. If you change this
execution setting to ON, SQL will no longer return that information to the client, but you can still
retrieve this number using the function @@ROWCOUNT or ROWCOUNT_BIG().
The reason to set NOCOUNT to ON, is to improve performance. It is not a lot of work for SQL to
report this number to the client application, but it is still work, and if you do not need this
information, it is best to SET NOCOUNT ON and let SQL do just a little less work.
One particular case where you can use this row count information, is error handling. Let’s suppose
you know that a certain update or delete statement should only affect 1 record. In that case, you can
check to see if @@ROWCOUNT = 1, and if not, undo the statement. We’ll see more on this in the
section on error handling.

In this section, we’ve covered a lot of functions. You need to be comfortable using these functions,
both for the exam and real life querying. Therefore, make sure you practice using these (and other)
functions.
Modify data

For the previous requirements, we focused on ways to get data out of a table, using the SELECT
statement. Now, we’ll focus on how to get data into a table, modify the data in a table and remove
data from a table, using the INSERT, UPDATE and DELETE statements, respectively.
The SELECT, INSERT, UPDATE and DELETE statements are elements of the Data Manipulation
Language (DML). DML is part of the SQL language. The other part of the SQL language is DDL: Data
Definition Language. This language contains elements that create, change and delete the structures
(such as tables and columns).

Write INSERT, UPDATE, and DELETE statements


Let’s start with the INSERT statement. We have already seen a bit of the INSERT statement, for the
examples where we had to create our own sample data. This is one of the INSERT statements we
used:
DROP TABLE IF EXISTS dbo.Employees

CREATE TABLE dbo.Employees


(
EmployeeID tinyint NOT NULL IDENTITY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

INSERT dbo.Employees VALUES ('Bob', 'Jackson', 'Main street 1, Dallas')

INSERT … VALUES is one of the four ways of inserting data into a table we’re going to cover. The
other three are:
* INSERT … SELECT
* SELECT INTO
* INSERT … EXEC

Using the INSERT…VALUES method, we supply the data to be inserted directly in the INSERT
statement. With the INSERT…SELECT method, you insert the result set of a select statement into an
existing table. With the SELECT INTO method, you insert the result set of a select statement into a
new table. And finally, with the INSERT…EXEC method, you insert the result of a stored procedure
into an existing table.

Let’s examine our insert statement in a bit more detail:

INSERT dbo.Employees VALUES ('Bob', 'Jackson', 'Main street 1, Dallas')

First, we’ve shortened the insert statement a bit. We’ve used INSERT instead of INSERT INTO.
Also, we’ve omitted the column listing. It is best practise to specify all columns in an insert
statement. This would turn our statement into:
INSERT INTO dbo.Employees (FirstName, LastName, Address)
VALUES ('Bob', 'Jackson', 'Main street 1, Dallas')

The reason that this is best practise, is the same as the reason listing all columns in a select statement
is recommended: should someone add another column later on, your statement will still work as
intended (as long as the column that is added has a default or can be left empty, as we’ll see next).
If you do not list column names, you have to supply values for each column, except for columns with
an identity property (in our case, EmployeeID).
If you do list column names, you can omit the columns that are nullable, or have a default value (and
of course, you can omit a column with an identity property as well). Let’s start with nullable columns.
Our address column does not allow for null values, but if it did, there would be two ways to insert a
record with a null value for the address, either implicit (by omitting the column) or explicit (by
specifying NULL):

INSERT dbo.Employees (FirstName, LastName)


VALUES ('Bob', 'Jackson')

INSERT dbo.Employees (FirstName, LastName, Address)


VALUES ('Bob', 'Jackson', NULL)

We haven’t covered default values, so we’ll do that here. For every column in a table, you can add a
default value in the column definition. You can either do this when creating the table:

CREATE TABLE dbo.Employees


(
EmployeeID tinyint NOT NULL IDENTITY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL DEFAULT 'Under the bridge'
)

Or, you can do this at a later stage:


ALTER TABLE dbo.Employees ADD DEFAULT ('Under the bridge') FOR [Address]

Now, if you want to insert records with the default value for address, you have two options similar to
a null value. Either you accept the default implicitly (by omitting the column), or explicitly (by
specifying the keyword DEFAULT):

INSERT dbo.Employees (FirstName, LastName)


VALUES ('Bob', 'Jackson')

INSERT dbo.Employees (FirstName, LastName, Address)


VALUES ('Bob', 'Jackson', DEFAULT)

In fact, if every column in the table would either be an identity column, allow a null value or have a
default, you could even insert a record with all default values:

INSERT dbo.Employees DEFAULT VALUES

By the way: if you want to, you can also insert more than one record at the same time, in the following
manner:

INSERT dbo.Employees (FirstName, LastName)


VALUES ('Bob', 'Jackson')
,('Bo', 'Jackson')
, ('Bill', 'Jackson')

The difference (between this single insert for three records and three insert statements for one record
each) is that in the former case, if something goes wrong, all three inserts will fail.
This is all you need to know about the INSERT…VALUES method.

The second method to insert records is INSERT… SELECT. This method will insert all records of
the result set (into an existing table).

DROP TABLE IF EXISTS dbo.Employees_backup

CREATE TABLE dbo.Employees_backup


(
FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
)

INSERT dbo.Employees_backup
SELECT FirstName, LastName, [Address]
FROM dbo.Employees

The select statement can be a simple select, or a more elaborate statement, for example, joining
multiple tables, or using set operators. For example:

INSERT dbo.Employees_backup
SELECT FirstName, LastName, [Address]
FROM dbo.Employees
UNION
SELECT 'Tim', 'Smith', 'Miami'

The third method to insert records is SELECT INTO. The difference between the SELECT … INTO
and the INSERT…SELECT method, is that for the SELECT … INTO method, the table you insert into
will automatically be created (and therefore, can not yet exist):

DROP TABLE IF EXISTS dbo.Employees_backup


SELECT FirstName, LastName, [Address]
INTO dbo.Employees_backup
FROM dbo.Employees

The columns from the newly created table will have the same data types, nullability and identity
property as the result set, but it will not have the same constraints such as defaults, primary keys or
foreign keys, or indexes. This is a very easy way to create an identical table for test purposes.
However, if you require a bit more control over the table you’re creating, it is usually best to create
the table explicitly.

The fourth and final method to insert records is INSERT… EXEC. As in the INSERT… SELECT
method, the result of the SELECT statement will be inserted in to the table. The difference is that you
do not use a regular select statement, but either a stored procedure, or dynamic SQL. We haven’t
covered either stored procedures or dynamic SQL, so we will need to do that before giving
examples. Let’s start with stored procedures. We’ll cover stored procedures in more detail in chapter
3, so for the moment, we’ll keep it short.

A stored procedure is a small program of T-SQL code stored in the database. For example: you could
create a stored procedure to select all records from the Employee table:

CREATE PROC dbo.usp_get_employees


AS
BEGIN
SELECT FirstName, LastName, Address FROM dbo.Employees
END

Then, you could execute the stored procedure using the keyword EXEC:

EXEC usp_get_employees

This will execute the select statement in the stored procedure. Based on this simple example, the
advantages of using stored procedures will not be immediately apparent. Why not execute the select
statement directly? The reason for this is that stored procedures give database programmers total
control over the way users and programs access the databases. We’ll cover more ways you’d like to
exert control through the use of stored procedures in chapter 3; for now, we’ll just give you one
example.
Let’s say your company has strict privacy regulations controlling how you may access data, and wants
to keep an audit log of every access to the Employee table. This is a very common requirements
nowadays; even users who’s work require that they have access to data, should only do so when they
actually need to. For instance, think of the medical records of a famous person in a hospital; staff that
is not involved in treating the patient, should not access his or her records, for the sake of curiosity.
You can easily log all access to this table if all access to this table is done through the use of this
stored procedure (and only through stored procedures). We’ll just give you a general idea what
something like this would look like (don’t actually execute this code, as it is incomplete):

CREATE PROC dbo.usp_get_employees_with_logging


AS
BEGIN

INSERT tblLogging (account, time, table)


VALUES (USER_NAME(), GETDATE(), 'dbo.Employees')

SELECT FirstName, LastName, Address FROM dbo.Employees


END

Obviously, we’d have to create the tblLogging first, but hopefully, you can understand the general idea
even without that table.

But back to our main topic, inserting data using INSERT… EXEC. When you execute the stored proc
usp_get_employees , you get a result set, which in turn, you can insert into a table:

INSERT dbo.Employees_backup
EXEC usp_get_employees

That’s it for using stored procedures to insert a result set into a table (for now).

Up next is dynamic SQL. First, we’ll explain what dynamic SQL is, and how to use it. Then, we’ll
show you how to use dynamic SQL to insert data (as this is the current topic). Next, we’ll go into
more detail on dynamic SQL, the considerations on when to use, and more importantly, when not to
use dynamic SQL. And finally, before we move on to updating & deleting data, we’ll cover the use of
cursors, which are often used with dynamic SQL.

So let’s start with what dynamic SQL is. The difference between dynamic SQL and regular SQL, is
that in a dynamic SQL statement, you store the SQL statement in a string variable, and then execute
that string. Like this:

DECLARE @sql nvarchar(1000)

SET @sql = 'SELECT FirstName, LastName, Address FROM dbo.Employees'

EXEC (@sql)

The last line of code can be substituted for this one:

EXEC sp_executesql @statement = @sql

The final EXEC statement can be used to insert data into a table, just like we did with the stored
procedure:

INSERT dbo.Employees_backup
EXEC (@sql)

Or alternatively:
INSERT dbo.Employees_backup
EXEC sp_executesql @statement = @sql

You now know enough about dynamic SQL to use it in the INSERT…EXEC method. But like we
mentioned, we would still like to elaborate on dynamic SQL, and explain cursors, before moving on
to updating & deleting data.

In more detail: dynamic SQL


The benefit of dynamic SQL is, well, that it is dynamic. You can make code to change your SQL
statement. This is particularly useful in situations, where you need to make a lot of similar SQL
statements to slightly different objects. We’ll give a simple and a more elaborate example of this.
The downside of dynamic SQL is that you potentially run into both security and performance issues.
We’ll talk about these concerns as well.
Because of the possible security and performance issues, dynamic SQL should not be your first
choice. It is still used a lot for database administration tasks, where, in many situations, the use of
dynamic SQL is still a valid choice, and at times, even the only choice. There are two main reasons
why dynamic SQL is more suitable for queries by database administrators, and less suitable for
queries by database developers:
* a database administrator usually has sysadmin permissions, and therefore can perform any action in
the database, while a database developer should write code that can run under the least amount of
privileges;
* a database administrator usually works with a large number of databases, and has limited control
over the design and structure of those databases, whereas a database developer works on a small
number of databases, and has (or should have) total control over the design and structure of the
database.

A common use for dynamic SQL for a database administrator would be a script to restore a database
backup to a different location. Restoring a database backup to a different location requires supplying
the new location for each database file. However, the number of database files may differ from one
database to the next (each database has at least two, a data file and a log file), and also the number of
backup files might be different for each backup (depending on your backup solution). So you can
select the relevant details of the backup from the msdb system database, and using dynamic SQL, add
the required statements to move each database file.
We won’t go into the particulars of writing such a script, as it would be an elaborate script, and many
such scripts can be found online. The point here is that, in such a situation, neither performance nor
permissions would be a reason not to use dynamic SQL (as executing the dynamic part of the restore
script wouldn’t take much time compared to the actual restore process, and the person running the
script would probably be a DBA with sysadmin permissions). Further on, after we’ve discussed
cursors, we’ll give a real life example of the use of dynamic SQL (combined with a cursor) for
database administrator scripts.

The dynamic part of dynamic SQL is that you can use all sorts of combinations of programming logic
and string manipulation to compose the string before you execute it. Let’s use the TOP statement as an
example. We haven’t covered this yet, so we’ll do that first. Using TOP, you limit the number of rows
you get back to the number (or percentage) of rows you specify. The following query will return just 3
rows from the Employees table:

SELECT TOP (3) *


FROM dbo.Employees

Unless, of course, there are less than three rows in the table. And this example returns half of the
table:

SELECT TOP 50 percent *


FROM dbo.Employees

TOP is almost always used with ORDER BY, in which case, the result set is ordered first before
returning the top number or percentage of rows. If you do not use ORDER BY when using TOP, you
can’t be sure which rows will be returned.

SELECT TOP 50 percent *


FROM dbo.Employees
ORDER BY LastName

Now suppose you want a dynamic number of rows to be returned. This is part of how you would do
that using dynamic SQL:

DECLARE @sql varchar(1000)


,@top int

SET @top = 3
SET @sql = 'SELECT TOP ' + CAST (@top as varchar(10)) + 'FirstName, LastName, Address FROM dbo.Employees'

EXEC (@sql)

Both variables are concatenated to form the SQL statement, before executing the statement. That way,
whatever value you assign to @top, is the number of rows that will be returned.

As mentioned above, there is also the alternative sp_executesql, which is slightly different:
* Sp_executesql requires a the string to be declared as Unicode (ntext, nchar or nvarchar), instead of
non-Unicode.
* You can supply parameters separately, which allows for query plan reuse (and thereby, possibly
improve performance).
* also, the use of explicit parameters in Sp_executesql helps protect against SQL injection. SQL
injection is a hackers method of executing SQL statements that are really different from what you
intended. The best way to explain SQL injection, is to use an old joke. Look up “Bobby Tables” in
your favourite search engine, then come back to examine the following code. After that, we’ll return
to our TOP statement example.

So, we’ll create a student table, populate it with one record, and then create a dynamic SQL statement
to search for a record based on a parameter. Just imagine the school has an application where you can
enter a first name in a search box to find a student record from the database, and the application
would put whatever you enter into the parameter @Firstname. The most basic version of the code
would look like this:

-- create the table


CREATE TABLE Students (
FirstName varchar(100)
,LastName varchar(100))

INSERT INTO dbo.Students (FirstName, LastName) VALUES('Robert', 'Tables')

-- fetch the student record


DECLARE @sql nvarchar(1000)
,@Firstname nvarchar(1000)

SET @Firstname = 'Robert'


SET @sql = 'SELECT * FROM dbo.Students WHERE FirstName = ' + CHAR(39) + @Firstname + CHAR(39)

EXEC (@sql)

You can verify that this code would indeed return the correct record from the Students table if a user
enters the correct name. However, image that a malicious user would not enter a correct name, but
instead would enter SQL code in the search box, so the parameter would become:

SET @Firstname = 'Robert''; DROP TABLE dbo.Students--'

The two single quotes and the semicolon would properly terminate the first (intended) SQL statement;
next we have the malicious code, and the two hyphens at the end would make sure SQL treats
whatever comes next as comment. The application would now execute two statements, the second of
which would effectively drop the table (assuming that the application has the permissions to do so).

The alternative code using sp_executesql instead of EXEC() would be:

SET @sql = 'SELECT * FROM dbo.Students WHERE FirstName = @FirstName'

EXECUTE sp_executesql
@statement = @sql
,@parameters = N'@FirstName as varchar(1000)'
,@FirstName = @FirstName

Now, when a malicious user enters Robert''; DROP TABLE dbo.Students—- in the search box, nothing bad
will happen; SQL will only execute one statement, looking for a table with the exact name as entered
in the search box. Therefore, using sp_executesql instead of EXEC() helps protect against SQL
injection.

Security is not a topic of this exam, but we’d still like to make an additional remark about this
example. Security should be a multi layered approach. The student application in this example should
probably not have been given permission to drop the table, but in real life, too many applications use
the permissions of database owner, or even worse, sysadmin, and thus, would have the permission to
drop a table. A proper design of the student application should not require dynamic SQL, and in
addition, the application should only be given the required permissions for the tasks it needs to
perform (and dropping a table would probably not be one such task).

So for both reasons of performance and security, it is better to use sp_executesql than EXEC() if you
do decide to use dynamic SQL. Now back to the previous example. The same code for our TOP
example using sp_executesql would be something like this:

DECLARE @sql nvarchar(1000)


,@top int

SET @sql = 'SELECT TOP (@top) FirstName, LastName, Address FROM dbo.Employees'
SET @top = 3

EXEC sp_executesql
@statement = @sql
,@parameters = N'@top as int'
,@top = @top

Either way, using sp_executesql or EXEC(), you would still need a method to make the number
dynamic, for example, by making a stored procedure that accepts @top as an input parameter, but this
basic example is enough to demonstrate the “dynamic” part of dynamic SQL.

We’ve deliberately chosen an example out of the archives, to demonstrate the following point: some
situations that used to require dynamic SQL in previous versions, no longer require using dynamic
SQL in newer versions. This is true for the variable for TOP. A few versions back, TOP was
improved to allow the direct use of variables:

SELECT TOP (@top) FirstName, LastName, Address


FROM dbo.Employees

This eliminates yet another scenario for dynamic SQL. The lesson here: if you can avoid dynamic
SQL, you probably should, but there are still situations (like database administrator tasks) where you
can make good use of dynamic SQL.

As promised, we’ll give you a real life example of dynamic SQL after we’ve covered our next topic:
cursors.

Cursors
In SQL Server, you want to manipulate data sets, not individual records, whenever possible. For
example: suppose you want to delete all records in a table. In that case, you want to write a single
SQL statement to delete all records, not deleting all records one by one until the table is empty. The
result will be the same, but deleting all records one by one will take a lot longer than deleting them
all at once.
There will, however, be situations where it is impossible to manipulate all records in a set, and you
will have to manipulate each record at a time. One such example would be when, for each record
individually, you need to call a stored procedure. For example: the WideWorldImporters database has
a special stored procedure for changing user passwords, but it does so only for one user at a time. In
situations like these, you can use a cursor.
The following are the required steps for using a cursor:
* Declare the cursor by specifying a select statement. The result set of that select statement is the set
of records that the cursor will go through one by one.
* Open the cursor.
* Fetch the first record from the result set.
* Perform some operation(s) on that record.
* Repeat fetching the next record and performing the operation, until all the work is done.
* Close the cursor.
* Deallocate the cursor (to free up all resources).

We’ll start with a very simple example. The operation we’ll perform will be to select the full name
and user preference of the first 50 records of the People table in the WideWorldImporters database.
This is not something you’d use a cursor for, but we’ll get to a more realistic example later on. To
perform the operation, we need to declare two variables, and for each record we fetch, store the
values for the full name and user preference in these variables. All the rest of the code is needed for
the cursor.
By the way: we need to limit the number of records (using TOP), otherwise SQL Server Management
Studio 2016 will throw a system out of memory error. This is an error in SSMS, not in SQL Server.
Another way to avoid this error is to install the most recent version of SSMS 2017.

USE WideWorldImporters

DECLARE crsLoop CURSOR FOR


SELECT TOP (50) FullName, UserPreferences
FROM [Application].[People]
ORDER BY FullName

DECLARE @FullName nvarchar(50)


,@UserPreferences nvarchar(max)

OPEN crsLoop

FETCH NEXT FROM crsLoop INTO @FullName, @UserPreferences

WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @FullName AS [Full Name], @UserPreferences AS [User Preferences]
FETCH NEXT FROM crsLoop INTO @FullName, @UserPreferences
END

CLOSE crsLoop
DEALLOCATE crsLoop

The main part of the cursor is the loop between BEGIN and END. In this case a very simple SELECT
statement, but it could just as well be multiple SQL statements. This loop will be performed as long
as @@FETCH_STATUS = 0, meaning: as long as the previous fetch successfully fetched a record (or
number of records). So always remember to end the loop with fetching the next record; otherwise, the
loop will never fetch a new record, continue processing the first record and therefore run indefinitely.

Cursors support some options to allow faster processing. As mentioned, cursors often take longer
than set operations, so there are some options to limit the overall load on the system as a whole, as
well as the impact on other statements on the same data (concurrency). As locking and concurrency
are not topics for this exam, we won’t cover those cursor options here. Should you ever have to write
a cursor that runs for a significant amount of time, and you need to know the impact on other
statements running at the same time, we highly recommend going through the documentation on
Microsoft Docs.

The actual operation, between BEGIN and END, is where the fun stuff happens. In the first example,
we’ve covered a simple select statement. We haven’t covered updating and deleting data yet, so we’ll
get back to this later, but for now, we’d just like to mention that, in order to update or delete the
record at the current position of the cursor, you can use a special WHERE clause:

WHERE CURRENT OF crsLoop

This will save you from having to write an additional way of identifying this record (which may be
easy, if the table has a primary key, but in any way, will require more code). This WHERE
CURRENT OF clause won’t work with SELECT statements, though; therefore, we needed to declare
the variables @FullName and @UserPreferences to store those values.

As promised, we’ll give an example combining cursors and dynamic SQL.

Imagine a situation where you need to kill all connections to a database, for instance, to restore a
backup over the existing database. In order to restore a database over an existing database, there can
be no current connections to the database, so if there are any, you need to kill those connections in
order to be able to perform the restore. For completeness sake, we’d like to mention that there is an
easier way of killing all connections than the code we are about to explain, namely setting the
database to single user before the restore (and setting it back to multi user afterwards). Without the
full restore statement, that would look something like this:

USE [master]
ALTER DATABASE [TestDB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
RESTORE DATABASE [TestDB] ...
ALTER DATABASE [TestDB] SET MULTI_USER

GO

Killing each connection using a cursor and dynamic SQL might not be the easiest way to kill all
connections, but it is a great way to demonstrate the use of both dynamic SQL and cursors.
First, we’ll demonstrate the KILL command. Open a new query window and execute the following
query:

SELECT @@SPID
The result will be the process ID of your connection; let’s say it is 58. Now, in the same window, kill
that connection using the following command (substitute your own process ID):

KILL 58

Oops, that won’t work; SQL will inform you that you can’t kill your own process. There are other
things to take into consideration before you start killing processes:
* don’t try to kill system processes. Any process with a process ID of 50 or less is a system process.
* when you kill a process, that process will be terminated immediately unless there is work by that
process that needs to be undone. For example, let’s say there is a process that tries to delete all
records in a large table, and it is doing this in a single transaction. A transaction is a single action that
has to either succeed or fail in its entirety. That means that when you kill the process before it is done,
all work will be undone (rolled back). So a delete statement that has been running for hours might
take hours more to rollback.

Let’s try once more to kill our connection (we know it is not a system process, and since it has no
work to undo, we can safely kill this connection). Open another query window and execute the same
KILL command. Now, it will succeed. For our script to kill all connections to a database, we first
need to know all processes connected to a database. We can accomplish this by querying the system
tables sysprocesses in the master database, and using the DB_ID() function we saw earlier:

SELECT *
FROM master..sysprocesses
WHERE dbid= DB_ID('TestDB')

In order to make this a real world example, we need to account for the possibility of a single process
having multiple records in this sysprocesses table (as will happen when parts of a query are executed
in parallel). We can eliminate duplicates using either DISTINCT or GROUP BY. We’ll cover GROUP
BY in more detail later on; for now, just know that it can do a lot more than eliminating duplicates.

SELECT spid
FROM master..sysprocesses
WHERE dbid= DB_ID('TestDB')
GROUP BY spid

SELECT DISTINCT spid


FROM master..sysprocesses
WHERE dbid= DB_ID('TestDB')

If you execute this query, you can see all process ID’s connected to database TestDB (make some
connections to this database to follow along). Using string manipulation we saw earlier, we can now
create a set of KILL commands by changing the SELECT statement to:

SELECT 'KILL ' + CAST(spid AS VARCHAR(10))

This will be the result set we will use for our cursor operation. As before, we need to store the
outcome of the select statement in a variable (we’ll call it @SQL), loop through all records of the
result set one by one, and use the variable @SQL as the statement for sp_executesql. You should now
be able to put all the pieces together, and come up with code similar to this:

DECLARE @SQL NVARCHAR(100);

DECLARE crsKill CURSOR LOCAL FAST_FORWARD


FOR SELECT 'KILL ' + CAST(spid AS VARCHAR(10))
FROM master.dbo.sysprocesses
WHERE dbid = DB_ID('TestDB')
GROUP BY spid;

OPEN crsKill;

FETCH crsKill INTO @SQL;

WHILE @@FETCH_STATUS = 0
BEGIN
EXEC sp_executesql @SQL;

FETCH crsKill INTO @SQL


END;

CLOSE crsKill;
DEALLOCATE crsKill;

“LOCAL” and “FAST_FORWARD” are two of the cursor options we suggested you’d look into,
should you ever need to write a cursor for production purposes, but the code will work fine if you
omit these options.
As long as you didn’t try to run this code in the TestDB (as you can’t kill your own connection),
we’ve now achieved our objective: killing all connections to a database.

Because SQL Server is written to handle complete result sets in a single action, most of the time, a set
based approach will be faster than operating on a data set one record at a time; therefore, the
processing of data by a cursor is often described as RBAR: row by agonizing row. On forums, you
might even read comments that “cursors are evil”. This might be a slight exaggeration. A cursor is just
another tool in the arsenal of a database developer, and should be used appropriately. If a set based
approach can be used, that will often be faster than a cursor, but there will be situations where a set
based approach isn’t possible.

We’ve now covered all four methods for inserting data: INSERT … VALUES, INSERT … SELECT,
SELECT INTO and INSERT EXEC, and along the way, we’ve also explained cursors and dynamic
SQL, SQL injection, the TOP and DISTINCT statements, and given you a glance of stored procedures
and grouping functions.

Update

In order to change values in a record, the UPDATE statement is used. In its simplest form, an
UPDATE statement looks like this:
UPDATE table
SET column = value
WHERE some sort of filter

For example, to change the last name of Bo in our Employees table, you’d use the following code:

UPDATE dbo.Employees
SET LastName = 'Didley'
WHERE FirstName = 'Bo'

The WHERE clause here is very important: it is used to filter which records to update. If you omit, or
forget, the WHERE clause, every record in the table will be updated. Tip: always use the same
WHERE clause in a SELECT statement to check which records will be updated before executing an
UPDATE statement.

We’ll start at the top, at the UPDATE clause, and work our way down from there.

The UPDATE clause has very few options. In our basic example, we simply listed the name of the
table from which we were going to update records. This can only be one table; even if you use a join
statement in order to filter rows, only rows from one table can be updated at a time.
As an alternative to updating a table, you could also update a table variable or a view; this works the
same as updating a table.
One particularly interesting option in the UPDATE clause is the option to use TOP. As we saw in the
SELECT statement, this limits the number of records that will be updated. And again, as in the
SELECT statement, the records will be arbitrarily chosen unless you also supply an ORDER BY
clause. The funny thing, however, is that the UPDATE statement does not support an ORDER BY
clause. The following statement will result in an “incorrect syntax” error:

UPDATE TOP (1) dbo.Employees


SET LastName = 'Didley'
WHERE FirstName = 'Bo'
ORDER BY EmployeeID

It will work fine if you omit the ORDER BY clause; however, in that case, if more than one record
matches the filter, you have no way of knowing which record will be updated. In a more technical
term: this statement would be non-deterministic.
So unless this random behaviour is what you want, the solution is to add a subselect statement, add
the TOP clause plus an ORDER BY clause to that subselect statement, and then join the table you’re
going to update to that subselect statement. Such a statement would look like this:

UPDATE dbo.Employees
SET LastName = 'Didley'
FROM ( SELECT TOP (1) EmployeeID
FROM dbo.Employees
WHERE FirstName = 'Bo'
ORDER BY EmployeeID) AS e
WHERE e.EmployeeID = Employees.EmployeeID

In order to check this, add a few more employees with a first name of Bo to the table, and run the
UPDATE statements with the TOP clause with and without the subselect statement.

The SET clause is where the actual change is made. Here, you can supply one or more column names,
and assign each of them a new value. You can assign a new value by:
* assigning a new value for the column directly;
* setting the column to NULL or DEFAULT;
* assigning a value for the column based on an expression (for instance, based on the old value).

The following statement will update two columns, setting the address to the default, and trimming
leading and trailing spaces (if any) from the last name for all employees with first name Bo:

UPDATE dbo.Employees
SET LastName = LTRIM(RTRIM(LastName))
,Address = DEFAULT
WHERE FirstName = 'Bo'

When using expressions, you’re not limited to the old value of the column you’re updating; you can
also use the value from other columns of the same record:

UPDATE dbo.Employees
SET LastName = LTRIM(RTRIM(LastName))
,Address = 'Home of mr. ' + FirstName + ' ' + LastName
WHERE FirstName = 'Bo'

In this example, we’re changing both the values for last name (removing spaces using LTRIM and
RTRIM) and setting the value for address to an expression based on the old value for last name
(before removing the spaces).

In the examples above, we’ve used string data. Here’s an example with number data. Let’s add a
column and set a default (we’ll get back to changing the structure of a table in a later requirement, but
this code is pretty self-explanatory):

ALTER TABLE dbo.Employees ADD Salary decimal(18,2) DEFAULT 1000.00

UPDATE dbo.Employees
SET Salary = DEFAULT

Now, here are two statements to add 100 dollars to the salary of Bo:

UPDATE dbo.Employees
SET Salary = Salary + 100
WHERE FirstName = 'Bo'

UPDATE dbo.Employees
SET Salary += 100
WHERE FirstName = 'Bo'

The latter way of writing is called a compound assignment operator; use whatever syntax you feel
more comfortable with, but you should understand them both. There are also compound assignment
operators for subtracting, multiplying and dividing, as well as for the modulo and bitwise comparison
operators.
We haven’t covered the bitwise comparison operators earlier, so let’s do that now (if you do not
know how binary logic works, this explanation may not make sense to you, but explaining binary
logic is way out of scope for this book).
There are three bitwise comparison operators: AND ( & ), OR ( | ) and exclusive OR ( ^ ). The
easiest way to explain these bitwise comparison operators is with examples where all of them work
on two input bits, and will return one bit. Remember: a bit is either 0 or 1.
* The AND function will return a 1 if both input bits are 1, otherwise it will return zero.
* The OR function will return a 1 if at least one of the input bits is a 1; if both of the input bits are 0, it
will return a 0.
* The exclusive OR (XOR) will return a 1 if one, and only one, of the input bits is 1; if either two, or
none of the input bits are 1, it will return a 0. This is what that looks like:

In reality, however, the bitwise comparison operators do not only work on bits; they will work on any
two whole numbers (e.g. bit, tinyint, smallint, int and bigint, but do check out Microsoft Docs for all
supported combinations), and will also return a whole number. Both inputs will be converted to a
binary number, after which the operation demonstrated above will be done on each bit (comparing the
rightmost bit of the first binary number to the rightmost bit of the second binary number, then the bit to
next to that, working all the way to the leftmost bit); the resulting binary number will be converted to
a whole number, and returned.
Let’s work out these operations for numbers 5 and 19. Turning them to binary results in 0101 and
10011. Notice that the first binary number has only 4 digits, while the second has 5 binary digits. As
you can check for yourself, in order to compare the 5th digit (counting from the right), the first number
will be padded with a zero. Working right to left, the AND comparison will return the binary number
00001 (decimal 1); the OR comparison will return 10111 (decimal 23); and the XOR comparison
will return 10110 (decimal 22). You can check this for yourself:

SELECT 5 & 19, 5 | 19, 5 ^ 19

We’ve now covered the SET clause. When discussing the TOP clause, we already demonstrated the
use of an optional FROM clause. This FROM clause is also used when, in order to update a table,
you filter by joining on a second table. For an example to update records based on a join, let’s use the
tables [Sales].[Customers], [Sales].[Orders] and [Sales].[Orderlines] in the WideWorldImporters
database. If you want to make sure you do not alter the original tables, you can easily make copies of
these tables, using the SELECT…INTO statement we covered earlier, but we’ll provide the code to
undo the change as well. An alternative method to achieve this is to start a transaction, and instead of
committing the transaction, roll it back. But we’ll get to that in Chapter 3 when we talk about error
handling.

The table [Sales].[Customers] contains a column StandardDiscountPercentage. In our version of the


WideWorldImporters database, the value for this percentage is 0 for every customer. Let’s say you
want to set this to 5% for every customer that has ordered something after January 1st , 2016. The
following query would achieve this:

UPDATE c
SET c.StandardDiscountPercentage = 5
FROM sales.customers c
INNER JOIN sales.Orders o on c.CustomerID = o.CustomerID
WHERE o.OrderDate > '2016-01-01'

To change this table back to its original state:

UPDATE Sales.Customers
SET StandardDiscountPercentage = 0
WHERE StandardDiscountPercentage = 5

But just like the case with the TOP statement, you have to make sure that the update is unambiguous,
otherwise, your update statement will become nondeterministic. A clear example of that would be an
update to the comment field of the Orders table based on the description field in the OrderLines table.
In our version of the WideWorldImporters database, the value for this comment is NULL for every
order. You could change the comment of an order to the description of an order line with the
following code:

UPDATE o
SET comments = ol.Description
FROM sales.orders o
INNER JOIN sales.orderlines ol ON o.orderid = ol.orderid
But as you may have guessed, an order can have multiple order lines (e.g. order 46 has two order
lines), and therefore, multiple descriptions. This code has no instruction for SQL which order line for
a particular order to choose, so it will just pick one (not randomly; SQL will pick the one that is
easiest at the time, so different runs of the same query may have different results). Be aware of this
when joining tables for an update statement; if the record in the table you’re going to update has
multiple corresponding records in the table you’re joining it to, and you’re setting the value of a
column of this record to a value of a column in the other table, you have no way of knowing which of
the multiple matching records will be chosen by SQL.

To change this table back to its original state:

UPDATE Sales.Orders
SET comments = NULL

The last clause of the UPDATE statement is the WHERE clause. As in the SELECT statement, this
WHERE clause is used to filter which records will be updated. There is only one important thing
different about the WHERE clause of an UPDATE statement: when the UPDATE is part of the loop of
a cursor, you can refer to the row at the current position of the cursor using the syntax WHERE
CURRENT OF (followed by the name of the cursor). For example, using the example of the orders
table, the following cursor will update the column comments of every order to include the order id:

DECLARE crsOrders CURSOR FOR


SELECT orderid
FROM Sales.orders

DECLARE @orderid int

OPEN crsOrders

FETCH NEXT FROM crsOrders INTO @orderid

WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE Sales.Orders
SET Comments = 'This is order ' + CAST(@orderid AS VARCHAR(10))
WHERE CURRENT OF crsOrders

FETCH NEXT FROM crsOrders INTO @orderid


END

CLOSE crsOrders
DEALLOCATE crsOrders

Please verify that, after the update, the comment corresponds to the order id. By the way: the use of
the cursor here is just to demonstrate the syntax of WHERE CURRENT OF; the actual update could
better have been done with a (much faster) set based approach:
UPDATE Sales.Orders
SET Comments = 'This is order ' + CAST(OrderId as varchar(10))

And once again, to change this table back to its original state:

UPDATE Sales.Orders
SET comments = NULL

The last thing we want to talk about, is a way to capture the new value for the column, after the
update, for further processing. In later requirements, we’ll see more elaborate use of this, when we
talk about the MERGE statement and the OUTPUT statement. The MERGE and OUTPUT statement
allow you to capture the value before the update, as well as the values of other columns from the
record(s) you’re updating. For now, we’ll stick to just the after value of the column you’re actually
updating. For this, we use a variable.
Earlier on, we saw how to select a value and assign that value to a variable. The SELECT clause for
this assignment was:

SELECT @FullName = FullName

Assigning the new value of an UPDATE statement works in a very similar way. Just change SET into
“SET @variable =”, like this:

DECLARE @Comments nvarchar(max)

UPDATE Sales.Orders
SET @Comments = Comments = 'This is order ' + CAST (OrderId as varchar(10))
WHERE OrderId = 1

SELECT @Comments

Note the double assignment in the SET clause. You assign both a new value to the column, and to the
variable. Like in the SELECT example, you have to make sure only one record is updated; otherwise,
SQL will store the last value in the variable, and you’ll have no way of knowing which record will
be updated last. If you remove the WHERE clause in the example above, the last record will
probably be the record with the highest OrderId, but this is not guaranteed, so don’t count on this.

This was the UPDATE statement. We’ve seen how to change the values for columns in a table, how to
filter which records are updated using the WHERE clause, the possible dangers of using
nondeterministic statements (in the cases of TOP, joins and variable assignments), and updating
through a cursor. Along the way, we’ve also covered the bitwise operators. We’ll get back to
updating in later requirements, when we cover the MERGE and OUTPUT statements. Up next, we’ll
start removing data from a table.

Delete
The DELETE statement is used to drop records from a table. In its most simple form, the syntax is:
DELETE
FROM table
WHERE some sort of filter

There is not much to discuss for the DELETE statement. There are some options that need to be
mentioned, but these options work the same as in the UPDATE statement:
* the TOP clause can be used, but ORDER BY is not supported, so in order to have control over
which records are deleted, you need the ORDER BY in a subselect statement;
* deletes can be based on joins;
* the WHERE clause is used to filter records;
* in a cursor loop, you can delete the current record using the WHERE CURRENT OF syntax.

Deleting records based on a join requires an example, because the syntax looks strange, as it contains
FROM twice. We’ll use an example similar to the UPDATE on a join statement. We’ll delete all
records of customers who ordered anything before January 1st , 2010. If you would do this in real life,
you probably wouldn’t check if they had bought anything before some date, but if their most recent
was before some date long ago, so this is not a real life example. But fortunately, this won’t delete
any records, as you can see for yourself:

SELECT *
FROM sales.customers c
INNER JOIN sales.Orders o on c.CustomerID = o.CustomerID
WHERE o.OrderDate < '2010-01-01'

This will return zero records. Based on this statement, we can construct the DELETE statement:

DELETE
FROM c
FROM sales.customers c
INNER JOIN sales.Orders o on c.CustomerID = o.CustomerID
WHERE o.OrderDate < '2010-01-01'

Always remember the WHERE clause; without it, all records will be deleted. If that is your intention,
under certain circumstances, you can use the alternative TRUNCATE statement, as in the following
example:

TRUNCATE TABLE dbo.Employees_backup

The TRUNCATE statement uses less transaction log space, and is usually faster than a DELETE of all
records. However, there are some limitations, such as the following:
* TRUNCATE requires more permissions than DELETE
* you can’t perform a TRUNCATE on a table that is being referenced by a foreign key constraint, or
is being replicated.

Earlier on, when discussing cursors, we explained that SQL is optimized for set based operations,
and that it is therefore usually best to perform all actions in one set based operation instead of
splitting an operation into multiple steps. When deleting large amounts of data, you can easily run into
an exception to that rule. Deleting large amounts of data can either fill up the transaction log, causing
the delete statement to abort and rollback, or block other operations for extended periods of time.
This is a common scenario when a scheduled cleanup job didn’t run in a long time.
The solution is to split this large delete into smaller pieces. The following code runs a loop that
deletes a thousand records, waits ten seconds (to allow other processes to continue) and continues as
long as records are being deleted.

DECLARE @rowcount int;

SET @rowcount = 1;

WHILE @rowcount > 0


BEGIN
DELETE TOP (100000)
FROM dbo.MyVeryLargeTable;

SET @rowcount = @@rowcount;

WAITFOR DELAY '00:00:10';


END;

In a real world scenario, you would have to tweak the number of records and the delay time, and
usually add a WHERE clause. As an alternative for the wait, you could perform a transaction log
backup. And by the way: in this case, we used the TOP clause without the ORDER BY, as we will
keep on running the loop until all records are deleted, so it does not matter that we can’t control
which records are deleted first.

Determine which statements can be used to load data to a table based on its structure and constraints

Basically, the last requirement showed you how to insert data; this requirement is about the
constraints that prevent you from successfully loading data. We’ve already covered these, so we’ll
just repeat them here:

* the data types of the column and the value you’re trying to insert should be compatible;
* if you want to insert a value into a column with an identity property, you need to set insert identity
on;
* you must provide a value for every column, unless it has a default value, is nullable or has an
identity property;
* you can only insert a value into a primary key column if that value does not already exist in the
table;
* you can only insert a value into a column with a foreign key constraint if that value exists in the
column that is being referenced by the foreign key.

Construct Data Manipulation Language (DML) statements using the OUTPUT statement
When discussing the update statement, we demonstrated how to capture the new value for a column
into a variable. Here, we’ll discuss how to capture the old value for a column (before the update), or
any column for the record being updated, either before or after the update. For this, we use the
OUTPUT clause.
After discussing the OUTPUT clause, we’ll move on to the MERGE statement. As the MERGE
statement uses the OUTPUT clause slightly different than the INSERT, UPDATE and DELETE
statements, we have to cover the MERGE statement last. The MERGE statement is intended for
scenarios where you want to update a value for a record if the record already exists, otherwise, insert
a new record.

Output
Before the OUTPUT clause was introduced in SQL 2005, this same functionality required two
actions: a SELECT and an UPDATE. That meant more code and more SQL transactions. Also, using
the OUTPUT clause instead of two separate statement ensures that no other transaction can change the
data between the first and second statement. This makes the OUTPUT clause very useful.
The values before the update for records being updated can be referenced using the keyword
DELETED; the values after the update for records being updated can be referenced using the keyword
INSERTED. We’ll demonstrate that INSERTED and DELETED act like temporary, hidden tables for
the UPDATE statement. The INSERT and DELETE statement also support an OUTPUT clause; for
obvious reasons, the INSERT statement only has the INSERTED table, and the DELETE statement
only has the DELETED table.

These hidden tables only exist in the same statement as the insert, update or delete you’re performing.
You can access a column of these tables using DELETED.<column_name>, or INSERTED.
<column_name>. Instead of using the column name, you can also use the * to reference all columns.
Remember that, when discussing the SELECT statement, we said using the asterisk is not best
practice, but we’ll use it here to keep our examples concise.

In its most simple form, this is how you use the OUTPUT clause. We’ll create a table, insert two
records, update the records and, in the same statement, return the values before & after the update.

DROP TABLE IF EXISTS output_test


CREATE TABLE output_test (id int, SomeText varchar(100))

INSERT output_test (id, SomeText) VALUES (1, 'before')


INSERT output_test (id, SomeText) VALUES (2, 'before')

UPDATE output_test
SET SomeText= 'after'
OUTPUT DELETED.*, INSERTED.*

DROP TABLE IF EXISTS output_test

To confirm that you can also use the OUTPUT clause on an INSERT statement, you can use the
following code:

INSERT output_test (id, SomeText) OUTPUT INSERTED.* VALUES (2, 'before')


And to reference a column by name, instead of using the asterisk:

INSERT output_test (id, SomeText) OUTPUT INSERTED.id VALUES (2, 'before')

Please test for yourself that the INSERT statement does not support the DELETED keyword, and that
this works the same way for a DELETE statement (which supports the DELETED keyword, but not
INSERTED).

In the examples we’ve used above, the output is returned to the calling application. The alternative is
to insert the output into another table. This can be achieved using the keyword INTO. Suppose we had
a table called output_before_test, with a similar structure as the output_test table. To load the records
before the update into this table, we’d change the OUTPUT clause to:

OUTPUT DELETED.* INTO output_test_before

This code pattern is great for auditing purposes. In that case, you might want to add, for example, a
timestamp. Here, we’ll add the date & time of the update into a column called ChangeTime:

OUTPUT DELETED.id, DELETED.SomeText, GETDATE() INTO output_test_before (id, Sometext, ChangeTime)

Notice the column listing after the name of the target table. If you omit this column listing, the order of
the columns in the statement has to match the column order of the target table.

There are some restrictions on the target table that you need to keep in mind: the target table can not
have a trigger defined on it, it cannot be part of a foreign key relationship, or have CHECK
constraints.

Merge
Up next is the MERGE statement. As we said, we use the MERGE statement when we want to update
a value for a record if the record already exists, otherwise we want to add the record. This is a
common scenario when loading a data warehouse.

The MERGE statement works on two tables, called SOURCE and TARGET. The MERGE statement
uses a join predicate to match rows between the source and target table; following that, it defines
separate actions on:
* the records that are matched;
* the records in the source table that have no match in the target table;
* and the records in the target table that have no match in the source table.
So the structure looks something like this:

MERGE [tableA] AS TARGET


USING [tableB] AS SOURCE
ON TARGET.[columnA] = SOURCE.[columnB]
WHEN MATCHED THEN
...
WHEN NOT MATCHED BY TARGET THEN
...
WHEN NOT MATCHED BY SOURCE THEN
...;

The MERGE statement is used in the SQL implementations of other relational database systems
(where it is sometimes called an UPSERT) without the possibility to act on records with no match in
the source table. There is an ANSI standard for SQL languages which also lacks this possibility. This
is reflected in the Transact-SQL syntax. When referring to the records present in the source but absent
in the target, you can use either WHEN NOT MATCHED or WHEN NOT MATCHED BY TARGET;
when referring to the records present in the target but absent in the source, you have to use WHEN
NOT MATCHED BY SOURCE (you can’t omit “BY SOURCE”). In our examples, for purposes of
clarity, we’ll use the longer syntax. Therefore, even in the statements without an action for records
with no match in the source table, we’ll use WHEN NOT MATCHED BY TARGET for actions on the
records without a match in the target table.

Another note: a MERGE statement must be terminated with a semi-colon.

For our first example, we’ll create two tables for a monitoring system. Imagine a monitoring program
that monitors free drive space. In it, you have a table with a threshold for each drive. The intention
being that if a certain drive has less free space than the threshold, an alert will be created by the
system. The target table contains the drive that already has a number of parameters, and we’ll add or
update a record for drive Q.

In this first example, we’ll create a source table separately. Later on, we’ll build on this example by:
* using a select statement as a source table in the MERGE statement;
* adding an action for non-matched records in the target table;
* adding an output clause;
* adding an additional search condition.

First, let’s look at how this would work without the MERGE statement.

DROP TABLE IF EXISTS Params_target

CREATE TABLE Params_target (drive CHAR(1), threshold_in_GB tinyint)


INSERT Params_target VALUES ('C',10), ('D',10), ('E',10), ('Q',10)

DECLARE @drive CHAR(1) = 'Q'


,@threshold_in_GB tinyint = 1

IF EXISTS (SELECT * FROM Params_target WHERE drive = @drive)


BEGIN
UPDATE Params_target
SET threshold_in_GB = @threshold_in_GB
WHERE drive = @drive
END
ELSE
BEGIN
INSERT Params_target VALUES(@drive, @threshold_in_GB)
END

Please verify for yourself that, with or without a record for drive Q in the target table, the outcome is
the same: after this statement, the target table will contain a record for drive Q, with a threshold of 1
GB.

SELECT *
FROM Params_target
ORDER BY drive

The same can be achieved with the MERGE statement.

--setup
DROP TABLE IF EXISTS Params_target
DROP TABLE IF EXISTS Params_source

CREATE TABLE Params_target (drive CHAR(1), threshold_in_GB tinyint)


INSERT Params_target VALUES ('C',10), ('D',10), ('E',10), ('Q',10)

CREATE TABLE Params_source (drive CHAR(1), threshold_in_GB tinyint)


INSERT Params_source VALUES('Q',1)

--actual MERGE statement


MERGE Params_target AS TARGET
USING Params_source AS SOURCE
ON TARGET.drive = SOURCE.drive
WHEN MATCHED THEN
UPDATE SET threshold_in_GB = SOURCE.threshold_in_GB
WHEN NOT MATCHED BY TARGET THEN
INSERT (drive, threshold_in_GB) VALUES (SOURCE.drive, SOURCE.threshold_in_GB);

As stated, the example could have been achieved with just a single table (the target table). We only
used the source table for two reasons: first, the resulting MERGE statement is easier to read, and
second, MERGE statements are often used for the process of loading data warehouses from staging
tables. But instead creating a source table and inserting a record into it, we can use a select statement
as source as well:

USING (SELECT 'Q' AS drive,1 as threshold_in_GB) AS SOURCE

In this example, we don’t have use for an action on the records with no match in the source table, but
we can easily add such an action:

WHEN NOT MATCHED BY SOURCE THEN


DELETE

As the merge is performed on the target table, there is no need to specify from which table. And just a
reminder: “BY SOURCE” is mandatory, whereas “BY TARGET” is optional.

The MERGE statement also supports the OUTPUT clause. You can’t add an OUTPUT clause to the
individual actions, just to the MERGE statement as a whole. This behaves exactly like in the INSERT,
UPDATE and DELETE statements, with one addition: $action. Like this:

OUTPUT $action, deleted.threshold_in_GB, inserted.threshold_in_GB;

For each record in the output, $action will be replaced by the action taken: INSERT, DELETE or
UPDATE. This is very useful for logging purposes. This can also be used to demonstrate the next item
on the list: adding an additional search clause.
As mentioned, the MERGE statement is often used in scenarios where a data warehouse is loaded
with additional data. As data warehouses can become pretty large, efficiency becomes very
important, so you don’t want to do unnecessary updates. So you want to avoid setting a value to the
same value it already is. To avoid that, you can add an additional search condition to the MATCHED
clause. If you change that line to:

WHEN MATCHED AND SOURCE.threshold_in_GB <> TARGET.threshold_in_GB

And add the OUTPUT clause, you can test both scenarios. If you merge a record with a threshold for
drive Q with a value of 1, into a target table that already has that same record, you can see the
difference: without the additional clause, the record is updated (setting 1 to 1), without the additional
clause no action is taken. And while the end result in the table will be exactly the same, the latter way
requires less resources (e.g. IO, time and transaction log space).

A final note on MERGE. If you choose to use the function @@ROWCOUNT after the MERGE
statement, it will return the total number of records affected from all branches (matched, not matched
by target & not matched by source). Please test this for yourself. As stated before: this exam requires
you to have a lot of experience writing code, and in order to follow along with this book, we expect
you to actually test the examples as we demonstrate them (and think of variations of your own). But as
MERGE statements syntax can be somewhat confusing, we’ll provide the code here:

MERGE Params_target AS TARGET


USING Params_source AS SOURCE
ON TARGET.drive = SOURCE.drive
WHEN MATCHED AND SOURCE.threshold_in_GB <> TARGET.threshold_in_GB THEN
UPDATE SET threshold_in_GB = SOURCE.threshold_in_GB
WHEN NOT MATCHED BY TARGET THEN
INSERT (drive, threshold_in_GB) VALUES (SOURCE.drive, SOURCE.threshold_in_GB)
WHEN NOT MATCHED BY SOURCE THEN
DELETE
OUTPUT $action, deleted.threshold_in_GB, inserted.threshold_in_GB;

So to recap, the MERGE statement in SQL is used as an UPSERT: update a record if it already exists,
otherwise, insert it. It uses the SOURCE and TARGET table syntax to compare both tables, and the
WHEN MATCHED and two different WHEN NOT MATCHED branches to define actions.

Determine the results of Data Definition Language (DDL) statements on supplied tables and data

In this section, we’ll demonstrate how to add, modify or delete columns from existing tables. We’ll
cover the syntax and show some examples on how to do this.
The 70-761 exam does not (explicitly) require you to be able to create a table, so it might seem
strange that you should be able to modify the definition of a table. The big difference between
creating a new table, and modifying the definition of an existing table, is that an existing table may
already have data in it, and indexes and constraints defined on it; any change you make to the
definition of the table has to be compatible with the rest of the table definition, and the data in it. So
we’ll discuss how the existing structure of a table impacts the DDL changes you can make.
The third issue we’ll discuss is performance. When you make changes to a big table, making changes
to the table can take a long time, during which the table will be locked, so there are some
performance aspects to consider.

Add column
To add a column to a table, you supply (at a minimum) the name of the table, the name of the new
column, its data type, and whether or not it allows NULL values.

ALTER TABLE [dbo].[Employees] ADD Department varchar(100) NOT NULL

It is recommended, but not required, to explicitly state whether or not NULL values are allowed, and
not rely on the default. In DDL statement above, we want to add a column that does not allow NULL
values. However, our Employees table still has three records from our last test, so this DDL statement
is not allowed:

ALTER TABLE only allows columns to be added that can contain nulls, or have a DEFAULT definition specified, or the column being
added is an identity or timestamp column, or alternatively if none of the previous conditions are satisfied the table must be empty to allow
addition of this column. Column 'Department' cannot be added to non-empty table 'Employees' because it does not satisfy these
conditions.

SQL does not know what values to add to the column for the existing records, so unless the table is
empty, the column you’re going to add must either:
* allow NULL values;
* have a default;
* have an identity specification;
* or be a timestamp column (which we won’t cover).

This is the code to add a column that allows NULL values:


ALTER TABLE [dbo].[Employees] ADD Department varchar(100) NULL

This statement will add the column, with a value of NULL for all existing records. Every column
name in a table has to be unique, so before we demonstrate how to add a column with a default value,
we’ll need to drop this column (if it is already present):

ALTER TABLE [dbo].[Employees] DROP COLUMN Department

If we define a default value on the column, this will be used for the existing records. Earlier, we’ve
already seen how to add a default value to the definition of a column.

ALTER TABLE [dbo].[Employees] ADD Department varchar(100) NOT NULL DEFAULT 'Sales'

However, if we now try to drop the column with a default defined on it, we’ll get an error:

Msg 5074, Level 16, State 1, Line 12


The object 'DF__Employees__Depar__2F5453E4' is dependent on column 'Department'.
Msg 4922, Level 16, State 9, Line 12
ALTER TABLE DROP COLUMN Department failed because one or more objects access this column.

As the error message states, you can’t drop the column because an object still relies on it: the default
constraint we created. This has a system generated and partially random name, in this case
'DF__Employees__Depar__2F5453E4'. So before we can actually drop the table, we have to drop
the constraint:

ALTER TABLE [dbo].[Employees] DROP CONSTRAINT [DF__Employees__Depar__2F5453E4]

By the way: instead of letting the system generate a name for you, you can also define the name of the
constraint yourself. We’ll call the constraint DF_Department, using the following syntax:

ALTER TABLE [dbo].[Employees] ADD Department varchar(100) NOT NULL


CONSTRAINT DF_Department DEFAULT 'Sales'

By changing NOT NULL into NULL you add the column, with a NULL value for every existing
record, but with the default constraint (for new records). There is also a final option: if you want the
column to be nullable, but still have a default value for the existing records. This can be achieved by
specifying WITH VALUES:

ALTER TABLE [dbo].[Employees] ADD Department varchar(100) NULL


CONSTRAINT DF_Department DEFAULT 'Sales' WITH VALUES

Our table already has an identity column on it, and as a table only allows a single identity column, we
can’t add a second. But when we circumvent this problem, adding an identity column is pretty
straightforward, as the following example demonstrates. We’ll create a copy of the Employee table
(without the EmployeeID column), then add the identity column:

SELECT FirstName, LastName, Address, Salary


INTO Employees_test
FROM dbo.employees

ALTER TABLE [dbo].[Employees_test] ADD EmployeeID int IDENTITY

Alter Column
In order to change the definition of a column, the change has to be compatible with the data and
constraints on that column. You can’t:
* change the data type to a data type that is incompatible with the data already stored in that column in
existing records;
* change a column that is part of a primary key;
* change a column that is part of a foreign key
* change a column that has a unique constraint (unless you only change the length, not the data type).

If this seems self-explanatory to you, just skip to the next section. If not, just follow along with these
demonstrations. There are some other limitations we won’t demonstrate:
* you can’t change a column from NULL to NOT NULL if there are NULL values currently stored in
the column;
* you can’t change a column that has a default constraint or a check constraint on it (unless you only
change the length, not the data type).

For the complete list of rules concerning altering the definition of a column, please consult Microsoft
Docs.

If you try to change the data type, all data in the column has to be converted to the new data type;
otherwise, altering the column fails. For example, the values for LastName can not be converted to an
integer, so the following will fail:

ALTER TABLE dbo.Employees ALTER COLUMN LastName int

Msg 245, Level 16, State 1, Line 14


Conversion failed when converting the varchar value 'Jackson' to data type int.
The statement has been terminated.

And you can’t change the length of the column LastName to 5, because some last names are longer
than 5 characters (in fact, all of them are, but it only takes one for the statement to fail):

ALTER TABLE dbo.Employees ALTER COLUMN LastName varchar(5)

Msg 8152, Level 16, State 14, Line 14


String or binary data would be truncated.
The statement has been terminated.

You can, however, change the length of the column to 50, because in this case, the data conversion
will work:
ALTER TABLE dbo.Employees ALTER COLUMN LastName varchar(50)
Next, we’ll demonstrate that you can’t change the data type of a primary key column. We don’t have a
primary key column yet, so we’ll demonstrate that we can change the data type without the primary
key. The following will work:

ALTER TABLE dbo.Employees ALTER COLUMN EmployeeId tinyint

Now, change it back again to int, and add the primary key:

ALTER TABLE dbo.Employees ADD CONSTRAINT PK_Employeeid PRIMARY KEY (EmployeeId)

You can verify for yourself that after this, changing the data type of the column to tinyint will fail,
because the column is defined as the primary key. The same thing applies when the column is defined
as a foreign key. In order to demonstrate this, will create a table to reference (Departments), and put a
foreign key on the column Department that references the newly created table. We’ll also have to put a
record in the Department table with a value of Sales (remember that a foreign key on a column
restricts the possible values in that column to the values present in the column in the other table that
the foreign key references):

CREATE TABLE Departments (Department varchar(100) NOT NULL PRIMARY KEY)


INSERT Departments VALUES ('Sales')

ALTER TABLE dbo.Employees ADD CONSTRAINT FK_Department FOREIGN KEY (Department) REFERENCES
Departments(Department)

ALTER TABLE dbo.Employees ALTER COLUMN Department varchar(200)

The last statement will fail:

Msg 5074, Level 16, State 1, Line 26


The object 'FK_Department' is dependent on column 'Department'.
Msg 4922, Level 16, State 9, Line 26
ALTER TABLE ALTER COLUMN Department failed because one or more objects access this column.

The last limitation we’ll cover is the unique constraint. A unique constraint on a column ensures that
every value in the column is unique. There are two noticeable differences between a primary key and
a unique constraint:
* you can only have one primary key in a table, but multiple unique constraints;
* a primary key does not allow NULL values, a unique constraint allows 1 record with a NULL value.

Let’s demonstrate this. We’ll add a column called EmployeeNumber and update all records so that it
has a unique number in it. Then, we’ll add a record with a null value for EmployeeNumber (to
demonstrate that a unique constraint allows 1 record with a null value).

ALTER TABLE [dbo].[Employees] ADD EmployeeNumber int NULL


GO
UPDATE dbo.Employees SET EmployeeNumber = EmployeeID
ALTER TABLE [dbo].[Employees] ADD CONSTRAINT uq_employeenumber UNIQUE (EmployeeNumber)
GO
INSERT dbo.Employees (FirstName, LastName, Address, Salary, Department, EmployeeNumber)
VALUES('Lisa', 'Simpson','Springfield',1500.00,'Sales', NULL )
ALTER TABLE [dbo].[Employees] DROP COLUMN EmployeeNumber

Now to demonstrate that dropping the column will fail, because of the unique constraint:

ALTER TABLE [dbo].[Employees] DROP COLUMN EmployeeNumber

Msg 5074, Level 16, State 1, Line 42


The object 'uq_employeenumber' is dependent on column 'EmployeeNumber'.
Msg 4922, Level 16, State 9, Line 42
ALTER TABLE DROP COLUMN EmployeeNumber failed because one or more objects access this column.

If you want, you can test this further and see that you can’t add a second record with a null value,
because you’ll receive this error:

Msg 2627, Level 14, State 1, Line 40


Violation of UNIQUE KEY constraint 'uq_employeenumber'. Cannot insert duplicate key in object 'dbo.Employees'. The duplicate key
value is (<NULL>).
The statement has been terminated.

If we drop the unique constraint first, we can drop the column:

ALTER TABLE [dbo].[Employees] DROP CONSTRAINT uq_employeenumber


ALTER TABLE [dbo].[Employees] DROP COLUMN EmployeeNumber

Another DDL action is changing the name of a column. In order to change the name of a column, we
use the system stored procedure sp_rename:

ALTER TABLE [dbo].[Employees] ADD EmployeeNumber int NULL


EXEC sp_rename 'Employees.EmployeeNumber', 'Employee_Number'
ALTER TABLE [dbo].[Employees] DROP COLUMN Employee_Number

Note: using sp_rename, you specify the old name of the column with the name of the table, and the
new name without the name of the column. You obviously can’t move a column to a different table, so
specifying the table in the new column name is redundant. In fact, if you do this, you’ll end up with a
column name with the table name in it, like ‘Employees.Employee_Number’. Even if you do not make
this mistake, changing the name of a column is something to be careful with, and SQL Server will
warn you about this when you do:

Caution: Changing any part of an object name could break scripts and stored procedures.

We’ve now shown you how to alter the name, data type or the nullability of a column, and what
restrictions in the data or data definition apply.

Drop Column

We’ve already seen how to drop a column:


ALTER TABLE [dbo].[Employees] DROP COLUMN Department

And we’ve already seen that you can’t drop the column if a default constraint references the column;
you’ll have to drop the constraint first, before dropping the column. The same thing applies when the
column:
* has a primary key, foreign key, unique constraint or check constraint;
* has an index on it.

To demonstrate, after adding an index on the department column, you can’t drop the column (maybe
you’ll have to recreate the column first if you dropped it previously):

CREATE INDEX ix_department ON dbo.Employees(Department)


ALTER TABLE [dbo].[Employees] DROP COLUMN Department

This will fail:

The index 'ix_department' is dependent on column 'Department'.

We’re done here. Let’s drop the index:


DROP INDEX ix_lastname

There is a long list of option for the ALTER TABLE statement that we won’t cover here. What we do
want to discuss briefly, are some of the options regarding performance. Adding a column to a big
production table can literally take hours. If you need to make a change to the definition of a big table,
here are some things to take into consideration:
* adding an empty, nullable column only changes the definition of the table, whereas adding an non-
empty column has to actually change each record (therefore, taking much longer);
* many ALTER TABLE statements support the option WITH ONLINE = ON. By default, the ALTER
TABLE statement is done offline, blocking all access to the table. With ONLINE = ON, the alteration
is done on a copy of the table, after which the copy replaces the original table. This will allow some
other operations to continue while the alter table statement is running. Many restrictions apply, but it
is worth looking into.
* though tricky, you can disable foreign keys and check constraints temporarily, using WITH
NOCHECK. This is also a useful trick when importing a lot of data. The tricky part is that,
afterwards, you have to make sure you re-enable the check. Re-enabling the check will still force
SQL to validate the data against the foreign key or check constraints, but doing this separately can be
significantly faster than doing this while importing. You can (and should) re-enable the check using
WITH CHECK, or re-enable all checks on a table:

ALTER TABLE table WITH CHECK CHECK CONSTRAINT ALL

And yes, the double CHECK is correct. Regarding performance when altering large tables: the most
important thing is to be aware of the possible impact of changes to your table, and thoroughly test
these changes before actually implementing them.

We’ve now demonstrated how to add, modify or drop a column. We’ve also demonstrated how these
DDL statements can be blocked by existing in the table, or constraints on the table (such as default
constraints, primary keys, foreign keys and indexes). And we’ve briefly discussed some performance
issues when altering large tables.
Summary

In this chapter, we’ve covered a lot of ground. We’ve shown how to insert data into a table using
INSERT, retrieve the data using SELECT, change the data in the table using UPDATE and remove
data from the table using DELETE and TRUNCATE. We’ve also explained how to combine the
results from more than one table, using joins and set operators. And we’ve seen a lot of functions
already present in SQL Server.
Questions
Like on the actual exam, not all information relevant to the question is covered in this book.
Unfortunately, we can’t cover all material you may receive questions on. But for most questions, if
you sufficiently understand the stated material in the exam objectives, and the material covered in this
book, you should be able to deduce the correct answer. And for the other questions: just eliminate the
wrong answers, and know that you don’t need to score 100% to pass the exam.

QUESTION 1
Which clauses of a SELECT statement are optional in order to retrieve rows for a table? Choose all
that apply.

A SELECT
B FROM
C WHERE
D ORDER BY

QUESTION 2
Which set operator will combine the result sets of two queries, without eliminating duplicates?
Choose all that apply.

A INTERSECT
B FULL OUTER JOIN
C UNION ALL
D COMBINE

QUESTION 3
The two following statements are logically equivalent:

SELECT *
FROM [MyTable]
WHERE Column1 = 1
OR Column1 = 2

SELECT *
FROM [MyTable]
WHERE Column1 = 1
EXCEPT
SELECT *
FROM [MyTable]
WHERE Column1 = 2

A TRUE
B FALSE
QUESTION 4
Consider the following pair of tables:

CREATE TABLE dbo.Customers


(
CustomerID tinyint NOT NULL IDENTITY PRIMARY KEY
,FirstName varchar(100) NOT NULL
,LastName varchar(100) NOT NULL
,[Address] varchar(100) NOT NULL
);

CREATE TABLE dbo.Orders


(
OrderID tinyint NOT NULL IDENTITY PRIMARY KEY
,CustomerID tinyint NOT NULL
,OrderDate datetime NOT NULL
,SalesAmount decimal(18,2) NOT NULL
);

You need to make a query to return the name, address and sales amount of all orders for New York
customers. Which of the following partial statements do you need to make the correct query?

A SELECT a.*, b.SalesAmount


B SELECT a.FirstName, a.LastName, a.Address, b.SalesAmount
C FROM Customers a
D FROM Orders a
E INNER JOIN Customers b
F INNER JOIN Orders b
G WHERE Address LIKE ‘%New York%’
H WHERE Address LIKE ‘%New York%’ AND a.CustomerID = b.CustomerID
I GROUP BY CustomerID
J GROUP BY Address

QUESTION 5
This question uses the same tables as the previous question.

Your task is to make a report containing the orders for all customers, including the customers without
orders. You plan to use the following T-SQL statement:

SELECT c.FirstName
, c.LastName
, c.LastName
, c.Address
, o.SalesAmount
, o.OrderDate
FROM [dbo].[orders] o
LEFT OUTER JOIN [dbo].[customers] c ON c.CustomerID = o.CustomerID
ORDER BY c.CustomerID;

Will this statement create the correct report?

A TRUE
B FALSE

QUESTION 6
This question uses the same tables as the previous question.

Which SQL statement will retrieve the records for customers without an order? Choose all that apply.

A SELECT * FROM Customers


WHERE CustomerID NOT IN (SELECT CustomerID FROM Orders)
B SELECT * FROM Customers c
INNER JOIN Orders o ON o.CustomerID = c.CustomerID
WHERE o.CustomerID IS NULL
C SELECT * FROM Customers c
LEFT OUTER JOIN Orders o ON o.CustomerID = c.CustomerID
WHERE o.CustomerID IS NULL
D SELECT *
FROM Customers
EXCEPT
SELECT CustomerID
FROM Orders

QUESTION 7
This question uses the same tables as the previous question.

The following screenshot shows the partial result of a Transact-SQL statement. What type of join has
been used in this statement?

A LEFT FULL JOIN


B LEFT OUTER JOIN
C CROSS JOIN
D None. This result cannot be achieved with a JOIN statement.
QUESTION 8
Which two of the following string functions are equivalent?

DECLARE @text varchar(50) = 'SomeTextString'

A SELECT LEFT(@text,5)
B SELECT LEFT(5,@text)
C SELECT REVERSE(RIGHT(@text,5))
D SELECT REVERSE(RIGHT(REVERSE(@text),5))
E SELECT SUBSTRING(@text,LEN(@text),5)
F SELECT SUBSTRING(@text,0,5)

The next few questions will use the same tables and data for a fictional record store.

CREATE TABLE dbo.Medium (


MediumID tinyint IDENTITY PRIMARY KEY
,Medium varchar(100) NULL);
GO
INSERT dbo.Medium VALUES ('CD'), ('Cassette'), ('LP');
GO

CREATE TABLE dbo.Artist (


ArtistID int IDENTITY PRIMARY KEY
,Artist varchar(100) NOT NULL
,BandMembers varchar(1000) NULL)
GO
CREATE UNIQUE INDEX ncix_artist ON dbo.Artist (Artist)

CREATE TABLE dbo.Album (


AlbumID int IDENTITY PRIMARY KEY
,MediumID tinyint REFERENCES dbo.Medium(MediumID) NULL
,ArtistID int REFERENCES dbo.Artist(ArtistID) NOT NULL
,Title varchar(100)
,ReleaseDate datetime NULL
,Price money NOT NULL)
GO
CREATE INDEX ncix_artist ON dbo.Album (ArtistID)

INSERT Artist VALUES ('Queen', NULL)


INSERT Artist VALUES ('Michael Jackson', NULL)
INSERT Album VALUES(1,1,'Bohemain Rapsody', '2016-01-01 00:00:00', '$20')
INSERT Album VALUES(NULL,2,'Thriller', '1984-01-01 00:00:00', '$20')
INSERT Album VALUES(1,2,'Thriller', '1984-01-01 00:00:00', '$10')

All questions relate to this data and these table definitions, not to changes made in other questions.
QUESTION 9
You’ve imported al lot of records. Afterwards, the following query takes too long. Which index is
best to speed up this query?

SELECT al.Title, al.ReleaseDate, ar.Artist


FROM Artist ar
INNER JOIN Album al ON ar.ArtistID = al.ArtistID
WHERE al.title = 'Thriller' and ar.Artist = 'Michael Jackson'

A CREATE INDEX ncix_artist ON dbo.Album (ArtistID)


B CREATE INDEX ncix_artist ON dbo.Artist (Artist)
C CREATE INDEX ncix_title ON dbo.Album (Title, ArtistID)
D CREATE INDEX ncix_title_artist ON dbo.Album (Title, Artist)
E None of the above

QUESTION 10
Which of the following statements concerning the following query are true? Choose all that apply.

SELECT *
FROM dbo.Album al
INNER JOIN dbo.Medium m ON m.MediumID = al.MediumID
INNER JOIN dbo.Artist ar ON al.ArtistID = ar.ArtistID
WHERE UPPER(ar.Artist) = 'QUEEN'
A The UPPER function prevents proper use of an index
B The use of the UPPER function is only required if the database has an accent sensitive collation
QUESTION 11
What is the correct predicate to retrieve all albums released in 2015? Choose all that apply.

SELECT *
FROM dbo.Album
WHERE YEAR(ReleaseDate) = 2015

A WHERE ReleaseDate BETWEEN 2014 and 2015


B WHERE DATEPART(yyyy,ReleaseDate) BETWEEN '2015-01-01' and '2016-01-01'
C WHERE DATEPART(yyyy,ReleaseDate) = 2015
D WHERE ReleaseDate < '2014-12-31' AND ReleaseDate > '2016-01-01'
E WHERE YEAR(ReleaseDate) = 2015

QUESTION 12
You have a statement that updates the price for an album:

UPDATE dbo.Album
SET Price = @NewPrice
WHERE AlbumID = @AlbumID

You need to insert the old price in a separate table. What is the best way to achieve this?

A Use a MERGE statement


B Add a SELECT statement before the insert statement to capture @oldPrice. Use an INSERT
statement to insert @OldPrice into the separate table
C Add an OUTPUT clause
D Add a SELECT statement before the insert statement to capture @oldPrice. Change the UPDATE
statement to update both tables at once.

QUESTION 13
A user wants to add a record for the band Motörhead. However, the statement doesn’t work properly.
This is the failing statement:

INSERT dbo.Artist VALUES ('Motörhead', 'Lemmy, Phil and Eddy')

And this is the error:

Msg 2601, Level 14, State 1, Line 83


Cannot insert duplicate key row in object 'dbo.Artist' with unique index 'ncix_artist'. The duplicate
key value is (motörhead).
The statement has been terminated.

What is the best solution to this problem:

A Change the column definition in the Artist table to case sensitive.


B Don’t sell any albums by bands with funny names.
C Drop the unique index.
D Don’t insert the record.
E Change the column definition in the Artist table to accent insensitive.

QUESTION 14
Consider the following statement:

SELECT m.Medium, ar.artist, COUNT(*)


FROM Album al
INNER JOIN Medium m ON m.mediumID = al.MediumID
INNER JOIN artist ar ON ar.artistID = al.artistID
GROUP BY m.medium, ar.artist

This statement creates a report for all items per artist. The album table allows for special editions by
allowing for NULL values in the column medium.
You need to make sure the statement includes these specialty items. You propose to make the
following change:
SELECT m.Medium, ar.artist, COUNT(*)
FROM Album al
RIGHT OUTER JOIN Medium m ON m.mediumID = al.MediumID
INNER JOIN artist ar ON ar.artistID = al.artistID
GROUP BY m.medium, ar.artist

Does this proposed solution meet the requirement?

A Yes
B No

QUESTION 15
You try to use the following statement to insert a new artist & album. Will this work?

DECLARE @ArtistID int

INSERT dbo.Artist
VALUES ('Pink Floyd', 'David Gilmoure, Nick Mason, Richard Wright, Syd Barrett')

SELECT @ArtistID = @@IDENTITY

INSERT dbo.Album
VALUES ((SELECT Mediumid FROM dbo.Medium WHERE medium = 'LP')
, @ArtistID, 'Dark side of the moon', '1973-03-01', '$9')

A Yes
B No. You can’t use a subquery in an insert statement.
C No. You need to set identity insert on to insert ArtistID into dbo.Album.
D No. You need to supply the column names in order to insert the ArtistID into dbo.Album.
E No, for reasons C and D.

QUESTION 16
Which of the following INSERT statements are incorrect? Choose all that apply.

A INSERT Medium DEFAULT VALUES


B SET IDENTITY_INSERT dbo.Medium ON; INSERT Medium (MediumID, Medium) VALUES
(1, 'MP3');SET IDENTITY_INSERT dbo.Medium OFF;
C INSERT dbo.Album VALUES ((SELECT Mediumid FROM dbo.Medium WHERE medium =
'LP'), (SELECT ArtistID FROM Artist WHERE Artist = 'Queen'), 'Queen', 1973-06-13, '$9')
D INSERT dbo.Album VALUES ((SELECT Mediumid FROM dbo.Medium WHERE medium =
'CD'), (SELECT ArtistID FROM Artist WHERE Artist = 'Queen'), 'Queen')
E insert artist values ('Queen', 'Freddy Mercury, Brian May, John Deacon, Roger Taylor');
F INSERT [dbo].[Artist] (Artist, BandMembers) VALUES ('Kiss', 'Gene Simmons, Paul Stanley,
Ace Frehley, Peter Criss')
QUESTION 17
You propose to the data type of the column ReleaseDate in the Album table to date, using the
following statement. Will this work?

ALTER TABLE Album ALTER COLUMN ReleaseDate date NULL

A Yes
B No

QUESTION 18
You propose to make BandMembers a required field, using the following statement. Will this work?

ALTER TABLE Artist ALTER COLUMN BandMembers varchar(1000) NOT NULL

A Yes
B No
Answers

This section contains the correct answers to the questions, plus an explanation of the wrong answers.
In addition to the correct answers, we’ll also give a few pointers which are useful on the actual exam.

QUESTION 1
The correct answer is C, D: ORDER BY. SELECT and FROM are required, WHERE and ORDER
BY are optional.

QUESTION 2
The correct answer is C: UNION ALL. INTERSECT will return the rows that are present in both
result sets; FULL OUTER JOIN is a join operator, not a set operator; COMBINE is not a T-SQL
keyword.

QUESTION 3
The correct answer is B: false. The first statement will return all records where Column1 equals 1 or
2; the second statement will return all records where Column1 equals 1, not 2 (and since the value for
Column1 cannot be both 1 and 2, the second filter in the WHERE clause is useless).

QUESTION 4
The correct answer is B, C, F and H. SELECT statement A includes an additional column from the
customer table, which is not required; FROM statement D gives the wrong table alias, as does
INNER JOIN E; WHERE clause G doesn’t provide the filter to match the orders to the correct
customer, and as none of the JOIN operators provide this, the WHERE clause needs to do this; and
finally, the GROUP BY statements are not required at all.

QUESTION 5
The correct answer is B: false. The LEFT OUTER JOIN will, in this case, return orders without a
matching customer, not customers without an order; a RIGHT OUTER JOIN would be required in this
case.
Answering this question is relatively easy if you look back to question 4, but be aware: on the exam,
it is not always possible to return to a previous question!

QUESTION 6
The correct answer is A and C. B will not work, as the INNER JOIN will by definition not include
the non-matching row. D will not work either; the EXCEPT statement requires a matching number of
columns in both result sets.

QUESTION 7
The correct answer is C. There is no matching between any of the fields in the tables; only a cross
join will join a record in one set to all records in the other set. A is false, as there is no LEFT FULL
JOIN; B is false, as a LEFT OUTER JOIN would not combine records without a match. As answer C
is correct, D is false.
QUESTION 8
The correct answer is A & D. The function LEFT takes two input parameters: a string and a number
of characters. B is not correct, as the number of characters should be the second parameter, not the
first. C will return the same 5 characters as A, but in reverse. And the SUBSTRING function should
start at character 1, not 0 (or the end of the string).

QUESTION 9
The correct answer is C. Answers A and B are incorrect; adding an index to table Artist will have
less impact, as there is already an index on Artist. Answer D is incorrect, as there is no column Artist
in table Album.

QUESTION 10
The correct answer is A. Answer B is incorrect; the upper function is required if the database is a
case sensitive collation, not an accent sensitive collation.

QUESTION 11
The correct answers are C and E. Answer A is incorrect, because column ReleaseDate has a
datetime data type. Answer B is incorrect, because the function DATEPART will return an integer.
And answer D is incorrect, because the < and > signs have been reverted.

QUESTION 12
The correct answer is C, use an OUTPUT clause to store the old price into a variable (and store that
price into the other table with a separate INSERT statement). A MERGE statement could get the job
done, but its primary purpose is to either insert or update a record; therefore answer A is not the best
way to achieve this. Same goes for answer B; selecting the old price before the update can get the job
done, but requires two reads of the same table, and is therefore more work. Answer D is incorrect,
because you can’t have an update statement update two tables at once.

On the exam, not all correct answers cover the complete answer, and neither are all incorrect answers
100% incorrect.

QUESTION 13
The correct answer is D: don’t insert the record. The error states that there is already a record for the
band, but spelled with a lower case m. This spelling error should be corrected; inserting a second
record for the same entity is data pollution.
Answer A is incorrect, because it would allow two bands with the same name. Answer B is
obviously wrong. Answers C and E are incorrect, because it doesn’t solve the problem of two
records for the same band.

QUESTION 14
The correct answer is B: no. MediumID is nullable, and the join does not take NULL values into
account. The correct query is:
SELECT m.Medium, ar.artist, COUNT(*) as 'Total items'
FROM Album al
INNER JOIN Medium m ON ISNULL(m.mediumID, 1) = ISNULL(al.MediumID, 1)
INNER JOIN artist ar ON ar.artistID = al.artistID
GROUP BY m.medium, ar.artist

QUESTION 15
The correct answer is A: yes. Answer B is wrong, you can use a subquery in an insert statement.
Answers C, D and E are wrong; these are requirements for inserting an identity value in a table, but
ArtistID is an identity value in the Artist table, not in the Album table.

QUESTION 16
The correct answer is B, C, D and E. Only A and F will work properly. Statements B and E will run
into a primary key violation. Answer C will work, but the date is not surrounded by quotes, and will
therefore be wrongly interpreted by SQL Server (as a subtraction); and D will fail because the
column price is omitted, which is non-nullable.

QUESTION 17
The correct answer is A: yes, this statement will work.

QUESTION 18
The correct answer is B, no. There are records in the table with NULL values.
Chapter 2: Query data with advanced Transact-SQL components
Chapter overview
In this second chapter, we’ll build on the foundation from the first chapter and discuss more advanced
T-SQL components. Well talk about subqueries, and the APPLY operator. We’ll discuss table
expressions, temporary tables and the differences between them. We’ll learn how to aggregate data,
by grouping and pivoting tables, and finally, we’ll handle temporal and non-relational data.

Exam objectives
For the exam, the relevant objectives are:

Query data with advanced Transact-SQL components (30–35%)


Query data by using subqueries and APPLY
Determine the results of queries using subqueries and table joins, evaluate performance
differences between table joins and correlated subqueries based on provided data and
query plans, distinguish between the use of CROSS APPLY and OUTER APPLY, write
APPLY statements that return a given data set based on supplied data
Query data by using table expressions
Identify basic components of table expressions, define usage differences between table
expressions and temporary tables, construct recursive table expressions to meet
business requirements
Group and pivot data by using queries
Use windowing functions to group and rank the results of a query; distinguish between
using windowing functions and GROUP BY; construct complex GROUP BY clauses
using GROUPING SETS, and CUBE; construct PIVOT and UNPIVOT statements to
return desired results based on supplied data; determine the impact of NULL values in
PIVOT and UNPIVOT queries
Query temporal data and non-relational data
Query historic data by using temporal tables, query and output JSON data, query and
output XML data
Query data by using subqueries and APPLY
In this section, we'll cover subqueries and the APPLY statement. Both are ways to combine two
queries, so in a way, both can be seen as advanced variations of joins. In chapter one, we've already
seen a few simple examples of subqueries, and we'll build on that. In a simple subquery, the subquery
can be evaluated separately from the outer query. The advanced subquery we’re going to cover is the
correlated subquery; this subquery cannot be evaluated separately from the outer query. The inner
query is executed once for every row of the result of the outer query.
We haven't covered the APPLY statement yet; the APPLY statement joins two query results, in which
the query on the right side is executed once for each row of the result of the query on the left side.
So as you’ll see, the similarity between joins and simple subqueries (which we have already
covered) and correlated subqueries and the APPLY statement is that they’re both ways of combining
two (or more) results; the difference is that with joins and subqueries, all results can be executed
separately, and with correlated subqueries and the APPLY statement, one depends on the other.

Determine the results of queries using subqueries and table joins


First, we’ll cover the simple subquery, and then the correlated subquery. We won’t cover table joins
again, as we’ve already done so in chapter 1, except to demonstrate that they can sometimes be used
to achieve the same result as a correlated subquery.

A subquery is a query within another query. This can be a SELECT statement in another SELECT
statement, but subqueries can also be used in INSERT, UPDATE and DELETE statements.

A simple subquery is a query that can be executed independent of the outer query. It can be placed in
the SELECT, FROM or WHERE clause. We’ve already seen some examples:

In the FROM clause (the derived table):

SELECT o.*
FROM (SELECT * FROM Sales.Orders) o

In the WHERE clause:

SELECT *
FROM [Application].[People]
WHERE PreferredName IN ( SELECT CustomerName
FROM sales.Customers )

In both cases, the inner query, or subquery, can be executed independently. Simply select the inner
query and hit F5, and it’ll run without problems.

The following shows an example of how to use a subquery in the SELECT clause. It will show the
average salary for every employee. It is unrelated to the rest of the query, and will show the same
value for each row returned by the outer query.
SELECT *
,(SELECT AVG(salary) FROM dbo.Employees) AS average_salary
FROM dbo.Employees

There are a lot of places where you can use a subquery. Basically, a subquery can be used anywhere
an expression is allowed.
Just be careful about the type of result that the subquery returns (a single (scalar) value, a list of
values or a result set). A subquery in the SELECT statement requires that the subquery returns a
scalar value. A subquery in the FROM statement requires that the subquery returns a set. And the
requirements for a subquery in the WHERE statement depend on the operator being used; in this case,
we used the operator IN, requiring the subquery to return a list of values. If we’d used the “greater
than” operator, this would require the subquery to return a scalar value, as in the following example:

SELECT *
FROM dbo.Employees
WHERE Salary > (SELECT AVG(salary) FROM dbo.Employees)

Should the subquery return the wrong type of result, for instance a result set instead of a scalar value,
your query may fail at runtime, so be careful that your logic ensures that the correct type of result will
be returned.

A correlated subquery is a subquery that cannot be executed without the outer query, because the inner
query depends on the outer query. Select the inner query and try to execute it; you’ll receive an error.
When would you use a correlated subquery? The query above returned the records for each employee
with a higher-than-average salary. Let’s suppose you want all the records for employees with a salary
that is higher than the average for his (or her) department. In that case, the subquery would need to
calculate the average for the department of that particular employee. That means that the inner query
is related to the outer query; it is correlated.
A correlated subquery relates both queries by using a table alias in the outer query, and referencing
the same alias in the subquery, as you can see in the following example.

This is the query that will retrieve all employee records whose salary is higher than the average for
their department:

SELECT *
FROM dbo.Employees e1
WHERE Salary > ( SELECT AVG(salary)
FROM Employees e2
WHERE e1.department = e2.department)

The subquery refers to the table alias e1, which is defined in the outer query. This is a dead giveaway
for a correlated subquery. The inner query is, logically speaking, performed once for each record in
the outer query, each time with a different value for e1.department.
The same result could have been achieved without a correlated subquery. This, for example, is a
logically equivalent query, using a simple subquery and a join instead of a correlated subquery:
SELECT *
FROM dbo.Employees e1
INNER JOIN ( SELECT AVG(salary) AS avg_salary, department
FROM dbo.employees
GROUP BY department) e2 ON e1.Department = e2.Department
WHERE salary > e2.avg_salary

We haven’t covered the GROUP BY yet; we’ll do that later this chapter. This is just to illustrate, that
often, a correlated subquery can achieve the same result as a join. Sometimes, there is a performance
difference (which we’ll cover for the next exam objective); if they perform the same, choose
whatever syntax seems most logical to you. A lot of developers find the JOIN statement easier to
conceptualize than the correlated subquery.

As with the simple subquery, the use of the correlated subquery is not limited to the WHERE clause.
Here is the same query, but this time, we’ve added the average salary per department to the SELECT
clause:

SELECT *
,( SELECT AVG(salary)
FROM Employees e2
WHERE e1.department = e2.Department)
FROM Employees e1
WHERE Salary > ( SELECT AVG(salary)
FROM Employees e2
WHERE e1.Department = e2.department)

The subquery does not necessarily have to select from the same table, as the following example will
demonstrate. This query will retrieve all records from the Department table for departments that have
more than 2 employees:

SELECT *
FROM Departments d
WHERE ( SELECT count (*)
FROM Employees e
WHERE e.Department = d.Department) > 2

The Department table contains only a column Department, and therefore no additional information, so
in the real world, we would only need to access the Department table if it contained additional info
we needed; this example just illustrates how to use a correlated subquery with two different tables.
But just to make this example a little more life-like, we’ll add a column to the Departments table, and
fill it with values:
ALTER TABLE Departments ADD Location varchar(100) DEFAULT 'Houston' WITH VALUES;
GO
UPDATE Departments
SET Location = 'New York'
WHERE Department = 'Management'
One more note on subqueries: a subquery may only use ORDER BY when TOP is also used. This
makes sense; the combination of ORDER BY and TOP is used for filtering, whereas ORDER BY
without TOP is used for presentation purposes, and this is only useful for the final result set.
We’ll see some more examples of correlated subqueries in the next section, when we compare the
performance of correlated subqueries to that of joins.

Evaluate performance differences between table joins and correlated subqueries based on provided
data and query plans
As stated earlier: a correlated subquery is a subquery that is performed one time for every record in
the outer query. This might lead to performance issues, and a logically equivalent join statement might
produce the same result faster. However, performance tuning depends on a lot of factors, and the
reverse might also be true; a JOIN statement may perform slower than a logically equivalent
correlated subquery. Or they may perform exactly the same; SQL is declarative language, meaning that
the code you write expresses what the code should accomplish, not how the code should accomplish
it.

The best advice we can give you for the real world is to test multiple alternatives for queries that will
operate on large record sets; you might want to investigate using different queries, different indexes,
temporary tables etc. For the exam, however, this advice is not enough, as this exam objective
specifically requires you to be able to evaluate performance differences between table joins and
correlated subqueries, based on provided data and query plans.
There is one big caveat here: a full explanation of query plans would include all possible operators,
and entire books have been written on interpreting query plans. For the exam, this is where you need
years of experience writing and tuning queries. Our exam tip: the two important aspects of the query
plan to look at are the relative percentages in the query plan, and the number of times a table (or
supporting index) is used. The rest of this section will describe two examples, one in which a join is
better and one in which a correlated subquery is better.

The first example is one where a join will perform better than the correlated subquery. Let’s assume a
sales person wants to know, for each customer, how many times they have made an order, and what
the last order date was. For that, we’ll need the data in the customer table and the orders table. But
we’ll have to make a copy of the customer table first (the reason for this, is that there is a security
policy defined on the customer table, and a simple select statement on this table will result in a query
plan that is needlessly complicated for the purpose of this demonstration). We’ll also make a copy of
the orders table. This is the code to make both copies to our database TestDB:

SELECT *
INTO Customers_copy
FROM WideWorldImporters.sales.Customers

SELECT *
INTO orders_copy
FROM WideWorldImporters.sales.orders
Now, compare both queries and their execution plan:

SELECT c.CustomerName, o.lastorderdate, o.NumberOfOrders


FROM (SELECT customerID
, max(orderdate) AS LastOrderDate, count(*) AS NumberOfOrders
FROM orders_copy
GROUP BY CustomerID) o
INNER JOIN Customers_copy c on c.CustomerID = o.customerId

SELECT CustomerName
,( SELECT MAX(orderdate)
FROM orders_copy o
WHERE c.customerId = o.customerID ) AS LastOrderDate
,( SELECT COUNT(*)
FROM orders_copy o
WHERE c.customerId = o.customerID )AS NumberOfOrders
FROM Customers_copy c

Depending on the format you’re reading this book in, the picture below might be too small to read, but
that is just another reason why you should follow along with the examples. The query that uses the
join has to go through the orders table only once to calculate the last order date and the total number
of orders; the correlated subquery will have to go through this table twice, once for each time it is
used in the select statement. You can see this on the right hand side of the query plan, where the join
query performs one Table Scan, and the correlated subquery performs two Table Scans. This results
in the difference in percentage if you execute both queries at the same time: 34% for the join, 66% for
the correlated subquery.
If you’d also want to add the date of the first order, the correlated subquery would have to scan the
orders table a third time, making the difference even bigger. This is one usage pattern where a join
performs better than a correlated subquery.

Up next we’ll show an example where the correlated subquery performs better than the join. Let’s
assume the same sales person now wants to have a report of customers who’ve never made an order.
Using the same copies of the tables, these are the necessary queries, the former using a join and the
latter using a correlated subquery:

SELECT *
FROM Customers_copy c
LEFT OUTER JOIN orders_copy o on c.CustomerID = o.CustomerID
WHERE o.CustomerID IS NULL

SELECT *
FROM Customers_copy c
WHERE NOT EXISTS ( SELECT *
FROM orders_copy o
WHERE c.CustomerID = o.CustomerID)
As you can see, the cost of the correlated subquery is much lower (23% versus 77%). One note on the
EXISTS query: it will stop the Table Scan for a customer as soon as it has found the first order for
that customer. So depending on the data in the table, it might not need to scan the entire table.

We’ve now given examples of simple subqueries, correlated subqueries and demonstrated that a
correlated subquery might perform better, or worse, than a logically equivalent table join. Let’s move
on to the APPLY statement.

Distinguish between the use of CROSS APPLY and OUTER APPLY


As stated in the introduction of this section: the APPLY statement joins two query results, in which
the query on the right side is executed once for each row of the result of the query on the left side.
That makes it similar to a correlated subquery, where the inner query is executed once for every row
in the outer query. The differences: first, the correlated subquery is used to return a single column or
scalar value, whereas APPLY will return a set, and second, APPLY can be used with either a regular
select statement or a table valued function (that takes a column from the left result set as input for the
function). It is this last functionality of APPLY that makes it particularly useful (as we’ll see later on).

There are two variations of APPLY: the CROSS APPLY and the OUTER APPLY. This is similar to
the CROSS join and the LEFT OUTER JOIN we discussed earlier. CROSS APPLY will return all
records from the left hand set with a match in the right hand set. Remember that the right query is only
executed for the left query, so there is no possibility of a row in the right query without a match in the
left query. This makes the CROSS APPLY more similar to an INNER JOIN (or to be more precise, a
CROSS join without the possibility of row in the right set without a match in the left set). The
OUTER APPLY is similar to the LEFT OUTER JOIN, and will return all records from the left hand
set, with or without a match in the right hand set.

First, an example to demonstrate the use of APPLY, with the equivalent JON statement:

SELECT *
FROM Departments d
CROSS APPLY
( SELECT *
FROM Employees e
WHERE e.Department = d.Department) table_alias

SELECT *
FROM Departments d
INNER JOIN Employees e ON d.Department = e.Department

Note that table_alias in the CROSS APPLY does not actually gets referenced anywhere, but it is
required nonetheless (similar to the table alias when you use a subquery in the FROM clause).
And these are the results:
Obviously, as these are test tables where changing constantly, your results may vary, but your results
should be similar, and more importantly: the results of both the APPLY and the JOIN statement will
be the same.
To demonstrate the use of the OUTER APPLY, we’ll add a record to the table referenced in the left
hand query (Departments) without a match in the right had query (Employees):

INSERT Departments VALUES ('Research', 'Eindhoven')

SELECT *
FROM Departments d
OUTER APPLY
( SELECT *
FROM Employees e
WHERE e.Department = d.Department) table_alias

SELECT *
FROM Departments d
LEFT OUTER JOIN Employees e ON d.Department = e.Department

And once more, both results:


As you can see, the record for the left hand result set without a match in the right hand result set is
now returned (the research department).

Please also run the statements with the execution plan. You’ll see that in the CROSS APPLY vs.
INNER JOIN example, both query plans are identical, and in the OUTER APPLY vs. LEFT OUTER
JOIN, an operator will be added to filter rows from the Employees table. The extra operator is
Compute Scalar, that adds a negligible cost to the query plan, so this might impact the performance in
other scenarios, but in our case the total cost of both queries is still almost identical.
So if the statements are logically equivalent, and the performance is the same, what is the benefit of
using APPLY? The answer: when you want to execute a table valued function for one of the values
from the left hand result.
Let’s give an example, and create a function that returns a table of employees and their salary, with
department name as input parameter. We’re jumping ahead a little bit, as this is an exam objective we
cover in chapter 3, but we really need a table valued function to demonstrate the APPLY statement.
Creating a function is easy, and this is the syntax to do it:

CREATE FUNCTION dbo.fn_GetListOfEmployees(@Department AS varchar(100))


RETURNS TABLE
AS
RETURN
(
SELECT FirstName + ' ' + LastName AS FullName, Salary
FROM Employees e
WHERE e.Department = @Department
)

A table valued function returns a table, so you can use the table valued function in the FROM clause.
You need to supply the input parameter. Using the engineering department as an example:
Now, we can use the table valued function on the right side of the APPLY operator, using a column
from the left hand side as the input variable:

Please verify for yourself that using OUTER APPLY will include the Research department in the
output, with NULL values as output for FullName and Salary.

Let’s give another example. SQL Server comes with a lot of built-in tables, views and functions.
We’ll combine two: a Dynamic Management View called sys.dm_exec_requests and a Dynamic
Management Function sys.dm_exec_sql_text. You can forget about the Dynamic Management part, and
we haven’t covered views yet; we’ll do that in chapter 3. For now, just regard them as a table and a
function.
Sys.dm_exec_requests shows information on all currently running execution requests (including an
identifier for the execution plan called a plan handle), and the function sys.dm_exec_sql_text accepts
a plan handle as input, and returns a table with (among other columns) the T-SQL text of the query.
This means that you can combine both of them, using the APPLY statement:

SELECT session_id, DB_NAME(database_id) AS [database], start_time


, open_transaction_count, [text] AS [Query]
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.plan_handle) s
WHERE session_id > 50 -- Exclude system spids
ORDER BY session_id;

As output, you’ll get a list of all running user processes. This should at least contain your own
process. If you are the only user, this will also probably be the only process. To demonstrate that it
actually works, you can open another query window and start a long running query (for example, the
cartesian product we used in chapter 1, joining the Sales.Orders table to the Sales.Orderlines table
without a join predicate).

We’ve now explained and demonstrated the use of the APPLY statement, the difference between
CROSS APPLY and OUTER APPLY, and given examples of combining APPLY with table valued
functions.

Write APPLY statements that return a given data set based on supplied data
In the previous section, we’ve already covered this objective as well. In the questions, we’ll be sure
to come back to the APPLY statement, but as far as the theory goes, you now know all you need to
know for this objective.
Query data by using table expressions
In this section, we’re going to discuss common table expressions, and demonstrate how to use them.
We’ll compare the common table expression to a concept we’ve already seen (the derived table) and
a concept we haven’t covered yet, so we’ll have to explain that as well: the temporary table. We’ll
also compare it to a concept that is not a listed exam objective, but that we feel is too similar to
ignore in this discussion: the table variable. And we’re going to discuss the recursive table
expression, a solution for a very specific but common problem.

Identify basic components of table expressions


In chapter 1, we demonstrated the use of a derived table:

SELECT o.*
FROM (SELECT * FROM Sales.Orders) o

The result set is given a name, and the derived table can then be used by the outer query. A common
table expression is very similar: you define a query, give the result set a name and use it in the rest of
the query as if it were a table. And another similarity: for both the CTE and the derived table, every
column needs to have a name.
We’ll rewrite the query above as a common table expression first, before explaining the difference
between the common table expression and the derived table.

WITH o
AS ( SELECT *
FROM Sales.Orders)
SELECT *
FROM o

One note about the syntax: the statement immediately preceding the WITH statement has to be
properly terminated with a semi-colon, otherwise the common table expression will fail with a syntax
error (the reason for this: WITH is a clause that is used at the end of other statements to provide a list
of options, like backup and ALTER TABLE statements, so the semi-colon is needed to avoid possible
conflicting situations).

Like a derived table, the common table expression only exists for the duration of the query.
A common table expression has two advantages over a derived table: readability, and the recursive
CTE.
The first one is readability. In our very simple example, both queries are easy to read, but as queries
get more realistic, the difference in readability of the CTE over the derived table can become quite
significant. Let’s revisit our example where we compared the performance of a correlated subquery
to a join, and rewrite the JOIN as a CTE:

SELECT c.CustomerName, o.lastorderdate, o.NumberOfOrders


FROM (SELECT customerID, max(orderdate) AS LastOrderDate, count(*) AS NumberOfOrders
FROM orders_copy
GROUP BY CustomerID) o
INNER JOIN Customers_copy c on c.CustomerID = o.customerId;

WITH o AS (
SELECT customerID, max(orderdate) AS LastOrderDate, count(*) AS NumberOfOrders
FROM orders_copy
GROUP BY CustomerID)
SELECT c.CustomerName, o.lastorderdate, o.NumberOfOrders
FROM o
INNER JOIN Customers_copy c on c.CustomerID = o.customerId

This readability issue is most obvious with queries were you have to reference the same derived
table multiple times, or when you have nested derived tables. As an alternative for nested derived
tables, you can use multiple common table expressions, where each CTE can reference the ones
before (but not the ones that follow):

WITH cte_1 AS (
SELECT *
FROM MyTable
...
),
cte_2 AS (
SELECT *
FROM cte_1
...
)
SELECT *
FROM cte_2

We’ll see an example of that later on.

The second advantage of the CTE over the derived table is the recursive CTE; this is the next exam
objective on our list.

Construct recursive table expressions to meet business requirements


The recursive CTE is very useful in situations where you have a chain of parent-child relations, and
need to query this chain of relations from the bottom to the top, or the other way round. The most
commonly used scenario for this example is the relation of employees to their managers, all the way
from the top of the company to the bottom.
Our employee table does not have manager information in it, so we’ll need to add that. In order to do
that, we’ll need to:
* add a column to store the information for the manager of each employee;
* add some managers to our employee table;
* update the table to give everyone a manager, except for Frank, whom we’ll designate as the CEO of
our fictional company.

ALTER TABLE Employees ADD ManagerID tinyint NULL;


GO
INSERT Employees (FirstName,LastName, Salary, Department, ManagerID)
VALUES ('Harriet', 'Hughes', 1800.00,'Sales',7);
INSERT Employees (FirstName,LastName, Salary, Department, ManagerID)
VALUES ('Ronnie', 'Robertson', 1800.00,'Engineering',7);

UPDATE Employees
SET ManagerID = (SELECT EmployeeID FROM Employees WHERE FirstName = 'Harriet')
WHERE Department = 'Sales'
AND FirstName <> 'Harriet'

UPDATE Employees
SET ManagerID = (SELECT EmployeeID FROM Employees WHERE FirstName = 'Ronnie')
WHERE Department = 'Engineering'
AND FirstName <> 'Ronnie'

In a properly designed table, we’d add a foreign key constraint to the ManagerID column, to ensure
that it points to an actual employee, and since a foreign key needs to point to a column that is
guaranteed to be unique (either a primary key or a unique constraint), we’d add that too. This is not
actually required for our example, and database design is a topic for exam 70-762, so consider this as
bonus material.

ALTER TABLE Employees ADD CONSTRAINT PK_EmployeeID PRIMARY KEY (EmployeeID)

ALTER TABLE Employees ADD CONSTRAINT FK_EmployeeID FOREIGN KEY (ManagerID) REFERENCES
Employees(EmployeeID)
GO

This is the recursive CTE to list every employee at every management level:

WITH Employee_CTE AS (
SELECT EmployeeID, ManagerID, Firstname + ' ' + LastName AS Name, 0 as Level
FROM Employees
WHERE ManagerID IS NULL
UNION ALL
SELECT e1.EmployeeID, e1.ManagerID, Firstname + ' ' + LastName, e2.level + 1
FROM Employees e1
INNER JOIN Employee_CTE e2 ON e2.EmployeeID = e1.ManagerID
)
SELECT Name, Level
FROM Employee_CTE
ORDER BY Level, Name

As a lot of students struggle with the recursive common table expression, this code requires some
additional explanation.
The first part of a recursive common table expression is called the anchor member. This is a query
that can be executed separately. The second part of the query, called the recursive member, is joined
to the anchor member. In our case, the EmployeeID of the anchor is joined to the ManagerID of the
recursive query. In English: for each employee in the anchor query, the recursive query returns a list
of employees who report to him (or her).
A recursive common table can have multiple anchor members and multiple recursive members, but
this is very uncommon. The last anchor member and the first recursive member must be combined
using UNION ALL, and since we only use one anchor member and one recursive member, we’re
using UNION ALL to combine the anchor query and the recursive query.
This recursive query is repeated until the recursive query returns an empty set, or a maximum has
been reached. By default, this maximum is set at 100, but this can be changed using OPTION
(MAXRECURSION). This maximum is required, because otherwise, a recursive query might run into
an endless loop (which might occur if your data is not a true hierarchy). In our case, the recursive
query will be executed until there are no more employees who are listed as anyone’s managers.

You can also traverse the hierarchy the other way around, from the bottom to the top, but that requires
a bit more work. The anchor query will list all employees of all levels of the hierarchy: top
managers, middle managers and bottom level employees. Each recursion will add the manager of
each employee in the previous list, until the level is reached with no more managers (the empty set).
That means that you’ll have to filter out the employees who are managers themselves from the initial
anchor member set. One way to achieve this is by selecting only employees in the anchor member
query whose EmployeeID is not present in the ManagerID column.
Each manager will appear once in the list for every employee in the chain underneath, so you have to
filter out the duplicates; we’ll do this using DISTINCT. So the query traversing the chain from bottom
to top becomes:

WITH Employee_CTE AS (
SELECT EmployeeID, ManagerID, Firstname + ' ' + LastName AS Name, 0 as Level
FROM Employees
WHERE EmployeeID NOT IN ( SELECT ManagerID
FROM Employees
WHERE ManagerID IS NOT NULL)
UNION ALL
SELECT e1.EmployeeID, e1.ManagerID, Firstname + ' ' + LastName, e2.level + 1
FROM Employees e1
INNER JOIN Employee_CTE e2 ON e1.EmployeeID = e2.ManagerID
)
SELECT DISTINCT Name, Level
FROM Employee_CTE
ORDER BY Level, Name

Two things to note here:


* in the anchor member query to determine the employees who themselves are not a manager, the
subquery in the WHERE clause contains a filter for IS NOT NULL. Otherwise, the subquery would
include a NULL record, and this will cause no records to be returned. As dealing with NULL values
can have unexpected results, and not handling NULL values correctly is a big source of bugs in real
life, we’ll clarify this. A NULL value is unknown; therefore, SQL can never positively determine that
any value is not equal to a NULL value. In the case of the exists statement, SQL cannot positively
determine that any of the values is not equal to the NULL value, and therefore will not return any
records. In more general terms, the following WHERE clause will never return any records:

WHERE Column NOT IN (SELECT a list of values that includes NULL)

* In the query from top to bottom, the ManagerID in the anchor member query was matched to the
EmployeeID in the recursive member query; in the query from bottom to top, it was the other way
round (the EmployeeID in the anchor member query was matched to the ManagerID in the recursive
member query).

These examples leave us with the name of an employee and the EmployeeID of his or her manager.
What if we wanted the name of every employee and his or her manager? We promised to show an
example of the use of multiple common table expressions, and that is exactly what we’ll use here.
We’ll change the first CTE to only return the ID’s and the level, put the DISTINCT statement in a
second CTE, and join that second CTE to the Employees table to look up the name for each
EmployeeID and ManagerID (notice the LEFT OUTER JOIN to look up the name of the manager;
without it, CEO Frank would be eliminated from the final result as he himself has no manager).
WITH Employee_CTE AS (
SELECT EmployeeID, ManagerID, 0 as Level
FROM Employees
WHERE EmployeeID NOT IN ( SELECT ManagerID
FROM Employees
WHERE ManagerID IS NOT NULL)
UNION ALL
SELECT e1.EmployeeID, e1.ManagerID, e2.level + 1
FROM Employees e1
INNER JOIN Employee_CTE e2 ON e1.EmployeeID = e2.ManagerID
)
,
Distinct_CTE AS (
SELECT DISTINCT EmployeeID, ManagerID, Level
FROM Employee_CTE)
SELECT e2.FirstName + ' ' + e2.LastName AS Name
, e3.FirstName + ' ' + e3.LastName AS Manager, Level
FROM Distinct_CTE e1
INNER JOIN Employees e2 ON e1.employeeID = e2.EmployeeID
LEFT OUTER JOIN Employees e3 ON e1.ManagerID = e3.EmployeeID
ORDER BY Level, Name

The result:

We’ve now demonstrated the use of the common table expression as an alternative for the derived
table, how to use multiple common table expressions in a single statement, and how to construct a
recursive CTE.

As we said before: like a derived table, the common table expression only exists for the duration of
the query. If you want to use the same logic as basis for other queries, instead of using a derived table
or CTE in multiple queries, you can use something we’ve already touched upon, and will explain in
more detail in chapter 3: the view. This will actually store the query in the metadata of the database,
which can be very convenient for code reuse (or security).
From a performance standpoint, it usually makes no difference if you use views, derived tables or a
common table expressions; SQL will work directly with the actual tables, and build its execution plan
based on those. An alternative to a CTE that can have a huge impact on performance, is the temporary
table; we’ll cover temporary tables next.

Define usage differences between table expressions and temporary tables


In this section, we’ll explain the difference in usage between common table expressions and
temporary tables. We’ll also compare the temporary table to the table variable. In order to do that,
we’ll first have to explain the temporary table, and then the table variable, before we can discuss
when to use either a table variable or a temporary table.

Temporary table
A temporary table is a table that is created in system database Tempdb, and only exists for the
duration of the session. In other words: you can use the same temporary table for different queries
within the same session, but after the session ends, the temporary table is automatically destroyed.
You create a temporary table like you would any other table, by explicitly creating a table (using
CREATE TABLE) or creating it implicitly using SELECT…INTO. The difference with a regular
table is that, in order to create a temporary table, you have to start the name with either a single pound
sign for a local temporary table, or a double pound sign for a global temporary table. We’ll start with
the local temporary table:

SELECT *
INTO #tmp_Employees
FROM Employees

This will create a table on disk, in TempDB. You can now read from it, and work with in all sorts of
ways, just like a regular table. For example, you can select from it, add or drop a column and drop the
table:

SELECT *
FROM #tmp_Employees

ALTER TABLE #tmp_Employees ADD Description varchar(100) NULL

DROP TABLE #tmp_Employees

If, before dropping the table, you open another session, you cannot interact with this table. You can
even create your own temporary table #tmp_Employees in this other session, and both tables will be
completely unrelated. You can verify its existence, though. In every database, there is a table called
objects in the sys schema, with a record of all objects:

SELECT *
FROM tempdb.sys.objects
WHERE name LIKE '#tmp_Employees%'

You’ll see a record for each #tmp_Employees you’ve created, with a name that is followed by a list
of underscores and a number (for use by the system to differentiate between the different tables).
In order to demonstrate that the temporary table only exists for the duration of a session, you can kill
the session that created the temporary table from another window, and verify that the table no longer
exists in sys.objects, and that the temporary table is no longer accessible in the window that created
it:

Msg 208, Level 16, State 0, Line 7


Invalid object name '#tmp_Employees'.

Because even though this is the same window, it is no longer the same session, as it was killed & re-
established.

For the global temporary table, things work pretty much the same. The difference is, that this table is
accessible in other sessions. Please try these steps for yourself: create the table (using
##tmp_Employees as its name, and from another session, looking it up in sys.objects, accessing the
data in the table, attempting to create another table with the same name, and verifying that it is
destroyed for all sessions upon killing the session that created the global temporary table).
Global temporary tables are seldom used because of this behavior, that a global temporary table is
accessible to other sessions but is destroyed with the session that created it. Most temporary tables
are therefore local temporary tables.

Table variables
A table variable is a table that, like a local temporary table, only exists for the duration of a particular
session. A table variable cannot be created implicitly (using SELECT…INTO) but has to be declared
explicitly:

DECLARE @Employees TABLE (


[EmployeeID] [tinyint] NOT NULL,
[FirstName] [varchar](100) NULL,
[LastName] [varchar](50) NULL,
[Address] [varchar](100) NULL,
[Salary] [decimal](18, 2) NULL,
[Department] [varchar](100) NULL,
[Employee_Number] [int] NULL,
[ManagerID] [tinyint] NULL)

There are some other noticeable differences between the temporary table and the table variable:
* the table variable exists only for the duration of the batch that created it, not the session (like every
other variable);
* the structure of a table variable cannot be changed after its declaration (e.g. you can’t add or drop
columns);
* you’re limit to which constraints you can add to a table variable.

Otherwise, a table variable acts like a regular table. So for instance, you can insert data into it:
INSERT INTO @Employees
SELECT *
FROM Employees

Provided of course, that you do it in the same batch. For a complete list of the do’s and don’ts of
using table variables, please read the documentation in Microsoft Docs.

In chapter 3, we’ll talk about table valued functions, and we’ll see the use of table variables there.
For now, we’ll leave table valued functions aside, and discuss the usage of table valued functions in
other code.

Usage of table variables and temporary tables


The two differences between, on the one hand, derived tables and common table expressions, and on
the other hand, table variables and temporary tables, are these: first, a derived table or common table
expression exists only for the query that defines it, whereas a table variable or temporary table exists
for the duration of a batch or session, which enables you to perform a series of different queries on
the object. And second, derived tables and common table expressions are not separate objects from
the actual underlying tables, and in performing the query, SQL Server will use those underlying
tables; whereas a temporary table or table variable actually is a separate object. Those two
differences will be reflected in the usage of table variables and temporary tables.

There are two reasons to use table variables and temporary tables. The first is simplicity, the second
is performance. We will discuss scenarios for both of these benefits in general terms, but won’t be
giving any examples here. The reason for this is that such examples would be, by their very nature,
quite complex, and require a lot of explanation of both the real world scenario and the query
optimizer, with limited applicability on the exam.

Let’s start with simplicity. SQL Server is very good at performing a lot of complex operations in a
single statement. However, humans are not, and programmers are humans. So if you need to perform a
number of complex operations, it is often easier to perform these operations step-by-step than in one
big statement. Performing operations step-by-step results in more readable code, with the added
benefit that you can check the intermediate results. You put all the data you need into a temporary
table, perform step 1 on that temporary table, check the results, and move on to step 2. And because
you’re working on a temporary table (or table variable), your code is isolated from the rest of the
processes on the database. Breaking complex logic into multiple steps might be bad for performance,
but this is a trade-off you might be willing to consider; code that is unreadable is unmaintainable
code, and therefore bad code.

The second potential benefit of temporary tables and table variables is performance. Putting data into
a temporary table or table variable is, in itself, overhead. In some situations, however, this might be
more than compensated by other performance benefits. One situation is when you have to perform an
expensive query, and then use the result of that query for several different queries. You might be better
of storing the result of the first query in a temporary table, or table variable, instead of performing the
expensive query multiple times.
Another situation that calls for the use of temporary tables or table variables is a situation where
SQL, for whatever reason, chooses a less-than-optimal plan.
What can cause a less-than-optimal plan? Well, for some queries, SQL Server has to choose from a
number of alternatives. It has to decide, for example:
* which table to operate on first (if there are multiple tables involved);
* whether use an index, a table or a combination of indexes and the base table;
* what operators to use to join the various intermediate results;
* how much memory to allocate for the query, based on the estimated number of rows for each
operation.

All of these decisions are influenced by statistics, and statistics are not perfect. Usually, SQL Server
will choose a plan that is “good enough”, but there will be times when you have to intervene. Such a
situation may occur when you are joining several large tables; one less-than-optimal choice for the
execution plan may have a huge impact when these tables contain tens of gigabytes of data (as
opposed to our test database, which only has 2 GB of data in total). In such cases, it is worth
investigating what happens when you read all required data into temporary tables, and join those
temporary tables (or table variables).
Here, you need to know another important difference between a temporary table and a table variable.
For a table variable, SQL estimates that a table variable will always contain one record, and you
cannot create nonclustered index on a table variable. For a temporary table, on the other hand, both
nonclustered indexes and statistics can be created, which can have a huge impact.

As stated, we will not give any actual examples of the performance benefits of using either temporary
tables or table variables. Just remember that you can use either of them if you want to break up a
complex operation into several smaller queries for the sake of simplicity, if you need to perform
several different actions on the result of one expensive query, or in some situations where you are
operating on several very large tables all at once. And if you do decide to use either a table variable
or a temporary table: use table variables for a small number of rows, and temporary tables for either
larger number of rows or when you need the constraints, statistics or nonclustered indexes that table
variables don’t support.
Group and pivot data by using queries
In this section, we’ll demonstrate three advanced query techniques: windowing functions, grouping
and pivoting. When using grouping functions, you group records together and perform operations on
the groups, not on the individual records. When using windowing functions, you also group records
together, but continue to work on the individual records.
Pivoting can best be described as turning rows in to columns, and vice versa.

We’ve already seen some examples of grouping records (for example, in the example where we
calculated the last order date per customer), so we’ll start there.

Construct complex GROUP BY clauses using GROUPING SETS, and CUBE


In an earlier example, we’ve already seen the use of a few simple GROUP BY statements, but we
didn’t thoroughly explain these statements. The current exam objective deals with complex GROUP
BY statements, so we’ll have to revisit the simple GROUP BY statements before moving on to the
complex statements.
For the simple statements, we’ll demonstrate how to group based on one or more columns, the effect
of NULL values, revisit some aggregate functions and highlight the order of the various operations.
For the complex statements, we will discuss three operators that work to combine several groupings
into one result set: GROUPINGS SETS, ROLLUP and CUBE.

Demonstrating all the grouping elements requires a demo table with carefully selected data, so we’ll
create one.

DROP TABLE IF EXISTS UFO_Sightings;


GO
CREATE TABLE UFO_Sightings (
CountryID tinyint
, Country varchar(50)
, State varchar(50)
, City varchar(50)
, UFO_Sightings int );

INSERT INTO UFO_Sightings VALUES (1, 'Germany', NULL, 'Berlin', 1)


INSERT INTO UFO_Sightings VALUES (1, 'Germany', NULL, 'Frankfurt', 2)
INSERT INTO UFO_Sightings VALUES (1, 'Germany', NULL, 'Frankfurt', 3)
INSERT INTO UFO_Sightings VALUES (2, 'United States', 'Texas', 'Houston', 1)
INSERT INTO UFO_Sightings VALUES (2, 'United States', 'Texas', 'Paris', 2)
INSERT INTO UFO_Sightings VALUES (2, 'United States', 'Nevada', NULL, 300)
INSERT INTO UFO_Sightings VALUES (3, 'France', NULL, 'Paris', 1);

SELECT *
FROM UFO_Sightings
The general idea of grouping is to group all records of a certain category, and return some information
on that category (using an aggregate function). For instance, to calculate the total number of UFO
sightings per country:

SELECT Country, SUM(UFO_Sightings)


FROM UFO_Sightings
GROUP BY Country

Note that the aggregation functions also work if you group all records together, for example:

SELECT SUM(UFO_Sightings)
FROM UFO_Sightings

This implicitly groups all records. This is equivalent to:

SELECT SUM(UFO_Sightings)
FROM UFO_Sightings
GROUP BY ()

In this case, we explicitly grouped all records; however, this syntax is hardly ever used. We only
show it here because we’ll use the same syntax later on when we talk about grouping sets.

It is possible to group on more than one column. For example, to group on both county and city, you
can use the following code:

SELECT Country, City, SUM(UFO_Sightings)


FROM UFO_Sightings
GROUP BY Country, City
ORDER BY Country, City;

When discussing the aggregate functions (in chapter 1), we noted that when you group by a column (or
number of columns), you can only use those columns, or aggregate functions on the records in each
group in both the SELECT and ORDER BY clause. This bears repeating. The clauses are processed
in the following order:
* FROM
* WHERE
* GROUP BY
* SELECT
* ORDER BY

After the grouping, all information from the individual records is no longer available. The WHERE
statement filters records before the GROUP BY statement. For example, this statement filters out all
records for cities that are not named Paris:

SELECT Country, SUM(UFO_Sightings)


FROM UFO_Sightings
WHERE City <> 'Paris'
GROUP BY Country

Please test this statement for yourself. If it did not return the results you expected, remember that SQL
does not know for sure that the unknown city NULL in Nevada isn’t called Paris, and therefore will
filter out that record. If that is not the result you’re after, use the ISNULL function to replace the NULL
value:

SELECT Country, SUM(UFO_Sightings)


FROM UFO_Sightings
WHERE ISNULL(City,'') <> 'Paris'
GROUP BY Country

After the GROUP BY statement, the SELECT clause can only use the columns you’ve grouped by, or
aggregate functions on these or other columns, and the same goes for the ORDER BY (but conversely,
a column that is used in the GROUP BY clause doesn’t need to be used in the SELECT clause). That
means that you can’t include CountryID in the SELECT clause. The following would result in an
error:

SELECT CountryID, Country, SUM(UFO_Sightings)


FROM UFO_Sightings
GROUP BY Country

Msg 8120, Level 16, State 1, Line 35


Column 'UFO_Sightings.CountryID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP
BY clause.

This statement fails, even though, logically speaking, there is a 1-1 relationship between CountryID
and country; the point is that SQL does not know that. If you do want to include this column in the
SELECT list, you’ll either have to add it to the GROUP BY column listing, or join the result back to
the original table.

Similar to the WHERE clause that filters records before the GROUP BY, you can filter records after
the grouping. The statement for this is the HAVING clause. For example, to return only information for
countries with more than 100 UFO sightings, the statement would be:
SELECT Country, SUM(UFO_Sightings) AS TotalNumberOfSightings
FROM UFO_Sightings
GROUP BY Country
HAVING SUM(UFO_Sightings) > 100;

Note that we do not use the alias TotalNumberOfSightings in the HAVING clause; HAVING is
processed before SELECT, therefore this alias is not yet available to filter on. In this case, we’re
using a very simple function, so it’s no problem using the function twice (in the SELECT and
HAVING clause). But in real life, functions may get complex, and readability would suffer. In that
case, it would be easy to rewrite this query using a CTE, and filter on the column alias:

WITH ufo AS (
SELECT Country, SUM(UFO_Sightings) AS TotalNumberOfSightings
FROM UFO_Sightings
GROUP BY Country)
SELECT *
FROM ufo
WHERE TotalNumberOfSightings > 100

Please verify for yourself that both queries are logically equivalent.

As for all the available aggregate functions, there are quite a few:

We’ve only used SUM in the examples above. You can test the rest of these if you like. We’ve already
covered most of these in chapter 1, and we’ll cover GROUPING and GROUPING_ID later. There is
one more thing we’d like to note on one of these functions, though: the aggregate function COUNT
will include NULL values in a column if you use COUNT(*), but it will omit NULL values if you use
COUNT(column). You can test this by adding a country without any sightings:

INSERT INTO UFO_Sightings VALUES (4, 'Belgium', NULL, 'Brussels', NULL);

SELECT Country, COUNT(UFO_Sightings) AS Count_1, COUNT(*) AS Count_2


FROM UFO_Sightings
GROUP BY Country

Another important note on NULL values: when grouping on a column with NULL values, SQL
considers all NULL values equal, and groups them together. Usually, NULL does not equal NULL, so
it is important to be aware of this different behavior when grouping. Even though Germany actually
has states (called Bundesländer), and Berlin and Frankfurt are not in the same state, their records are
grouped, because we’ve omitted the state (to demonstrate this point):

SELECT Country, State, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY Country, State

This covers the simple GROUP BY statement. Now for the more complex GROUP BY logic:
ROLLUP, CUBE and GROUPINGS SETS. These complex statements are all different ways of
combining several grouping results. Along the way, we’ll cover the two remaining aggregate
functions (GROUPING() and GROUPING_ID).

Let’s start with ROLLUP. After grouping records into categories and subcategories, ROLLUP
aggregates the subcategories into categories, and the categories into one grand total. In the case of our
example, this means the following. Suppose we group all records on country and city to calculate the
total number of sightings per city, just like we’ve done before:

SELECT Country, City, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY Country, City
ORDER BY Country, City
If we include the ROLLUP, we’ll get the same result, plus:
* the total number of sightings per country (with NULL for the city);
* the grand total of sightings (with NULL for both the country and the city).

So the following statement should return 5 extra rows (1 per country, 1 for the grand total):

SELECT Country, City, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY ROLLUP (Country, City);

And so it does:

This means that the ROLLUP statement is logically equivalent to this one:

SELECT Country, NULL AS City, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY Country
UNION ALL
SELECT Country, City, SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
GROUP BY Country, City
UNION ALL
SELECT NULL, NULL, SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
GROUP BY ()

The ordering will probably be a bit different, and as you can see in the execution plan (if you are still
following along), the ROLLUP statement is more efficient, as it is one statement, not three. But still,
the end result is the same. By the way: an alternative syntax for the ROLLUP is:

GROUP BY Country, City WITH ROLLUP;

The NULL values in this result bring us nicely to the next subject: how do you distinguish the NULL
values that are caused by the grouping from the NULL values that were present in the ungrouped data?
The answer is the function GROUPING. This function takes the name of a column as input parameter,
and will return a 0 for a row if the value in that column for that row is not the result of grouping, and
a 1 for a row if the value in that column for that row is the result of grouping. To clarify, let’s see this
function in action:

SELECT Country, GROUPING(Country) AS Is_result_of_grouping_countries


, State, GROUPING(State) AS Is_result_of_grouping_cities
FROM UFO_Sightings
GROUP BY Country, State WITH ROLLUP

In a real world example, you might use this function for further filtering, or in a report, to display
something more informative, for example:

SELECT CASE
WHEN GROUPING(Country) = 1 THEN 'Grand Total'
WHEN GROUPING(State) = 1 THEN 'Total for ' + Country
ELSE ''
END AS Totals
, Country
, State
,SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
GROUP BY Country, State WITH ROLLUP
The other aggregate function we still have to discuss, GROUPING_ID, has a similar purpose. Like
the GROUPING function, this GROUPING_ID function will, in real life examples, probably be used
for further filtering. It is a bit harder to explain, so bear with me; the example will probably be
sufficient. This function takes, as input parameter, the exact same column listing you’ve used in the
GROUP BY listing, and returns a number as output. This number is the weighted sum of the result of
the GROUPING function for each of the columns. The GROUPING function returns either 0 or 1; the
weight is based on the position, starting at the right: the first column has a weight of 1 (so either 0*1
or 1*1), the second column a weight of 2 (so either 0*2 or 1*2), the third has a weight of 4 (so either
0*4 or 1*4), the fourth a weight of 8 etc.
To see this function in action:

SELECT Country, State, GROUPING_ID(Country, State )


FROM UFO_Sightings
GROUP BY Country, State WITH ROLLUP

The result:

As this is complicated, let’s see what happens when we add a third level to group on: continent.

ALTER TABLE UFO_Sightings ADD Continent varchar(100) NULL

GO --required batch separator, otherwise query will fail on missing column

UPDATE UFO_Sightings
SET Continent = 'Europe'
WHERE Country IN ('Germany', 'France', 'Belgium')

UPDATE UFO_Sightings
SET Continent = 'North America'
WHERE Country IN ('United States')
And if that still does not explain the GROUPING_ID function, maybe this will help. The last column
in the following SELECT statement calculates the exact same result as the GROUPING_ID function,
but using the GROUPING function:

SELECT Continent
, Country
, State
, GROUPING_ID(Continent, Country, State )
, GROUPING(Continent) * 4 + GROUPING(Country) * 2 + GROUPING(State)
FROM UFO_Sightings
GROUP BY Continent, Country, State WITH ROLLUP

The only results we get from the GROUPING_ID function is 1, 3 & 7. That is because we roll up:
records for the same city (and in the same country & same continent), records for the same country
etc. There is no grouping for the same city in different countries. In our case, that wouldn’t make
sense, as Paris, Texas is definitely something different from Paris, France. If we would like to group
on the same city in different countries, we’d use the CUBE keyword. So let’s try this, and see what
happens in Paris:

SELECT Country, City, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY CUBE (Country, City)
ORDER BY 1, 2;
As you can see, the number of sightings in Paris, Texas and Paris, France are added. This CUBE can
be best explained as follows. A grouping of columns A and B with CUBE is the total of 4 different
groupings: (A & B), (A), (B) and (). You can check this for yourself by running the following code:

-- CUBE dissected: A & NULL, NULL & B, A & B, NULL & NULL
SELECT Country, NULL AS City, SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
GROUP BY (Country)
UNION ALL
SELECT NULL, City, SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
GROUP BY ( City)
UNION ALL
SELECT Country, City, SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
GROUP BY Country, City
UNION ALL
SELECT NULL, NULL, SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
ORDER BY 1, 2

In both of these examples, we order by column number instead of by column name; not a good
practice for finished code, but very convenient while writing code.

Now we’ve demonstrated how to group on cities with the same name of different countries, we
should now see more values for the GROUPING_ID function. So let’s try that, by grouping on
continent, country & city with CUBE:

SELECT Continent
, Country
, City
, SUM(UFO_Sightings) AS TotalUFO_Sightings
, GROUPING_ID(Continent,Country, City)
FROM UFO_Sightings
GROUP BY CUBE (Continent, Country, City)
ORDER BY 5;

The result is too big to show in a screenshot, but if you follow along, you’ll see all values from 0 to 7
(the highest possible number when grouping on 3 columns; 2^3 -1).

The last thing we have to talk about, are grouping sets. When you use grouping sets, you combine
multiple GROUP BY clauses into a single GROUP BY list. Both results are combined using UNION
ALL. Let’s take two sets: group by city, and group by country. The query would look like this:

SELECT Country, City, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY GROUPING SETS ( Country, City)

This is the logical equivalent of:

SELECT Country, NULL, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY Country
UNION ALL
SELECT NULL, City, SUM(UFO_Sightings) AS TotalUFO_Sightings
FROM UFO_Sightings
GROUP BY City

If you’d like to add the grand total, you can use the syntax we discussed earlier, using only round
brackets () . That would make this query look like this:

SELECT Country, City, SUM(UFO_Sightings) AS TotalUFO_Sightings


FROM UFO_Sightings
GROUP BY GROUPING SETS ( Country, City, ())

This would combine three groupings: a grouping by city, a grouping by country and a grand total.

That concludes our tour of the more complex grouping operations: ROLLUP, CUBE and GROUPING
SETS. We’ll get back to grouping when we compare it to the windowing functions, but in order to do
that, we’ll first have to explain those windowing functions. That’s up next.

Use windowing functions to group and rank the results of a query


As we’ve just seen, when using GROUP BY constructions, only aggregated data is returned; data from
the individual records is not present in the result set. Using windowing functions, you can return both
the data from the individual records, and data about the grouped records. SQL Server has three types
of windowing functions: aggregate, ranking and analytic functions. As we’ve already covered the
aggregate functions for the GROUP BY construction, we’ll start there.
For the following demo’s, we’ll make a table in which three friends keep track of their running races.
As an aside: we’ll store the time it takes to complete the race in a column with data type time. This is
not technically correct, as the time data type refers to the time of day, not an amount of time. It would
be more correct to store this race time as an amount of seconds. But since we’re not going to perform
much calculations on this column, this won’t be a problem, and it is easier to read the examples using
the time data type.

DROP TABLE IF EXISTS Runner_info

CREATE TABLE Runner_info (


Runner varchar(50)
,Nickname varchar(50)
,Race varchar(50)
,Race_date date
,Distance decimal(8,3)
,Finish_time time)
GO
INSERT Runner_info VALUES ('Jack', 'The Flash', 'New York marathon', '2016-11-06', 42.195, '2:53:00')
INSERT Runner_info VALUES ('Jack', 'The Flash', 'New York marathon', '2017-11-05', 42.195, '2:55:00')
INSERT Runner_info VALUES ('Jack', 'The Flash', 'Berlin marathon', '2016-09-25', 42.195, '2:45:00')

INSERT Runner_info VALUES ('Juan Gonzalez', 'Speedy', 'New York marathon', '2016-11-06', 42.195, '3:22:00')
INSERT Runner_info VALUES ('Juan Gonzalez', 'Speedy', 'New York marathon', '2017-11-05', 42.195, '3:18:30')

INSERT Runner_info VALUES ('Rudy', 'Roadrunner', 'New York marathon', '2016-11-06', 42.195, '4:01:00')
INSERT Runner_info VALUES ('Rudy', 'Roadrunner', 'New York marathon', '2017-11-05', 42.195, '4:05:30')
INSERT Runner_info VALUES ('Rudy', 'Roadrunner', 'Berlin marathon', '2016-09-25', 42.195, '3:59:59')
INSERT Runner_info VALUES ('Rudy', 'Roadrunner', 'Berlin marathon', '2017-09-24', 42.195, '4:13:00')
INSERT Runner_info VALUES ('Rudy', 'Roadrunner', 'Rotterdam half marathon', '2017-04-07', 21.1, '1:52')

SELECT *
FROM Runner_info

Aggregate window functions


After applying the grouping function, all information of individual records is lost in the result set.
Using windowing functions, that is not the case. In windowing functions, a group of records is called
a partition, and you apply a function to a partition (or a frame within a partition; more on this later).
For each row of the result set, a windowing function will return information on the partition that
particular row is a member of.
Let’s start off easy. We’ll define just one partition: all records. As you can see, the syntax for defining
a single partition is similar to grouping on all records. The following query will return the same
result set as above, plus a column with the total number of races:

SELECT *
,COUNT(*) OVER () AS Nr_of_races_total
FROM Runner_info

The next step is to create a partition per runner. In order to demonstrate that each windowing function
in a query can use different partitioning definitions, we’ve added this second window function. This
is how we do that:

SELECT *
,COUNT(*) OVER () AS Nr_of_races_total
,COUNT(*) OVER (PARTITION BY Runner) AS Nr_of_races_per_runner
FROM Runner_info

We like to keep examples as concise as possible, so we’re not using a WHERE clause. If we would,
you’d see that the WHERE clause is evaluated before the windowing function. So for instance, if
we’d want to filter out Rudy’s half marathon, we’d simply add a WHERE clause:

SELECT *
,COUNT(*) OVER (PARTITION BY Runner) AS Nr_of_races_per_runner
FROM Runner_info
WHERE Distance = 42.195

This also means that if we want to filter based on the results of the windowing function, we need
something extra. GROUP BY has the optional HAVING clause; there is no equivalent for windowing
functions. That means that, in order to filter based on the results of the windowing functions, you need
to use a common table expression (or one of the alternatives we discussed earlier), and apply the
filtering to a query on that CTE.

You can GROUP BY more than one column; the same goes for partitioning. This is how you partition
on two columns. Let’s do that, to see the best result for each runner for each race. And to really
hammer in the point that you can mix data from the aggregated records as well as the individual
records, we’ll calculate the difference for each race (compared to the personal best for that course).

SELECT Runner
,Race
,Race_date
,Finish_time
,MIN(Finish_time) OVER (PARTITION BY Runner, Race) AS Race_PR
,DATEDIFF(mi,MIN(Finish_time) OVER (PARTITION BY Runner, Race), Finish_time) AS Slower_than_PR
FROM Runner_info
Here, we’ve used the function MIN to determine the lowest value, but we could have used any
aggregate function, except for GROUPING and GROUPING_ID; these can only be used in the
GROUP BY statement.

As stated, aggregate functions do not necessarily have to be applied to an entire partition; they can
also be applied to a frame within a partition. You define an lower and upper boundary for the frame
in the following manner:

OVER ( PARTITION BY column


ORDER BY column
ROWS BETWEEN … AND …)

In order for this to mean anything, you need to order the partition (the query will return an error if you
omit the ORDER BY clause). There is an alternative for ROWS: this is RANGE, which we’ll cover
later. But first, we have to explain the options for defining the lower and upper boundary of the frame.
For the lower boundary of the frame, you specify either:
* the current row (using CURRENT ROW);
* a number of rows before the current row (using n PRECEDING, where n is the number of rows);
* all rows before the current row (using UNBOUNDED PRECEDING).

For the upper boundary of the frame, you have similar options:
* the current row (using CURRENT ROW);
* a number of rows after the current row (using n FOLLOWING, where n is the number of rows);
* all rows after the current row (using UNBOUNDED FOLLOWING).

This functionality is very useful for calculating moving averages and cumulative totals. Let’s calculate
the running total distance for all races Rudy has run. In order to do this, we need to order on race
date, use UNBOUNDED PRECEDING as lower boundary for the frame, and CURRENT ROW as
upper boundary:

SELECT Runner, Race, Race_date, distance


,SUM(distance) OVER (PARTITION BY Runner
ORDER BY Race_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS Cumulative_distance
FROM Runner_info
WHERE Runner = 'Rudy'
ORDER BY Runner, Race_date

As you can see, the cumulative distance increases with each race. By the way: “ROWS
UNBOUNDED PRECEDING” can be used instead of “ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW”; the default for the upper boundary is CURRENT ROW.

The same way you can calculate a moving average. Usually, a moving average is calculated using a
number of preceding values, but we’ll use preceding and following values (just so we can show that
syntax as well):

SELECT Runner, Race, Race_date, distance


,AVG(distance) OVER (PARTITION BY Runner
ORDER BY Race_date
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
AS Running_average
FROM Runner_info
WHERE Runner = 'Rudy'
ORDER BY Runner, Race_date

As we mentioned earlier: there is also an alternative for ROWS: RANGE. The difference is that
ROWS bases the windowing frame purely on row numbers, and ignores duplicate values; RANGE
takes duplicate values into consideration (later on, we’ll see a similar thing when we look at the
windowing ranking functions).
We’ve ordered on race_date, and in Rudy’s running results, there are no duplicate values for race
date. So let’s create one, and demonstrate the difference between RANGE and ROWS.

INSERT Runner_info VALUES ('Rudy', 'Roadrunner', 'Ran home', '2017-11-05', 10, '1:10')
First, the exact same query we saw earlier, using ROWS:

SELECT Runner, Race, Race_date, distance


,SUM(distance) OVER (PARTITION BY Runner
ORDER BY Race_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS Cumulative_distance
FROM Runner_info
WHERE Runner = 'Rudy'
ORDER BY Runner, Race_date

Now, using RANGE instead of ROWS:

SELECT Runner, Race, Race_date, distance


,SUM(distance) OVER (PARTITION BY Runner
ORDER BY Race_date
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS Cumulative_distance
FROM Runner_info
WHERE Runner = 'Rudy'
ORDER BY Runner, Race_date

As you can see, using RANGE, the distance for row 6 is added to the cumulative distance for row 5,
as the race_date for both rows is the same (and we’re ordering by race_date), whereas using ROWS,
the distance for row 6 is not added to the cumulative distance for row 5.

We’ve now demonstrated how an aggregate function can be applied to a partition, or a frame within
an (ordered) partition. That’s all for the aggregate windowing functions.
Ranking window functions
Ranking functions return a number, based on the order of the row in a partition. There are 4 ranking
functions: ROW_NUMBER, RANK, DENSE_RANK and NTILE.
ROW_NUMBER will assign an incrementing number, starting with 1 for the first row, 2 for the
second row etc. RANK does something similar, with one difference: identical values in the row will
have an identical outcome for the RANK function (similar to what we saw with ROWS & RANGE
earlier). So if rows 2 and 3 have an equal value, the RANK function will return 1, 2, 2 and 4, for the
first 4 rows (whereas ROW_NUMBER will return 1, 2, 3, 4). DENSE_RANK will do the same as
RANK, except for the gap: using the same example where rows 2 and 3 have an equal value,
DENSE_RANK will return 1, 2, 2 and 3, for the first 4 rows. NTILE requires an input parameter: an
integer, representing a number of groups to divide the rows in. The NTILE function will return the
number of the group the record is in. So NTILE(5) will divide all records into 5 groups. If the number
of records cannot be divided neatly into groups, the remainder is divided by adding 1 record to each
group, starting with the lowest numbered group.

We’ll demonstrate the ranking functions based on the race time for the marathon.

SELECT Runner, Race, Race_date, Finish_time


,ROW_NUMBER() OVER (ORDER BY Finish_time) AS rownumber
,RANK() OVER (ORDER BY Finish_time) AS rank
,DENSE_RANK() OVER (ORDER BY Finish_time) AS denserank
,NTILE(5) OVER (ORDER BY Finish_time) AS ntile
FROM Runner_info
WHERE Distance = 42.195
ORDER BY rownumber

This is the result:

As you can see, NTILE cannot divide all records neatly into 5 groups. Nine divided by 5 (into whole
numbers) is 1, with a remainder of 4; therefore, all groups get 1 record, and the first 4 get 1 extra.
With the current data, ROW_NUMBER, RANK and DENSE_RANK produce equal results. As the
difference between these functions can only be demonstrated when there are duplicates, we’ll
introduce one:
INSERT Runner_info VALUES ('Fred', 'The Shadow', 'New York marathon', '2016-11-06', 42.195, '2:53:00')
This time is the same as the 2nd fastest time.

The added record has no effect on the result of the function ROW_NUMBER; for the RANK function,
there are now two rows with rank 2, but no row with rank 3; and for the DENSE_RANK function,
there are now two rows with rank 2, but the row with ROW_NUMBER 4 has DENSE_RANK 3 (no
gaps in the dense rank).

That’s all for the ranking functions; now on to the analytic functions.

Analytic window functions


SQL Server 2016 supports 8 analytic window functions:
* LAG & LEAD
* FIRST_VALUE & LAST_VALUE
* CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC & PERCENTILE_RANK.

For all of these examples, we’ll use the dbo.Employees table. To be more precise: the columns
FirstName, LastName, Salary and Department. Currently, this is the relevant content of my copy of the
Employees table:

SELECT FirstName, LastName, Department, Salary


FROM dbo.Employees

Make sure you have something similar so you can follow along.
We’ll start with the first two, LAG and LEAD. LAG will retrieve data from the previous row in a
given result set (or more accurately, a partition within a result set); LEAD will retrieve data from the
following row. As we’ve seen before, since a result set is not ordered without an ORDER BY clause,
we’ll have to specify one for the LAG (or LEAD) function (otherwise, “previous” or “following”
have no meaning). Retrieving data from a previous row in a given set is something that, without these
windowing functions, would require a self-join or cursor in older versions of SQL; the use of the
windowing function is more elegant, and probably faster.

First, a simple example:

SELECT FirstName, LastName, Department, Salary


,LAG(Salary) OVER (ORDER BY Salary) as lag
,LEAD(Salary) OVER (ORDER BY Salary) as lead
FROM Employees

As you can see, for every row in the result set, the salary from the previous row is returned for the
LAG function, and the salary from the following row is returned for the LEAD function. For the first
row, there is no previous row, so NULL is returned; same for the last row.
It is important to note that the ORDER BY in the windowing function is separate from an ORDER BY
clause for the result set. For instance, we can order the result set on a different column:

SELECT FirstName, LastName, Department, Salary


,LAG(Salary) OVER (ORDER BY Salary) as lag
,LEAD(Salary) OVER (ORDER BY Salary) as lead
FROM Employees
ORDER BY LastName

LAG and LEAD functions have two optional parameters. The first one is offset: the distance to the
current row. By default, this is 1. This example shows how to retrieve not the row before the current
row, but the one before that (i.e. the offset is 2):

SELECT FirstName, LastName, Department, Salary


,LAG(Salary, 2) OVER (ORDER BY Salary) as lag
,LEAD(Salary, 2) OVER (ORDER BY Salary) as lead
FROM Employees

This also has the effect that for the two records with the lowest salaries, the LAG function will return
NULL.
The second optional parameter for LAG and LEAD is the replacement value (for these NULL values).

SELECT FirstName, LastName, Department, Salary


,LAG(Salary, 2, 'not applicable') OVER (ORDER BY Salary) as lag
,LEAD(Salary, 2, 'not applicable') OVER (ORDER BY Salary) as lead
FROM Employees

The data type of this replacement value, however, has to be compatible with the data type of the
column; therefore, this statement will return an error:

Msg 8114, Level 16, State 5, Line 45


Error converting data type varchar to numeric.

Using zero as a replacement value would work, but wouldn’t have meaning in this case. Hopefully,
you remember from chapter one how to make this statement work for yourself (tip: use the functions
CAST and ISNULL).

The last thing we need to discuss for the LAG and LEAD function is that they can also be applied to a
partition of the total result set. For example, let’s partition by department:

SELECT FirstName, LastName, Department, Salary


,LAG(Salary) OVER (PARTITION BY Department ORDER BY Salary) as lag
,LEAD(Salary) OVER (PARTITION BY Department ORDER BY Salary) as lead
FROM Employees
ORDER BY Department
The LAG and LEAD functions are not limited to columns; sub-queries or expressions can be used as
well, but as this works pretty much the same, we won’t give any examples of that. We’ve shown all
the required building blocks for the LAG and LEAD functions, so we’ll move on to the next two:
FIRST_VALUE & LAST_VALUE.

As the name suggests, the functions FIRST_VALUE will return the first of a given partition. The
function LAST_VALUE is a bit confusing: it will return the current row, unless you specify ROWS
(or RANGE) BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. I like to think of it
this way: the window is built row-by-row, and rows following the current one are not included in the
window unless you explicitly define the window that way (consistent with the syntax of windowing
aggregate functions).
So, to return the least and most earning employee per department:

SELECT FirstName, LastName, Department, Salary


,FIRST_VALUE(FirstName + ' ' + LastName)
OVER ( PARTITION BY Department
ORDER BY Salary)
as Least_earning_employee
,LAST_VALUE(FirstName + ' ' + LastName)
OVER ( PARTITION BY Department
ORDER BY Salary
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
as Most_earning_employee
FROM Employees
ORDER BY Department, Salary

As far as the other analytic functions are concerned (CUME_DIST, PERCENTILE_CONT,


PERCENTILE_DISC & PERCENTILE_RANK): these are not commonly used, so you might not
encounter them on the exam. All four of them are used to calculate a statistical number representing
where a given value is in the distribution of values in that column. If you’re not interested in statistics,
or learning about topics you might not encounter on the exam, you can move on to the next exam
objective; otherwise, just follow along with these examples.

CUME_DIST is a number between 0 and 1 that represents the number of rows with a value less or
equal to the current row:

SELECT FirstName, LastName, Department, Salary


,CUME_DIST() OVER (ORDER BY Salary) as cumulative_distance
,CUME_DIST() OVER (ORDER BY Salary) * COUNT(*) OVER () as equal_or_less
FROM Employees
ORDER BY Salary

In the last column, we’ve multiplied the cumulative distance by the total number of rows (7) using an
aggregate windowing function we saw earlier.

The functions PERCENTILE_CONT and PERCENTILE_DISC require, as an input parameter, a


value between 0 and 1, and will return a value where the input parameter represents the fraction of
the values lower than the value returned. In other words: if you use 0.5 as an input parameter and
apply the function to a window on salary, these functions will return a salary where 50% of salaries
in the window are lower than the value returned. This will calculate the median value.
The difference between the functions PERCENTILE_CONT and PERCENTILE_DISC is that
PERCENTILE_DISC will return a value that is actually in the data set (a discrete value);
PERCENTILE_CONT treats the data set as continuous, and may return an interpolated value (for
example, with 0.5 as input value, it will return the average of the two middle values for an even
number of records).
To demonstrate the difference, we’ll add another record (so we’ll have an even number of records):

INSERT dbo.Employees (FirstName, LastName, Salary, Department)


VALUES ('James', 'Peterson', 900, 'Engineering')

Let’s use .5 as input parameter, to calculate a value for the salary where 50% of values are lower.

SELECT FirstName, LastName, Department, Salary


,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary) OVER () as median_salary
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Salary) OVER () as median_salary
FROM Employees
ORDER BY Salary
As you can see, the PERCENTILE_DISC returned a value that is actually in the data set;
PERCENTILE_CONT did not.

The last one, PERCENT_RANK, will return a number between 0 and 1, representing the fraction of
values that are lower than the current value (excluding the current value). In the next example, we’ll
also include a percentage, as this will be more familiar to most readers:

SELECT FirstName, LastName, Department, Salary


,PERCENT_RANK() OVER (ORDER BY Salary) as fraction_lower_than_this
,CAST(PERCENT_RANK() OVER (ORDER BY Salary) * 100 AS dec(5,1))
as percentage_lower_than_this
FROM Employees
ORDER BY Salary

This covers the analytic windowing functions. To recap: LAG and LEAD will return the previous and
following value in the window of a result set for a given column; FIRST_VALUE and LAST_VALUE
will return the first and last value in the window of a result set for a given column; and CUME_DIST,
PERCENTILE_CONT, PERCENTILE_DISC & PERCENTILE_RANK will return a value that
represents where the current value is in the distribution of values in that column.

Distinguish between using windowing functions and GROUP BY


We’ve already mentioned this a few times, but as it is an exam objective, it bears repeating. When
using GROUP BY constructions, only aggregated data is returned; data from the individual records is
not present in the result set. Using windowing functions, you can return both the data from the
individual records, and data about the grouped records.

Construct PIVOT and UNPIVOT statements to return desired results based on supplied data
A PIVOT statement does a combination of two things. First, it turns rows into columns. Second, it
aggregates data (which is the reason why we discuss this PIVOT statement right after the grouping
and windowing functions). And the UNPIVOT statement does almost the exact opposite.
This explanation probably is not that clear. You can look up the definition in Microsoft Docs, but that
explanation is probably not that clear either, so we’ll start with a realistic example of a situation that
would benefit from using a PIVOT statement, the output and the actual PIVOT statement. After that,
we’ll give a better explanation of the PIVOT statement and its components, based on this example,
and start building a PIVOT statement from scratch based on a single, simple table.

Suppose your manager wants a list of the quantity of items sold for the top 100 products for each of
the top 100 customers. With the knowledge from the previous chapters, you can retrieve this data from
the WideWorldImporters database. For the sake of convenience, let’s limit ourselves to the top 5,
define our top 5 customers as the ones with the most orders (disregarding order amount) and the top 5
products as the products that appear on most order lines (disregarding quantity). Your query would
look something like this:

SELECT CustomerName, si.StockItemName, sum(quantity) as quantity


FROM sales.Customers c
INNER JOIN sales.Orders o on c.CustomerID = o.CustomerID
INNER JOIN sales.OrderLines ol on o.OrderID = ol.OrderID
INNER JOIN Warehouse.StockItems si on ol.StockItemID = si.StockItemID
WHERE c.CustomerID IN ( SELECT TOP 5 CustomerID
FROM Sales.Orders
GROUP BY CustomerID
ORDER BY COUNT(*))
AND si.StockItemID IN ( SELECT TOP 5 StockItemID
FROM Sales.Orderlines
GROUP BY StockItemID
ORDER BY COUNT(*))
GROUP BY CustomerName, si.StockItemName
ORDER BY CustomerName, si.StockItemName

And the result would look like this:

Your manager is satisfied with the data, but not with the format of the output. Just run the same query
for the top 100, and you can see why not: it is just not informative. What he wants is this:
There are three ways you can achieve this (without disappointing your manager). First, you can take
Excel and start cutting and pasting. Second, you can rewrite this statement using a lot of CASE or
JOIN statements. This will do the trick, but the resulting code will not be legible. The third option is
the topic of this exam objective: use the PIVOT statement.

This is the PIVOT statement to achieve this result:

WITH cte AS(


SELECT CustomerName, si.StockItemName, quantity
FROM sales.Customers c
INNER JOIN sales.Orders o on c.CustomerID = o.CustomerID
INNER JOIN sales.OrderLines ol on o.OrderID = ol.OrderID
INNER JOIN Warehouse.StockItems si on ol.StockItemID = si.StockItemID
WHERE si.StockItemID IN ( SELECT TOP 5 StockItemID
FROM Sales.Orderlines
GROUP BY StockItemID
ORDER BY COUNT(*))
)
SELECT *
FROM cte
PIVOT (SUM(quantity)
FOR CustomerName IN ([Anand Mudaliyar],[Jibek Juniskyzy]
,[Agrita Abele],[Kalyani Benjaree],[Jaroslav Fisar]))
AS SalesQuantity

So this is a PIVOT statement. Using a table (or in this case a derived table) as input, SQL will create
a result set where:
* a list of specified values from one column from the input will be turned into separate columns in the
output (in this case: the 5 listed names for the column CustomerName);
* a second column from the input will be used to calculate aggregate values for the new output
columns (in this case: the column quantity);
* the third column from the input will be used to group by (in our case: the column StockItemName).
Or to be more precise: all remaining columns from the input will be used to group by. Usually, you
only want to group on one column, so if your table has more than the three required columns, use a
derived table to return only those three (eliminating the rest), as we’ve done here.

Hopefully, this makes it clear what a PIVOT statement is, and what it is used for. Now, let’s start
building one from scratch to make clear how it works. For that, we’ll create a simpler version of a
Sales table, with just three people buying groceries:
USE TestDB
GO
DROP TABLE IF EXISTS Sales;
GO
CREATE TABLE Sales(
Customer varchar(50)
, Product varchar(50)
, Quantity int)

INSERT INTO Sales(Customer, Product, Quantity)


VALUES('John','Apples',4)
,('John','Bananas',7)
,('John','Cantaloupe',1)
,('John','Apples',5)
,('John','Beer',6)
,('Bill','Cantaloupe',3)
,('Bill','Beer',24)
,('Diana','Apples',6)
,('Diana','Bananas',5)
,('Diana','Cantaloupe',2)

SELECT *
FROM Sales

First, we’ll pivot by customer:

-- Pivot customers
SELECT Product, Bill, John, Diana
FROM Sales
PIVOT (SUM(Quantity)
FOR Customer IN (Bill, John, Diana)) AS some_alias
ORDER BY Product

Note the list of values: FOR Customer IN (). These will be turned into columns. Unfortunately, these
values have to be hard coded into the statement; there is no simple way of creating a statement that
fetches these values from the table. There is a hard way: creating a statement using dynamic SQL. On
the web site, you can find a sample script for this.
Another thing worth mentioning: like a derived table, the PIVOT statement needs to have an alias.
You can reference this alias in the SELECT clause, but that is optional.

The result:

You can just as easily pivot by product. Let’s do that, and see what happens when we include a
product in the listing that none of these customers has bought:

-- Pivot product
SELECT Customer, Apples, Bananas, Cantaloupe , Beer, [Magic mushrooms]
FROM Sales
PIVOT (SUM(Quantity)
FOR Product IN (Apples, Bananas, Cantaloupe , Beer, [Magic mushrooms])) AS some_alias
ORDER BY Customer

As you can see, all values for total sales quantity for magic mushrooms are NULL.

We’ve now demonstrated the use of all three columns. We mentioned earlier on that a PIVOT
statement needs three columns: one from which a list of specified values will be turned into columns,
one on which an aggregate function will be applied (the result of this function will become the values
for the newly created columns), and that all remaining columns will be used to group on. Usually,
there is only one remaining column; let’s add one and find out what happens.

-- adding a column
ALTER TABLE Sales ADD CustomerID int NULL
GO
UPDATE Sales SET CustomerID = 1 WHERE Customer = 'Bill'
UPDATE Sales SET CustomerID = 2 WHERE Customer = 'John'
UPDATE Sales SET CustomerID = 3 WHERE Customer = 'Diana'

SELECT Product, Bill, John, Diana


FROM Sales
PIVOT (SUM(Quantity)
FOR Customer IN (Bill, John, Diana)) AS some_alias
ORDER BY Product

This is the result:


Effectively, you’ve now performed a GROUP BY (Product, CustomerID); this combination was
probably not useful. If you rerun the statement that pivots by product, you’ll effectively GROUP BY
(Customer, CustomerID), and since there is a 1-on-1 relation between them, the fourth column doesn’t
make a difference in that case.

Let’s drop the column, and move on to unpivoting:

ALTER TABLE Sales DROP CustomerID

Unpivoting is basically the reverse. You take a list of columns, and store the values in a single
column. To demonstrate this, we’ll create a table by pivoting into a new table:

SELECT Customer, Apples, Bananas, Cantaloupe , Beer, [Magic mushrooms]


INTO Pivoted_sales_table
FROM Sales
PIVOT (SUM(Quantity)
FOR Product IN (Apples, Bananas, Cantaloupe , Beer, [Magic mushrooms])) AS some_alias
ORDER BY Customer

SELECT *
FROM Pivoted_sales_table

Now, let’s unpivot the table:

SELECT Customer, Product, Quantity


FROM Pivoted_sales_table
UNPIVOT( Quantity FOR
Product IN ( Apples, Bananas, Cantaloupe
, Beer, [Magic mushrooms])
) AS some_alias
And the result:

As you can see, the unpivot statement has no aggregate function. And there is no un-aggregate function
to completely reverse the effect of pivoting. In the original table, there were two records for John
buying apples, and SQL has no way of knowing that from the Pivoted_sales_table. So unpivoting is
not the exact opposite of pivoting.

Determine the impact of NULL values in PIVOT and UNPIVOT queries


It is important to be aware of the impact of NULL values in all statements. In the examples above, you
can see the resulting NULL values. For example, Bill did not buy apples; therefore, the
SUM(Quantity) of his apple sales is NULL. In chapter 1, we saw how to replace a NULL value (using
ISNULL or COALESCE). Note that you can’t do this directly in the PIVOT statement, but that you
have to do this in the SELECT statement.
Also, different aggregation functions will produce different results. If we take our first PIVOT query
on this table, and use COUNT instead of SUM, no NULL values will be returned:

--Count(*) not allowed


SELECT Product, Bill, John, Diana
FROM Sales
PIVOT (COUNT(Quantity)
FOR Customer IN (Bill, John, Diana)) AS some_alias
ORDER BY Product

We’ll leave it to the reader to test the effect of other aggregate functions on NULL values to the reader
(MIN, MAX, AVG etc.). You can enhance this example by adding a record to the sales table with a
NULL value for quantity. But note that you cannot use two aggregate functions in the same PIVOT
statement, so the following won’t work:
SELECT Customer, Apples, Bananas, Cantaloupe , Beer, [Magic mushrooms]
FROM Sales
PIVOT (MIN(Quantity), MAX(Quantity)
FOR Product IN (Apples, Bananas, Cantaloupe , Beer, [Magic mushrooms])) AS some_alias
ORDER BY Customer

That’s all you need to know about pivoting. Before we move on, let’s drop these tables:

USE TestDB
GO
DROP TABLE IF EXISTS Sales;
DROP TABLE IF EXISTS Pivoted_sales_table;
Query temporal data and non-relational data
In this section, we’re going to cover the use of two types of non-relational data in SQL Server: JSON
and XML. We’ll briefly explain non-relational data, how to store non-relational data in a relational
database, how to query non-relational data inside a relational database, and how to query relational
tables to output JSON or XML data. Also, we’ll briefly discuss when not to use non-relational data.
A note up front. JSON and XML have a long history outside SQL Server, and are the themselves the
topics of entire books. Especially the implementation of XML in SQL Server is complex. In covering
these exam objectives, we’ve decided to limit our description to the most basic concepts and
examples, while still covering the most important topics for the exam and keeping the text
understandable for readers without experience using either XML or JSON; at times, this will be at the
expense of the use of precise terminology. If you feel you need a deeper understanding of either of
these topics, we encourage you to read books dedicated to these topics.

But first, we’ll start with a topic that is not related to non-relational data: temporal data.

Query historic data by using temporal tables


In the WideWorldImporters database, there is a table called Countries. This stores, among other
things, the last recorded population. So if you’d like to know the current population of Australia, this
is the SQL query that will answer that question:

USE WideWorldImporters
GO

SELECT CountryName, LatestRecordedPopulation


FROM Application.Countries

It’s a little over 21 million. But what if you would like to know what the population of Australia was
some years ago? This is what temporal tables are for. The table countries is a system versioned
temporal table, which means that SQL will automatically keep track of older versions of records, and
allows you to query those older versions. The following query will give you the population count of
Australia on January 1st , 2013:

SELECT CountryName, LatestRecordedPopulation


FROM Application.Countries
FOR SYSTEM_TIME AS OF '2013-01-01 00:00:00.0000000'
WHERE CountryName = 'Australia'

Before you start digging into the data to discover the population growth of your country: this is just
sample data, and unfortunately, there is not that much historical data in this table. But there is enough
data there for us to demonstrate:
* how to query temporal tables for a specific time (or time interval) in the past;
* how SQL uses a history table to store older versions of the records;
* how SQL uses two time columns to keep track of the time.

After that, we’ll create a temporal table of our own, so that we can demonstrate:
* how to create a temporal table from scratch, from an existing base table or from an existing base
table and history table;
* the requirements for a temporal table;
* the effect of inserts, updates and deletes on a temporal tables.

There are a lot of business scenarios that require you to keep track of older versions of records. In the
past, you had to write your own code to achieve this, using stored procedures and/or triggers.
Triggers are not an exam objective, so we won’t cover triggers in this book, but in short, a trigger is a
bit of code defined on a table that runs automatically whenever someone performs an update, insert or
delete on that table. This allows you, among other things, to store the old version of the table before it
is changed. Note that temporal tables will not be sufficient for every scenario; besides the old version
of the record, temporal tables only store the begin and end time of when the record was valid, not
who changed the record; if you require this information, temporal tables alone will not solve your
problem (maybe in a future version of SQL). Temporal tables can also help you recover data in case
of erroneous data modifications, or if you want to check trends over time.

But let’s get back to our query of the temporal table. FOR SYSTEM_TIME AS OF will retrieve the
records of the table as they were at that specific time. Note that this is UTC time, not your local time
zone; you’ll have to get comfortable with date & time functions to query temporal tables. Besides a
specific point in time, you can query for an interval, in four different ways:
* FROM start_date_time TO end_date_time
* BETWEEN start_date_time AND end_date_time
* CONTAINED IN (start_date_time, end_date_time)
* ALL.

We’ll start with that last one, as it will show you all the records we have available. Later on, we’ll
create our own temporal tables, and you can experiment some more with these interval options.

SELECT CountryName, LatestRecordedPopulation, ValidFrom, ValidTo


FROM Application.Countries
FOR SYSTEM_TIME ALL
WHERE CountryName = 'Australia'

This query retrieved all available records. As you can see, there is not that much data to query. The
record that is currently valid is the first one. This has a ValidTo date of 9999-12-31 23:59:9999999,
the maximum value for a datetime2 column (a temporal table requires two datetime2 columns, as
we’ll see later on). For the currently valid record, the ValidTo time will always be this maximum
value. When converting your local time to UTC time for queries on temporal tables, keep in mind that
you keep this value unchanged!

The interval options “FROM start_date_time TO end_date_time” and “BETWEEN start_date_time


AND end_date_time” are very similar. The difference is related to the boundary. When comparing
intervals, it is always important to be aware of whether the boundary values are included or
excluded.
Note: records that were never active, and therefore have the same ValidTo and ValidFrom time, will
never be included. We’ll demonstrate this later on.
The interval option “FROM start_date_time TO end_date_time” will return all records that were
valid between the supplied values for start time and end time, excluding the boundaries. The interval
option “BETWEEN start_date_time AND end_date_time” will do the same, except that it will
include records that became active on the upper boundary. We have three records, so we’ll
demonstrate this difference by supplying the ValidTo and ValidFrom times of the middle record (in the
screenshot above this is record #2, but remember that a set has no order, and we didn’t specify an
ORDER BY clause, so please verify this for yourself).

SELECT CountryName, LatestRecordedPopulation, ValidFrom, ValidTo


FROM Application.Countries
FOR SYSTEM_TIME FROM '2013-07-01 16:00:00.0000000'
TO '2015-07-01 16:00:00.0000000'
WHERE CountryName = 'Australia'

SELECT CountryName, LatestRecordedPopulation, ValidFrom, ValidTo


FROM Application.Countries
FOR SYSTEM_TIME BETWEEN '2013-07-01 16:00:00.0000000'
AND '2015-07-01 16:00:00.0000000'
WHERE CountryName = 'Australia'

As you can see, the BETWEEN query included the record that became valid on the upper boundary of
the interval we supplied, but it did not include the record that stopped being valid on the lower
boundary. You might think that there is an option to include records that become invalid on the lower
boundary, but no. The interval options “FROM start_date_time TO end_date_time” and “BETWEEN
start_date_time AND end_date_time” will return records that were valid within the supplied
interval; the last option, CONTAINED IN, retrieves records that both became valid and invalid
within the supplied interval (including the lower and upper boundary). This means that records that
were valid any period of time before or after this interval are not returned. To demonstrate this, we’ll
set the upper boundary to the present:

DECLARE @now datetime2 = SYSDATETIME()


SELECT CountryName, LatestRecordedPopulation, ValidFrom, ValidTo
FROM Application.Countries
FOR SYSTEM_TIME CONTAINED IN ('2013-07-01 15:00:00.0000000', @now)
WHERE CountryName = 'Australia'

As you can see, the currently valid record is not returned; for a record to be returned by
CONTAINED IN, it has to be valid only within that interval, not before or after.

We’ve seen how to retrieve older versions of records, using either a specific time or an interval.
Now, the next question is: how does SQL do that? Let’s look at the countries table in more detail.

Notice the little clock in the table icon, and the Countries_Archive history table underneath the
country table. Older versions of the records are stored in this history table. A history table will
automatically be created when you create a temporal table (as we’ll see later on). When you query a
temporal table using FOR SYSTEM_TIME, SQL will automatically retrieve the necessary older
versions from the history table. For this, the temporal table requires a primary key (in order to
uniquely identify which records from the history table correspond to a record in the temporal table),
but the history table will not have a primary key (otherwise, this would create conflicts when
inserting older versions in the history table).
You can also query this history table directly. In fact the following query is equivalent to the query
FOR SYSTEM_TIME ALL:

SELECT CountryName, LatestRecordedPopulation, ValidFrom, ValidTo


FROM Application.Countries_archive
WHERE CountryName = 'Australia'
UNION ALL
SELECT CountryName, LatestRecordedPopulation, ValidFrom, ValidTo
FROM Application.Countries
WHERE CountryName = 'Australia'

Equivalent, except for the records in the history table with the same ValidTo and ValidFrom date, as
we mentioned earlier.

So the primary key in the temporal table is used to look up older versions of a record in the history
table, and the ValidTo and ValidFrom columns are used to determine the time period a certain record
was valid. These are called period columns. The name is not important; what is important, is that
these columns have a datatype datetime2, and are created using a specific syntax we’ll see later on.
Optionally, these columns can be made hidden. That means that they will only be returned for select
statements that explicitly specify these columns, but not be returned for select statements that use
SELECT *. That means that you can change a regular table to a temporal table without having to
worry if existing select statements in applications will be impacted. Whether hidden or not, these
columns cannot be updated by users, only by the system.

With this information, let’s create our own temporal table. As always, we’ll first remove the table if
it already exists. But you can’t just drop a temporal table; first, you’ll have to change it to a regular
table, and then drop both the history table and the temporal table.

USE TestDB
GO
IF EXISTS ( SELECT *
FROM sys.tables
WHERE name = 'Countries'
AND temporal_type_desc = 'SYSTEM_VERSIONED_TEMPORAL_TABLE')
BEGIN
ALTER TABLE Countries SET ( SYSTEM_VERSIONING = OFF)
END
GO

DROP TABLE IF EXISTS Countries_archive


DROP TABLE IF EXISTS Countries
GO

Now, we can create the temporal table:

CREATE TABLE Countries (


CountryID int IDENTITY PRIMARY KEY
,CountryName varchar(50)
,Population int
,ValidFrom datetime2 GENERATED ALWAYS AS ROW START HIDDEN
,ValidTo datetime2 GENERATED ALWAYS AS ROW END HIDDEN
,PERIOD FOR SYSTEM_TIME (ValidFrom,ValidTo))
WITH(SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Countries_archive))
GO

Note the syntax of the two period columns, the fact that they are defined as hidden, and the fact that
we explicitly named the history table (otherwise, the history table would have gotten a system
generated name). Next, let’s put a record in, and update it. Obviously, the time stamp on your machine
will differ.
INSERT Countries (CountryName, Population) VALUES ('Neverland', 100)
UPDATE Countries SET Population = 101 WHERE CountryName = 'Neverland'

SELECT *, ValidFrom, ValidTo


FROM Countries
FOR SYSTEM_TIME ALL
ORDER BY ValidFrom

The time in the period column is set to the start time of the transaction that changes the record. Let’s
demonstrate that, by updating once, waiting ten seconds and updating once more:

BEGIN TRAN
UPDATE Countries SET Population = 102 WHERE CountryName = 'Neverland'
WAITFOR DELAY '00:00:10'
UPDATE Countries SET Population = 103 WHERE CountryName = 'Neverland'
COMMIT TRAN

SELECT *, ValidFrom, ValidTo


FROM Countries
FOR SYSTEM_TIME ALL
ORDER BY ValidFrom

Note that there is no 10 second gap between the ValidTo date of the first update, and the ValidFrom
date of the second; these are all set to the start time of the transaction. Also, in the result of this query,
there is no version of the record that sets the population to 102. That is because the value for both
period columns is set to the same time, and such records do not show up in temporal queries. We can
see this version in the history table, though:

SELECT *
FROM countries_archive

The version is there, only the temporal query will not return it, as both period columns are set to the
same time: the start time of the transaction. And this being a transaction, this version of the record
was never available for queries other than the transaction that created it.
Remember, when querying, to compensate for the difference between local and UTC time. As an
example:

DECLARE @local_time datetime2


,@utc_time datetime2
SET @local_time = '2018-07-04 13:40:00'

SET @utc_time = DATEADD(second, DATEDIFF(second, SYSDATETIME(), SYSUTCDATETIME()), @local_time)

SELECT *
FROM Countries
FOR SYSTEM_TIME AS OF @utc_time

This will return the version of the record that updated the population to 101. We won’t provide a
screenshot, as we would urge you to test this out with your own test data, as this requires you to use
your actual local time zone. You can also test the effect of deleting data from the temporal table, but it
is pretty easy to guess: the record will be removed from the temporal table, and moved to the history
table (with the ValidTo date set to the start time of the delete transaction).

It is important to note that you cannot change the data in the history table. Try this, to clean up data
after 7 years:
DELETE FROM Countries_archive WHERE ValidTo <DATEADD(yy, -7, SYSUTCDATETIME())

Msg 13560, Level 16, State 1, Line 121


Cannot delete rows from a temporal history table 'TestDB.dbo.Countries_archive'.

In SQL 2017, you can set a retention policy for temporal tables, but no such luck in SQL 2016. It is a
good idea to design a cleanup job whenever you design a table that might end up containing a lot of
data, so how would you do that for a temporal table? Besides retention policy, SQL 2017 offers
another option: stretched databases in Azure. I’ve heard people say that, on the exam, whenever an
answer contains Azure as a possible solution, this is probably the correct answer, but this won’t help
you if it is about SQL 2016. In this version, the only two options are:
* table partitioning (which is an objective for exam 70-762);
* or changing the table to a regular table, deleting the data and changing the table back to a temporal
table. We’ve already seen how to change a temporal table to a regular table (before dropping it); as
the last topic for this exam objective, we’ll change a regular table to a temporal table.

But before that: what can you do to a history table? Not much, actually. There is a long list of
limitations and considerations on Microsoft Docs, which is worth checking out if you actually start
working with temporal tables, but the most important one of this: you can create indexes and statistics
on the history table, and if you expect heavy use of the temporal tables, you probably should.
You can also add or drop columns, but indirectly: these changes will be made to the history table
when you make the change to the temporal table. For instance, the following will add a column to
both the temporal and the history table, and drop it again:
ALTER TABLE Countries ADD Description varchar(100) NULL
GO
UPDATE Countries SET Description = 'Not an actual country' WHERE CountryName = 'Neverland'

GO
ALTER TABLE Countries DROP COLUMN Description

Just keep in mind that the change will be made to both tables, including nullability. So if you define a
column as NOT NULL and add a default value, this default will be added to all the records of the
history table as well, and therefore actually changing the older versions of the records.

Now for the last subject: changing an existing set of tables (with data) into a temporal table and
corresponding history table. This requires you to observe the same requirements as for creating
temporal tables from scratch, but SQL will also perform some checks on the data inside the tables
(for instance, the ValidFrom time for a record in the history table cannot be later than the ValidTo
date). All of these steps follow logically from what we’ve covered so far, and this is an excellent
way to recap all of that information. As an exercise, we suggest that you first try this on your own,
before following along.

We’ll start by dropping the tables:

USE TestDB
GO
IF EXISTS ( SELECT *
FROM sys.tables
WHERE name = 'Countries'
AND temporal_type_desc = 'SYSTEM_VERSIONED_TEMPORAL_TABLE')
BEGIN
ALTER TABLE Countries SET ( SYSTEM_VERSIONING = OFF)
END
GO

DROP TABLE IF EXISTS Countries_archive


DROP TABLE IF EXISTS Countries
GO

Next, create the tables. Note that the temporal table requires a primary key, and the history table
cannot have a primary key. But as a side effect of declaring a column a primary key, it is also non-
nullable, and the CountryID column of the history table must have the same nullability. We’ll also put
a single record in (with a population of 103).

CREATE TABLE Countries (


CountryID int IDENTITY PRIMARY KEY NOT NULL
,CountryName varchar(50)
,Population int)
GO

INSERT Countries (CountryName, Population) VALUES ('Neverland', 103)

CREATE TABLE Countries_archive (


CountryID int NOT NULL --no identity or primary key
,CountryName varchar(50)
,Population int)
GO

Now, we’ll add the period columns (required) and manually insert an older version of the record in
the history table (not required for SQL, just for demonstration purposes).

ALTER TABLE Countries_archive ADD ValidFrom datetime2 NOT NULL


ALTER TABLE Countries_archive ADD ValidTo datetime2 NOT NULL
GO
INSERT Countries_archive (CountryID, CountryName, Population, ValidFrom,ValidTo)
VALUES (1, 'Neverland', 100, '2011-01-01', '2012-01-01')

Now, we need to add the period columns to the temporal table. We’ll to add them as nullable, insert
the correct period values, and after that, make them non-nullable.
ALTER TABLE Countries ADD ValidFrom datetime2 NULL
ALTER TABLE Countries ADD ValidTo datetime2 NULL
GO
UPDATE Countries SET ValidFrom = '2012-01-01'
UPDATE Countries SET ValidTo = '9999-12-31 23:59:59.9999999'
GO

ALTER TABLE Countries ALTER COLUMN ValidFrom datetime2 NOT NULL


ALTER TABLE Countries ALTER COLUMN ValidTo datetime2 NOT NULL
GO

The batch separator GO is required between these different commands, otherwise they fail if you run
the script all at once.

Now, we’ve got the data (even if it is only one record) and all the requirements in place, so we can
change the regular table into a temporal table:

ALTER TABLE Countries ADD PERIOD FOR SYSTEM_TIME (ValidFrom,ValidTo)


ALTER TABLE Countries SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Countries_archive))

Note that this last statement will check for a correct sequence of period times, and will fail if the data
is wrong.
Using this script, you can create your own test scenario’s. For example, to make sure you understand
the four different interval options that are available for temporal queries using FOR SYSTEM_TIME;
or, you can test how to calculate a trend using temporal tables. The following query checks for the
population growth in the year 2011:

SELECT [2012].population - [2011].population AS 'population growth'


FROM Countries FOR SYSTEM_TIME AS OF '2011-01-01' AS [2011]
INNER JOIN Countries FOR SYSTEM_TIME AS OF '2012-01-01' AS [2012]
ON [2011].CountryID = [2012].CountryID
This concludes our demonstration of temporal tables. We’ve demonstrated how SQL stores older
versions of records in a temporal by using a history table and two period columns; how to query
older versions by specifying either a period or an interval (using FOR SYSTEM_TIME); and we’ve
also shown how to create a temporal table, either from scratch or using existing tables. Be sure to
drop the tables before you move on, and remember that temporal tables use UTC time for the period
columns, and that the time recorded is the transaction start time.

Query and output JSON data


In SQL Server 2016, Microsoft has added support for JSON. JSON stands for JavaScript Object
Notation. JSON stores data in key-value pairs, and is a very popular format to exchange data between
web sites and other systems.
We’ll show you what JSON data is exactly, how to format the results of a T-SQL query as JSON, and
how to query JSON data, by extracting data elements out of a JSON text string.

If you’ve dropped the Employees table, recreate this and insert some records, as we’ll be using this
in our examples. Also, add the employee_number column with NULL values. Something like this will
do:

CREATE TABLE [dbo].[Employees](


[EmployeeID] [tinyint] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](100) NULL,
[LastName] [varchar](50) NULL,
[Address] [varchar](100) NULL,
[Employee_Number] [int] NULL
)

INSERT dbo.Employees (FirstName, LastName, Address)


VALUES ('Bob', 'Jackson', 'Under the bridge')
,('Bo', 'Didley', 'Home of mr. Bo Didley')

About JSON
In a JSON text string, data is stored as either objects, arrays or a combination of both. The most
simple JSON would represent one object, with a key and a value:

{"FirstName":"Bob"}

The key is FirstName, the value is Bob. Some syntax rules: the entire JSON text string must be put
between {curly brackets}, the name of the key must be surrounded with “double quotes”, and the
same goes for the value if its data type is a string.
In SQL Server, you can check if a string is a properly formed JSON string using the function ISJSON,
as the following example will demonstrate:

DECLARE @json NVARCHAR(MAX)


SET @json = '{"FirstName":"Bob"}'

SELECT ISJSON(@json) AS 'IsValidJSON?'


This will return a 1 if the text is a valid JSON string, 0 if it is not. As the data type in SQL Server,
we’ve chosen NVARCHAR; there is no special JSON data type in SQL Server 2016 (which we’ll
see, is unlike XML, which has its own data type).

In JSON, there are only 4 data types: string, Boolean (true or false), numeric or NULL. As mentioned
above, if the value is a string, it must be surrounded by double quotes, otherwise it may not.
As our examples get more complex, you’ll want to format the JSON string for readability. For this,
you can use an online formatter such as https://jsonformatter.org/ ; I’ve used my favourite editor,
Visual Studio Code.

Our first example was an object with a single property, expressed as a key-value pair, but an object
can have more than a single property. For example:

[
{
"FirstName": "Bob",
"lastname": "Jackson",
"address": "Under the bridge"
}
]

This, by the way, is the output of the following query:

SELECT FirstName, lastname, address


FROM employees
WHERE EmployeeID = 1
FOR JSON AUTO

We’ll cover outputting queries as JSON later, but for now, we would just like to mention that SQL
adds the [square brackets] to the output; that’s why we have them in this JSON string.

Besides key-value pairs, a JSON string can also contain arrays. An array is indicated with square
brackets. The square brackets in the example above indicate that this JSON string is an array with just
one object. If we add a second object to the array (by changing the filter in the WHERE statement),
we’d get the following JSON string:

[
{
"FirstName": "Bob",
"lastname": "Jackson",
"address": "Under the bridge"
},
{
"FirstName": "Bo",
"lastname": "Didley",
"address": "Home of mr. Bo Didley"
}
]
Objects and arrays can be nested. If we’d like to add an array of email addresses for an employee,
we’d just add an array:

{
"Employees":[
{
"FirstName": "Bob",
"lastname": "Jackson",
"address": "Under the bridge",
"emailaddress": ["1@example.com","2@example.com","3@example.com"]
},
{
"FirstName": "Bo",
"lastname": "Didley",
"address": "Home of mr. Bo Didley"
}
]
}

So now we have a single object. Its key is employees, and its value is an array of objects, and each
object has multiple properties, and one of these properties is in the form of an array. And yes, this can
get confusing pretty fast. Note that only Bob has the property “email address”, Bo does not; this is
perfectly OK in JSON.

Now you know enough of JSON to start formatting T-SQL queries to output JSON.

Format query output as JSON


You can change the format of the output of a SQL query by adding the clause FOR JSON, which has
two modes. We’ll start by using the mode FOR JSON AUTO, which has no additional formatting
options, and then cover the alternative mode, FOR JSON PATH, which does have some additional
formatting options: ROOT, INCLUDE_NULL_VALUES and WITHOUT_ARRAY_WRAPPER.

We’ve already seen how to format the output of a SQL statement as JSON; we simply add the words
“FOR JSON AUTO”. Let’s demonstrate this with data for two employees:

SELECT FirstName, lastname, address


FROM employees
WHERE EmployeeID in (1,2)
FOR JSON AUTO

This is the output:


[
{
"FirstName": "Bob",
"lastname": "Jackson",
"address": "Under the bridge"
},
{
"FirstName": "Bo",
"lastname": "Didley",
"address": "Home of mr. Bo Didley"
}
]

The result is the same if we use PATH instead of AUTO.

SELECT FirstName, lastname, address


FROM employees
WHERE EmployeeID in (1,2)
FOR JSON PATH

Using path, however, we do have some formatting options. Let’s start with adding a ROOT object:

SELECT FirstName, lastname, address


FROM employees
WHERE EmployeeID in (1,2)
FOR JSON PATH, ROOT ('Employees')

{
"Employees": [
{
"FirstName": "Bob",
"lastname": "Jackson",
"address": "Under the bridge"
},
{
"FirstName": "Bo",
"lastname": "Didley",
"address": "Home of mr. Bo Didley"
}
]
}

If we omit the ROOT option, we can add the option WITHOUT_ARRAY_WRAPPER; these options
are mutually exclusive.

SELECT FirstName, lastname, address


FROM employees
WHERE EmployeeID in (1,2)
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER

This will remove the square brackets surrounding the JSON string:

{
"FirstName": "Bob",
"lastname": "Jackson",
"address": "Under the bridge"
},
{
"FirstName": "Bo",
"lastname": "Didley",
"address": "Home of mr. Bo Didley"
}

The third option is the option to include NULL values. By default, properties with a NULL value will
not be included in the JSON output. To demonstrate, let’s update the Employee_Number for one of the
employees:

UPDATE Employees
SET Employee_Number = employeeID
WHERE EmployeeID = 1

Now, let’s include Employee_Number in the query:

SELECT FirstName, lastname, address, Employee_Number


FROM employees
WHERE EmployeeID in (1,2)
FOR JSON PATH, INCLUDE_NULL_VALUES

[
{
"FirstName": "Bob",
"lastname": "Jackson",
"address": "Under the bridge",
"Employee_Number": 1
},
{
"FirstName": "Bo",
"lastname": "Didley",
"address": "Home of mr. Bo Didley",
"Employee_Number": null
}
]

Please verify for yourself that, without the option INCLUDE_NULL_VALUES, the key
Employee_Number for our 2nd employee would not be included in the output.

We’ve now demonstrated the three options for the PATH mode, but we haven’t changed the path itself.
This can be done to create complex nestings.

In order to nest properties, we use the column alias. Let’s nest the first and last name under an object
called Name. This can simply be done by adding a dotted name for an alias:

SELECT FirstName AS 'Name.FirstName'


, lastname AS 'Name.LastName'
, address
FROM employees
WHERE EmployeeID in (1,2)
FOR JSON PATH
[
{
"Name": {
"FirstName": "Bob",
"LastName": "Jackson"
},
"address": "Under the bridge"
},
{
"Name": {
"FirstName": "Bo",
"LastName": "Didley"
},
"address": "Home of mr. Bo Didley"
}
]

As you can see, whatever properties have the same name before the dot, get grouped together in the
same object.

These are all the options you have to format the output of SQL statements to create JSON strings.
Now, let’s walk the other way, and retrieve data from a JSON string.

Extract data from JSON


In order to retrieve data from a JSON string, we’ll use the functions OPENJSON, JSON_VALUE,
JSON_QUERY and JSON_MODIFY. In short: OPENJSON will return all of the data in the JSON
string in the form of a result set; JSON_VALUE will extract a single (scalar) value from a JSON
string; JSON_QUERY will extract either an object or an array from a JSON string; and
JSON_MODIFY will take a JSON string as input, modify a value and return the modified string.

OPENJSON has some interesting features we’ll need to cover that’ll make the other functions easier
to understand, so we’ll start there. The function OPENJSON takes a JSON string as input, and will
return a result set with three columns: key, value and type. The first two are obvious; type is a number
which indicates whether the value is an array, an object or, if the key is a property (instead of an array
or nested object), the data type.
In the following query, we’ll declare a variable (as NVARCHAR), assign a JSON string to it, and
extract data from it using the function OPENJSON. The result will nicely show all possible key types,
and the number representing that type (as mentioned, there are only 4 data types: string, Boolean (true
or false), numeric or NULL).

DECLARE @json NVARCHAR(MAX)

SET @json = '{


"A null value":null
,"A text string":"string"
,"A number":2
,"A boolean (true or false)":true
,"A boolean (true or false)":false
,"An array":[1,2,3]
,"An object": {"key1":"value1", "key2":"value2"}
}'

SELECT *
FROM OPENJSON(@json)

As the function returns a result set, you can treat it as a table, and apply all sorts of filters and
operators to it. For example, if you’re not interested in the NULL values, you’d use the following
query:

SELECT [key], value


FROM OPENJSON(@json)
WHERE type <> 0

Note that “key” is in the list of reserved Transact-SQL keyword, and therefore needs to be surrounded
by [square brackets]. By the way: a list of reserved keywords can be found here:
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/reserved-keywords-transact-sql?
view=sql-server-2016

As the functions returns a single table, only the first level properties are converted to rows.
Therefore, in our result set, the property “An object” is returned as a single row (note that the double
quotes are required, as we’ve included a space in the key). If we want to convert the content of that
object to tabular form, we need to use the PATH argument:

SELECT [key], value


FROM OPENJSON(@json,'$."An object"')

For every property of the object, the key and value will be returned. And similar, for an array:

SELECT [key], value


FROM OPENJSON(@json,'$."An array"')

For each element in the array, the key will represent the ordinal position (starting with zero), and the
value will represent the value of the element.

What happens if you specify a path that does not exist? That depends on whether you’ve used lax or
strict mode. Lax mode is the default, and will return an empty result set; strict mode will return an
error. You can verify this for yourself by trying the following queries against a path that doesn’t exist:

SELECT [key], value


FROM OPENJSON(@json,'lax $."n array"')

SELECT [key], value


FROM OPENJSON(@json,'strict $."n array"')

If you require a bit more control over the output, you can add a WITH clause. This way, you can
control the column name and the data; additionally, if the key is not a property, but instead an object
or an array, you need to specify this (using AS JSON). For instance, to return the value for the
property “A number” as bigint, with a column name of NameOfTheColumn, and something similar for
the property “A text string”, you would use the following:

SELECT *
FROM OPENJSON(@json)
WITH (
NameOfTheColumn bigint '$."A number"'
,NameOfTheOtherColumn varchar(100) '$."A text string"'
)

This method can also be used to retrieve a combination of properties of the first level object and
second level object:

SELECT *
FROM OPENJSON(@json)
WITH (
NameOfTheColumn bigint '$."A number"'
,NameOfTheOtherColumn varchar(100) '$."A text string"'
,JSONColumn nvarchar(max) '$."An object"' AS JSON
,ObjectValue varchar(100) '$."An object".key1'
)

This demonstrates that OPENJSON returns a result set. The other functions, JSON_VALUE and
JSON_QUERY, will return a scalar value. For instance, to return the value for the key “A number”,
you’d use the following query:

SELECT JSON_VALUE(@json, '$."A number"')

And to retrieve the object in the key “An object”, you’d use the following query:

SELECT JSON_QUERY(@json, '$."An object"')

If you use the function JSON_VALUE on a key that has the type array or object, or if you use the
function JSON_QUERY on a key that is a property, you’ll get no result. So the following will not
work:

SELECT JSON_QUERY(@json, '$."A number"')


SELECT JSON_VALUE(@json, '$."An object"')

Finally, to modify the JSON string, you can use the function JSON_MODIFY. This function takes three
input parameters: the JSON text string, the key you want to modify and the new value. For instance, to
change the number in our JSON string to 17:

SELECT JSON_MODIFY(@json,'$."A number"', 17)

This describes how the functions JSON_VALUE, JSON_QUERY and JSON_MODIFY work on a
JSON string that represents a single object. However, with a slight modification, the same can be
done if the JSON string contains an array of objects. All you need to do is to include the ordinal
position of the object within the array (again, starting from zero). For instance, to get the value for the
property address of the second employee in an array, you’d use:

SELECT JSON_VALUE(@json, '$.Employees[1].address')

This section has covered how to extract data from a JSON string, using the functions OPENJSON,
JSON_VALUE, and JSON_QUERY. We’ve also seen how to change a JSON text string, using the
JSON_MODIFY function. Next, we’ll cover XML.

Tip: if you want some extra JSON data to practice with, you can find this in the columns
UserPreferences, CustomFields and OtherLanguages in the Application.People table in the
WideWorldImporters database.

Query and output XML data


In this section, we’re going to discuss XML. XML stands for eXtensible Markup Language and like
JSON, XML is used to store data in the human-readable format of a text string. XML is used to store
data for web services and data exchange between programs.
As we did with JSON, we’ll show you what XML data is exactly, how to format the results of a T-
SQL query as XML, and how to query XML data, by extracting data elements out of a XML text string.
Along the way, we’ll illustrate some of the similarities and differences between JSON and XML.

The first support for XML in SQL Server was in version 2000, whereas support for JSON is new in
SQL 2016; as you’ll see, this also means there is a lot more to discuss about XML. In fact, there is a
lot more to XML in SQL Server then we’ll discuss here; we’ll just stick to the exam objective, the
ability to query and output XML data.

About XML
In this section, we’ll talk about elements, that have values and attributes; how elements are identified
by opening and closing tags; and how XML text can be either a document or a fragment.

In XML, data is stored in elements. An element consists of a piece of data, surrounded by an opening
tag and a closing tag:

<FirstName>Bob</FirstName>

Because of the opening and closing tag, an XML document will be bigger than the equivalent JSON
document (where the key is given only once). In applications where speed is essential, this can be
important for developers.

A few remarks on syntax rules:


* the closing tag is preceded by a forward slash;
* XML is case sensitive, and the opening tag and closing tag must be an exact match (with the forward
slash as only difference);
* no spaces are allowed in the tags;
* there are 5 characters that cannot be used directly in either a tag or a value: the ampersand (&), less
than (<), greater than (>), apostrophe (‘) and double quote (“); these should be replaced with the
combination of an ampersand, a code for the character to be replaced, and a semicolon, respectively:
* &amp;
* &lt;
* &gt;
* &apos;
* &quot;

SQL Server has an XML data type, so you can validate the XML by assigning it to a variable with an
XML data type. So the following will generate an error, because name is spelled with a capital n in
the opening tag, but not in the closing tag:

DECLARE @xml XML


SET @XML = '<FirstName>Bob</Firstname>'

SELECT @xml

Msg 9436, Level 16, State 1, Line 2


XML parsing: line 1, character 26, end tag does not match start tag

As an alternative, do you remember the function to determine if you can cast data as a certain data
type? If not, here it is:

SELECT TRY_CAST('<FirstName>Bob</Firstname>' as XML)

This is the equivalent of the ISJSON function. We’ll use variables to put XML into, but since XML is
a data type, we could just as easily create a table with an XML column:

CREATE TABLE tbl_XML (Column1 XML)

But we’ll stick to using XML variables, as these examples will be more concise.

An element does not have to contain a single value; elements can be empty, and elements can be
nested. First, an empty element. This can be represented in two ways:

<FirstName />

Or:
<FirstName> </FirstName>

As an example of nested elements, we’ll describe two employee, in the form of nested FirstName
elements within an Employees element:

<Employees>
<FirstName>Bob</FirstName>
<FirstName>Bo</FirstName>
</Employees>

The nested elements are referred to as sub elements, or child elements. The outermost element (in this
case, Employees), is called the root element. A well-formed XML document will have one, and only
one, root element; a piece of XML without a root element is called a fragment.

Besides a value, an element can also have one or more attributes. These attributes are enclosed in the
opening tag. This shows the employee id as an attribute of the element employee:

<Employee Id="1">
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
</Employee>
<Employee Id="2">
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
</Employee>

As you can see, attributes are surrounded by “double quotes”.


Just a glance ahead: this XML fragment was generated from SQL Server using the following query:

SELECT EmployeeID as '@Id'


,FirstName
,LastName
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML PATH ('Employee')

Last thing to know about XML itself: an XML document can have a header, containing a declaration, a
schema reference and maybe some comments.

<?xml version="1.0" encoding="UTF-8"?>


<xs:schema xmlns:xs="http://a.website.that.describes.my.xml.schema/XMLSchema"></xs:schema>
<!-- My first employee list -->
<Employee Id="1">
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
</Employee>
<Employee Id="2">
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
</Employee>

In this header, line 1 is the declaration, which references the XML version number (either 1.0 or 1.1);
line 2 is a reference to a web site that describes the schema of this particular XML document; and line
3 is a line of comment, surrounded <!—and -->. An XML document that is described by an XML
Schema Definition is called a typed XML document.
For the exam, you will probably not be asked to create XML headers, or even more complicated,
XML schemas, but you should at least be able to recognize them.
We’ve now covered what XML looks like, and the described all the terms you need to know for the
rest of this exam objective.

Format query output as XML


In order to format the output of a T-SQL query as JSON, we used FOR JSON; for XML, we use
(surprise, surprise) FOR XML. Whereas FOR JSON has only two modes (AUTO and PATH), FOR
XML has four: RAW, AUTO, PATH and EXPLICIT. These four modes give you an increasing amount
of control. We’ll start with the easiest one: RAW.

RAW mode will return an XML fragment. Without additional options, every row of the result set will
be returned as an XML element called row, and every column will be returned as an attribute of that
element:

SELECT EmployeeID
,FirstName
,LastName
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML RAW

<row EmployeeID="1" FirstName="Bob" LastName="Jackson" />


<row EmployeeID="2" FirstName="Bo" LastName="Didley" />

In order to replace “row”, RAW mode takes an optional argument:

SELECT EmployeeID
,FirstName
,LastName
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML RAW ('GiveThisElementAnotherName')

<GiveThisElementAnotherName EmployeeID="1" FirstName="Bob" LastName="Jackson" />


<GiveThisElementAnotherName EmployeeID="2" FirstName="Bo" LastName="Didley" />

However, the columns are still returned as attributes of the element (the resulting XML fragment is
called attribute-centric). If you want to return the columns as nested elements, use the option
ELEMENTS. The resulting fragment will now be considered element-centric:

SELECT EmployeeID
,FirstName
,LastName
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML RAW , ELEMENTS

<row>
<EmployeeID>1</EmployeeID>
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
</row>
<row>
<EmployeeID>2</EmployeeID>
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
</row>

Similar to JSON, XML output will by default not include NULL values. We can verify this by
including the column Employee_number in our SELECT statement (remember that we did not assign
an Employee_Number to Bo Didley):

SELECT EmployeeID
,FirstName
,LastName
,Employee_Number
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML RAW

<row EmployeeID="1" FirstName="Bob" LastName="Jackson" Employee_Number="1" />


<row EmployeeID="2" FirstName="Bo" LastName="Didley" />

You can include the NULL values by ELEMENTS XSINIL. This will also automatically add a
reference to an XML schema:

SELECT EmployeeID
,FirstName
,LastName
,Employee_Number
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML RAW, ELEMENTS XSINIL

<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EmployeeID>1</EmployeeID>
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
<Employee_Number>1</Employee_Number>
</row>
<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EmployeeID>2</EmployeeID>
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
<Employee_Number xsi:nil="true" />
</row>

And if you want to add a root element, just say so:

SELECT EmployeeID
,FirstName
,LastName
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML RAW, ROOT ('Employees')

<Employees>
<row EmployeeID="1" FirstName="Bob" LastName="Jackson" />
<row EmployeeID="2" FirstName="Bo" LastName="Didley" />
</Employees>

We’ve kept the examples simple, but you can combine the options ROOT, ELEMENTS and the
optional argument to rename the row.

The next mode, FOR XML AUTO, gives you minimal control over the format of the query. However,
SQL Server will automatically try to nest elements using heuristics. For this automatic nesting to
work, it is important to have the columns and the rows in the correct order.

To have an example where nesting of elements makes sense, we’ll use the Orders and Customers
table of the WideWorldImporters database. If you’d want to generate an XML document containing the
orders of customers, you’d probably want to have all orders nested as sub elements of an element
customer; any other nesting of elements would be nonsensical. In order to limit the amount of rows,
we’ll include only two customers and one month of sales:

SELECT c.Customername, o.orderid, OrderDate


FROM sales.Orders o
INNER JOIN sales.Customers c on c.CustomerID = o.CustomerID
WHERE c.customerId in (832, 803)
AND YEAR(OrderDate)=2013
AND MONTH(OrderDate)=1
--ORDER BY OrderDate
FOR XML AUTO

And this is the output:

<c Customername="Bala Dixit">


<o orderid="2" OrderDate="2013-01-01" />
<o orderid="46" OrderDate="2013-01-01" />
<o orderid="480" OrderDate="2013-01-09" />
<o orderid="499" OrderDate="2013-01-09" />
<o orderid="906" OrderDate="2013-01-17" />
<o orderid="1446" OrderDate="2013-01-29" />
</c>
<c Customername="Aakriti Byrraju">
<o orderid="1" OrderDate="2013-01-01" />
<o orderid="45" OrderDate="2013-01-01" />
<o orderid="495" OrderDate="2013-01-09" />
<o orderid="1318" OrderDate="2013-01-25" />
<o orderid="1387" OrderDate="2013-01-28" />
</c>

There are a few things to note here:


* the table alias becomes the name of the element;
* like RAW mode, AUTO mode supports ELEMENTS, to turn the columns into sub elements instead
of attributes;
* the use of the functions YEAR and MONTH is included here as a reminder of an earlier exam
objective (using functions in the WHERE clause might prevent the use of an index);
* the order of the rows and columns matter.

To illustrate this last point: if we uncomment the ORDER BY clause, the heuristics cannot properly
nest the orders under each customer. As a result set, the output looks like this:
As XML, the output looks like this:

<cust Customername="Bala Dixit">


<o orderid="2" OrderDate="2013-01-01" />
<o orderid="46" OrderDate="2013-01-01" />
</cust>
<cust Customername="Aakriti Byrraju">
<o orderid="1" OrderDate="2013-01-01" />
<o orderid="45" OrderDate="2013-01-01" />
<o orderid="495" OrderDate="2013-01-09" />
</cust>
<cust Customername="Bala Dixit">
<o orderid="480" OrderDate="2013-01-09" />
<o orderid="499" OrderDate="2013-01-09" />
<o orderid="906" OrderDate="2013-01-17" />
</cust>
<cust Customername="Aakriti Byrraju">
<o orderid="1318" OrderDate="2013-01-25" />
<o orderid="1387" OrderDate="2013-01-28" />
</cust>
<cust Customername="Bala Dixit">
<o orderid="1446" OrderDate="2013-01-29" />
</cust>

As you can see, the orders are only nested under the same customer element if adjacent rows in the
result set are for the same customer. Please confirm for yourself that, if you change the order of the
columns, the result will be similar; elements will not be nested in the most logical manner. By
controlling the order of the columns and rows, you have some control over the format of the resulting
XML; if you need more control over the nesting, use either PATH or EXPLICIT mode.

We’ve already seen an example of FOR XML PATH:

SELECT EmployeeID as '@Id'


,FirstName
,LastName
FROM Employees
WHERE EmployeeID in (1,2)
FOR XML PATH ('Employee')

The column alias ‘@Id’ causes the EmployeeID to be returned as an attribute, not a sub element, and
as we mentioned before, the argument after PATH causes “row” to be replaced by Employee. For our
simple Employee example, that is all you need to know. For our more complicated example of the
customer orders, you need to know that, in order to create nested elements in the output, you must use
subqueries in the select statement. This subquery must, in itself, return an XML fragment using FOR
XML PATH and TYPE.
We’ll start with the outer query:

SELECT c.CustomerID AS '@id'


,c.Customername
--subquery will be inserted here
FROM sales.customers c
WHERE c.customerId in (832, 803)
ORDER BY c.CustomerName
FOR XML PATH ('Customers'), ELEMENTS, ROOT ('CustomerOrders')

This produces the following result:

<CustomerOrders>
<Customers id="832">
<Customername>Aakriti Byrraju</Customername>
</Customers>
<Customers id="803">
<Customername>Bala Dixit</Customername>
</Customers>
</CustomerOrders>

Note that you can play around with sub elements and attributes, by using the @ in the column alias,
and the option ELEMENTS.

Now, in order to nest the orders, as we said, we need a subquery in the SELECT clause. In the
WHERE clause of this subquery, we’ll link the CustomerID of the subquery to that of the outer query
(c.customerID = c2.customerID). We’ll use the following subquery:

SELECT o.orderid AS '@OrderID'


,o.OrderDate
FROM sales.customers c2
INNER JOIN sales.orders o on c2.CustomerID = o.CustomerID
WHERE c.customerID = c2.customerID
AND YEAR(OrderDate)=2013
AND MONTH(OrderDate)=1
FOR XML PATH('Orders'), ROOT('CustomerOrderId'),TYPE

So the total query becomes:


SELECT c.CustomerID AS '@id'
,c.Customername
,(SELECT o.orderid AS '@OrderID'
,o.OrderDate
FROM sales.customers c2
INNER JOIN sales.orders o on c2.CustomerID = o.CustomerID
WHERE c.customerID = c2.customerID
AND YEAR(OrderDate)=2013
AND MONTH(OrderDate)=1
FOR XML PATH('Orders'), ROOT('CustomerOrderId'),TYPE)
FROM sales.customers c
WHERE c.customerId in (832, 803)
ORDER BY c.CustomerName
FOR XML PATH ('Customers'), ELEMENTS, ROOT ('CustomerOrders')

When we execute this statement without the FOR XML clause in the outer query, the result set is:

Because of the option TYPE in the subquery, the 3rd column is displayed as XML.

With the FOR XML in the outer query, the total result will be the XML document we wanted (in
abbreviated form):

<CustomerOrders>
<Customers id="832">
<Customername>Aakriti Byrraju</Customername>
<CustomerOrderId>
<Orders OrderID="1">
<OrderDate>2013-01-01</OrderDate>
</Orders>
<Orders OrderID="45">
<OrderDate>2013-01-01</OrderDate>
</Orders>
...
</Orders>
</CustomerOrderId>
</Customers>
<Customers id="803">
<Customername>Bala Dixit</Customername>
<CustomerOrderId>
<Orders OrderID="2">
<OrderDate>2013-01-01</OrderDate>
</Orders>
<Orders OrderID="46">
<OrderDate>2013-01-01</OrderDate>
...
</Orders>
</CustomerOrderId>
</Customers>
</CustomerOrders>

Now, can you join the order lines to these orders as nested sub element? Try it for yourself, before
reading on.

The solution is to add a subquery in the select of the first subquery. This second nested subquery
should be joining all three tables, and the order id of this subquery should be linked to the order id of
the other subquery. Something like this:

SELECT ol.orderlineId as '@OrderLineID'


,ol.Description
,ol.Quantity
FROM sales.orderlines ol
inner join sales.orders o2 on ol.OrderID = o2.OrderID
inner join sales.customers c3 on c3.CustomerID = o.CustomerID
WHERE o2.OrderID = o.orderid
FOR XML PATH ('OrderLines'), TYPE

This covers the option FOR XML PATH. The last option we’d like to mention is FOR XML
EXPLICIT. Even Microsoft Docs describes this as “cumbersome”, and advises FOR XML PATH as
the simpler alternative, so we won’t cover it here. Just know that this is the option that gives you full
control over the XML output, and is your last resort if FOR XML PATH won’t do.

We’ve now demonstrated how to format the result of SQL queries in XML format; now let’s go the
other way, and use XML as input to extract data from.

Extract data from XML


There are two methods to turn XML text into SQL data that we’d like to discuss: OPENXML and
XQUERY. Both use XPath to navigate through XML documents, in a directory-like manner.
Let’s start with OPENXML. It requires three steps:
* create an internal representation of the XML document;
* apply the OPENXML rowset function to this internal representation, including some parameters to
indicate what part of the XML you want included in the result;
* remove the internal representation of the XML document.

Here is an example, using the output from an earlier example:


DECLARE @xml varchar(1000)
DECLARE @xml_document_handle int

SET @xml = '


<CustomerOrders>
<Customers id="832">
<Customername>Aakriti Byrraju</Customername>
</Customers>
<Customers id="803">
<Customername>Bala Dixit</Customername>
</Customers>
</CustomerOrders>
'

-- Step 1
EXEC sp_xml_preparedocument @xml_document_handle OUTPUT, @xml;

--Step 2
SELECT *
FROM OPENXML (@xml_document_handle, '/CustomerOrders/Customers',11)
WITH ( id int,
Customername varchar(200));

--Step 3
EXEC sp_xml_removedocument @xml_document_handle

Steps 1 and 3 require only a little bit of explanation. As you can see, we’re using int and varchar as
data types. The stored proc sp_xml_preparedocument takes some form of text data type (e.g. varchar,
nvarchar or xml) as input parameter @xml. The output, @xml_document_variable, is a number that
points to a prepared XML document that SQL stores somewhere in memory; so, in order to free up
that memory when we’re done, we need to call sp_xml_removedocument to clean up that prepared
XML document.

Step 2 is the OPENXML function itself, and it is more complex. We’ve covered the document handle;
the other things we’re going to discuss are the row pattern ( '/CustomerOrders/Customers' ), the flag (11), the
WITH statement and an optional column pattern (which is not included in the statement above). For
the rest of the options of this statement, we’ll have to refer to Microsoft Docs.
Let’s start with the flag. The number 11 in the OPENXML function is a flag, that determines if values
from attributes, from elements or both are returned. 1 uses the attribute-centric mapping, meaning only
attribute values are returned; 2 uses the element-centric mapping, meaning only element values are
returned; and 11 means values from both attributes and elements are returned. Try this for yourself:
turn 11 into 1, and Customername will be returned as NULL; turn 11 into 2, and id will be returned as
NULL.

The row pattern provides the path to the XML elements. This is in the form of an XPath pattern. Only
elements from a particular level of nesting will be returned by the OPENXML function, unless you
also specify the column pattern (which we’ll cover along with the WITH statement). In the example
above, the XML has a root element CustomerOrders, and a nested element of Customers. Only data
from that particular nesting level will be returned, as specified by the row pattern
( '/CustomerOrders/Customers' ). Because we’ve specified flag 11, the sub elements of this level will also
be included, but, as we’ll see later, no levels under that level.

The WITH statement is pretty self-explanatory. It defines the result set that the OPENXML function
will return. The most interesting option about the WITH statement, however, is the optional column
pattern.
As stated above, the row pattern specifies which level of nested elements will be included in the
output. If you change the row pattern to '/CustomerOrders' , no data will be returned, as there is no
element called id or Customername at this level of nesting. So how would you handle a situation
where you have data at other levels of nesting? This is what you use the column pattern for.
In a previous example, we’ve created XML output of customers, orders and order lines. For this, we
used the following query:

SELECT c.CustomerID AS '@id'


,c.Customername
, (SELECT o.orderid AS '@OrderID'
,o.OrderDate
,(SELECT ol.orderlineId as '@OrderLineID'
,ol.Description
,ol.Quantity
FROM sales.orderlines ol
inner join sales.orders o2 on ol.OrderID = o2.OrderID
inner join sales.customers c3
on c3.CustomerID = o.CustomerID
WHERE o2.OrderID = o.orderid
for XML PATH ('OrderLines'), TYPE
)
FROM sales.customers c2
INNER JOIN sales.orders o on c2.CustomerID = o.CustomerID
WHERE c.customerID = c2.customerID
AND YEAR(OrderDate)=2013
AND MONTH(OrderDate)=1
FOR XML PATH('Orders'), ROOT('CustomerOrderId'),TYPE)
FROM sales.customers c
WHERE c.customerId in (832, 803)
ORDER BY c.CustomerName
FOR XML PATH ('Customers'), ELEMENTS, ROOT ('CustomerOrders')

Just a quick reminder: you can copy these queries from the web site,
http://www.rbvandenberg.com/70-761-querying-data-mcsa-sql-2016/

We’ve abbreviated the output of this query, to use it as input for our next example. How would you
return the customer id, customer name, order date, order id and description? This data is spread out in
the XML document over multiple levels. Customer info is located at /CustomerOrders/Customers, and
order line description is located at CustomerOrderId/Orders/OrderLines.

The trick is to use the column alias. Here, using XPath, you can retrieve data from a higher nesting
level. As a row pattern, you provide the path to the lowest nesting level, and use two periods to
indicate the parent element of this level (note that you can also indicate a deeper nesting level, but
doing this will only return the first sub element; for instance, if you have several order lines under a
single order, and you start at the order level, only the first order line will be returned). In this path,
you have to distinguish between attributes and elements; for attributes, you have to specify the @
symbol.
So the OPENXML function to return the required data from this XML would be:
DECLARE @xml varchar(max)
DECLARE @xml_document_handle int

SET @xml = '


<CustomerOrders>
<Customers id="832">
<Customername>Aakriti Byrraju</Customername>
<CustomerOrderId>
<Orders OrderID="1">
<OrderDate>2013-01-02</OrderDate>
<OrderLines OrderLineID="2">
<Description>Ride on toy sedan car (Black) 1/12 scale</Description>
<Quantity>10</Quantity>
</OrderLines>
</Orders>
<Orders OrderID="45">
<OrderDate>2013-01-01</OrderDate>
<OrderLines OrderLineID="1">
<Description>32 mm Double sided bubble wrap 50m</Description>
<Quantity>50</Quantity>
</OrderLines>
</Orders>
</CustomerOrderId>
</Customers>
<Customers id="803">
<Customername>Bala Dixit</Customername>
<CustomerOrderId>
<Orders OrderID="2">
<OrderDate>2013-01-01</OrderDate>
<OrderLines OrderLineID="3">
<Description>Developer joke mug - old C developers never die (White)</Description>
<Quantity>9</Quantity>
</OrderLines>
<OrderLines OrderLineID="6">
<Description>USB food flash drive - chocolate bar</Description>
<Quantity>9</Quantity>
</OrderLines>
</Orders>
</CustomerOrderId>
</Customers>
</CustomerOrders>
'

-- Step 1
EXEC sp_xml_preparedocument @xml_document_handle OUTPUT, @xml;

--Step 2
SELECT *
FROM OPENXML (@xml_document_handle, '/CustomerOrders/Customers/CustomerOrderId/Orders/OrderLines',11)
WITH ( id int '../../../@id',
Customername varchar(1000) '../../../Customername',
OrderDate date '../OrderDate',
OrderID int '../@OrderID',
Description varchar(1000));

--Step 3
EXEC sp_xml_removedocument @xml_document_handle

As you can see: we use the double period (..) to go up one level, same as moving up a directory in
DOS.

That’s it for OPENXML. As you can see it is far more complicated than OPENJSON. You have to
prepare an internal representation of the XML document, from that internal representation retrieve the
required data, using XPath to form a row pattern and column patterns, and finally drop the internal
representation.

As you’ll see, XPath is also used by XQuery. The XML Query language is defined by the W3C, and
SQL Server has implemented parts of it. It has done so by adding a number of methods to the XML
data type. The five most important methods are: query, value, exist, modify and nodes.

* query returns a fragment of XML;


* value returns a single value as a SQL data type;
* exist tests if a node or value exists, and returns true or false;
* nodes returns a result set of nodes;
* modify modifies the XML.

Let’s explain them one by one, starting with the query method. For each example, unless stated
otherwise, we’ll use the @xml variable we’ve declared in our last example, containing the XML
document with customer, order and order lines data.
The query method is applied to an XML column or variable, takes as variable an XQuery and will
return a fragment of XML. The following will return all elements from OrderLines and everything
nested below this:

SELECT @xml.query('/CustomerOrders/Customers/CustomerOrderId/Orders/OrderLines')

The output:

<OrderLines OrderLineID="2">
<Description>Ride on toy sedan car (Black) 1/12 scale</Description>
<Quantity>10</Quantity>
</OrderLines>
<OrderLines OrderLineID="1">
<Description>32 mm Double sided bubble wrap 50m</Description>
<Quantity>50</Quantity>
</OrderLines>
<OrderLines OrderLineID="3">
<Description>Developer joke mug - old C developers never die (White)</Description>
<Quantity>9</Quantity>
</OrderLines>
<OrderLines OrderLineID="6">
<Description>USB food flash drive - chocolate bar</Description>
<Quantity>9</Quantity>
</OrderLines>

You can filter this to a specific element, by providing a number. Put the XPath between brackets, and
a number between square brackets, all within the single quotes. So in order to retrieve the second
order line:

SELECT @xml.query('(/CustomerOrders/Customers/CustomerOrderId/Orders/OrderLines)[2]')

<OrderLines OrderLineID="1">
<Description>32 mm Double sided bubble wrap 50m</Description>
<Quantity>50</Quantity>
</OrderLines>

You can also use the number to filter out specific nodes of the XML path. For example, to take the
second order line of the second customer, you’d use:
SELECT @xml.query('(/CustomerOrders/Customers[2]/CustomerOrderId/Orders/OrderLines)[2]')

<OrderLines OrderLineID="6">
<Description>USB food flash drive - chocolate bar</Description>
<Quantity>9</Quantity>
</OrderLines>

The same can be achieved by performing an XQuery in the XML itself. For instance, to retrieve the
order lines for customer id 803:

SELECT @xml.query('(/CustomerOrders/Customers[@id=803]/CustomerOrderId/Orders/OrderLines)')
<OrderLines OrderLineID="3">
<Description>Developer joke mug - old C developers never die (White)</Description>
<Quantity>9</Quantity>
</OrderLines>
<OrderLines OrderLineID="6">
<Description>USB food flash drive - chocolate bar</Description>
<Quantity>9</Quantity>
</OrderLines>

Note that id is an attribute (which requires the @ in [@id=803] ). We can do the same by name.

SELECT @xml.query('(/CustomerOrders/Customers[Customername="Bala Dixit"]/CustomerOrderId/Orders/OrderLines)')

XQuery also supports something called a FLWOR statement. You can use this to perform more
complex filtering than simply by name or id. FLWOR is an acronym for FOR LET WHERE ORDER
RETURN. This is the XML version of SELECT FROM WHERE ORDER BY. We’ll briefly cover the
parts of this FLWOR statement.

FOR and RETURN are mandatory, so we’ll start there. In the FOR clause, you assign one or more
variables to an input XPath. In the RETURN clause, you construct the result that will be returned.
The following statement will return all orders as XML:

SELECT @xml.query('
for $Orders in /CustomerOrders/Customers/CustomerOrderId/Orders
return ($Orders)
')

As mentioned, the RETURN clause can be used to construct the output; for instance, this would return
all orders as a single string:

SELECT @xml.query('
for $Orders in /CustomerOrders/Customers/CustomerOrderId/Orders
return string($Orders)
')

In the optional LET and WHERE clause, you can perform filtering. For example, this will assign the
quantity of an order to the variable $Quantity, and filter for orders with a quantity over 20:

SELECT @xml.query('
for $Orders in /CustomerOrders/Customers/CustomerOrderId/Orders
let $Quantity :=$Orders/OrderLines/Quantity
where $Quantity > 20
return ($Orders)
')

And ORDER BY does exactly what you’d expect; it orders the output.
SELECT @xml.query('
for $Orders in /CustomerOrders/Customers/CustomerOrderId/Orders
order by $Orders/@OrderID
return $Orders
')

The FLWOR statement can get a lot more complicated, but we’ll leave it at this. For more information
on the FLWOR statement, please refer to Microsoft Docs:

https://docs.microsoft.com/en-us/sql/xquery/flwor-statement-and-iteration-xquery?view=sql-server-
2016

The second method, value, returns a single value as a SQL data type. That means that there are two
differences compared to the query method: we need to make sure that only a single atomic value can
be returned, and we need to supply a data type. For instance:
SELECT @xml.value('(/CustomerOrders/Customers[2]/CustomerOrderId/Orders/OrderLines)[1]', 'varchar(1000)')

If you supply the wrong data type, you’ll get an error. Let’s replace varchar(1000) with int:

SELECT @xml.value('(/CustomerOrders/Customers[2]/CustomerOrderId/Orders/OrderLines)[1]', 'int')

Msg 245, Level 16, State 1, Line 47


Conversion failed when converting the nvarchar value 'Developer joke mug - old C developers never die (White)9' to data type int.

And likewise, if you specify an XQuery that may return more than a single value. This is what
happens when we remove the ordinal number [1] at the end of the XQuery:

SELECT @xml.value('(/CustomerOrders/Customers[2]/CustomerOrderId/Orders/OrderLines)', 'varchar(1000)')

Msg 2389, Level 16, State 1, Line 47


XQuery [value()]: 'value()' requires a singleton (or empty sequence), found operand of type 'xdt:untypedAtomic *'

Note that it is not enough that only a single value will be return with that particular input; the query
must enforce that only a single value will be returned, even with other input (with the same schema).

The next method to discuss is exist. This will return true if a value exists in the XML variable or
column, and false is the value does not exist. For instance, we can check to see if a fifth order line
exists:

SELECT @xml.exist('(/CustomerOrders/Customers/CustomerOrderId/Orders/OrderLines)[5]')

This will return false, as we’ve only got 4 order lines. And to check if there is data for customer id
832:
SELECT @xml.exist('(/CustomerOrders/Customers[@id] = 832)')

The actual XQuery can get a lot more complicated than this, but we’ll stick to the basics here. The
last method that returns data, is the nodes method. This is used in conjunction with the values or query
methods, to turn an XML document into a relational table. This process is sometimes called
shredding. This is illustrated by the following query:

SELECT XML_table.value('../../../@id', N'int') as customerid


,XML_table.value('../../../Customername[1]', 'varchar(100)') as customername
,XML_table.value('../@OrderID', 'varchar(100)') as Orderid
,XML_table.value('@OrderLineID', 'varchar(100)') as OrderLineid
,XML_table.value('Quantity[1]', 'varchar(100)') as quantity
,XML_table.value('Description[1]', 'varchar(100)') as description
FROM @xml.nodes('/CustomerOrders/Customers/CustomerOrderId/Orders/OrderLines') as t(XML_table)
ORDER BY customerid desc

Note that the XQueries used in the value methods are almost an exact match of our last OPENXML
example; but because the value method requires a singleton value, the ordinal number [1] has been
added for the elements.

This is the resulting table:

And to demonstrate the combination of methods value and query, we’ll retrieve the customer id, the
order id and, as XML, the order details:

SELECT XML_table.value('../../@id', N'int') as customerid


,XML_table.value('@OrderID', N'int') as orderID
,XML_table.query('.') as orderdetails
FROM @xml.nodes('/CustomerOrders/Customers/CustomerOrderId/Orders') as t(XML_table)

This is the resulting table. As you can see, the order details are returned as XML:

The last method we need to discuss is the modify method. It allows you to insert, delete or modify
XML. For this, we’ll return to our more simple XML string:
<Employees>
<Employee Id="1">
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
</Employee>
<Employee Id="2">
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
</Employee>
</Employees>

DECLARE @xml XML

SET @xml = '


<Employees>
<Employee Id="1">
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
</Employee>
<Employee Id="2">
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
</Employee>
</Employees>
'

In order to insert data, either an element or an attribute, you need to supply the location where you
want to insert the new data. To insert an attribute, you supply the element you want to insert the
attribute into. For instance, to insert the attribute SSID for employee #2, you would use the following
syntax:

SET @xml.modify('
insert attribute SSID {"idontknow"}
into (/Employees/Employee)[2]
')

This would result in:

<Employees>
<Employee Id="1">
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
</Employee>
<Employee Id="2" SSID="idontknow">
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
</Employee>
</Employees>

For an element, you need to specify if you want to insert this before, after or into another element. As
an alternative, you can also specify to insert into the first or last position.
For example, to insert another employee at the last position, you have to options. Either you insert it
into the last position directly, or you take into account that there are currently two employee elements,
and insert the new element after the second element:

SET @xml.modify('
insert <Employee Id="3"><FirstName>Jeff</FirstName><LastName>Williams</LastName></Employee>
after (/Employees/Employee)[2]
')

SET @xml.modify('
insert <Employee Id="3"><FirstName>Jeff</FirstName><LastName>Williams</LastName></Employee>
into (/Employees)[1]
')

Both will result in:

<Employees>
<Employee Id="1">
<FirstName>Bob</FirstName>
<LastName>Jackson</LastName>
</Employee>
<Employee Id="2">
<FirstName>Bo</FirstName>
<LastName>Didley</LastName>
</Employee>
<Employee Id="3">
<FirstName>Jeff</FirstName>
<LastName>Williams</LastName>
</Employee>
</Employees>

Deleting data is straightforward: just supply the XPath. So to delete an attribute:

SET @xml.modify('
delete /Employees/Employee[2]/@SSID
')

And to delete an element:

SET @xml.modify('
delete /Employees/Employee[2]/Address
')

If the element or attribute isn’t present, no error will be returned.

The last data modification of the modify method, is replace. This will replace the value of an element
or attribute. First, to replace an attribute:

SET @xml.modify('
replace value of (/Employees/Employee[2]/@SSID)[1]
with "001-002-003"
')

And second, to replace the text of an element, you have to add text() to the path:

SET @xml.modify('
replace value of (/Employees/Employee[2]/Address/text())[1]
with "Beverly Hills"

That is it. We’ve now covered the 4 basic methods to turn XML into relation data (query, value,
nodes and exist), and the related method to modify XML data, modify. Along the way, we’ve seen the
FLWOR statement in XQuery: FOR, LET, WHERE, ORDER BY, RETURN. You’ve now learned the
basics of using XML in SQL Server.

When (not) to use XML or JSON


On a more practical note: when should you use non-relational data, in the form of either XML or
JSON, and when should you stay away from it? This is not a stated exam objective, but we’d like to
touch upon this subject anyway.

There are definitely some benefits to using XML or JSON. Both are very flexible. This flexibility
allows the exchange of data between systems. For example, you may receive data from a supplier in
the form of XML documents, and process the data into your own system; for this, you don’t need a
direct connection to their systems. And the self-describing nature of XML and JSON, with the ability
to nest data, is a great advantage over text based data files such as CSV files.

This flexibility also allows you to add new properties a lot easier than for a database. Image that you
have a shop, and have a product table for the products you sell. In a relation design, every property
would be translated into a column in your product table. But what if you sell hundreds of different
types of products? Each of them might have distinct properties that other products don’t have. CD’s
have a composer, and cars have a number of doors, but not vice versa. In a truly relational design,
you’d end up with countless columns in your product table; this would be unmanageable. In non-
relational data, you can easily add properties to one product, without adding it to every product. You
might create columns for the most common properties, and an additional text column to store all non-
standard properties in JSON format.

This flexibility of JSON and XML also has a serious downside. The rigid requirements of primary
keys, foreign keys and other data constraints are simply essential to some systems. When selling
product for a web shop, it might not be a big deal if your cars have the property color, and your t-
shirts have a property colour. In other systems, this kind of data pollution might not be acceptable.

Another downside of non-relational data in SQL Server is that performance might suffer. SQL is
optimized to handle relational data. Searching for data in a relational format is far quicker than
searching for data inside XML or JSON strings. To alleviate this problem, you can put XML indexes
on XML columns. This is not an option for JSON. But even so: if you regularly need to scan large
tables for data inside XML columns, you might be better off by shredding the most frequently
searched attributes and elements into a relational format.

In a world where an ever increasing number of systems are connected, exchanging data between those
systems becomes more and more important, so even as a SQL developer, you need to know how to
handle non-relational data.
Summary
In this chapter, we’ve covered some advanced query techniques. We’ve seen how to turn a query into
a subquery, either a regular subquery or a correlated subquery, and demonstrated the differences
between subqueries and joins. We’ve also covered the APPLY function, which combines a result set
and a table valued function. We’ve talked about common table expressions, and the difference
between common table expressions and temporary tables; in chapter 3, we’ll talk about views, which
like common table expression, can be made to simplify querying. We’ve also covered different ways
to aggregate data: using grouping functions, windowing functions and pivot functions.
Questions

QUESTION 1
True or false. If you PIVOT a result set, and then UNPIVOT it, you’ll end up with the original result
set.

A True
B False

QUESTION 2
What statement is true, regarding the difference between XML and JSON? Choose all that apply.

A JSON documents will be smaller than XML documents containing the same data.
B SQL has special indexes for JSON, but not for XML.
C SQL has a JSON data type, but not for XML.
D Both JSON and XML support nesting of data elements.
E JSON is case sensitive, XML is not.

QUESTION 3
What technique is used to calculate a running average?

A Grouping functions
B Windowing function
C Recursive common table expression
D PIVOT function

QUESTION 4
True or false. A query on a temporal table using "FOR SYSTEM_TIME ALL" will return the
currently valid record from the base table, in addition to all records from the history table.

A True
B False

QUESTION 5
What statement is true, regarding the difference between a local temporary table and a global
temporary table? Choose all that apply.

A A global temporary table is created by using the prefix @@, a local temporary table is created
by adding the prefix @
B Data in a local temporary table is unavailable to other sessions, data in a global temporary table
is available to other sessions.
C A local temporary table is automatically deleted when the session that created it is ended; a
global temporary table must be deleted manually
D Both global and local temporary tables are deprecated. Use temporal tables instead.
QUESTION 6
Consider the following T-SQL code:

DECLARE @json NVARCHAR(MAX);


SET @json =
N'
{ "products":[
{ "id" : 1,"productinfo": { "product": "printer"
, "model": "i300" }
, "price": 300 },
{ "id" : 2,"productinfo": { "product": "scanner",
"model": "ScanXL" }
, "price": 500 }
]
}'

/* update price for scanner ScanXL to 400 */

SELECT @json;

Which update statement can be used to change the price for scanner ScanXLto 400? Choose all that
apply.

A SET @json = JSON_MODIFY(@json,'$.products[1].price',400);


B SET @json = JSON_MODIFY(@json,'$.products[2].price',400);
C SET @json = JSON_UPDATE(@json,'/Products[1]/price',400);
D SET @json = JSON_UPDATE(@json,'/Products[2]/price',400);
E SET @json = JSON_MODIFY(@json,'/Products[1]/price',400);
F SET @json = JSON_MODIFY(@json,'/Products[2]/price',400);

QUESTION 7
Given the same JSON string, what statement is needed to retrieve the productinfo object from product
id 1?

A SET @json = JSON_VALUE(@json, '$.products[0].productinfo')


B SET @json = JSON_VALUE(@json, '$_.products[0].productinfo')
C SET @json = JSON_QUERY(@json, '$.products[0].productinfo')
D SET @json = JSON_QUERY(@json, '$_.products[0].productinfo')

QUESTION 8
Which statements regarding temporal tables are true? Choose all that apply.

A A temporal table requires a primary key.


B A history table contains, in addition to the columns in the original record, two additional
columns: the time it was changed and the user who changed the record.
C To retrieve the currently valid record, use WHERE ValidTo is NULL
D The additional columns required for a temporal tables are hidden by default.

QUESTION 9
True or false. In a windowing function, the GROUPING() function is used to indicate whether a
NULL value is part of the original data, or the result of aggregation.

A True
B False.

QUESTION 10
Given the following table definition:

CREATE TABLE [dbo].[Employees](


[EmployeeID] [tinyint] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](100) NULL,
[LastName] [varchar](50) NULL,
[Address] [varchar](100) NULL,
[Salary] [decimal](18, 2) NULL,
[Department] [varchar](100) NULL,
[Employee_Number] [int] NULL,
[ManagerID] [tinyint] NULL,
[json_column] [nvarchar](max) NULL,
CONSTRAINT [PK_EmployeeID] PRIMARY KEY CLUSTERED
(
[EmployeeID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

What type of query is needed to supply a list of all employees and the people who report to them
directly (if any)?

A Recursive table expression


B Inner join
C Windowing function
D Outer apply

QUESTION 11
For the scenario in question 10, you’ve decided to try an apply function. What code is needed to
solve this? Fill in only the essential code on the dots.

SELECT *
FROM Employees e
... ( SELECT FirstName + ' ' + LastName AS TeamMembers
FROM Employees e2 ...)
...

A CROSS APPLY
B OUTER APPLY
C WHERE e2.ManagerID = e.employeeID
D WHERE e.ManagerID = e2.employeeID
E table_alias
F WHERE e2.managerId IS NOT NULL

QUESTION 12
You need to make a report to provide the top 10 customers, based on the amount of sales in the year
2016. You have the following query:

SELECT TOP 10 CustomerName


,( SELECT SUM(OrderAmount)
FROM Orders o
WHERE o.CustomerID = c.CustomerID)
FROM Customers c
WHERE YEAR(o.order_date) = 2016

Will this query work?

A No, it will not work.


B It will work, but it will give incorrect results.
C It will work, but it will perform slowly.
D Nothing is wrong.
Answers

This section contains the correct answers to the questions, plus an explanation of the wrong answers.
In addition to the correct answers, we’ll also give a few pointers which are useful on the actual exam.

QUESTION 1
The correct answer is: B, false. The PIVOT statement will perform an aggregate on a column, which
unpivot cannot undo.

QUESTION 2
A: true, as XML requires an opening and closing tag.
B: not true, it is the other way around.
C: not true, it is the other way around.
D: true.
E: not true. Both are case sensitive.

QUESTION 3
The correct answer is B. A running average is also known as a moving average.

QUESTION 4
The correct answer is B. From the history table, only those records will be returned that have been
valid at a certain point in time; this excludes records with the same start and end time.

QUESTION 5
Answer A is incorrect. A global temporary table is created by using the prefix ##, a local temporary
table is created by adding the prefix #
Answer B is correct.
Answer C is incorrect. Both a local and a global temporary are deleted automatically when the
session that created it is ended.
Answer D is incorrect. Global and temporary tables serve a completely different purpose than
temporal tables. Global and temporary tables are intended to store data for a limited time; temporal
tables are intended to preserve older copies of records.

QUESTION 6
The correct answer is A. Answers B is false, as JSON starts counting at 0. Answers C through F are
false, as they mix JSON and XML syntax.

QUESTION 7
The correct answer is C. JSON_VALUE will only return a value, not an object (or array), therefore
answers A and B are false; answer D is false, as the underscore after the dollar sign is syntax used in
PowerShell, not JSON.

QUESTION 8
Answers A and D are correct. Answer B is incorrect, as the history table contains the same columns
as the temporal table. This includes a timestamp for the time at which the record became valid, and a
timestamp for when the record become invalid. A column for the user that performed the change is not
available. Answer C is false; to retrieve the currently valid record, no additional WHERE clause is
required.

QUESTION 9
The correct answer is B, false. The GROUPING() function is used in a grouping function, not a
windowing function.

QUESTION 10
The correct answer is D, an outer apply. Answer A is wrong; a recursive query would be needed if
all subordinate employees are needed (either reporting directly or indirectly to a manager).
Answer B is wrong; it would leave out the employees to whom nobody reports (an outer join would
work fine, though). Answer C is wrong; a windowing function is used for aggregating.

QUESTION 11
The correct answer is B, C, E. Answer A is wrong; CROSS APPLY leaves out the employees to
whom nobody reports. Answer D is wrong, as it would list the employees manager.
And answer F is wrong, no further filtering is needed.

QUESTION 12
Answer A is correct. The WHERE clause refers to a table alias that is used in the sub query, so this
won’t work. That makes all the other answers wrong.
However: if you would move the WHERE clause to the sub query, answer B would be correct:
without the ORDER BY clause on the 2nd column, any 10 customers could be returned.
And with the ORDER BY clause, answer C might be correct: an index on order_date would not be
used.
Chapter 3: Program databases by using Transact-SQL
Chapter overview
In this final chapter, we’ll dive deeper into subjects that we’ve already touched upon, or at least,
mentioned in the previous chapters.
We’ll explain stored procedures, functions and views: how to create them, use them and drop them.
Next, we’ll discuss transactions and error handling. It is important to write robust code, and to
consider up front what might go wrong, and how to prevent errors in the system or in the input from
resulting in errors in your data, or output.
And finally, we’ll talk about different data types, and nullability. We’ve already seen, and used, a lot
of data types, and we’ve seen how different Transact-SQL components are affected by NULL values,
so we will elaborate on the work we’ve already done.

Exam objectives
For the exam, the relevant objectives are:

Program databases by using Transact-SQL (25–30%)


Create database programmability objects by using Transact-SQL
Create stored procedures, table-valued and scalar-valued user-defined functions, and
views; implement input and output parameters in stored procedures; identify whether
to use scalar-valued or table-valued functions; distinguish between deterministic and
non-deterministic functions; create indexed views
Implement error handling and transactions
Determine results of Data Definition Language (DDL) statements based on transaction
control statements, implement TRY…CATCH error handling with Transact-SQL,
generate error messages with THROW and RAISERROR, implement transaction
control in conjunction with error handling in stored procedures
Implement data types and NULLs
Evaluate results of data type conversions, determine proper data types for given data
elements or table columns, identify locations of implicit data type conversions in
queries, determine the correct results of joins and functions in the presence of NULL
values, identify proper usage of ISNULL and COALESCE functions
Create database programmability objects by using Transact-SQL
In this part, we’ll explain stored procedures, functions and views: how to create them, use them and
drop them. This is material which is roughly the same as for the MTA 98-364 Database Fundamentals
exam, so at this point in your career, you should already know a lot of this by heart. And moreover,
we’ve already encountered each of these programmability objects in the first two chapters. However,
we’ll repeat all the basic material just the same, because we need to add a few points that were not
topics of the MTA 98-364 exam. We’ll give a quick summary up front of the additional points we’re
going to cover; if you are already familiar with these points, you might consider skipping ahead to the
error handling objectives.
For views, we’ll discuss the schema binding and encryption options, as well as the ability to insert
data into a view, and create indexed views; for functions, we’ll dive a little deeper in the
requirements for creating a function; and for stored procedures, we’ll look at the option
RECOMPILE.

Create stored procedures, table-valued and scalar-valued user-defined functions, and views;

Views
Let’s start with views. A view is a virtual table, whose contents are defined by a query. The view
itself does not store any data; this data is still stored in the underlying table(s). There are two main
reasons to use views:
* To allow for easier coding (better readability);
* To provide granular security.

Earlier, we talked about the common table expression. A common table expression is very similar to
a view. The difference is that you store the definition of the view in the database, for later reuse,
whereas the common table expression only exists for the query that defined it.

One of the first examples we gave of a common table expression was this statement:

WITH o
AS ( SELECT *
FROM Sales.Orders)
SELECT *
FROM o

This was obviously a contrived example; there is no point in selecting everything from a common
table expression without further manipulation. This example was meant to demonstrate the syntax of
the common table expression, not the usefulness.
Code containing common table expressions can be hard to read when the code gets more complex. To
make life easier for the developer, this SQL statement can be stored as a view. The following
example demonstrates how to create a simple view:

CREATE VIEW vwEmployeeFirstName


AS
SELECT FirstName
FROM Employees

Now, we can select from the view just as we would from a common table expression (or a regular
table, for that matter):

SELECT *
FROM vwEmployeefirstname

With this simple example, using a view will not improve readability. Using views to improve
readability is best saved for complex queries, for example, when joining multiple tables, or shredding
XML; however, creating and using the view works just the same, so there is no need using more
complex examples here.

By the way: starting from Service Pack 1, in SQL Server 2016 you can create a view without
dropping it first, using the following syntax:

CREATE OR ALTER VIEW vwEmployeeFirstName


AS
SELECT FirstName
FROM Employees;

This CREATE OR ALTER syntax also works for stored procedures, functions and triggers. Like the
DROP TABLE IF EXISTS statement we saw earlier, this means you don’t have to check if an object
already exists before creating it, making your code a little bit easier to read. Also, dropping and
recreating an object means that all permissions granted on the object will have disappeared; using
CREATE OR ALTER, you maintain the permissions.

The second reason to use a view is security. You can grant SELECT permissions on a view without
granting permissions on the underlying tables. Let’s take our employees table as an example. This
table contains salary information, so access to this table should be on a need-to-know basis. Suppose
you want to allow the manager from the Engineering department to see information about just the
employees from the Engineering department, not the employees of other departments. One way of
achieving that is creating a view, and grant the manager permissions on that view:

CREATE VIEW vwEmployeesEngineering


AS
SELECT *
FROM dbo.Employees
WHERE Department = 'Engineering'
GO
SELECT *
FROM vwEmployeesEngineering

There are some rules regarding views we’d like to mention:


* the CREATE VIEW statement must be in a batch of its own, that is why we needed to provide the
GO batch separator in the example;
* as security is not an exam objective, we’ve skipped the step of granting permissions on the view;
* every column in the view needs a (unique) name, so provide a column alias when using functions or
selecting from multiple tables that use the same column name;
* you can’t use an ORDER BY statement in the definition of the view, unless TOP is also specified
(but you can use ORDER BY when selecting from the view);
* we’ve used the two part name (schema.object), which is required for indexed views (as we’ll see
later on).

Basically, this is all knowledge you should already have from the Database Fundamentals exam. Like
we said, there are a few additional options we’d like to discuss: schema binding, encryption and
DML statements on views.
Schema binding is used to protect the view against changes to the underlying schema. If the schema is
modified, this might prevent your view from working properly. With schema binding on a view, you
cannot drop any table or alter any column that the view depends upon; instead, you have to drop the
view first, then make the change to the underlying table, and recreate the view. The advantage of this
is that you have to be aware of the effect your change has; you cannot accidently mess up the view.
We’ll demonstrate what happens to the view when somebody drops a column that the view depends
on, with and without schema binding. First, we’ll add a new column; second, we’ll change the view
definition to include the new column; then, we’ll select from the view, and see what happens before
and after we drop the additional column .

ALTER TABLE Employees ADD AdditionalColumn varchar(100) NULL


GO
CREATE OR ALTER VIEW vwEmployeesEngineering
AS
SELECT FirstName, LastName, AdditionalColumn
FROM dbo.Employees
WHERE Department = 'Engineering'
GO
SELECT *
FROM vwEmployeesEngineering
GO
ALTER TABLE Employees DROP COLUMN AdditionalColumn
GO
SELECT *
FROM vwEmployeesEngineering

The first select against the view will work just fine; the second will result in the following error:
Msg 207, Level 16, State 1, Procedure vwEmployeesEngineering, Line 3 [Batch Start Line 41]
Invalid column name 'AdditionalColumn'.
Msg 4413, Level 16, State 1, Line 42
Could not use view or function 'vwEmployeesEngineering' because of binding errors.

This error could have been prevented by adding the schema binding option to the view definition:
CREATE OR ALTER VIEW vwEmployeesEngineering
WITH SCHEMABINDING
AS
SELECT FirstName, LastName, AdditionalColumn
FROM dbo.Employees
WHERE Department = 'Engineering'

With the schema binding option, an attempt to drop the column would have been prevented, with the
following error:
Msg 5074, Level 16, State 1, Line 46
The object 'vwEmployeesEngineering' is dependent on column 'AdditionalColumn'.
Msg 4922, Level 16, State 9, Line 46
ALTER TABLE DROP COLUMN AdditionalColumn failed because one or more objects access this column.

This would have informed the developer that other code still relies on this object, and that dropping
this column would break other stuff.
Note that, in order to get this example to work, we changed the definition of the view to include the
new column; the old definition used a SELECT *, but even so, after adding the column there is no
guarantee that this new column would be included in the result set. We could have achieve the same
result by refreshing the definition of the view in another way:
EXEC sp_refreshview 'vwEmployeesEngineering'

This unpredictable effect on a view of adding a column to an underlying table is another example of
why you should not use SELECT * in production.
The second option we’d like to mention is encryption:
CREATE OR ALTER VIEW vwEmployeesEngineering
WITH ENCRYPTION
AS
SELECT FirstName, LastName, AdditionalColumn
FROM dbo.Employees
WHERE Department = 'Engineering'

This will prevent anybody from reading the definition of the view. Normally, any user with sufficient
privileges on a database can see the definition of an object, for example by right-clicking the object in
Object Explorer, and creating a script to create the object, or by selecting the definition directly from
a system table called sys.syscomments:
SELECT object_name(id), text
FROM sys.syscomments

These methods will not work for objects that have been created with the encryption option. Note,
however, that this protection is not air tight: there are step-by-step instructions on the internet on how
to circumvent this encryption option.
The third option we’d like to discuss is the check option. This option will prevent inserts or updates
to the table that would cause the record to fall outside of the scope of the view.
We haven’t covered DML statements against views yet, but performing updates, inserts or deletes
against a view is pretty straightforward. We’ll give some examples of DML against a view here.
There are two important restrictions when performing DML against a view: changes may only affect
one of the underlying tables, and SQL Server must be able to make the change unambiguously (which
basically means that updates against columns that are the result of a function aren’t allowed).
The check option will add a third restriction. As stated, this option will prevent inserts or updates to
the table that would cause the record to fall outside of the scope of the view. For example, changing
the first name of Bob to Bobby would be no problem, with or without the check option:
UPDATE vwEmployeesEngineering
SET FirstName = 'Bobby'
WHERE FirstName = 'Bob'

If this were done by mistake, the person making the update could easily revert the change. But
changing the department would cause the record to fall outside the view, and a person with
permissions only on this view would not be able to correct this mistake:
UPDATE vwEmployeesEngineering
SET Department = 'Sales'
WHERE FirstName = 'Bobby'

Let’s revert these two updates, and change the view definition to include the check option:
UPDATE dbo.Employees
SET FirstName = 'Bob', Department = 'Engineering'
WHERE FirstName = 'Bobby'
GO
CREATE OR ALTER VIEW vwEmployeesEngineering
AS
SELECT *
FROM dbo.Employees
WHERE Department = 'Engineering'
WITH CHECK OPTION

With this check option in place, updating the first name would still work, but updating the department
would fail with the following error:
Msg 550, Level 16, State 1, Line 80
The attempted insert or update failed because the target view either specifies WITH CHECK OPTION or spans a view that specifies
WITH CHECK OPTION and one or more rows resulting from the operation did not qualify under the CHECK OPTION constraint.
The statement has been terminated.

The same restriction applies when inserting records into a view: the record must be included in the
scope of the view, otherwise the insert is not allowed.
So with the check option in place, the following insert statement will fail:

INSERT INTO [dbo].[vwEmployeesEngineering]


([FirstName], [LastName], [Address], [Salary], [Department])
VALUES ('James' ,'Jameson' ,'Somewhere in Ireland' ,1000 ,'sales')
GO

Obviously, this restriction does not apply for deletes; by its very nature, a delete statement will cause
the record to no longer be returned by the view. So the following will work fine, with or without the
check option in the view definition:
DELETE FROM vwEmployeesEngineering WHERE FirstName = 'James'

We’ve now seen how to create and use a view. Now, let’s drop the view:

DROP VIEW IF EXISTS vwEmployeesEngineering

There is still one more thing we need to discuss about views though:

Create indexed views

On a view, you can create one or more indexes to speed up queries. Without indexes on the view, a
view is just a query stored in the database; SQL Server will attempt to use indexes on the underlying
tables to execute the query on the view, just as it would with a regular query.

When creating indexes on a view, there are some points to consider:


* the first index you create on a view must be a unique clustered index;
* after the first (unique clustered) index, you can create additional, nonclustered indexes;
* the view must be defined with schema binding;
* you cannot use nondeterministic functions in the view (not even in columns that are not included in
an index);
* all tables and functions referenced in the view must use the two part name, schema.object;
* as with an index on a regular table, an index on a view will cause delay in performing updates,
inserts and deletes, whether done directly the underlying table or indirectly on the view.

With an index on a view, SQL Server will store the values for those columns as a separate object.
This is most useful with queries that require complex calculations, such as grouping statements.
However, these are also the cases that updates on the base tables may be most expensive, so you need
to carefully weigh the pros and the cons before actually adding an index to a view.
Also, as with any index you want to add, you need to make sure that SQL Server will actually use the
index you’re going to add. We won’t cover all the reasons why SQL Server may or may not use an
index for a given query, as this is way out of scope for this exam; we do, however, need to mention
that in the case of indexed views, there is an extra aspect SQL Server takes into consideration when
deciding whether or not to use the index: the edition of SQL Server. If you use Enterprise Edition,
SQL Server will automatically consider using an index view, even if you perform a query on the
underlying table(s) directly; in other editions of SQL Server, you need to use the query hint WITH
NOEXPAND to force SQL to use the indexed view.
To be more precise: this extra consideration applies to previous versions of SQL Server, and to
versions of SQL Server prior to Service Pack 1. With SQL Server 2016 Service Pack 1 or later, the
difference between editions has been removed.

That’s all for indexed views. Now, let’s move on to stored procedures.

Implement input and output parameters in stored procedures


A stored procedure is a small T-SQL program stored inside the database. This can be a single T-SQL
statement, or a collection of statements. There are lots of reasons to create a stored procedure:
* To allow for easier coding (especially code reuse);
* To enhance performance. Each time SQL executes a statement, it has to devise a plan how to execute
this statement as fast as possible (this is called the execution plan). This plan will be stored in
memory. The execution plan of a stored procedure can be reused the next time the stored proc is
executed; this may not be the case for code executed outside a stored procedure. Reuse of an
execution plan is beneficial, since calculating an execution plan can be quite expensive (in terms of
CPU time).
* To provide granular security. As with a view, you can grant a user the permission to execute a
stored procedure without granting permissions on the underlying tables.
* To perform complex checks and error handling.

We’ll first show you an easy example of a stored procedure, before we introduce the option that
really causes stored procedures to shine: parameters.

Here is how to create a stored procedure that selects employee information:

CREATE OR ALTER PROC procGetEmployeeInformation


AS
BEGIN
--explanation of what the stored proc does
SET NOCOUNT ON;
SELECT *
FROM Employees;
END

“CREATE PROC” is short for “CREATE PROCEDURE”; both do the same. The “BEGIN” and
“END” keywords are optional; so is the “SET NOCOUNT ON”. Both, however, are best practices,
so we’ll include them here. The “SET NOCOUNT ON” prevents SQL from returning the message “x
rows affected” to the client; this information is almost always ignored, so “SET NOCOUNT ON”
prevents this network overhead (and thereby improves performance just a little bit). Later on, we’ll
explain why you should add the connection setting XACT_ABORT as well, when discussing error
handling.

The line of commentary may be optional from a technical perspective, but by many developers it is
considered mandatory. Because the stored procedure gets stored in the database, it can be especially
helpful to add comments. For example, you might put in things like:

* What the stored proc is supposed to do;


* Special attention to parts of the code that are not straightforward;
* The date the stored proc was written;
* The name of the developer who wrote it;
* The version of the stored proc;
* Changes in this version.
Our example stored procedure can be executed as follows:

EXEC procGetEmployeeInformation

“EXEC” is short for “EXECUTE”; both do the same. The DDL code to delete the stored procedure is
very familiar:

DROP PROC IF EXISTS procGetEmployeeInformation

All this is pretty straightforward. Things become really interesting when you start to use parameters.
A parameter is a variable that can be used to store a single value. This value can then be manipulated
and used to compare with data attributes, variables or other parameters. As with a column, a
parameter has to be declared with a data type. Parameters can be either optional or mandatory, and
used for either input into the stored proc, or output from the stored proc.

In the following example, we’ll create a stored proc to retrieve the ID of an employee.

CREATE PROC procGetEmployeeID


@FirstName varchar(100)
,@LastName varchar(50)
,@Address varchar(100)
AS
BEGIN
SELECT EmployeeID
FROM Employees
WHERE FirstName = @FirstName
AND LastName = @LastName
AND Address = @Address;
END;

Two things about the parameter declaration. First, note that we’ve used the same data type for the
parameters as the column definition. This decreases the chances of errors; if the column would allow
200 characters, and the corresponding parameter only 100, you could get errors (in chapter 3, we’ll
talk more about implicit conversions when data types do not match). Second, we’ve given each
parameter a descriptive name. It may be tempting to name your variables @1, @2 and @3, but avoid
doing this; otherwise, larger pieces of code will become unreadable.

To execute this stored proc, you have to supply a value for each parameter. While typing, Intellisense
will help you with the parameters you need to supply. There are two ways of supplying the values for
the parameters, explicit and implicit.

EXEC procGetEmployeeID @FirstName = 'Bob', @LastName = 'Jackson', @Address = 'Under the bridge'

EXEC procGetEmployeeID 'Bob', 'Jackson', 'Under the bridge'


In the first example, we’ve explicitly given the parameter names as well as the values; in the second
example we’ve only given the parameter values. If you supply a value for each parameter, in the
correct order, it is optional to supply the parameter names (however, for readability it is better to
make things explicit). In this case, all parameters are mandatory, so you need to supply a value;
otherwise, you’ll get an error, for example if you omit the value for address:

Msg 201, Level 16, State 4, Procedure procGetEmployeeID, Line 0 [Batch Start Line 127]
Procedure or function 'procGetEmployeeID' expects parameter '@Address', which was not supplied.

A parameter can be made optional by declaring it with a default value, as this (partial) statement
demonstrates:

CREATE PROC procGetEmployeeID


@FirstName varchar(100) = 'Bob'
,@LastName varchar(50) = NULL
...

The preceding parameters were all input parameters. Now let’s look at an output parameter. You can
define an output parameter simply by using the keyword “OUTPUT” (or “OUT”). Then, in the body of
the stored proc, you have to assign a value to this parameter (if you don’t supply a value to this output
parameter, it will still work, but it doesn’t make any sense to return an empty output parameter).

Let’s change our stored proc to add an output parameter. Because the stored procedure already exists,
we have to either drop it first, or use “CREATE OR ALTER” syntax we also saw for views. And as
with the views, the difference between ALTER and DROP & CREATE is that the former maintains all
permissions, and the latter does not.

CREATE OR ALTER PROC procGetEmployeeID


@FirstName varchar(100)
,@LastName varchar(50)
,@Address varchar(100)
,@EmployeeID tinyint OUTPUT
AS
BEGIN
SELECT @EmployeeID = EmployeeID
FROM Employees
WHERE FirstName = @FirstName
AND LastName = @LastName
AND Address = @Address;
END;

To execute the stored proc, you have to create a variable to capture the output:

DECLARE @EmployeeID tinyint


EXEC procGetEmployeeID 'Bob', 'Jackson', 'Under the bridge', @EmployeeID = @EmployeeID OUTPUT
SELECT @EmployeeID
This will, of course, return the ID for Bob Jackson. It might be confusing that @EmployeeID gets
declared twice: once in the stored proc, and once in the batch that calls the stored proc. To be more
precise: it is not one variable declared twice, but two different variables for the same purpose (and
therefore we’ve given them the same name). This has to do with the scope of the variable. The
variables that are declared in the stored proc can only be used inside the stored procedure, and in the
execute statement. It is said that the variable is local to the stored proc (local scope).

We’ve now seen how to use parameters as input and output. Parameters can also be used to control
the flow of logic inside a stored procedure. An example of this would be the following:

CREATE PROC procLogicalTest


@parameter int
AS
BEGIN
IF @parameter = 1
BEGIN
EXEC proc1;
END
ELSE
BEGIN
EXEC proc2
END
END;

We haven’t covered logical flow, and neither does the exam, but this example is included anyway
because it begins to show the power of stored procedures in combination with parameters. Probably
the example is self-explanatory: stored procedure “procLogicalTest” will call either “proc1” or
“proc2”, based on the value of “@parameter”.

This logical flow can also be used for input validation, and correcting minor errors. Let’s say you
want to use a stored procedure to insert the name of a new customer. Before inserting the name, you
can check whether the name has been entered with any leading or trailing spaces, and remove those
automatically (using the functions LTRIM and RTRIM).

There are two additional options we’d like to discuss: encryption and recompile. The option WITH
ENCRYPTION does the same thing as when used with a view: it will hide the definition of the stored
procedure. In addition to the disadvantage we already mentioned when discussing views (that this
security is not air tight), when using encryption for stored procs there is the added disadvantage that
query plans will not be visible; this is a serious problem when performance tuning.
The other option we’d like to discuss is the option WITH RECOMPILE. As stated above, one of the
benefits of using stored procedures is that SQL can re-use the execution plan. Calculating an
execution plan is CPU intensive, so in most situations, it is better when you can re-use existing
execution plans. However, there is also an exception to this rule you need to be aware of.
Sometimes, an execution plan may be suitable for one set of input parameters, but completely
unsuitable for a different set of input parameters. If you have several sets of input parameters that
require different execution plans for the same stored procedure, and performance suffers drastically if
the wrong plan is used, you can use the option WITH RECOMPILE to force SQL Server to calculate
a new execution plan each time the stored procedure is executed.

That’s it. We’ve shown that a stored procedure is simply a collection of T-SQL statements, used with
or without parameters. Before we move on to functions, some last remarks about stored procedures:

* Do not use the prefix sp_ for stored procedures. This prefix is reserved for SQL Server system
stored procedures. In our examples we’ve used the prefix “proc” as naming convention. Having a
consistent naming convention helps to make code more readable.
* You can’t use “GO” in a stored procedure. This is a batch separator, and all statements inside a
stored procedure are executed in a single batch. Neither can you use the “USE database” clause to
switch database. If you do need to access another database, you should reference the table as
[database].[schema].[table] (with or without the square brackets).

We’ll revisit stored procedures when talking about error handling. Now, let’s move on to functions.

Identify whether to use scalar-valued or table-valued functions


Let’s revisit functions. In chapter 1, we’ve already discussed a lot of system functions; here, we are
going to make functions ourselves.
A function is a SQL statement that accepts an input parameter, performs an action using this parameter
and returns a value (either a single value or a result set). Based on this description, you might wonder
what the difference is between a function and a stored procedure. There are several differences, in
usage, capabilities and performance. For now, the most important ones are:
* A function must return a value; a stored procedure may return a value, or even more than one.
* A function cannot be used to perform actions to change the database state (that is: you can’t perform
DML such as inserts, updates or deletes to change data).
* Stored procedures can be executed on their own (as we’ve seen) using “EXECUTE”, while
functions are executed as part of a SQL statement.

As stated, a function cannot change the state of the database, not in the least. Changing data is not
allowed, but even more subtle changes are prohibited as well. Let’s use RAND as an example. As
stated in chapter 1, the RAND function will return a pseudo random number: it might appear random,
but in fact SQL Server just returns the next random number in a list with random numbers. This means
that, whenever you call the RAND function, you change where SQL was in that list of random
numbers; so calling RAND changes the state of the server, and therefore, you are not allowed to use
the function RAND inside another function. This will result in an error, as you can see here:

CREATE FUNCTION dbo.fn_rand ()


RETURNS INT
AS
BEGIN
RETURN RAND()
END

Msg 443, Level 16, State 1, Procedure fn_rand, Line 6 [Batch Start Line 0]
Invalid use of a side-effecting operator 'rand' within a function.

This syntax does, however, show the basic create statement for a function: a function takes zero or
more input parameters, and returns either a scalar value or a table.

Some remarks on functions:


* Like we’ve seen for system functions, a function that has no parameters still requires the
parentheses, both when declaring and calling the function;
* the maximum number of parameters for a function is 2100, a limit you’re not very likely to run into;
* as with views and stored procedures, functions can be declared using the SCHEMABINDING and
ENCRYPTION options;
* if you want to call a function with a default value for an input parameter, you need to use the
keyword DEFAULT (this behavior differs from a stored procedure, where you can simply omit the
parameter if a default has been declared);
* calling a function requires that you use the two-part name (schema name + object name), otherwise
you’ll receive a syntax error.

We’ll show you some functions to return a scalar value, and some that return a table (one inline table
function, one multi-statement function).

Let’s start with a function that will return a scalar value. The following function will take a number as
input parameter, and return that number multiplied by two:

CREATE OR ALTER FUNCTION dbo.fn_doubler (@a int)


RETURNS INT
AS
BEGIN
RETURN @a * 2
END

And this is how you would use the function, with a constant as input:

SELECT dbo.fn_doubler (18)

But you can also use a variable or a column as input. And you can use a scalar function anywhere a
scalar expression is used, for example in a check constraint in a table declaration, or in a select
statement. The following would return double the salary for every employee by calling the function in
the select statement using the column salary as input:

SELECT FirstName
, LastName
, dbo.fn_doubler(salary) as TwiceTheSalary
FROM dbo.Employees

Please note, though, that in this last example, the function is executed once for every row in the result
set. Here, we’re doing a straightforward calculation, so that doesn’t matter. But in cases where the
function returns data from a table, performance might suffer; in that case, you might want to consider
using a join instead of a function.

The second type of functions we’d like to demonstrate is the inline table valued function. We’ve
already encountered one in chapter 2, when we needed one to demonstrate the APPLY operator:

CREATE FUNCTION dbo.fn_GetListOfEmployees(@Department AS varchar(100))


RETURNS TABLE
AS
RETURN
(
SELECT FirstName + ' ' + LastName AS FullName, Salary
FROM Employees e
WHERE e.Department = @Department
)

The difference between a scalar valued function and a table valued function is that the table valued
function returns a table. And of course, calling this function is also different. You can use this function
where you’d normally use a table:

SELECT * FROM dbo.fn_GetListOfEmployees('Sales')

This particular table valued function is an inline table valued function, because it only has a single
select statement. As such, it does not require a definition of the table that will be returned. The
alternative table valued function is a multi-statement table valued function, which is used where you
require more than a single statement to achieve the required result set; this multi-statement does
require a definition of the table that will be returned.
So an inline table valued function looks like this:

RETURNS TABLE
AS
RETURN
(
SELECT statement...
)

Whereas a multi-statement table valued function looks like this:

CREATE OR ALTER FUNCTION dbo.fn_name (...)


RETURNS @return_table TABLE
AS
BEGIN
--some statements

--a statement to fill the table variable


INSERT @return_table
SELECT ...
--return the table variable to the calling statement
RETURN
END

So a complete statement would look something like this:

CREATE dbo.ManagersAllTheWayToTheTop (@employeeid tinyint)


RETURNS @tbl_managers TABLE
(
Name varchar(151)
, Manager varchar(151)
,Level tinyint
)
BEGIN
WITH Employee_CTE AS (
SELECT EmployeeID, ManagerID, 0 as Level
FROM Employees
WHERE FirstName = 'Bob'
UNION ALL
SELECT e1.EmployeeID, e1.ManagerID, e2.level + 1
FROM Employees e1
INNER JOIN Employee_CTE e2 ON e1.EmployeeID = e2.ManagerID
)
,
Distinct_CTE AS (
SELECT DISTINCT EmployeeID, ManagerID, Level
FROM Employee_CTE)
INSERT INTO @tbl_managers
SELECT e2.FirstName + ' ' + e2.LastName AS Name
, e3.FirstName + ' ' + e3.LastName AS Manager, Level
FROM Distinct_CTE e1
INNER JOIN Employees e2 ON e1.employeeID = e2.EmployeeID
LEFT OUTER JOIN Employees e3 ON e1.ManagerID = e3.EmployeeID
ORDER BY Level, Name
RETURN
END

GO

SELECT * FROM dbo.managersallthewaytothetop(1)

DROP FUNCTION IF EXISTS dbo.ManagersAllTheWayToTheTop

This is basically the recursive CTE we used in chapter 2 to return a list of every employee and his or
her managers; in this function, we’ve added the parameter so we can return this list of managers for a
given employee.

We’ve now demonstrate how to create, use and drop user defined functions. We’ve demonstrated
three different types of functions: scalar valued functions, inline table valued functions and multi-
statement valued functions. And we’ve mentioned some of the requirements of functions. There is one
more exam objective on functions, which we need to cover:

Distinguish between deterministic and non-deterministic functions


As stated in chapter 1, when talking about system functions: the difference between a deterministic
and a non-deterministic function is pretty easy. A deterministic function always returns the same result
(given the same input parameters and the same state of the database), while a non-deterministic
function does not necessarily produce the same result, even when called with the same input
parameters and the same state of the database.
Something similar applies to user defined functions. A user defined function is also either
deterministic or not. You cannot instruct SQL Server whether or not your function is deterministic or
not; SQL will determine this by following some rules. One of those rules is that the function cannot
call other non-deterministic functions; another, less obvious one, is that the function must be declared
with schema binding. Whether or not a function is deterministic, is determined when you create the
function. If you change any of the underlying objects, a function might change from being deterministic
to being non-deterministic. So the only way to guarantee that a function will remain to be
deterministic, is to create the function with schema binding.

Let’s demonstrate this. Our simple doubler function will always result in the same output. Even so,
SQL Server will consider this function non-deterministic. We can verify this by checking for the
extended property of the object:

CREATE OR ALTER FUNCTION dbo.fn_doubler (@a int)


RETURNS INT
AS
BEGIN
RETURN @a * 2
END
GO

SELECT OBJECTPROPERTYEX(OBJECT_ID('dbo.fn_doubler'), 'IsDeterministic')

The system function OBJECTPROPERTYEX needs two input parameters: an object id, and the
property we want to know. It will return 1 (true) if the function is deterministic, 0 (false) if it is not.
With the current definition, SQL will return zero: non-deterministic. Simply by adding the schema
binding option, this will change:

CREATE OR ALTER FUNCTION dbo.fn_doubler (@a int)


RETURNS INT
WITH SCHEMABINDING
AS
BEGIN
RETURN @a * 2
END
GO

SELECT OBJECTPROPERTYEX(OBJECT_ID('dbo.fn_doubler'), 'IsDeterministic')


Now, SQL will return a 1: deterministic.

Why is it so important to know if a function is deterministic or not, and why do we need to come back
to this topic? Simply this: we’ve now covered indexed views, and indexed views require all columns
to be deterministic.

That was all the information you need to know for the exam on stored procedures, views and
functions. You really need to know when to use one instead of the other; we’ll test you on that in the
questions following this chapter. Besides knowing all this information, you also need to be
comfortable actually applying this information in complex real life scenarios. Unfortunately, we can
only give you the pieces of the puzzle; there is no substitute for real life experience when it comes to
actually creating a whole image out of these puzzle pieces.
Implement error handling and transactions
In programming, error handling is important. Servers and networks will crash, and users will enter
input in ways you never intended. As a database developer, you should attempt to write code that is
as robust as possible. It should prevent errors during data processing from resulting in errors in the
actual data, allow the procedure to exit gracefully and also, it should provide enough feedback to
determine exactly what went wrong.
In this set of exam objectives on error handling, you will learn:
* what a transaction is;
* that there are three types of transactions (auto-commit, implicit and explicit);
* that you can have transactions inside transactions (nested transactions);
* what constitutes an error;
* how to roll back changes in case of an error;
* how to purposely save some part of the transaction, but not the rest, using savepoints;
* how to perform additional actions before exiting, using TRY CATCH blocks;
* and how to return an error to the application in a controlled manner, using either THROW or
RAISERROR.

What is a transaction
Let’s start with the transaction. A transaction is a unit of work that either succeeds as a unit, or fails as
a unit; at the end of the transaction, either all changes of the transaction have been performed, or none
of the changes (leaving no trace behind). An example of a transaction would be transferring money
from one bank account to another. If everything goes well, the amount should be subtracted from the
first account, and added to the second account; if something goes wrong, no effect of either of these
two actions should remain. We’ll use customer sales orders for this set of exam objectives; in this
scenario, a transaction should either add a sales order including all sales order lines, or no sales
order at all.

In particular, a database transaction should have the following four properties: Atomicity,
Consistency, Isolation and Durability. Collectively, these four properties are known by the acronym
ACID.
Atomicity is a property we already described: a transaction is treated as a single unit, and should
either succeed as a whole, or fail as a whole. This should be true whether the transaction is made up
of a single statement or multiple statements.
Consistency means that a transaction can only result in a system that is consistent, that is, has a
correct state. After the transaction, all the defined rules of integrity should still apply. For instance,
neither a successfully committed transaction or a failed transaction can result in a situation where a
primary key constraint is no longer valid, or a column contains a value that does not fit the data type.
This does not mean that a database has to be consistent from a logical point of view; if you haven’t
added the correct foreign key constraint, a sales order line without an accompanying sales order is
still considered consistent from a SQL Server point of view.
Isolation refers to the effect that concurrently running transactions have on each other. This property
is not as easily described as the other properties. Simply put, isolation means that incomplete parts of
a transaction should not be visible to other transactions (and vice versa), and the effect of two
transactions should be the same, regardless of whether the transactions are executed at the same time,
or sequentially.
However, complete isolation comes at a price: completely locking all parts of a database that can be
touched by a transaction, in order to ensure that other transactions will have no effect, can potentially
have an enormous impact on the performance of these other transactions. Therefore, SQL Server
supports various levels of isolation, so you can pick a suitable level, with the right balance of
performance on the one hand, and safety on the other hand. We’ll cover one such example later on.
However, a more complete explanation of isolation levels, concurrency and the possible data
conflicts is beyond the scope for this book. For more information on this topic, see:
https://docs.microsoft.com/en-us/sql/connect/jdbc/understanding-isolation-levels?view=sql-server-
2016

The last ACID property is durability. This simply means that, once a transaction has been committed,
it will persist in the database even in the case of a system crash. In SQL Server, this is guaranteed by
writing the transaction to the transaction log file.

Three types of transactions


Before we discuss the different types of transactions, let’s set up the tables we need for the code
we’ll use. We’re going to make three tables: a customer table, an order table and an order line table.

USE TestDB
GO

DROP TABLE IF EXISTS dbo.OrderLines


DROP TABLE IF EXISTS dbo.Orders
DROP TABLE IF EXISTS dbo.Customers
GO

CREATE TABLE dbo.Customers(


CustomerID int IDENTITY(1,1) NOT NULL,
FirstName varchar(100) NOT NULL,
LastName varchar(100) NOT NULL,
[Address] varchar(100) NOT NULL,
PRIMARY KEY CLUSTERED (CustomerID)
)

CREATE TABLE dbo.Orders(


OrderID int IDENTITY(1,1) NOT NULL,
CustomerID int NOT NULL,
OrderDate datetime NOT NULL,
SalesAmount money NOT NULL,
PRIMARY KEY CLUSTERED (OrderID )
)

CREATE TABLE dbo.OrderLines(


OrderID int NOT NULL,
OrderLineID int IDENTITY(1,1) NOT NULL,
Item varchar(100) NOT NULL,
Amount int NOT NULL,
LineAmount money NOT NULL,
PRIMARY KEY CLUSTERED (OrderID, OrderLineID )
)

GO

ALTER TABLE dbo.Orders ADD CONSTRAINT FK_customerID


FOREIGN KEY (CustomerID) REFERENCES dbo.Customers(CustomerID)

ALTER TABLE dbo.OrderLines ADD CONSTRAINT FK_OrderID


FOREIGN KEY (OrderID) REFERENCES dbo.Orders(OrderID)

In SQL Server, there are three types of transactions: auto-commit, implicit and explicit transactions.
The difference is whether they are started and/or stopped automatically or by code. By default, every
statement is a transaction that is automatically started and automatically committed. So the following
statement will either succeed as a whole, or fail as a whole.

INSERT dbo.Customers VALUES ('Bob', 'Jackson', 'Main street 1, Dallas')

And if the statement succeeds, it will automatically be committed. The same thing would apply to an
update or delete statement that affects more than one record: either all records are affected, or none.
Should the statement fail somewhere before the end, all records that have already been modified will
be rolled back.

The second type of transaction is an implicit transaction. The connection setting


IMPLICIT_TRANSACTIONS is off by default; if you turn this setting on, SQL Server will
automatically start a transaction for you for certain actions, among which are:
* DML actions INSERT, UPDATE, DELETE, TRUNCATE and some SELECT statements;
* DDL actions CREATE, ALTER and DROP;
* DCL actions GRANT, DENY and REVOKE;
* Cursor operations OPEN and FETCH.

The most important thing to know about implicit transactions, is that while SQL may implicitly start a
transaction, you explicitly have to finish it (by either committing or rolling back the transaction).

We’ll use the following example to illustrate three points (on transaction isolation, detecting open
transactions and on using implicit transactions):

SET IMPLICIT_TRANSACTIONS ON

INSERT dbo.Customers VALUES ('Frank', 'Smith', 'Second street 2, Miami')

The immediate result will be:

(1 row(s) affected)

You can verify that the record has been inserted by using the following query in the same query
window:
SELECT *
FROM Customers

This will return two records. However, the transaction is not completed. If you run this select
statement in a different query window, the query will not return a result; instead, it will wait until the
implicit transaction is explicitly finished.
This demonstrates the ACID property Isolation: the record that has been inserted in the open
transaction is not visible to another transaction. This is true under the default isolation level. The first
point we’d like to illustrate is the effect of the isolation level of a transaction. As mentioned before,
you can choose to use another isolation level than the default. One way of doing this is the following:

SELECT *
FROM Customers WITH (NOLOCK)

The default isolation level is called READ COMMITTED; with the NOLOCK hint, we’ve now read
an uncommitted record. We specifically mention this NOLOCK hint, because it is often used as a
performance enhancing technique without a proper understanding of the consequence: you’ve now
read a record that may never become valid, because it might be updated by the transaction before it is
committed, or it might be rolled back. Another possible effect of using the NOLOCK hint is that the
statement might not return all records. To properly understand this, you have to take into consideration
that every database action takes time, and numerous sub actions have to be performed during that
time.
To use a somewhat simplified example: suppose your query that uses the NOLOCK hint is reading an
entire table, from beginning to end. At the same time, another transaction is updating a record
somewhere in the middle of the table, and for some reason, the updated record ends up in a different
place in the table. Depending on the timing and the actual change, you might not read the record (if it
is moved to the beginning of the table before you reached the middle of the table) or you might read
the record twice (if it is moved to the end of the table after you’ve already read the record when it
was still in the middle of the table).
This is the result of the NOLOCK hint, and would not have happened using the default isolation level.
There are certainly scenarios where it would be perfectly acceptable to trade speed for accuracy by
using the NOLOCK hint, but there are also scenarios where this is not acceptable; the point is that you
should know what the impact is before changing the isolation level of a transaction or an entire
database.

The second point we’d like to illustrate is how to check if a session has a transaction open. We’ll
give three commands for you to use. The first two can be used from any session, and are therefore
useful for troubleshooting purposes (i.e., to find out who started an implicit transaction just before
leaving for lunch). The first method is:

DBCC OPENTRAN();

This will return information on the oldest open transaction in a database. In our case, this returned the
following information:
Transaction information for database 'TestDB'.

Oldest active transaction:


SPID (server process ID): 52
UID (user ID) : -1
Name : implicit_transaction
LSN : (106:111448:2)
Start time : Sep 30 2018 1:20:12:263PM
SID : 0x010500000000000515000000c89c3d0ec56c09729aa241caea030000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

The second method is to query a Dynamic Management View:

SELECT session_id, open_transaction_count


FROM sys.dm_exec_sessions
WHERE session_id > 50;

This will return all user sessions (session_id higher than 50) and the number of open transactions for
each session.

The third method is only available within the scope of the transaction itself:
SELECT @@TRANCOUNT

This will return the level of open transactions for a given session. If you execute this in the window
that has the open transaction, this will return 1; in the other window, this will return zero. This
function returns an integer, not a Boolean; when talking about nested transactions, we’ll see that this
number will be higher for nested transactions.

The third point we’d like to make is that implicit transactions are not a best practice, because the
transaction will be open until you explicitly either commit the transaction, or roll it back. Otherwise,
people actually will start implicit transactions just before leaving for lunch, effectively blocking parts
of the database for everyone else.
Let’s commit the transaction:

COMMIT TRANSACTION

And don’t forget to turn implicit transactions off again:

SET IMPLICIT_TRANSACTIONS OFF

We’ve now discussed auto-committed transactions and implicit transactions. The third type is explicit
transactions. These are only useful for multi-statement transactions, but we’ll start with just a single
statement. An explicit transaction is a transaction that is started and ended using code:

BEGIN TRANSACTION
INSERT dbo.Customers VALUES ('Joe', 'Johnson', 'Third Avenue 3, New York')
COMMIT TRANSACTION
As shorthand for BEGIN TRANSACTION, you can use BEGIN TRAN; as shorthand for COMMIT
TRANSACTION, you can use COMMIT TRAN, or even COMMIT.

There are two options we’d like to discuss for the BEGIN TRAN statement: a name and the mark.
You can give a transaction a name, but this has little effect (as we’ll see later) other than that it allows
the second option, which is the option to mark a transaction:

BEGIN TRAN MyTran WITH MARK 'StartOfTransaction'

This will create a marker in the transaction log that allows you to restore a transaction log backup to
this point. The restore option STOPATMARK will restore the log including this transaction; the
option STOPBEFOREMARK will restore the log up until the start point of this transaction.

The alternative to committing the transaction, is rolling back the transaction. For this, you use the
following command:

ROLLBACK

Nesting transactions
As stated, naming transactions has little effect, besides the fact that this is required for marking
transactions in the transaction log. You might expect that naming transactions gives you the option to
determine which transaction is rolled back. This is not the case. Only the outer transaction can be
rolled back. We’ll demonstrate this by nesting transactions, i.e. starting a transaction within a
transaction. First, we’ll use unnamed transactions; then, we’ll demonstrate the same effect with named
transactions. To determine that we actually have nested transactions, we’ll use the system function
@@TRANCOUNT, which we saw earlier.
First, we’ll insert an order to work with:

INSERT Orders (CustomerID, OrderDate, SalesAmount)


VALUES (1, GETDATE(), 10.00)

Three times in a row, we’ll start a transaction, query the transaction level and increment the order
amount for only order (#1). Then, also three times in a row, we’ll commit a transaction and query the
transaction level.

UPDATE Orders SET SalesAmount = 10.00

BEGIN TRAN
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
BEGIN TRAN
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
BEGIN TRAN
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
COMMIT TRAN
SELECT @@Trancount AS 'Transaction level'
COMMIT TRAN
SELECT @@Trancount AS 'Transaction level'
COMMIT TRAN

SELECT *
FROM Orders
WHERE orderid = 1

The result:

If we change the last COMMIT to a ROLLBACK, all updates will be rolled back; this is to be
expected. This, however, is also the case if we change one of the other commits into a rollback
instead. To illustrate let’s change the second COMMIT to a ROLLBACK:

UPDATE Orders SET SalesAmount = 10.00

BEGIN TRAN
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
BEGIN TRAN
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
BEGIN TRAN
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
COMMIT TRAN
SELECT @@Trancount AS 'Transaction level'
ROLLBACK TRAN --Only this line has changed
SELECT @@Trancount AS 'Transaction level'
COMMIT TRAN

SELECT *
FROM Orders
WHERE orderid = 1

The effect: the complete transaction will be rolled back immediately. SalesAmount will be changed
back to 10. The remaining @@TRANCOUNT will return zero, and because no transactions are open
any longer, the remaining COMMIT will generate an error:

Msg 3902, Level 16, State 1, Line 16


The COMMIT TRANSACTION request has no corresponding BEGIN TRANSACTION.

Let’s give each transaction its own name, and rerun this example.

UPDATE Orders SET SalesAmount = 10.00

BEGIN TRAN MyFirstTransaction


SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
BEGIN TRAN MySecondTransaction
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
BEGIN TRAN MyThirdTransaction
SELECT @@Trancount AS 'Transaction level'
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1
COMMIT TRAN MyThirdTransaction
SELECT @@Trancount AS 'Transaction level'
ROLLBACK TRAN MySecondTransaction
SELECT @@Trancount AS 'Transaction level'
COMMIT TRAN MyFirstTransaction

SELECT *
FROM Orders
WHERE orderid = 1

The result? Still an error, but a different one this time:

Msg 6401, Level 16, State 1, Line 20


Cannot roll back MySecondTransaction. No transaction or savepoint of that name was found.

And because this statement didn’t work, we’ve now opened three transactions, and closed only two,
so we need to close the open transaction manually:

ROLLBACK TRAN

So to reiterate: you can nest transactions, and you can give each transaction its own name; and though
you can commit transactions by name, the only transaction that really matters is the outer one. It is not
until you commit this outer transaction, that the transactions are actually committed. And you can’t roll
back an inner transaction, not even by name; you can only rollback the outer transaction. The error
message we got when trying to rollback one of the inner transactions, did hint to an alternative,
though: savepoints.
But before we move on to savepoints, there is something we need to explain about nesting
transactions. Nesting transactions is usually the result of calling stored procedures within a
transaction, not starting multiple transactions in the same script like we did here. So stored procedure
#1 starts a transaction, and within that transaction, calls stored procedure #2, which also starts a
transaction. Without all the required error handling, the outer proc would look something like this:

CREATE PROC uspOuterProc


AS
BEGIN
BEGIN TRAN
EXEC uspInnerProc
COMMIT TRAN
END

To make this code production ready, you’d obviously need to set XACT_ABORT on, add logic to
either commit or rollback, etc.; this just shows the basic coding pattern. We’ll revisit the effect of one
stored procedure calling another later on.
However, we still need to explain a few more pieces of the puzzle before putting it all together. So
let’s move on to savepoints.

Savepoints
A savepoint within a transaction is like in a computer game; you roll back to this point without losing
all progress you’ve made up until this point. This is particularly useful when performing long-running
operations, that have taken you a lot of time already and would also require a lot of time rolling back.
You create a save point by using the syntax “SAVE TRANSACTION <savepoint_name>”, and if
necessary, you roll back to the savepoint using the syntax “ROLLBACK TRANSACTION
<savepoint_name>”. We won’t use a long-running operation to demonstrate this, though. The
following example should suffice:

UPDATE Orders SET SalesAmount = 10.00

BEGIN TRAN
UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1

UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1

SAVE TRANSACTION MyFirstSavepoint

UPDATE Orders
SET SalesAmount = SalesAmount + 1
WHERE orderid = 1

ROLLBACK TRANSACTION MyFirstSavepoint

COMMIT TRAN

SELECT *
FROM Orders
WHERE orderid = 1

The result is that the SalesAmount has been set to 12. The first two updates will be committed by the
COMMIT statement; the third update statement will be rolled back by the ROLLBACK at the end.

Three notes on savepoints:


* issuing a savepoint has no effect on the level of transaction nesting, and neither does rolling back to
a savepoint (therefore, we’ve omitted the @@TRANCOUNT in the example above);
* if the entire transaction is rolled back, all changes up to the savepoint will be rolled back as well;
* you can have multiple savepoints with the same name, but if you roll back to a name you’ve used
more than once, the transaction will be rolled back to the last savepoint with that name.

So we’ve now explained what a transaction is, how to nest transactions, and how to use savepoints
within transactions so you can partially roll back a transaction. This brings us to the next question:
what, exactly, is rolled back, and more importantly: what is not?
Determine results of Data Definition Language (DDL) statements based on transaction control
statements
In general, all changes are rolled back when you issue the ROLLBACK command: Data Manipulation
Language statements such as INSERT, UPDATE and DELETE; Data Definition Language statements
such as CREATE, DROP and ALTER; and Data Control Statements such as GRANT, DENY and
REVOKE. There are, however, some noteworthy exceptions. These are:
* operating system tasks;
* variable assignments;
* assigning identity values.

As Data Control Language statements are outside of the scope of this book, we won’t demonstrate the
effect of rolling back these statements, but you can easily test them for yourself. We will demonstrate
some of the DML and DDL statements, and how they will be undone by rolling back the transaction.
Before we do that, however, we’ll discuss the exceptions.
First, the operating system tasks. These are non-transactional. For example, if your system allows you
to perform xp_cmdshell commands, and you have a D-drive you can write to, you can create a
directory on the server using the following command:

EXEC xp_cmdshell 'mkdir d:\Test_directory'

If this doesn’t work, don’t worry; just take my word for it. The point is, that rolling back this action
will not work. So the following code will create the directory, without undoing the action:

BEGIN TRAN
EXEC xp_cmdshell 'mkdir d:\Test_directory'
ROLLBACK

In this case, the initial command succeeded, but rolling back failed. In other cases, the command itself
will already fail, such as this one:

BEGIN TRAN
CREATE DATABASE TooBad
ROLLBACK

Msg 226, Level 16, State 5, Line 10


CREATE DATABASE statement not allowed within multi-statement transaction.

Creating a database is not merely an operating task, but it does include operating tasks (namely,
creating at least two files). Therefore, SQL can only partially roll back this statement, and does not
allow this statement in an (explicit) transaction.

The second exception is assigning of variables. This will not be undone by rolling back the
transaction. We’ll demonstrate this later on.

The third exception we’d like to point out is the incrementing of identity values. If you create a record
in a transaction and an identity value is assigned to a column, two changes are made: the record is
created and the identity value is incremented. Rolling back the transaction will undo the first change
(inserting the record), but not the second one (incrementing the identity value). We will use the
function IDENT_CURRENT to demonstrate this. This system function takes, as input, the name of a
table, and will return the current identity value.

So, to demonstrate, we’ll perform the following changes within a transaction, and see if they will be
rolled back:
* assign a value to a variable;
* create a table;
* add a column to a table;
* insert a record into a table;
* as a side-effect of inserting this record, increment an identity value.

This is our test code:


SET NOCOUNT ON
DECLARE @MyToughVariable INT
SET @MyToughVariable = 1
SELECT IDENT_CURRENT('Customers') AS 'Current identity'

BEGIN TRAN
SET @MyToughVariable = 2
SELECT @MyToughVariable AS 'Variable'
CREATE TABLE MysteryTable (MysterID INT IDENTITY, Mystery VARCHAR(100))
INSERT MysteryTable VALUES ('Who killed Kennedy?')
SELECT *
FROM MysteryTable
ALTER TABLE dbo.Customers ADD DateOfBirth DATETIME NULL
INSERT dbo.Customers
VALUES ('Serena', 'Baker', 'Broadway, New York', GETDATE())
SELECT * FROM Customers
ROLLBACK TRAN

SELECT @MyToughVariable AS 'Variable'

SELECT IDENT_CURRENT('Customers') AS 'Current identity'

SELECT *
FROM dbo.Customers

SELECT *
FROM MysteryTable

See if you can determine the output of the five statements after the rollback, based on the explanation
we’ve provided.

This is the result of our queries:


The first result is the current identity before the transaction. We’ve inserted three rows, so it is no
surprise that the current identity value is 3.
After this first SELECT statement, we start the transaction. The second result is the updated value for
the variable inside the transaction.
The third result is the one record we inserted into our Mystery table.
The fourth result is the Customer table, to demonstrate that we’ve successfully added the column and
inserted a record. Remember that these two changes are visible within the same transaction, but not
yet visible to other transactions (due to transaction isolation).
The fifth result is the first after the rollback. It demonstrates that the variable still has the value we
assigned to it within the transaction, so this change was not rolled back.
The sixth result is the most surprising. Inserting the record in the customer table caused the identity
value to be incremented, and this effect persisted after the transaction was rolled back. However, this
value was not increased to 4, but to 1003. The actual value is not important, however; this is caused
by a performance optimization for the process of generating identity values. Had we skipped the
adding of the column to the customer table, this identity value would have been incremented to 4, not
1003. What is important, is that the incrementing of the identity value is not rolled back to 3 (if it
would, every other insert statement on this table would have to wait for this transaction to finish).
The seventh result demonstrates that the adding of the column DateOfBirth, as well as the INSERT
statement of a new customer, has been rolled back.
And the last one, to prove that the creating of the table has been rolled back:

Msg 208, Level 16, State 1, Line 27


Invalid object name 'MysteryTable'.

We’ve now demonstrated that some actions are transactional, and are therefore rolled back when you
issue a ROLLBACK statement, and that some actions aren’t. For the exam, it is important to memorize
this list of events that are transactional: Data Manipulation Language statements such as INSERT,
UPDATE and DELETE; Data Definition Language statements such as CREATE, DROP and ALTER;
and Data Control Statements such as GRANT, DENY and REVOKE. And the list of tasks that are not
transactional:
* operating system tasks;
* variable assignments;
* assigning identity values.

You’ve now seen how to roll back an open transaction, and what happens to the different types of
statement when you do. The next question is: when does a transaction roll back?
In all of our examples up to this point, a rollback happened when we decided to do so. In real code,
however, you want to roll back whenever an error occurs. Either you as a developer decide that an
error has occurred, or SQL Server decides that an error has occurred, and the transaction rolls back.
Things, however, are not as simple as that. To clarify this, we first need to explain two concepts:
error levels and the XACT_ABORT setting. Both of these concepts influence whether SQL Server
will decide to roll back your transaction. At first, we’ll just explain these two concepts in theory;
later on, we’ll see them in action (after we’ve also explained THROW and RAISERROR, because
we’ll use those two statements in our examples).

Error levels
Every error in SQL Server has an error level. This error level is a number that indicates the severity
of the error, with zero being the least severe error, and 25 the most severe error. We won’t describe
every level here; you can read the complete list of errors on Microsoft Docs, if you like. For the
current discussion on error handling, we can group the error levels into 4 categories:
* 0-10. These are informational.
* 11-16. These are considered to be errors that the user can correct.
* 17-19. These are considered to be software errors that the user can’t correct.
* 20-25. These are considered to be system errors that the user can’t correct. The connection is
automatically closed, and the transaction is rolled back.

As you can see from this list, error levels 20-25 are the only ones that are, without considering other
factors, severe enough to cause SQL Server to roll back the transaction.

XACT_ABORT setting
By default, SQL Server will not roll back an entire transaction if an error occurs with an error level
of 0-19. If it was a SQL statement that failed, it will simply roll back that one statement, and continue
with the transaction. This default setting can be changed by setting transaction abort on:
SET XACT_ABORT ON

With XACT_ABORT on, whenever SQL encounters a run-time error, the entire transaction will be
rolled back.
I’ll give you two examples where different XACT_ABORT settings cause different results: a foreign
key violation, and an application time-out. Later on, we’ll see another example where different
XACT_ABORT settings cause different results: the THROW statement.

First, the application timeout. The test scenario is this: in one window, with XACT_ABORT either on
or off, we’ll create a table and insert a record within a transaction. However, before the transaction is
committed, we’ll simulate an application time-out. In another query window, we’ll test if we can read
from the table we just created.

Before the test, make sure the table doesn’t exist:

DROP TABLE IF EXISTS MysteryTable;

We’ll simulate an application time-out by opening a new query window, and (under Options) setting
the execution time-out:
In this new query window, run the following query:

USE TestDB
GO
SET XACT_ABORT ON
BEGIN TRAN
CREATE TABLE MysteryTable (MysterID INT IDENTITY, Mystery VARCHAR(100))
INSERT MysteryTable VALUES ('Who killed Kennedy?')

WAITFOR DELAY '00:01'


COMMIT TRAN

The table is created and the record inserted. However, the combination of waiting for 1 minute, and
the execution time-out, will cause the following error before the transaction can be committed:

(1 row(s) affected)
Msg -2, Level 11, State 0, Line 2
Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

To test this, in another query window, run the following query:

SELECT * FROM MysteryTable;

With XACT_ABORT on, this will lead to an error:

Msg 208, Level 16, State 1, Line 18


Invalid object name 'MysteryTable'.

Now run the test once more, with SET XACT_ABORT OFF. Now, the test query in the other window
will just keep waiting for the other transaction to finish. You’ll have to manually commit, or rollback,
the transaction that is still open.

The other example we’ll use is a foreign key error. We’ll create an order with two order lines: one
with the correct order id, and one with a non-existent order id:

SET XACT_ABORT ON

BEGIN TRAN
DECLARE @Order_id INT

INSERT Orders (CustomerID, OrderDate, SalesAmount)


VALUES (2, GETDATE(), 50.00)

SET @Order_id = @@IDENTITY

INSERT OrderLines (OrderID, Item, Amount, LineAmount)


VALUES (@Order_id , 'Bag of screws', 4, 10.00)
INSERT OrderLines (OrderID, Item, Amount, LineAmount)
VALUES (@Order_id + 1, 'Screwdriver', 1, 10.00)

COMMIT

The result:

(1 row(s) affected)

(1 row(s) affected)

Msg 547, Level 16, State 0, Line 22


The INSERT statement conflicted with the FOREIGN KEY constraint "FK_OrderID". The conflict occurred in database "TestDB",
table "dbo.Orders", column 'OrderID'.

To clarify: the record for the order is inserted, the first order line is inserted, and the attempt to insert
the second order line fails with a foreign key violation. This error causes the transaction to abort, and
rolled back.

Now run the test once more, with SET XACT_ABORT OFF. Once more, the record for the order is
inserted, the first order line is inserted, and the attempt to insert the second order line fails with a
foreign key violation. With XACT_ABORT off, this error will not cause the transaction to be rolled
back; we’ve now inserted an incomplete order.

This code still does not contain all the elements of proper error handling, but it does demonstrate the
importance of setting XACT_ABORT on: with this setting on, some errors will cause transactions to
be rolled back that would otherwise cause serious problems.

Implement TRY…CATCH error handling with Transact-SQL


If an error occurs, we do not simply want to roll back the transaction. Sometimes, you want some
form of control. You may want to perform some actions before terminating; this is what we’ll talk
about for this exam objective. For the next exam objective, we’ll talk about formatting a specific
error message to return to the application before terminating.
If you want to perform additional actions, instead of just rolling back, you can use the TRY… CATCH
block. A basic TRY… CATCH block looks like this:

BEGIN TRY
...some SQL statements...
...some more SQL statements...
...some more SQL statements...
END TRY
BEGIN CATCH
...some other SQL statements...
END CATCH

When a statement inside the TRY block raises an error with an error level 11-19, the execution of the
TRY block end, and execution continues inside the CATCH block. Remember that an error level of 10
or lower indicates an informational message, and 20 or higher terminate the connection; so only error
levels 11-19 will pass control over to the CATCH block. This is where you handle some open issues,
roll back the transaction and send an error message to the application.
Let’s start with a simple example, causing a divide-by-zero error:

BEGIN TRY
DECLARE @MyVariable INT
SET @MyVariable = 5
SET @MyVariable = @MyVariable/0
PRINT 'This line of code will not be executed'
END TRY
BEGIN CATCH
PRINT 'A useful error message'
END CATCH

The result:

A useful error message

Now, execute only the statements within the TRY block, so you can see what happens. Dividing 5 by
zero will generate an error:

Msg 8134, Level 16, State 1, Line 4


Divide by zero error encountered.

Without a TRY block, the PRINT statement following the error will still be executed. With a TRY
block it will not be executed; instead, execution will continue in the CATCH block (as you can see,
this is an error with error level 16).

To improve the output of our PRINT statement, we can add some additional information. Within a
CATCH block (and only within a CATCH block), the following functions are available:
* ERROR_NUMBER;
* ERROR_SEVERITY;
* ERROR_STATE;
* ERROR_PROCEDURE ;
* ERROR_LINE;
* ERROR_MESSAGE.

The old error function, @@ERROR, had to be called directly after the statement that caused the error,
otherwise the error information was lost; these new functions are available in the CATCH block
wherever you need them. This allows for easier coding. Using these functions, we can enhance the
example above:

BEGIN TRY
DECLARE @MyVariable INT;
SET @MyVariable = 5;
SET @MyVariable = @MyVariable/0;
PRINT 'This line of code will not be executed';
END TRY
BEGIN CATCH
SELECT
ERROR_NUMBER() AS ErrorNumber,
ERROR_SEVERITY() AS ErrorSeverity,
ERROR_STATE() AS ErrorState,
ISNULL(ERROR_PROCEDURE(), 'Not inside a stored proc') AS ErrorProcedure,
ERROR_LINE() AS ErrorLine,
ERROR_MESSAGE() AS ErrorMessage;
END CATCH

At this point, the TRY… CATCH block still misses a transaction, so we still need to add that. But
first, we need to get rid of the ugly PRINT statement, as this is not done in production code. Using
PRINT statements, or selecting output to the screen, works fine while testing; in production code,
however, to alert the application that something has gone wrong you should use either THROW or
RAISERROR.

Generate error messages with THROW and RAISERROR


There are two commands to generate error messages: the older RAISERROR and the newer THROW.
We’ll start with the newer command, THROW. Afterwards we’ll show that older command,
RAISERROR, still has its uses.

To start, just a quick reminder: in chapter 1, we told you that the statement before THROW needs to
properly terminated with a semicolon; otherwise, unexpected things may happen. So in the following
code, we’ll terminate all statements.
The THROW statement can be used with or without parameters. Without parameters, it can only be
used inside a CATCH block, like this:

BEGIN TRY
DECLARE @MyVariable INT;
SET @MyVariable = 5;
SET @MyVariable = @MyVariable/0;
PRINT 'This line of code will not be executed';
END TRY
BEGIN CATCH
PRINT 'A useful error message';
THROW;
PRINT 'This line of code will not be executed either';
END CATCH

This is the result:

A useful error message


Msg 8134, Level 16, State 1, Line 4
Divide by zero error encountered.

The divide-by-zero error will cause the execution to continue in the CATCH block. There, the error
you’ve just caught will be thrown again, which will cause the code to be terminated (as you can see,
the last PRINT statement wasn’t executed).

The THROW statement can also be used outside of a CATCH block. In that case, it must be used with
the following three parameters:
* error_number, an integer that is equal to, or greater than, 50,000;
* message, a string with data type nvarchar(2048);
* state, a number between 0 and 255.

You can supply values for these parameters directly, or use variables for any of these parameters.
This is the way to supply them directly:

THROW 50000, 'A useful error message', 0;

And this is the way to use variables:

DECLARE @msg_number INT = 50000


,@msg_text NVARCHAR(2048) = N'A useful error message'
,@msg_state TINYINT = 0;

THROW @msg_number, @msg_text, @msg_state;

Either way, the result is the same:

Msg 50000, Level 16, State 0, Line 19


A useful error message

The benefit of using variables is that you can substitute the variable with something more useful, for
instance the time the error occurred. Inside a CATCH block, you could use any of the ERROR
functions we discussed earlier, but we need to use similar examples both with and without a CATCH
block, so we’ll use time and process id instead.
The easy way to do this is simple string concatenation; the more difficult way would be to format an
error message. We’ll demonstrate both. Using simple string concatenation:

SET @msg_text = 'This error occurred on '


+ CONVERT(varchar(19),getdate(), 20) + ', process id '
+ CONVERT(varchar(10),@@SPID);

The alternative is to format an error message using SQL Server messages. This gives the advantage
that the error will occur in the correct language for the client. This process involves three steps: first,
you have to add the message to the sys.messages table using the system stored procedure
sp_addmessage; next, you have to format the message variable using the FORMATMESSAGE
function; and finally, throw the error message. We’ll demonstrate using two languages.

Strangely enough, while THROW accepts 50,000 as lowest number, sp_addmessage accepts 50,001
as lowest number. The reason that the error number has to be greater than 50,000 is that anything less
is considered to be a system error. So we’ll use 50,001 and add our message in two languages, US
English and German:

USE master;
GO
EXEC sp_addmessage @msgnum = 50001
,@severity = 16
,@msgtext = 'This error occurred on %s , process id %s.'
,@lang = 'us_english';

EXEC sp_addmessage @msgnum = 50001


,@severity = 16
,@msgtext = 'Dieser Fehler ist aufgetreten %1!,Prozess ID %2!'
,@lang = 'German';
GO

There is one point in this syntax that requires a bit of explanation: the percent sign. This will be used
for variable substitution. When we format our message using FORMATMESSAGE, we’ll need to
supply two variables that will be inserted in the location of the variables that start with the percent
sign, and this will have to be done in the correct order, as you can see in the following example:

DECLARE @msg_number INT = 50001


,@msg_text NVARCHAR(2048)
,@msg_state TINYINT = 0;

SET LANGUAGE German;

SET @msg_text = FORMATMESSAGE(50001, CONVERT(varchar(19),getdate(), 20), CONVERT(varchar(10),@@SPID));

THROW @msg_number, @msg_text, @msg_state;

The result:

Die Spracheneinstellung wurde auf Deutsch geändert.


Msg 50001, Level 16, State 0, Line 58
Dieser Fehler ist aufgetreten 2018-10-01 17:52:46,Prozess ID 55

And if we set the language back to US English:

DECLARE @msg_number INT = 50001


,@msg_text NVARCHAR(2048)
,@msg_state TINYINT = 0;

SET LANGUAGE us_english;

SET @msg_text = FORMATMESSAGE(50001, CONVERT(varchar(19),getdate(), 20), CONVERT(varchar(10),@@SPID));

THROW @msg_number, @msg_text, @msg_state;

The result:

Changed language setting to us_english.


Msg 50001, Level 16, State 0, Line 57
This error occurred on 2018-10-01 18:03:24 , process id 55.

This covers THROW; now let’s move on to RAISERROR. Along the way, we’ll point out some
differences between the two statements.
RAISERROR takes, as input, at least three parameters:
* either an error number or a message string;
* the error level, indicating the severity;
* state, a number between 1 and 255.

If the first variable requires parameters, there are added as well.

For the first value, you can choose to use an error number, in which case you need to have added this
error to sys.messages using sp_addmessage, as we saw with the THROW statement. If you choose a
message text (either a string variable or a string), the substitution of the percent sign we saw earlier
works just the same. So all three of these statements will have the same result:

DECLARE @date varchar(19) = CONVERT(varchar(19),getdate(), 20)


,@process_id varchar(10) = CONVERT(varchar(10),@@SPID);

RAISERROR (50001, 16, 1, @date, @process_id);


GO

DECLARE @date varchar(19) = CONVERT(varchar(19),getdate(), 20)


,@process_id varchar(10) = CONVERT(varchar(10),@@SPID);

RAISERROR ('This error occurred on %s , process id %s.', 16, 1, @date, @process_id);


GO

DECLARE @date varchar(19) = CONVERT(varchar(19),getdate(), 20)


,@process_id varchar(10) = CONVERT(varchar(10),@@SPID)
,@msg_text nvarchar (100) = N'This error occurred on %s , process id %s.';

RAISERROR (@msg_text, 16, 1, @date, @process_id);


GO

As with THROW, you can freely choose the state number; this has no outside consequences.
A big difference between RAISERROR and THROW, however, is that you can set the error level
with RAISERROR; you can’t do this with THROW. If you choose an error level higher than 10, and
lower than 20 in a TRY block, the RAISERROR command will cause execution to be passed on to the
CATCH block. This allows you to choose an error level of 10 or lower, if you only want an
informational message (without going to the CATCH block), or an error level of 20 or higher, which
will terminate the connection completely. If you want to use an error level of 20 or higher, you need to
have sysadmin permission, and add the keywords WITH LOG. This will cause the error to be written
to the SQL error log (and as a consequence, to the Windows Application Log).

DECLARE @date varchar(19) = CONVERT(varchar(19),getdate(), 20)


,@process_id varchar(10) = CONVERT(varchar(10),@@SPID)
,@msg_text nvarchar (100) = N'This error occurred on %s , process id %s.';

RAISERROR (@msg_text, 20, 1, @date, @process_id) WITH LOG;


GO

The result:

Msg 2745, Level 16, State 2, Line 17


Process ID 62 has raised user error 50000, severity 20. SQL Server is terminating this process.
Msg 50000, Level 20, State 1, Line 17
This error occurred on 2018-10-02 09:45:28 , process id 62.
Msg 596, Level 21, State 1, Line 11
Cannot continue the execution because the session is in the kill state.
Msg 0, Level 20, State 0, Line 11
A severe error occurred on the current command. The results, if any, should be discarded.

Don’t forget to clean up our mess, and remove the message from sys.messages:

EXEC sp_dropmessage @msgnum = 50001, @lang = 'all';

There is one more thing we need to mention: THROW honors the XACT_ABORT setting,
RAISERROR does not. Let’s take the following example:

DROP TABLE IF EXISTS MysteryTable;

DECLARE @date varchar(19) = CONVERT(varchar(19),getdate(), 20)


,@process_id varchar(10) = CONVERT(varchar(10),@@SPID);

SET XACT_ABORT OFF;


BEGIN TRAN

CREATE TABLE MysteryTable (MysterID INT IDENTITY, Mystery VARCHAR(100));


INSERT MysteryTable VALUES ('Who killed Kennedy?');

THROW 50000, 'A useful error message', 0;


--RAISERROR (50001, 16, 1, @date, @process_id);

COMMIT
GO
SELECT * FROM MysteryTable;

Use this to demonstrate the result of the combinations of XACT_ABORT on or off with either
THROW or RAISERROR. We’ll go ahead and tell you the outcome. If you use RAISERROR, the
XACT_ABORT setting doesn’t make a difference, the table creation will not be rolled back; the
outcome is the same if you use THROW with XACT_ABORT off; only if you have XACT_ABORT
on and you use THROW, will the transaction that created the table be rolled back (causing the select
to fail).

As you can see, both commands can be used to compose detailed error messages. The advantage of
THROW is that it can be used to rethrow the original error, is somewhat easier to use (because you
do not need to use a message from sys.messages) and will honor the XACT_ABORT setting; the
advantage of RAISERROR is that you can choose the error level.

As an alternative to using either THROW or RAISERROR (or PRINT in test code), you can use code
to insert a record detailing the error into a custom error table. Just remember to add this code after
the ROLLBACK in the CATCH block, or write the data to a (table) variable; otherwise this insert
would be rolled back as well.

Implement transaction control in conjunction with error handling in stored procedures


We’ve now described and demonstrated all the required components for error handling. How
transactions work, how to use TRY…CATCH blocks, how to create an error message and how the
XACT_ABORT setting will influence what happens when an error occurs. When you put all the
components together, a blueprint for a stored procedure would look something like this:
CREATE PROCEDURE usp_my_stored_proc @MyVariable <data_type>
AS
BEGIN
SET XACT_ABORT, NOCOUNT ON;

BEGIN TRY
BEGIN TRAN
--Validate input parameter(s)
IF @MyVariable ...

--Actual work
...
COMMIT TRAN
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0 ROLLBACK TRAN
-- error message, either THROW, RAISERROR or custom error logging code
END CATCH
END

You start a stored procedure by setting the connection settings XACT_ABORT and NOCOUNT ON.
Next, you enter a TRY block, and start a transaction. The first thing to do here is check your input
variables. Then, you go to work. Within the TRY block, you can use savepoints if you’ve performed
some long-running operations that you do not want to undo, should part of the transaction have to be
rolled back. At the end, if nothing has gone wrong, you commit the transaction; if something has gone
wrong, execution will proceed in the CATCH block. Here, you roll back the transaction and, if there
is an error, return the error to the application (or the stored procedure that called this one).

There is one check in this code that we have talked about, but not yet demonstrated. When discussing
named transactions, we demonstrated that a rollback would roll back the outer transactions; any
commits that followed, caused the error that a rollback was issued without an accompanying
transaction. This is the reason to check if @@TRANCOUNT is non-zero before rolling back in the
CATCH block. Let’s see this in action.
Suppose your actual work contains a call to another stored procedure. This inner stored procedure
should also have logic to rollback in case of errors. We’ll use an inner stored proc that generates a
divide-by-zero error, causing it to roll back:

CREATE PROC usp_inner_proc


AS
BEGIN
SET XACT_ABORT ON
BEGIN TRY
BEGIN TRAN
SELECT 5/0
COMMIT TRAN
END TRY
BEGIN CATCH
ROLLBACK;
THROW
END CATCH
END
GO
CREATE PROC usp_outer_proc
AS
BEGIN
SET XACT_ABORT ON
BEGIN TRY
BEGIN TRAN
EXEC usp_inner_proc
COMMIT TRAN
END TRY
BEGIN CATCH
ROLLBACK;
THROW
END CATCH
END
GO
EXEC usp_outer_proc

The result will be that the inner procedure rolls back the transaction started by the outer procedure.
Since the inner stored procedure fails inside the TRY block, execution will continue in the CATCH
block, with an attempt to roll back the transaction that has just been closed. This will generate an
error:

Msg 3903, Level 16, State 1, Procedure usp_outer_proc, Line 11 [Batch Start Line 15]
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.

Therefore, you need to check if the transaction hasn’t been closed by another stored procedure, before
rolling back:

ALTER PROC usp_outer_proc


AS
BEGIN
SET XACT_ABORT ON
BEGIN TRY
BEGIN TRAN
EXEC usp_inner_proc
COMMIT TRAN
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0 ROLLBACK;
THROW
END CATCH
END

This will prevent the error in the outer stored procedure. If we execute the outer stored procedure
once more, this is the result:

Msg 8134, Level 16, State 1, Procedure usp_inner_proc, Line 7 [Batch Start Line 32]
Divide by zero error encountered.

Now, the THROW statement will return the error of what went wrong during the actual work, not
what went wrong in the error control logic. Let’s clean up:

DROP PROC IF EXISTS usp_inner_proc


DROP PROC IF EXISTS usp_outer_proc

Two final notes for the exam:


* input validation is not an exam objective, but you should check input parameters whenever possible.
If you know a value should be, for example, between 1 and 7, test this, and if it’s not OK, quit the
stored proc with a detailed error. And the same holds true after the real work, before committing the
transaction: test the outcome whenever it makes sense. If you know, for example, that only one record
should be affected, test this outcome, and if it’s not OK, roll back & quit the stored proc with a
detailed error.
* Microsoft recommends using THROW instead of RAISERROR; when writing your own code, I
recommend knowing all available tools of the particular version of SQL Server you’re working with,
and making an appropriate choice.

Knowing all the components should help you pass the exam, but it is not enough to write your own
fault tolerant code, and be a successful database programmer. For more information about error
handling, I recommend reading the eBook Defensive Database Programming with SQL Server, by
Alex Kuznetsov, available for free at: https://www.red-gate.com/library/defensive-database-
programming
Implement data types and NULLs
We are almost there. The last set of exam objectives will be relatively easy, mostly due to the fact that
we’ve already touched upon these subjects in the previous parts of this book. Besides, you should
already have a lot of experience working with data types in real life.

For this last set of objectives, we’ll discuss the various SQL Server data types, and how to choose
the correct data type; and what happens when you convert one data type to another, either explicitly or
implicitly. We’ll also discuss how to handle NULL values, and what the effect of joins and functions
is when dealing with these NULL values.

Data types
As we’ve already seen in the previous chapters: for every attribute, variable or parameter, you have
to tell SQL explicitly what type of data it is: for example, some sort of number, a text string, or a date.
Making SQL aware of the data type has several advantages. By defining the data type, you have some
defense against users putting data in the wrong field, because you can’t put a name in a field that has
been defined as a date field. This way, a data type in SQL Server acts as a constraint. You also can’t
insert a non-existing date like “February 31, 2000”; SQL would prevent this, and raise an error:

The conversion of a varchar data type to a datetime data type resulted in an out-of-range value.

In some data type categories, there is a difference in the accuracy between the different data types.
For instance: smalldatetime is accurate up to the minute, whereas datetime is (almost) accurate up to
the thousandth of a second.
In addition: SQL has specific system functions for different data types. For example, SQL allows you
to easily add one month to “January 31, 2000” if you store this as a date, but if you store it as text, you
have to write your own logic to achieve this; it that case, you’d have to make sure that you don’t end
up with “February 31, 2000”. In chapter 1, we’ve already seen that there is a system function for this:
DATEADD.
Another example of how SQL handles different data types differently, is the Boolean. This is
basically a field that is either true or false; think of it as a checkbox. An example in an employee
database would be an attribute on an employee to indicate whether an employee is a current or a past
employee. You could define a text field for this, call it “IsCurrentEmployee”, and store either the text
string “true” or the text string “false”. The better alternative is storing it as a Boolean. As with the
date example above, choosing the Boolean as data type limits the possible values to true and false; if
you choose a text data type, any value could be entered, including values that don’t make sense for
that particular attribute.
Also: storing the text string “false” takes at least 5 bytes; storing a Boolean takes 1 bit or at the most,
one byte. One byte is 8 bits, so storing “false” as a text string takes up 5 to 40 times as much space as
storing it as a Boolean.

So in discussing the data types, we’ll discuss the range of values the data type allows, the accuracy,
and the storage requirements. We’ll only cover a few system functions specific to these data types, as
system functions were the topic of a separate exam objective in chapter 1.

Exactly which data type should you choose? SQL has a long list of them. If there is no data type that
meets your needs, you can even create your own data type (using CREATE TYPE); this is often done
for attributes like telephone number or zip code. We won’t cover user defined data types, though;
we’ll stick to the default ones. This is a list of all data types in SQL Server 2016:

Category Data type


Exact numerics Tinyint, smallint, int, bigint
Bit
Money, smallmoney
Decimal, numeric
Approximate numeric Float, real
Date and time Date, time
Datetime, datetime2, smalldatetime
Datetimeoffset
Character strings Char, varchar, text
Nchar, nvarchar, ntext
Binary strings Binary, varbinary, image
Other Cursor
Spatial geography and geometry types
Hierachyid
Rowversion
XML
SQL_variant
Uniqueidentifier
Table

Note that these are the data types for SQL Server version 2016: new versions may add data types, and
they are well worth checking out. For instance, date and time were added in SQL Server 2008; before
that, only datetime was available. So in order to store a date, most designs choose to use the datetime
data type, doubling the storage requirements.

Exact numerics
The data types tinyint, smallint, int & bigint are exactly the same, except for the range of numbers
they can store, and the storage requirements. They are used to store whole numbers. These are the
ranges and storage requirements of these four data types:
* a tinyint takes up 1 byte of storage, and therefore can only store numbers between 0 and 255;
* a smallint takes up 2 bytes, and can store numbers between -32,768 and 32,767
* an int takes up 4 bytes, and can store numbers between (roughly) -2 billion and 2 billion
* a bigint takes up 8 bytes, and can store numbers between -2 ^ 63 and 2 ^ 63.
Choose the smallest data type you can get away with. If you need an attribute “week number”,
a tinyint will suffice; you know up front that there will never be more than 52 weeks in a year. But do
look ahead, and take growth into account.
A bit is simply a Boolean, either 0 or 1 (true or false). The amount of storage required depends on the
number of bit columns. SQL Server stores bit columns together; so the storing of the first 1-8 bits
takes 1 byte, storing bits 9-16 takes another byte, etc. Therefore, every non-nullable bit column takes
between 1 bit and 1 byte of storage.

Money and smallmoney are intended to store monetary values. Both have an accuracy of one ten-
thousandth. Smallmoney can store values between - 214,748.3648 to 214,748.3647, and requires 4
bytes of storage; money can store values between plus and minus 9 billion, and requires 8 bytes of
storage. Neither money data type stores the currency, though; this, and possible rounding errors,
seriously limits the applicability of these data types.

The data type decimal stores a number, where you define the precision and the scale. Numeric is just
another name for decimal. Precision is the total number of digits that can be stored; if unspecified,
precision is 18. Scale indicates the number of digits to the right of the period; it can only be specified
if precision is also specified. The maximal precision is 38, in which case a decimal can store
numbers between -10^38 and 10^38. This is the syntax to declare a decimal with a total of 5 digits,
with two on the right side of the period:

DECLARE @MyVariable DECIMAL(5,2)

DECIMAL is often abbreviated as DEC. You can also use NUMERIC.

The higher the precision, the more storage is required. A decimal requires 1 byte of storage, plus 4
for the first 10 digits, another 4 for the second 10 digits etc.; therefore, a decimal requires a minimum
of 5 bytes and a maximum of 17 bytes.

Approximate numerics
Data types float and real are used to store approximate data. Real is the synonym for float(24). This
is the syntax to define a float with a precision of n, with a maximum precision of 53:
DECLARE @MyVariable FLOAT(n)

Regardless of what you enter, anything from 1-24 will result in float(24); anything above that will
result in float(53). The former will take 4 bytes of storage; the latter 8.
Float and real are approximate numbers; they cannot contain all possible values, so be aware that
rounding may occur when you have a lot of digits to the right side of the period.
For converting data from the float or real data type to a string data type, there is a special function you
can use: STR. This allows you to specify the total number of characters (including the period) and the
number of digits to the right of the period. Here’s an example:

DECLARE @MyVariable real = 100.123456789

SELECT STR(@MyVariable, 6,2)

The result is: 100.12.


Date and time
In chapter 1, we already covered the date and time data types. We’ll repeat them here, plus the
required storage:
* date. Can only contain a date. E.g. ‘2017-08-26’. Storage size: 3 bytes.
* time. Can only contain a time, with an accuracy of up to 100 nanoseconds. E.g.
’17:32:00.1234567’. Storage size: 5 bytes.
* smalldatetime. Contains both date and time, with an accuracy of up to one minute. E.g. ‘2017-08-26
17:33’. Storage size: 4 bytes.
* datetime. This is the date & time data type that is used most often. Contains both date and time, with
an accuracy of up to a third of a hundredth of a second. E.g.: ‘2017-08-26 17:33:00.123’. Because
this accuracy is a third of a hundredth of a second, when you use 3 digits after the period, the
rightmost digit is always zero, three or seven.
* datetime2. Contains both date and time, with an accuracy of up to 100 nanoseconds. E.g. ‘2017-08-
26 17:33:00.1234567’. Storage size: 8 bytes.
* datetimeoffset. Contains both date and time, with an accuracy of up to 100 nanoseconds, plus the
UTC offset (difference, in hours, between local time and UTC). E.g. ‘2017-08-26 17:33:00.1234567
+ 02:00’. Storage size: 10 bytes.

Microsoft recommends using the new date & time data types (date, time, datetime2 and
datetimeoffset) instead of the older ones (datetime & smalldatetime).

There is a long list of date & time system functions, as we’ve seen in chapter 1; make sure you are
familiar with them.

Character strings
There are Unicode and non-Unicode data types character strings. The non-Unicode data types are
char, varchar and text; the Unicode equivalents are nchar, nvarchar and ntext. The data type text
will be removed in a future version of SQL Server, so we won’t cover it here (as are ntext and
image). Char is a fixed length character string, with a maximum of 8000 characters. If you need to
exceed 8000 characters, use varchar(max).
As we’ve already seen many times before, this is used to declare a variable with a fixed length of 4:

DECLARE @MyVariable CHAR(4) = 'Text'

The char data type takes up one byte per character. The varchar equivalent requires only one byte per
character that is actually stored, plus a little overhead to keep track of the length. So if the length of
the data you’re going to store is fixed, use char; if not, use varchar.

The equivalent Unicode data types are nchar and nvarchar. The differences are that you can store
Unicode data, that the storage requirement is two bytes per character instead of one, and that you need
to use the N prefix when assigning a string literal:

DECLARE @MyVariable NCHAR(4) = N'Text'

Binary strings
The data types binary and varbinary store binary data. Binary is a fixed length data type, with a
maximum of 8000 bytes. If you need to exceed 8000 characters, use varbinary(max). The binary data
types can be used to store anything in a database (e.g. ZIP files), should the need arise.

Other
The other data types are: cursor, spatial geography and geometry types, hierachyid, rowversion,
XML, SQL_variant, uniqueidentifier and table. Of these, we’ve already covered XML, cursor and
table extensively, we won’t repeat that here. For the rest: let’s start with the uniqueidentifier.

The uniqueidentifier is a 16-byte GUID. You can create one using the functions NEWID or
NEWSEQUENTIALID, like this:

DECLARE @MyVariable uniqueidentifier = NEWID()

SELECT @MyVariable

The result:

6B52550D-EAAF-4BE2-9BF1-3C28510480B2
They are guaranteed to be unique. They are also partially random. Using the function
NEWSEQUENTIALID guarantees that each newly generated unique identifier is higher than the one
before.

Unique identifiers can also be easily created on application servers, and still be guaranteed to be
unique; that is why they are often chosen as primary key and clustered index. There are two
downsides to this:
* the clustered index key is included in every other index, so this makes the indexes bigger and
slower;
* unless sequential ID’s are used, new records will be randomly distributed over the table,
potentially causing page splits and fragmentation.

For most tables, the choosing a uniqueidentifier as primary key is not a good idea. In most cases, for
the primary key and clustered index key, choose something that is small, unchanging and ever
increasing; this translates to the smallest variation of an integer (i.e. tinyint, smallint, integer or bigint)
that will fit the maximum number of records you estimate your table is ever going to contain.

Rowversion is an automatically created, unique 8 byte binary number. It can be used to track changes
within a table. If you add a rowversion column to a table, every update to that record will cause the
rowversion to be incremented. Here is a short example:

CREATE TABLE Table1 (Column1 smallint, Column2 rowversion)

INSERT Table1 (Column1) VALUES (1)


SELECT * FROM Table1
UPDATE Table1 SET Column1 = 2
SELECT * FROM Table1
The result:

However, this rowversion column only checks if anything has changed, it does not provide a
mechanism to keep track of what actually changed (you might want to use Change Data Capture for
that, but that is a different topic altogether).

The SQL_variant data type is a strange one. You can store data from almost every other data type in
there, and the SQL_variant data type will store the data as well as the data type of the original data.
Database design best practices dictate that you should describe the data type of your data up front.
Therefore, I have not (yet) met a good scenario where to recommend the use of this data type.

Spatial geography and geometry types, as well as the hierachyid data type, are special data types for
very special circumstances. They require a lot of explanation, or none at all. Since they are not
explicitly named as exam objectives, we won’t cover them at all.

Evaluate results of data type conversions


In chapter 1, we’ve already demonstrated the use of the explicit data type conversions, using the
functions CAST and CONVERT, and their alternatives TRY_CAST and TRY_CONVERT. Do you
recall the differences between these functions? If not, you will want to reread chapter 1. For the
particular use of converting float data to a string character, we’ve also discussed the STR function.

The alternative to an explicit conversion is an implicit conversion. An implicit conversion is what


happens when using constants, or when SQL needs to compare or operate on two values with
different data types. In the following example, SQL will implicitly convert the string ‘45’ to a number;
that’s why it can perform the addition. And the comparison between the variable and the string also
works:

DECLARE @MyVariable INT = '45'

SET @MyVariable = @MyVariable + 10

IF @MyVariable > '50'


PRINT 'Yep, it''s bigger.'

This can be deduced from the results:

Yep, it's bigger.


The same thing happens when we would have declared MyVariable as a text string, e.g.
VARCHAR(100). We try to add two numbers, and even though one is declared a text string, SQL
automatically “gets it”.
However, when we try use a similar example to concatenate two strings, this will fail:

DECLARE @MyVariable VARCHAR(100) = 'My favourite number is '

SET @MyVariable = @MyVariable + 16

SELECT @MyVariable

The result:

Msg 245, Level 16, State 1, Line 3


Conversion failed when converting the varchar value 'My favourite number is ' to data type int.

Why doesn’t SQL Server automatically understand what we’re trying to do here? Because in both
situations, SQL Server uses the same, simple rule. Each data type is ranked in a data type precedence
list, with user-defined data types having the highest precedence, and binary having the lowest
precedence.
Whenever SQL needs to compare two values with a different data type, the data type with the lower
precedence in this list will be converted to the data type with the higher precedence. And since the
data type integer has a higher precedence than the data type varchar, SQL will try to convert the text
string to an integer. In the first example, this is what we intended; in the second example, this failed.
This string could of course not be converted to an integer, causing the error.

You can find the complete data type precedence list here:
https://docs.microsoft.com/en-us/sql/t-sql/data-types/data-type-precedence-transact-sql?view=sql-
server-2016

Please read the list for yourself. The quick summary: from highest to lowest precedence, the list is:
* user-defined data types;
* SQL_variant, as anything can be converted to a SQL_variant;
* XML;
* date & time data types;
* number data types;
* text data types;
* binary.

Microsoft also has a list of what data type can be implicitly converted to what other data type. We’ll
provide the link for reference, should you ever need it:
https://docs.microsoft.com/en-us/sql/t-sql/data-types/data-type-conversion-database-engine?
view=sql-server-2016

Furthermore, there are a few rules you need to keep in mind:


* SQL will convert any whole number that will fit into an integer, into an integer (not a tinyint or
smallint if that would fit as well);
* anything bigger will be converted to a decimal;
* a constant with a decimal point will be converted to a decimal data type, with minimum precision
and scale required;
* if an operator combines two expressions with the same data type, the result will be that same data
type.

We already saw that last rule in action when talking about arithmetic functions. The result of an
operation on two integers will be an integer. That explains the unexpected result of the following
simple math problems:

SELECT 2/5
SELECT 2000000000 + 2000000000

All of these constants will be implicitly converted to integers. The first will return zero, as 0.4 is
truncated; the second will cause an arithmetic overflow, as 2 billion will fit into an integer, but 4
billion won’t.
So theoretically, if we would replace one of these numbers to 4 billion, that constant would get
implicitly converted to a decimal data type, and because SQL now needs to operate on two values
with different data types, it will convert the data type with the lower value to the data type with the
higher data type. In this case, that means that integer will be converted to decimal before adding the
two numbers; and after this conversion, the addition will succeed:

SELECT 2000000000 + 4000000000

The result is, of course, 6 billion.


The following simple trick can be used to check for yourself into what data type SQL will convert a
given constant. We’ll provide the constant in the SELECT statement, create a column in a new table,
and look up the definition in the system tables sys.columns and sys.types:
DROP TABLE IF EXISTS Type_check_table

SELECT 4.63 AS column1


INTO Type_check_table

SELECT c.name, t.name, c.max_length, c.precision, c.scale, c.is_nullable


FROM sys.columns c
INNER JOIN sys.types t ON c.system_type_id = t.system_type_id
WHERE object_id = object_id('Type_check_table')

DROP TABLE IF EXISTS Type_check_table


Remember that numeric and decimal are essentially the same data type.

In the output, max_length refers to the storage size (in bytes). Note that the column is created as non-
nullable. Had we used a variable, instead of a constant, the column would have been nullable.

DECLARE @MyVariable numeric(3,2) = 4.63

SELECT @MyVariable AS column1


INTO Type_check_table

These examples, the explanations and the two links provided should give you enough information to
understand what will happen when SQL needs to implicitly convert one data type to another. And in
theory, the “when” is pretty simple: whenever SQL needs to operate on, or compare, to values with a
different data type, it will try to implicitly convert the data type with the lowest precedence to the
data type with the highest precedence. Should this fail, or if you want more control over the
conversion, perform an explicit conversion instead, by using one of the conversion functions:
CONVERT, CAST, TRY_CONVERT or TRY_CAST.

Determine proper data types for given data elements or table columns
Choosing the correct data type for a data element is actually not that difficult. First off, you need to
know all data types that are available in SQL Server. In particular, you need to know the type of data
it can store, the accuracy, the range of values and the storage requirements.
Then, for every specific data element, you have to consider what type of data you want to store in it.
This tells you what data type category you need. Next, determine the accuracy you need, as well as
the smallest and largest possible value you need to store now, as well as in the foreseeable future.
And then, if that still leaves you with multiple choices: pick the one that requires the least amount of
storage (mainly for performance reasons). And if this still doesn’t give you a good fit between what
SQL offers, and what you need, you might consider creating your own data type.
If I were to say that, nine times out of ten, this short checklist is sufficient, that would still be
overestimating the complexity of choosing the most appropriate data type. Certainly, there are
situations where you need to actually analyze access patterns and use performance tests to determine
the most appropriate data type. For instance, choosing the data type for the primary key of a large
table with a high transaction volume and rigorous performance requirements does warrant deeper
analysis. Still, in most situations, the choice for the correct data type can be pretty straightforward.

There are, however, two considerations to be made here. The first is that, far too often, developers do
not spend the time needed to pick the appropriate data type, regardless of how much (or how little)
time that would take. Far too often, in production databases, we still see columns that are used to
store only a date, but instead have a datetime data type; columns that are only used to store non-
Unicode text but instead have a Unicode data type; or columns that need to store only small, whole
numbers but instead have an integer data type. It would not have taken much time to determine the
appropriate data type; however, even that time was apparently not taken.
The second consideration is that, for the exam, you are more likely to be tested on the more difficult
situations. That means that you need to memorize the list of data type categories, the data types in each
category and the differences between the data types.
So we’ll repeat the basic rules.
1) Choose the smallest data type you can get away with. If you need an attribute “week number”,
a tinyint will suffice; you know up front that there will never be more than 52 weeks in a year. But do
look ahead, and take growth into account. Because it will take a long time to convert an integer
column that is the clustered index of a large table to a bigint.
2) Choosing the correct data type is even more important for primary and foreign key columns, and
columns that will be included in indexes, as these values are stored more than once. A good starting
point is to choose a column with a data type that is small, static and ever increasing.
3) Only allow NOT NULL when necessary. There is a slight overhead to having nullable columns in
terms of storage, and also, the query optimizer can sometimes make better decisions for non-nullable
columns, as opposed to nullable columns.

Identify locations of implicit data type conversions in queries


We’ve covered all the theory of implicit data type conversions previously. Locating implicit data type
conversions is a matter of meticulously checking every query you need to work on.

Determine the correct results of joins and functions in the presence of NULL values
This is also a subject we already covered. We discussed how to handle joins on nullable columns in
chapter 1; and for every other function we discussed, we discussed the impact of NULL values when
we thought that needed particular attention.

Identify proper usage of ISNULL and COALESCE functions


We’ve already covered the function ISNULL, but not the alternative: COALESCE. So we’ll do that
here.
The ISNULL function takes two input parameters: a value and a replacement value. If the first value is
not NULL, the function will return this value; otherwise, it will return the replacement value. In code:

DECLARE @MyNULLVariable VARCHAR(100)


,@MyNonNULLVariable VARCHAR(100) = 'Original value'

SELECT ISNULL(@MyNULLVariable, 'Replacement value')


,ISNULL(@MyNonNULLVariable, 'Replacement value')

In other words: ISNULL will return the first non-null value. The biggest difference between the
ISNULL function and COALESCE is that COALESCE takes more than two input parameters, and out
of those parameters, it will return the first non-null value.

DECLARE @MyNULLVariable VARCHAR(100)


,@MyNonNULLVariable VARCHAR(100) = 'Original value'

SELECT COALESCE(@MyNULLVariable, @MyNonNULLVariable, 'Replacement value')

COALESCE is an expression, not a function. Under the covers, this is actually a case expression:

SELECT CASE
WHEN @MyNULLVariable IS NOT NULL THEN @MyNULLVariable
WHEN @MyNonNULLVariable IS NOT NULL THEN @MyNonNULLVariable
ELSE 'Replacement value'
END

There are also a few more subtle differences, with regards to:
* the ANSI standard;
* the resulting data type;
* the resulting nullability;
* performance.

First off: ISNULL is proprietary T-SQL, whereas COALESCE is based on the ANSI standard. If you
need to work with other database systems, or have plans on moving your database to another
platform, you might want to use COALESCE for this reason.

Second: the resulting data type is different. ISNULL will return the data type of the first value;
COALESCE will return the data type with the highest precedence of all supplied input parameters
(the same behavior as the underlying CASE expression).
This behavior can lead to truncation and conversion errors, or unexpected results if not all input
parameters have the same data type. For instance, datetime has a higher precedence than date. So if
the first input parameter has data type date, and the second datetime, ISNULL will return a date, and
COALESCE will return a datetime:

DECLARE @MyDateVariable date = '2018-10-05'


,@MyDateTimeVariable datetime = '2018-10-05 11:07'

SELECT ISNULL(@MyDateVariable, @MyDateTimeVariable)


, COALESCE(@MyDateVariable, @MyDateTimeVariable)

If we do not assign a value to the date variable, the resulting values will be different, but the data type
of the outcome will be the same:

The third difference between ISNULL and COALESCE is the nullability of the outcome. Earlier on,
we used code to create a table using a SELECT INTO statement to determine the resulting data type
of an implicit conversion. We can use the same code here, to determine the nullability of the outcome
of the ISNULL and COALESCE function using different input.
The nullability of the outcome of ISNULL is always non-nullable, unless both input parameters are
nullable. Let’s demonstrate this, before moving on to the nullability of the outcome of COALESCE.
Remember: the string constant is non-nullable, the variable is nullable even though we provide a
value for it:

DROP TABLE IF EXISTS Type_check_table

DECLARE @MyDateVariable date = '2018-10-05'

SELECT ISNULL(@MyDateVariable, '2018-10-05 11:07') AS column1


,ISNULL('2018-10-05 11:07', @MyDateVariable) AS column2
,ISNULL(@MyDateVariable, @MyDateVariable) AS column3
,ISNULL('2018-10-05 11:07', '2018-10-05 11:07') AS column4
INTO Type_check_table

SELECT c.name, t.name, c.max_length, c.precision, c.scale, c.is_nullable


FROM sys.columns c
INNER JOIN sys.types t ON c.system_type_id = t.system_type_id
WHERE object_id = object_id('Type_check_table')

DROP TABLE IF EXISTS Type_check_table

The outcome:

As stated: only when both input parameters are nullable, is the outcome nullable.

If we replace all 4 ISNULL functions with COALESCE, the result will be different:
Now, the outcome is nullable if any of the input parameters are nullable; only if all input parameters
are non-nullable, the outcome will be non-nullable.

The last difference is, under certain circumstances, performance. As stated, the COALESCE function
actually performs CASE expression. When using subqueries, this makes a difference. Consider the
following query:

SELECT COALESCE (subquery, replacement_value)

Under the covers, this is:

SELECT CASE
WHEN (subquery) IS NOT NULL THEN (subquery)
ELSE replacement_value
END

In this example, the subquery is performed twice even though both subqueries are identical. Because
even though the subquery may be identical, word for word, technically it is a different query; there is
nothing in the CASE expression that mandates that both subqueries are identical. Depending on the
requested isolation level, it is not even guaranteed that both outcomes of the same subquery are
identical, as the data might have been changed by other transactions, somewhere between the first and
second execution of the subquery. In this case, it is better to use ISNULL instead of COALESCE.
As an example, we’ll use the Orders and Sales table. In chapter 2, we made a copy of these tables
into our TestDB database (because of a security policy on the original versions of these tables, that
complicated the execution plan). Here, we’ll use the same copies. If you have already dropped these
copies, here is the code to recreate them:

SELECT *
INTO Customers_copy
FROM WideWorldImporters.sales.Customers

SELECT *
INTO orders_copy
FROM WideWorldImporters.sales.orders

As a subquery, we’ll look for the date of the most recent order for each customer; if NULL, we’ll
replace that date with January 1st , 1900. The two alternatives, using ISNULL and COALESCE, look
like this:
SELECT CustomerName
,ISNULL(( SELECT MAX(OrderDate)
FROM Orders_copy o
WHERE o.CustomerID = c.CustomerID), '1900-1-1')
FROM Customers_copy c

SELECT CustomerName
,COALESCE(( SELECT MAX(OrderDate)
FROM Orders_copy o
WHERE o.CustomerID = c.CustomerID), '1900-1-1')
FROM Customers_copy c

We’ll spare you the result of both queries, as this is identical. The interesting thing is the execution
plan:

In case the execution plan is too small to read: the orders_copy table is scanned twice in the
COALESCE example, but only once using the ISNULL example. This is why we mentioned that the
COALESCE expression is actually a CASE expression:

SELECT CustomerName
,CASE WHEN ( SELECT MAX(OrderDate)
FROM Orders_copy o
WHERE o.CustomerID = c.CustomerID) IS NOT NULL
THEN (
SELECT MAX(OrderDate)
FROM Orders_copy o
WHERE o.CustomerID = c.CustomerID)
ELSE '1900-1-1'
END
FROM Customers_copy c

So to summarize: the ISNULL function and the COALESCE expression are very similar. They are
used to replace possible NULL values with replacement value. The difference are:
* ISNULL takes two input parameter, COALESCE two or more;
* ISNULL is a proprietary Transact-SQL function, COALESE is in the ANSI standard;
* ISNULL will return the data type of the first input parameter, COALESCE will return the data type
with the highest precedence of all the input parameters;
* the output of the ISNULL function will be nullable only if both input parameters are nullable,
whereas the output of COALESCE will be nullable unless all input parameters are non-nullable;
* performance might be different, when a subquery is used as input.
Summary

In this final chapter, we’ve elaborated on some subjects that we’ve already encountered in previous
chapters.
We’ve explained stored procedures, functions and views: how to create them, use them and drop
them.
After that, we’ve discussed transactions and error handling. We’ve talked about the ACID properties
of transactions, why you should set XACT_ABORT on, and demonstrated the use of TRY CATCH
blocks. We’ve also seen that, in the case of nested transactions, only the outer transaction truly
commits the transaction, and any rollback in an inner transaction will roll back all transactions,
including the outer one.
And finally, we’ve talked about different data types, and nullability. We’ve discussed the various data
type categories and the data types, with their precision, scale, storage requirements and data type
precedence. We’ve also demonstrated how implicit conversions work, and what can go wrong when
implicitly converting values. Finally, we’ve elaborated on the effect of NULLs: on functions and
joins, and how to replace NULL values.
Questions

The last set of questions in this book. These questions will test your knowledge of all three chapters.

For some bonus questions, visit the web site:


http://www.rbvandenberg.com/books/mcsa-sql-2016-70-761/bonus-questions-for-mcsa-70-761/

QUESTION 1
You need to insert records for three new customers. You need to make sure that either all inserts
succeed, or none.

You use the following script:

INSERT dbo.Customers VALUES ('Bob', 'Jackson', 'Main street 1, Dallas');


INSERT dbo.Customers VALUES ('Frank', 'Smith', 'Second street 2, Miami');
INSERT dbo.Customers VALUES ('Joe', 'Johnson', 'Third Avenue 3, New York');

Will this script satisfy the requirements?

A True
B False

QUESTION 2
Consider the following code:

DECLARE @StringOne varchar(100)


,@StringTwo varchar(100) = 'Non-empty string'

SELECT @StringOne + @StringTwo

What is the result?

A NULL
B 'Non-empty string'
C It depends

QUESTION 3
You are deleting millions of rows from a large table with a single delete statement. Half way through
this procedure, the server crashes. After the server reboots, you find that none of
the records are deleted. Which of the ACID properties is responsible for this?

A Atomicity
B Consistency
C Isolation
D Durability
QUESTION 4
Which connection settings are considered best practices when writing stored procs?

A SET IMPLICIT_TRANSACTIONS OFF


B SET NOCOUNT ON
C SET XACT_ABORT OFF
D SET ANSI_NULLS ON

QUESTION 5
Consider the following statement:

SELECT s.Name, s.PostalCode, s.City


FROM [dbo].[Suppliers] s
WHERE s.SupplierID BETWEEN 3 and 33;

Will the records with SupplierID 3 and 33 be included in the result set (if present in the Suppliers
table)?

A Yes
B No

QUESTION 6
Consider the following table create statement:

1 CREATE OR ALTER TABLE [dbo].[Customers](


2 [CustomerID] [int] IDENTITY(100,1) NOT NULL,
3 [FirstName] [nvarchar](100) NOT NULL,
4 [LastName] [nvarchar](100) NOT NULL,
5 [PostalCode] [PostalCode] NULL,
6 [Location] [geography] NULL,
7 [BirthDay] [date] NULL,
8 PRIMARY KEY CLUSTERED
9 ([CustomerID] ASC)
10 )

Which line, if any, contains an error?

QUESTION 7
You want to reuse a complex query for further processing. The availability of the query must be
limited to the execution scope of a single, outer query. Which of the following SQL components
should you use?

A View
B Common table expression
C Table variable
D Local temporary table
E Global temporary table
F Stored procedure

QUESTION 8
You want to reuse a complex query for further processing. You need to store the results of this query
for reuse inside this session. For further processing, you need to be able to add one or more indexes.
Which of the following SQL components should you use?

A View
B Common table expression
C Table variable
D Local temporary table
E Global temporary table
F Stored procedure

QUESTION 9
You want to reuse a complex query for further processing. For the duration of the session, you need to
store the results of this query for reuse in this session as well as other sessions. For further
processing, you need to be able to update the stored results. Which of the following SQL components
should you use?

A View
B Common table expression
C Table variable
D Local temporary table
E Global temporary table
F Stored procedure

SCENARIO: server maintenance


The following questions all relate to the same scenario. You are a system administrator, responsible
for monitoring servers. You have a table with server information, and a table with maintenance
information. This is the definition of these tables:

DROP TABLE IF EXISTS tblServers


DROP TABLE IF EXISTS tblServerMaintenance

CREATE TABLE tblServers (


ServerID INT IDENTITY PRIMARY KEY
,ServerName VARCHAR(100)
,ServerFQDN VARCHAR(100)
,Description VARCHAR(1000))

CREATE TABLE tblServerMaintenance (


ServerMaintenanceID INT IDENTITY
,ServerID INT NULL
)

You want to make a lot of improvements to the code.

QUESTION 10
You’ve decided to start by creating a stored procedure to put servers in maintenance. This is the code
you intend to implement:

CREATE OR ALTER PROC usp_put_server_in_maintenance


@servername VARCHAR(100)
AS
BEGIN
SET XACT_ABORT ON;
SET @servername = '%' + @servername + '%';
BEGIN TRY
BEGIN TRAN
INSERT tblServerMaintenance
SELECT ServerID
FROM tblServers
WHERE ServerName like @servername
OR ServerFQDN LIKE @servername;
IF @@ROWCOUNT > 1 THROW 55500, 'too many servers match search pattern', 1
COMMIT TRAN
END TRY
BEGIN CATCH
ROLLBACK;
THROW;
END CATCH
END

A requirement for this stored procedure is that it may only put one server in maintenance mode at the
same time. If the wildcard matches more than a single server, no server may be put into maintenance
mode, and an error must be generated.

Does the proposed code meet this requirement?

A Yes
B No

QUESTION 11
The maintenance table currently has a lot of invalid entries, due to manual updates to this table.
Which query would delete only the records with an invalid ServerID?

--A
DELETE
FROM tblServerMaintenance
WHERE ServerID IN (
SELECT ServerID
FROM tblServerMaintenance
EXCEPT
SELECT ServerID
FROM tblServers
)

--B
DELETE
FROM tblServerMaintenance
WHERE ServerID IN (
SELECT ServerID
FROM tblServerMaintenance
INTERSECT
SELECT ServerID
FROM tblServers)

--C
DELETE
FROM tblServerMaintenance
WHERE ServerID IN (
SELECT sm.ServerID
FROM tblServerMaintenance sm
LEFT OUTER JOIN tblServers s on s.ServerID = sm.ServerID
WHERE s.ServerID IS NULL)

--D
DELETE sm
FROM tblServerMaintenance sm
LEFT OUTER JOIN tblServers s ON s.ServerID = sm.ServerID
WHERE s.ServerID IS NULL
QUESTION 12
You want to add a foreign key relation to the maintenance table, to prevent further errors. Which of
the following steps do you need to take, and in which order?

A Change the column definition to NOT NULL


B Remove the NULL and non-matching values
C Add the foreign key
D Drop and recreate the tblServerMaintenance table with the foreign key
E Add the foreign key WITH NOCHECK
F Save the matching values to a temporary table
G Import the matching values back into the newly created table
H Add a primary key to the tblServerMaintenance

QUESTION 13
You want to change the column ServerID to non-nullable. What is the correct syntax for this?
A ALTER TABLE tblServerMaintenance ALTER ServerID NOT NULL
B ALTER TABLE tblServerMaintenance ALTER ServerID int NOT NULL
C ALTER TABLE tblServerMaintenance ALTER COLUMN ServerID NOT NULL
D ALTER TABLE tblServerMaintenance ALTER COLUMN ServerID int NOT NULL

QUESTION 14
You want to add two columns, to indicate the start and end time for the maintenance interval. These
column must be accurate up to the minute. What would be the best data type?

A SMALLDATETIME
B DATETIME
C DATETIME2
D DATETIMEOFFSET

QUESTION 15
Which statement can you use to add the required start column? Substitute <data_type> with the correct
answer from the previous question.

A ALTER TABLE tblServerMaintenance ADD Maintenance_start <data_type> NOT NULL


ALTER TABLE tblServerMaintenance ADD Maintenance_start <data_type>NULL
ALTER TABLE tblServerMaintenance ADD COLUMN Maintenance_start <data_type>NOT
NULL
D ALTER TABLE tblServerMaintenance ADD COLUMN Maintenance_start <data_type>NULL

QUESTION 16
As the next step, you now need to change the maintenance stored procedure, to allow for the
maintenance start & end window. In addition, you want maintenance_start to have a default value. If
the maintenance_start is not specified, it should be replaced with the current date & time.

CREATE OR ALTER PROC usp_put_server_in_maintenance


@servername VARCHAR(100)
,@maintenance_start smalldatetime = NULL
,@maintenance_end smalldatetime
AS
BEGIN
SET XACT_ABORT ON;
SET @servername = '%' + @servername + '%';
BEGIN TRY
BEGIN TRAN
INSERT tblServerMaintenance (ServerID, Maintenance_start, Maintenance_end)
SELECT ServerID, @maintenance_start, @maintenance_end
FROM tblServers
WHERE ServerName like @servername
OR ServerFQDN LIKE @servername;
IF @@ROWCOUNT > 1 THROW 55500, 'too many servers match search pattern', 1
COMMIT TRAN
END TRY
BEGIN CATCH
ROLLBACK;
THROW;
END CATCH
END

In the select statement, you need to replace @maintenance_start with:

A ISNULL(@maintenance_start, GETDATE())
B IIF(@maintenance_start IS NULL, GETDATE(), @maintenance_start)
C COALESCE(@maintenance_start, CURRENT_TIMESTAMP)
D Any of the above

QUESTION 17
The parameter @maintenance_start now has a default value. What is the correct way to call the stored
proc with the default for @maintenance_start? Choose all that apply.

A EXEC dbo.usp_put_server_in_maintenance @servername = 'DC01', @maintenance_end =


'2019-01-01 00:00'
EXEC dbo.usp_put_server_in_maintenance @servername = 'DC01', @maintenance_start =
DEFAULT, @maintenance_end = '2019-01-01 00:00'
EXEC dbo.usp_put_server_in_maintenance 'DC01', '2019-01-01 00:00'
D EXEC dbo.usp_put_server_in_maintenance 'DC01', DEFAULT, '2019-01-01 00:00'

QUESTION 18
Which statement will return the name of the server and its description, plus the end of the maintenance
period if applicable?

--A
SELECT ServerName, s.Description, sm.Maintenance_end
FROM tblServers s
INNER JOIN tblServerMaintenance sm on s.ServerID = sm.ServerID
WHERE CURRENT_TIMESTAMP BETWEEN sm.maintenance_start AND sm.Maintenance_end

--B
SELECT ServerName, s.Description, sm.Maintenance_end
FROM tblServers s
CROSS JOIN tblServerMaintenance sm
WHERE CURRENT_TIMESTAMP BETWEEN sm.maintenance_start AND sm.Maintenance_end

--C
SELECT ServerName, s.Description, sm.Maintenance_end
FROM tblServers s
CROSS APPLY ( SELECT *
FROM tblServerMaintenance sm
WHERE s.ServerID = sm.ServerID
AND CURRENT_TIMESTAMP BETWEEN
sm.maintenance_start AND sm.Maintenance_end) sm

--D
SELECT ServerName, s.Description, sm.Maintenance_end
FROM tblServers s
OUTER APPLY ( SELECT *
FROM tblServerMaintenance sm
WHERE s.ServerID = sm.ServerID
AND CURRENT_TIMESTAMP BETWEEN
sm.maintenance_start AND sm.Maintenance_end) sm

QUESTION 19
For reporting purposes, you want to replace potential NULL values for the column maintenance_end
with the text ‘n/a’. You intend to do this by changing line 1 of the previous query into the following
code:

SELECT ServerName, s.Description, ISNULL(sm.Maintenance_end, 'n/a')

Does the proposed code meet this requirement?

A Yes
B No

QUESTION 20
You want to make a further improvement to the usp_put_server_in_maintenance stored procedure.
You want to have the error message written to the Windows application log. To achieve this, which
change do you need to make to the following code?

THROW 55500, 'too many servers match search pattern', 1

A Change the value for the last parameter to 20 or higher


B Add the keywords WITH LOG
C Move the THROW in the CATCH block for this.
D This can’t be done.
E None of the above.

END OF SCENARIO: server maintenance

QUESTION 21
Consider the following tables:
CREATE TABLE [dbo].[Customers](
[CustomerID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](100) NOT NULL,
[LastName] [varchar](100) NOT NULL,
[Address] [varchar](100) NOT NULL,
[PostalCode] [PostalCode] NULL
[City] [varchar](100) NULL)

CREATE TABLE [dbo].[Orders](


[OrderID] [int] IDENTITY(1,1) NOT NULL,
[CustomerID] [int] NOT NULL,
[OrderDate] [datetime] NOT NULL,
[SalesAmount] [money] NOT NULL)

Which lines, if any, in the following code contain an implicit conversion?

1 DECLARE @CustomerID int = '112'


2
3 SELECT FirstName
4 , LastName
5 , Address
6 , CONCAT(CAST(PostalCode AS VARCHAR(10)), ', ', City) AS AddressLine2
7 , o.OrderDate
8 , o.SalesAmount
9 FROM Customers c
10 INNER JOIN Orders o ON o.CustomerID = C.CustomerID
11 WHERE c.CustomerID = @CustomerID
12 AND o.OrderDate > DATEFROMPARTS(2016, 12, 31)

The last set of questions all have the same answers; only the questions are different.

QUESTION 22
You need to calculate the last day of a given month.

In this scenario, should you use (chose all that apply):

A View
B Scalar function
C Table valued function
D Stored procedure
E Temporary table

QUESTION 23
You need to retrieve an ordered list of best customers, parameterized for a given city. Access to the
list must be logged to a dedicated table.
In this scenario, should you use (choose all that apply):

A View
B Scalar function
C Table valued function
D Stored procedure
E Temporary table

QUESTION 24
You need to retrieve an ordered list of best customers, parameterized for a given city.

In this scenario, should you use (chose all that apply):

A View
B Scalar function
C Table valued function
D Stored procedure
E Temporary table

QUESTION 25
You need to retrieve a list of best customers.

In this scenario, should you use (chose all that apply):

A View
B Scalar function
C Table valued function
D Stored procedure
E Temporary table

QUESTION 26
For new functionality, you need to make major structural changes to a table, while maintaining
backwards compatibility for older code that still requires the structure of the old table.
Therefore, you implement a new table for the new structure, and in order to maintain backwards
compatibility, you implement a ... (chose all that apply):

A View
B Scalar function
C Table valued function
D Stored procedure
E Temporary table
Answers

This section contains the correct answers to the questions, plus an explanation of the wrong answers.
In addition to the correct answers, we’ll also give a few pointers which are useful on the actual exam.

QUESTION 1
The correct answer is B: false. There is no transaction involved, so if any of the three statements fail
(for whatever reason), the others might still be executed.

QUESTION 2
The correct answer is C. With connection setting CONCAT_NULL_YIELDS_NULL ON, the result is
NULL; when OFF, the result is 'Non-empty string'.

QUESTION 3
The correct answer is A, atomicity.

QUESTION 4
The correct answer is A and B.
The setting IMPLICIT TRANSACTIONS handles whether or not a transaction is implicitly started by
SQL; it is off by default, and should be turned off.
The setting NOCOUNT handles the message to the client of the number of rows affected by a query; it
is off by default, and should be turned on.
The setting XACT_ABORT handles what errors cause a transaction to be terminated; it is off by
default, but should be turned on.
The setting ANSI_NULLS handles concatenation of string values with null values; it is off by default,
and should only be turned on if you need it.

QUESTION 5
The correct answer is A: yes. The border values for the operator BETWEEN will included in the
result set. If you want the border values not to be included, use a combination of greater than & less
than (> & <).

QUESTION 6
The correct answer is line 1. The CREATE OR ALTER syntax is added in Service Pack 1, but only
for functions, views and stored procedures.

QUESTION 7
The correct answer is B, a common table expression. All other components will last longer than the
duration of a single query.

QUESTION 8
The correct answer is D, a local temporary table. This allows you to add columns. A global
temporary table will be available for other sessions, this is not needed. A table variable will not
allow you to add columns. Other components will not store results.
QUESTION 9
The correct answer is D, a global temporary table. A local temporary table will only be available for
the session that created it. Other components will not meet the requirements.

QUESTION 10
The correct answer is A: yes. The THROW command will cause the transaction to abort.

QUESTION 11
The correct answer is D. Answers A and C will work, except for the NULL values; the IN operator
used in the where clause will not match the NULL records.
Answer B will remove all the records you need to keep.

QUESTION 12
The correct answer is B, C. You need to remove any values that conflict with the foreign key
relationship. However, NULL values are allowed, so step A is not required. Answer H is incorrect;
while adding a primary key may be a good idea, it is not required. All the other steps are incorrect, as
these are not required.

QUESTION 13
The correct answer is D. The keyword COLUMN is required, and so is specifying the data type (even
though this does not change).

QUESTION 14
The correct answer is A: smalldatetime. All other data types have a higher accuracy than required,
therefore requiring more storage than needed.

QUESTION 15
The correct answer is B. Adding the keyword COLUMN is incorrect syntax. Requiring the column to
be non-nullable (without adding a default) would conflict with existing records.

QUESTION 16
The correct answer is D. To replace a NULL value with a replacement value, either ISNULL or
COALESCE will do fine. The IIF function is not covered in this book, but since A and C are correct,
you can deduce that the correct answer is D. You cannot expect to be familiar with all keywords you
may encounter on the exam.
Look up the details for IIF on Microsoft Docs. In short: it takes three parameters: a logical test that is
either true or false, a value to return if the logical test is true and a value to return if the logical test is
false.

QUESTION 17
The correct answer is A, B, D. C is incorrect as only two values are supplied; even though parameter
two is now optional, you can only omit this parameter when explicitly calling all others.
QUESTION 18
The correct answer is D. Answer A is incorrect: the inner join will leave out the servers that are not
in maintenance. Answer B is incorrect: the cross join will return each server record with each
maintenance record. Answer D is incorrect: the cross apply will return the same information as
answer A.

QUESTION 19
The correct answer is B: no. Though the syntax for replacing is correct, SQL can’t convert the string
to a smalldatetime; therefore, the query will fail.

QUESTION 20
The correct answer is E: none of the above. To achieve this, you need to use RAISERROR instead of
THROW.

QUESTION 21
The correct answer is 1 and 12. The string is converted to an integer, and the function
DATEFROMPARTS returns, like the name implies, a date (which is converted to datetime). The
CAST in line 6 is an explicit conversion. Note that, besides DATEFROMPARTS, SQL also has the
functions DATETIMEFROMPARTS and TIMEFROMPARTS.

QUESTION 22
The correct answer is B, a scalar function. This function already exists: it is called EOMONTH.

QUESTION 23
The correct answer is D, a stored procedure (due to the logging requirement).

QUESTION 24
The correct answer is C and D, a table valued function or a stored procedure.

QUESTION 25
The correct answer is A, C and D. Either a view, a table valued function or a stored procedure.

QUESTION 26
The correct answer is A: a view. This can be used anywhere the old table can be used.

You might also like