You are on page 1of 3

Step 4: Create Data Features using TSQL In

Database Advanced Analytics Tutorial


SQL Server 2016 and later

Updated: April 19, 2016


Applies To: SQL Server 2016
After several rounds of data exploration, you have collected some insights from the data, and are ready to move on to feature
engineering. This process of creating features from the raw data is a critical step in advanced analytics modeling.
In this step, you'll learn how to create features from raw data by using a TransactSQL function. You'll then call that function from
a stored procedure to create a table that contains the feature values.

Define the Function


The distance values reported in the original data are based on the reported meter distance, and don't necessarily represent
geographical distance or distance traveled. Therefore, you'll need to calculate the direct distance between the pickup and drop
off points, by using the coordinates available in the source NYC Taxi dataset. You can do this by using the Haversine formula in a
custom TransactSQL function.
You'll use one custom TSQL function, fnCalculateDistance, to compute the distance using the Haversine formula, and use a
second custom TSQL function, fnEngineerFeatures, to create a table containing all the features.

To calculate trip distance using fnCalculateDistance


1. The function fnCalculateDistance should have been downloaded and registered with SQL Server as part of the preparation
for this walkthrough. Take a minute to review the code
In Management Studio, expand Programmability, expand Functions and then Scalarvalued functions.
Rightclick fnCalculateDistance, and select Modify to open the TransactSQL script in a new query window.

CREATEFUNCTION[dbo].[fnCalculateDistance](@Lat1float,@Long1float,@Lat2float,
@Long2float)
Userdefinedfunctionthatcalculatesthedirectdistancebetweentwogeographical
coordinates.
RETURNSfloat
AS
BEGIN
DECLARE@distancedecimal(28,10)
Converttoradians
SET@Lat1=@Lat1/57.2958
SET@Long1=@Long1/57.2958
SET@Lat2=@Lat2/57.2958
SET@Long2=@Long2/57.2958
Calculatedistance

SET@distance=(SIN(@Lat1)*SIN(@Lat2))+(COS(@Lat1)*COS(@Lat2)*COS(@Long2
@Long1))
Converttomiles
IF@distance<>0
BEGIN
SET@distance=3958.75*ATAN(SQRT(1POWER(@distance,2))/@distance);
END
RETURN@distance
END
GO

The function is a scalarvalued function, returning a single data value of a predefined type.
It takes latitude and longitude values as inputs, obtained from trip pickup and dropoff locations. The Haversine
formula converts locations to radians and uses those values to compute the direct distance in miles between those
two locations.
To add the computed value to a table that can be used for training the model, you'll use another function, fnEngineerFeatures.

To save the features using fnEngineerFeatures


1. Take a minute to review the code for the custom TSQL function, fnEngineerFeatures, which should have been created for
you as part of the preparation for this walkthrough.
This function is a tablevalued function that takes multiple columns as inputs, and outputs a table with multiple feature
columns. The purpose of this function is to create a feature set for use in building a model. The function
fnEngineerFeatures calls the previously created TSQL function, fnCalculateDistance, to get the direct distance between
pickup and dropoff locations.

CREATEFUNCTION[dbo].[fnEngineerFeatures](
@passenger_countint=0,
@trip_distancefloat=0,
@trip_time_in_secsint=0,
@pickup_latitudefloat=0,
@pickup_longitudefloat=0,
@dropoff_latitudefloat=0,
@dropoff_longitudefloat=0)
RETURNSTABLE
AS
RETURN
(
AddtheSELECTstatementwithparameterreferenceshere
SELECT
@passenger_countASpassenger_count,
@trip_distanceAStrip_distance,
@trip_time_in_secsAStrip_time_in_secs,
[dbo].[fnCalculateDistance](@pickup_latitude,@pickup_longitude,@dropoff_latitude,
@dropoff_longitude)ASdirect_distance
)
GO

2. To verify that this function works, you can use it to calculate the geographical distance for those trips where the metered
distance was 0 but the pickup and dropoff locations were different.

SELECTtipped,fare_amount,passenger_count,(trip_time_in_secs/60)asTripMinutes,
trip_distance,pickup_datetime,dropoff_datetime,
dbo.fnCalculateDistance(pickup_latitude,pickup_longitude,dropoff_latitude,
dropoff_longitude)ASdirect_distance
FROMnyctaxi_sample
WHEREpickup_longitude!=dropoff_longitudeandpickup_latitude!=dropoff_latitude
andtrip_distance=0
ORDERBYtrip_time_in_secsDESC

As you can see, the distance reported by the meter doesn't always correspond to geographical distance. This is why
feature engineering is so important.
In the next step, you'll learn how to use these data features to train a machine learning model using R.

Next Step
Step 5: Train and Save a Model using TSQL

Previous Step
Step 3: Explore and Visualize the Data

See Also
InDatabase Advanced Analytics for SQL Developers Tutorial
SQL Server R Services Tutorials
2016 Microsoft

You might also like