Professional Documents
Culture Documents
implementation
By Abdelrahman Omar
1|Page
bbi.ai
Content
Introduction……………………………………………………………………………………………………………………… 3
Authenticate Kerberos……………………………………………………………………………………………………….6
Developing SCD type 6 Script……………………………………………………………………………………………..8
Type 6 Logic Diagram…………………………………………………………………………………………………………11
Developing SCD type 2 Script……………………………………………………………………………………………..11
Type 2 Logic Diagram………………………………………………………………………………………………………….13
2|Page
• Introduction
Slowly Changing Dimension (SCD) Type 6. This was an exciting project that allowed me
to apply my knowledge of big data and Spark to a real-world problem.
SCD Type 6 is a complex data warehousing concept that tracks changes to data over
time. In this type of dimension, the system keeps track of all the changes made to a
record, even after it has been updated or deleted. This is useful in scenarios where
historical data is important, such as in financial reporting or compliance.
To implement SCD Type 6 in Spark, I had to first identify the fields in the data that
needed to be tracked for changes. I then designed a schema to store the historical data
on a hive table, created a temporary table to hold the current data, and used Spark
Python Dataframes to compare the two tables and identify any changes.
Connect to VM
1. Open MobaXterm on your local machine.
2.Click on the "Session" button in the top left corner of the window.
3|Page
3.In the "Session Settings" window, select "SSH" as the protocol.
4|Page
4.Enter the IP address of the VM in the "Remote host" field.
5|Page
If the connection is successful, you will see a terminal window with a command prompt for
the VM.
6|Page
Authenticate Kerberos
In the context of Spark and Hive, Kerberos authentication can be used to secure
access to sensitive data and ensure that only authorized users are able to perform
certain actions. For example, Kerberos can be used to authenticate users who
want to access a Hive table or run a Spark job, and to ensure that only authorized
users can read or modify that data.
To use Kerberos authentication with Spark and Hive, several configuration steps
are required. These steps typically involve configuring Kerberos on the cluster,
configuring Kerberos authentication for the Spark and Hive services, and
configuring user accounts to use Kerberos authentication. Additionally, users may
need to obtain Kerberos tickets and set up the appropriate environment variables
in order to authenticate with the cluster.
1.Identify the Spark service principal name (SPN) that you want to use for Kerberos
authentication. The SPN typically takes the form of "spark/[hostname]@REALM".
7|Page
spark/bbi-gateway.test.local@TEST.LOCAL
2. Generate a Kerberos keytab file for the SPN using the kadmin command. For example, the
command "kadmin -q 'addprinc -randkey spark/[hostname]@REALM'" will generate a keytab
file for the "spark/[hostname]@REALM" principal.
8|Page
Developing SCD type 6 Script
1. Open mobaxtreme to write the python script
2. First thing before start write the ETL logic we must import all libraries we will use in our
program
9|Page
3. Create spark session with DB connector and hive context to connect to MySQL and hive
table.
5. Submit the script and start the ETL Job using “spark-submit” command
When submitting the script we must specify the directory of the MySQL connector using
The argument –jars
10 | P a g e
Type 6 diagram
For developing Type 2 script it is the same steps as developing type 6 and this is the logic
For the script
11 | P a g e
12 | P a g e
Type 2 Diagram
13 | P a g e