You are on page 1of 13

Spark Type 2 and type 6

implementation

By Abdelrahman Omar

1|Page

bbi.ai
Content
Introduction……………………………………………………………………………………………………………………… 3
Authenticate Kerberos……………………………………………………………………………………………………….6
Developing SCD type 6 Script……………………………………………………………………………………………..8
Type 6 Logic Diagram…………………………………………………………………………………………………………11
Developing SCD type 2 Script……………………………………………………………………………………………..11
Type 2 Logic Diagram………………………………………………………………………………………………………….13

2|Page
• Introduction

Slowly Changing Dimension (SCD) Type 6. This was an exciting project that allowed me
to apply my knowledge of big data and Spark to a real-world problem.
SCD Type 6 is a complex data warehousing concept that tracks changes to data over
time. In this type of dimension, the system keeps track of all the changes made to a
record, even after it has been updated or deleted. This is useful in scenarios where
historical data is important, such as in financial reporting or compliance.
To implement SCD Type 6 in Spark, I had to first identify the fields in the data that
needed to be tracked for changes. I then designed a schema to store the historical data
on a hive table, created a temporary table to hold the current data, and used Spark
Python Dataframes to compare the two tables and identify any changes.

Connect to VM
1. Open MobaXterm on your local machine.
2.Click on the "Session" button in the top left corner of the window.

3|Page
3.In the "Session Settings" window, select "SSH" as the protocol.

4|Page
4.Enter the IP address of the VM in the "Remote host" field.

5.Double-click on the session in the "Sessions" tab to connect to the VM.

5|Page
If the connection is successful, you will see a terminal window with a command prompt for
the VM.

6|Page
Authenticate Kerberos

Kerberos is a widely-used authentication protocol that can be used to secure


distributed computing environments, such as Apache Hadoop and its related
projects, including Spark and Hive. Kerberos provides a way for users to securely
authenticate with a distributed system, and to access resources on that system
based on their authenticated identity.

In the context of Spark and Hive, Kerberos authentication can be used to secure
access to sensitive data and ensure that only authorized users are able to perform
certain actions. For example, Kerberos can be used to authenticate users who
want to access a Hive table or run a Spark job, and to ensure that only authorized
users can read or modify that data.

To use Kerberos authentication with Spark and Hive, several configuration steps
are required. These steps typically involve configuring Kerberos on the cluster,
configuring Kerberos authentication for the Spark and Hive services, and
configuring user accounts to use Kerberos authentication. Additionally, users may
need to obtain Kerberos tickets and set up the appropriate environment variables
in order to authenticate with the cluster.

Overall, Kerberos authentication provides an important layer of security for


distributed computing environments like Spark and Hive, helping to ensure that
sensitive data is only accessible by authorized users and reducing the risk of data
breaches and unauthorized access.

1.Identify the Spark service principal name (SPN) that you want to use for Kerberos
authentication. The SPN typically takes the form of "spark/[hostname]@REALM".

7|Page
spark/bbi-gateway.test.local@TEST.LOCAL

2. Generate a Kerberos keytab file for the SPN using the kadmin command. For example, the
command "kadmin -q 'addprinc -randkey spark/[hostname]@REALM'" will generate a keytab
file for the "spark/[hostname]@REALM" principal.

Kadmin -g ‘addprinc -randkey spark/bbi-gateway.test.local@TEST.LOCAL’

3.Save the keytab file in a secure location on the Spark server.

4.Configure the Spark service to use Kerberos authentication by setting the


"spark.authenticate" configuration property to "true" in the Spark configuration file.

kinit -kt /var/run/cloudera-scm-agent/process/738-spark_on_yarn-


SPARK_YARN_HISTORY_SERVER/spark_on_yarn.keytab spark/bbi-
gateway.test.local@TEST.LOCAL

8|Page
Developing SCD type 6 Script
1. Open mobaxtreme to write the python script

2. First thing before start write the ETL logic we must import all libraries we will use in our
program

9|Page
3. Create spark session with DB connector and hive context to connect to MySQL and hive
table.

4. Start implement Type 6 logic

5. Submit the script and start the ETL Job using “spark-submit” command

When submitting the script we must specify the directory of the MySQL connector using
The argument –jars

10 | P a g e
Type 6 diagram

Developing Type 2 script

For developing Type 2 script it is the same steps as developing type 6 and this is the logic
For the script

11 | P a g e
12 | P a g e
Type 2 Diagram

13 | P a g e

You might also like