You are on page 1of 4

Question 2 : Player session service

We need to build a ingestion pipeline service from game application which produces events
data to the database and run analytics & derive metrics out of them.

Requirement :
Types of ingestion :

Event batches : Time (ts) will be considered as the ingestion condition


Session Batches :
1. for the last x hours for each country
2. last 20 complete sessions for given player
Data older than an year is omitted

On perm solution is proposed with the below architecture .

Game No Sql Hadoop (Spark /


server/application Hive/Impala/Presto
Database HDFS)

Gaming application gives us the events/sessions data and the data is getting written into a No
Sql database. We can consider Mongo/Redis/Cassandra database for effective handling of the
events data ( Assuming the data is getting written into Mongo database ). Data is stored in
mongodb and we can consume data from here for all our analytics purpose. Data is duplicated
here which might add additional cost but data loss can be prevented

Once the data is moved into mongodb, we can connect to mongo using the spark mongo
connector ( spark is deployed in cluster mode on perm and ingested data is stored in HDFS in
parquet format which is effective for reading). After data ingestion is done , we are pointing our
external tables to this raw files and creating table on top of it which can be used by data
engines ( Hive/Impala/Presto) and also data visualization ( Tableau / Pentagon)

DATA INGESTION
Approaches used in data ingestion are : Ingestion is written in Scala (batch mode) using spark
data frame API’s and converted into jar using the maven libraries. Jar is wrapped into a bash
script which is input driven and can be called using the cron jobs or deployed into airflow using
the bash operator . We can also use spark structured streaming which will be explained with
the cloud solution.

We are connecting to mongodb using spark-mongo connector and only one year of data is
loaded into data frame using year(current_date)-1 clause. Use cases are split into three
modules and every time we load it is getting loaded as overwrite mode so old data is refreshed
with new ones and external tables are pointed to HDFS.
Used various data optimization techniques in the scala code like data frame persists , various
storage levels and we are repartitioning into a single file and writing them into parquet format.

Bash script : driver.sh


connect_string=mongodb://username:password@server.com:27017/$db.$collection
spark2-submit
--jars /home/jar/mongo-spark-connector_2.10-2.2.0.jar,/home/jar/mongodb-driver-3.4.3.jar
--driver-class-path /home/jar/mongo-spark-connector_2.10-2.2.0.jar,/home/jar/mongodb-
driver-3.4.3.jar
--conf "spark.mongodb.input.uri=$connect_string" --packages org.mongodb.spark:mongo-
spark-connector_2.10:2.2.0
--master yarn --deploy-mode cluster --executor-memory 24G --num-executors 16 --executor-
cores 2
--class com.esurance.ingestion.spark2_mongo
/home/target/scala2.11/spark_ingestion_pipe_2.11-0.1.jar
$FEED_ST_DAY $collection $user_hour $player_id $TYPE

We can tweak the above parameters based on the timing and resources needed.

Example :
We can run the scripts using the below sample inputs
sh driver.sh 2019-07-01 events NULL 1234 SESSION
sh driver.sh 2019-07-01 events 4 1234 SESSION
sh driver.sh 2019-07-01 events 4 1234 EVENTS

Scala script :
import scala.io.Source
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import java.util.Calendar
import org.apache.spark.storage.StorageLevel
import java.io.IOException
import java.util.TimeZone
import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import sys.process._
import org.apache.hadoop.fs._
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import scala.util.Try
import org.apache.spark.sql.functions.{col, hex}
object spark2_mongo {
val spark = SparkSession.builder().appName("Spark_2
Mongo").enableHiveSupport().getOrCreate()
import spark.implicits._
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")

def main(args: Array[String]) {


var feeddate = args(0).toString
var collection = args(1).toString
var user_hour = args(2).toString
var player_id = args(3).toString
var ingest_type = args(4).toString
val df_1 = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
// Adding the filter condition so only 1 year worth of data is loaded into data frame
var data1 = df_1.filter(to_date(data1(incr_column )) >= year(current_date)-1 )
data1.persist(StorageLevel.MEMORY_AND_DISK_SER)
val fileLoc = "/hdfslocation" + collection
ingest(feeddate,user_hour,player_id,ingest_type,data1,fileLoc)
}

def ingest(feeddate: String,user_hour String,player_id String,ingest_type: String,data1:


DataFrame,fileLoc: String) = {
if (ingest_type == "EVENT") {
event_session_data(feeddate,user_hour,player_id,ingest_type,flat_data,data1,fileLoc)
}
if (ingest_type == "SESSION") {
if (user_hour == "NULL") {
player_data (feeddate,user_hour,player_id,ingest_type,flat_data,data1,fileLoc)
}
Else {
hour_session_data(feeddate,user_hour,player_id,ingest_type,flat_data,data1,fileLoc)
}
}
}

def event_session_data (feeddate: String,user_hour String,player_id String,ingest_type:


String,data1: DataFrame,fileLoc: String) = {
var filter_data = data1.filter(to_date(data1(incr_column )) >= feeddate)
filter_data.persist(StorageLevel.MEMORY_AND_DISK_SER)
filter_data.repartition(1).write.mode("overwrite").save(fileLoc)
filter_data.unpersist()
}
def hour_session_data (feeddate: String,user_hour String,player_id String,ingest_type:
String,data1: DataFrame,fileLoc: String) = {
// Filter condition is applied to the data frame to get user specified hours from the current
hour
var hour_behind= "expr(\"INTERVAL"" + user_hour + \"HOUR\")"
var filter_data = data1.filter(to_date(data1(ts)) >= current_timestamp - hour_behind)
filter_data.persist(StorageLevel.MEMORY_AND_DISK_SER)
filter_data.repartition(1).write.mode("overwrite").save(fileLoc)
filter_data.unpersist()
}

def player_data (feeddate: String,user_hour String,player_id String,ingest_type: String,data1:


DataFrame,fileLoc: String) = {
var filter_data = data1.filter(to_date(data1(player_id)) = player_id)
filter_data.persist(StorageLevel.MEMORY_AND_DISK_SER)
filter_data.registerTempTable("players")
var player_session= spark.sql("with fun_next as (select *,lead (event) OVER (partition by
session_id ORDER BY ts desc) AS next_event from players ),session_country as (select
player_id,session_id from fun_next where event='start' and next_event='end')select * from
session_country limit 20")
player_session.repartition(1).write.mode("overwrite").save(fileLoc)
player_session.unpersist()
}

You might also like