Professional Documents
Culture Documents
Requirement:: Hive/Impala/Presto Hadoop (Spark / HDFS) No SQL Database Game Server/application
Requirement:: Hive/Impala/Presto Hadoop (Spark / HDFS) No SQL Database Game Server/application
We need to build a ingestion pipeline service from game application which produces events
data to the database and run analytics & derive metrics out of them.
Requirement :
Types of ingestion :
Gaming application gives us the events/sessions data and the data is getting written into a No
Sql database. We can consider Mongo/Redis/Cassandra database for effective handling of the
events data ( Assuming the data is getting written into Mongo database ). Data is stored in
mongodb and we can consume data from here for all our analytics purpose. Data is duplicated
here which might add additional cost but data loss can be prevented
Once the data is moved into mongodb, we can connect to mongo using the spark mongo
connector ( spark is deployed in cluster mode on perm and ingested data is stored in HDFS in
parquet format which is effective for reading). After data ingestion is done , we are pointing our
external tables to this raw files and creating table on top of it which can be used by data
engines ( Hive/Impala/Presto) and also data visualization ( Tableau / Pentagon)
DATA INGESTION
Approaches used in data ingestion are : Ingestion is written in Scala (batch mode) using spark
data frame API’s and converted into jar using the maven libraries. Jar is wrapped into a bash
script which is input driven and can be called using the cron jobs or deployed into airflow using
the bash operator . We can also use spark structured streaming which will be explained with
the cloud solution.
We are connecting to mongodb using spark-mongo connector and only one year of data is
loaded into data frame using year(current_date)-1 clause. Use cases are split into three
modules and every time we load it is getting loaded as overwrite mode so old data is refreshed
with new ones and external tables are pointed to HDFS.
Used various data optimization techniques in the scala code like data frame persists , various
storage levels and we are repartitioning into a single file and writing them into parquet format.
We can tweak the above parameters based on the timing and resources needed.
Example :
We can run the scripts using the below sample inputs
sh driver.sh 2019-07-01 events NULL 1234 SESSION
sh driver.sh 2019-07-01 events 4 1234 SESSION
sh driver.sh 2019-07-01 events 4 1234 EVENTS
Scala script :
import scala.io.Source
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import java.util.Calendar
import org.apache.spark.storage.StorageLevel
import java.io.IOException
import java.util.TimeZone
import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import sys.process._
import org.apache.hadoop.fs._
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import scala.util.Try
import org.apache.spark.sql.functions.{col, hex}
object spark2_mongo {
val spark = SparkSession.builder().appName("Spark_2
Mongo").enableHiveSupport().getOrCreate()
import spark.implicits._
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")