You are on page 1of 8

PYSPARK LEARNING HUB : ARTICLE - 11

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : ARTICLE - 11

Agenda :
1. Standard way of creating schema( scan whole data )
2. Standard way of creating schema( scan 10 % data )
3. How do we enforce schema , style 1, schema DDL
4. How do we enforce schema , style 2, StructType

from pyspark.sql import SparkSession

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true').
\
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir",
"/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : ARTICLE - 11

1. Standard way of creating schema( scan whole data


)
df=spark.read\
.format("csv")\
.option("header","true")\
.option("inferSchema","true")\
.load("/public/yelp-dataset/yelp_user.csv")
df.show(1)

df.printSchema()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : ARTICLE - 11

2. Standard way of creating schema( scan 10 % data )

#2. standard way of creating schema( scan 5 % data )


df_2=spark.read\
.format("csv")\
.option("header","true")\
.option("inferSchema","true")\
.option("samplingRatio",.1)\
.load("/public/yelp-dataset/yelp_user.csv")

df.printSchema()

3. How do we enforce schema , style 1, schema DDL


#3. how do we enforce schema , style 1
#define schema
orders_schema='order_id long, order_date date, cust_id
long , order_status string'

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : ARTICLE - 11

df=spark.read\
.format("csv")\
.schema(orders_schema)\
.load("/public/trendytech/datasets/orders_sample1.csv")
df.show(5)

df.printSchema()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : ARTICLE - 11

4. How do we enforce schema , style 2, StructType

#4. how do we enforce schema , style 2, StructType


from pyspark.sql.types import *
order_schema_struct=StructType([
StructField("orderid",LongType()),
StructField("Orderdate",DateType()),
StructField("Custid",IntegerType()),
StructField("OrderStatus",StringType()),

])

df=spark.read\
.format("csv")\
.schema(order_schema_struct)\
.load("/public/trendytech/datasets/orders_sample1.csv"

df.show(5)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : ARTICLE - 11

df.printSchema()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : ARTICLE - 11

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR

You might also like