Professional Documents
Culture Documents
I used the following urls to install spark and setup spark on my desktop.
https://www.youtube.com/watch?v=IQfG0faDrzE
https://www.youtube.com/watch?v=WQErwxRTiW0
http://media.sundog-soft.com/spark-python-install.pdf
I save the following data in a csv file and processed the data using decision tree
algorithm.
outlook,temperature,humidity,wind,playball
--------------,----------------------,----------------,---------,--------------
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rain,mild,high,weak,yes
rain,cool,normal,weak,yes
rain,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rain,mild,normal,weak,yes
sunny,mild,normal,strong,yes
overcast,mild,high,strong,yes
overcast,hot,normal,weak,yes
rain,mild,high,strong,no
This script takes the last column of the csv file as CLASSIFIER and the remaining
columns as CANDIDATE columns
to identify the ROOT NODE and split further.
There is further scope to tune the code and improve the formatting.
I was just curious to convert my pl/sql code on Decision Tree Algorithm into spark
sql.
And I solved the data from csv file and got the output rules.
Happy scripting.
"""
#################################SCRIPT STARTS HERE
#cd /c/python-spark-tutorial
#spark-submit.cmd python/pysparkDecisionTreeEntropy.py #to run the script from
gitbash command prompt
df =
spark.read.format("csv").option("infer_schema","true").option("header","true").opti
on("sep",",").load("e:/data/decisiontree.csv")
print(df.printSchema())
print(df.show())
column_list = df.columns
print(column_list)
df.createOrReplaceTempView("dtree")
CLASSIFIER = df.columns[-1]
print(CLASSIFIER)
column_list.remove( CLASSIFIER) #remove classifier from columns' list
print(column_list)
print(rowCount)
print(rowCount.first())
spark.sql(" select "+ CLASSIFIER +",cast(count(*) as float) val from dtree group
by "+ CLASSIFIER +" ").createOrReplaceTempView("clscount") #tbl clsfcnt
MGAIN = float(0)
spark.sql(QUERY).show()
spark.sql(QUERY).filter(F.col("gain") >= MGAIN).show()
GAIN = spark.sql(QUERY).collect()[0][0]
print(GAIN)
print(MGAIN,ROOTNODE)
column_list.remove( ROOTNODE) #remove rootnode from columns' list
print(column_list)
spark.sql(QUERY).createOrReplaceTempView("rtree")
print(ROOTNODE,"->",LLVAL[0]," : ",LVAL[3],"->",LLVAL[1],"(",LLVAL[2],") :
",CLASSIFIER,"->",LLVAL[3],"(",LLVAL[4],")")
numRows -=LLVAL[2] #reduce numRows with classified rows' count
spark.stop()
#################################SCRIPT ENDSS HERE
"""
References:
http://www.orafaq.com/node/3163
https://www.coursehero.com/file/17335804/Tutorial02/
None
+--------+-----------+--------+------+--------+
| outlook|temperature|humidity| wind|playball|
+--------+-----------+--------+------+--------+
| sunny| hot| high| weak| no|
| sunny| hot| high|strong| no|
|overcast| hot| high| weak| yes|
| rain| mild| high| weak| yes|
| rain| cool| normal| weak| yes|
| rain| cool| normal|strong| no|
|overcast| cool| normal|strong| yes|
| sunny| mild| high| weak| no|
| sunny| cool| normal| weak| yes|
| rain| mild| normal| weak| yes|
| sunny| mild| normal|strong| yes|
|overcast| mild| high|strong| yes|
|overcast| hot| normal| weak| yes|
| rain| mild| high|strong| no|
+--------+-----------+--------+------+--------+
....
....
....
+------+
| val|
+------+
|0.9403|
+------+
....
....
....
0.24690000712871552 outlook
....
....
....