PythonForDataScience Cheat Sheet Duplicate Values GroupBy
>>> df = [Link]()
PySpark - SQL Basics
>>> [Link]("age")\ Group by age, count the members
.count() \ in the groups
Queries .show()
>>> from [Link] import functions as F
Select
Filter
>>> [Link]("firstName").show() Show all entries in firstName column >>> [Link](df["age"]>24).show() Filter entries of age, only keep
>>> [Link]("firstName","lastName") \
PySpark & Spark SQL .show()
>>> [Link]("firstName", Show all entries in firstName, age
those
records of which the values are >24
Spark SQL is Apache Spark's module for "age", and type
Sort
working with structured data. explode("phoneNumber") \
.alias("contactInfo")) \
>>> [Link]([Link]()).collect()
.select("[Link]",
Initializing SparkSession "firstName",
>>> [Link]("age", ascending=False).collect()
>>> [Link](["age","city"],ascending=[0,1])\
"age") \ .collect()
A SparkSession can be used create DataFrame, register DataFrame as tables, .show()
execute SQL over tables, cache tables, and read parquet files. >>> [Link](df["firstName"],df["age"]+ 1) Show all entries in firstName and
age, .show() add 1 to the entries of age
>>> from [Link] import SparkSession
>>> spark = SparkSession \
>>> [Link](df['age'] > 24).show()
When
Show all entries where age >24 Missing & Replacing Values
.builder \ >>> [Link]("firstName",
.appName("Python Spark SQL basic example") \ Show firstName and 0 or 1depending >>> [Link](50).show() Replace null values
[Link]([Link] > 30, 1) \ on age >30 >>> [Link]().show() Return new df omi5ing rows with null values
.config("[Link]", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> [Link] \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[[Link]("Jane","Boris")] Show firstName if in the given options .show()
Creating DataFrames Like
.collect()
FromRDDs
>>> [Link]("firstName", Show firstName, and lastName is
[Link]("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from [Link] import * Startswith - Endswith >>> [Link](10)\ df with 10 partitions
>>> [Link]("firstName", Show firstName, and TRUE if .rdd \
InferSchema .getNumPartitions()
>>> sc = [Link] [Link] \ lastName starts with Sm
.startswith("Sm")) \ >>> df with 1 partition
>>> lines = [Link]("[Link]")
.show() [Link](1).[Link]()
>>> parts = [Link](lambda l: [Link](",")) >>> [Link]([Link]("th"))\ Show last names ending in
>>> people = [Link](lambda p: Row(name=p[0],age=int(p[1]))) th .show() Running SQL Queries Programmatically
>>> peopledf = [Link](people) Substring
SpecifySchema
>>> people = [Link](lambda p: Row(name=p[0],
>>> [Link]([Link](1, 3) \
.alias("name")) \
Return substrings of firstName Registering DataFrames asViews
age=int(p[1].strip()))) .collect() >>> [Link]("people")
>>> schemaString = "name age" Between >>> [Link]("customer")
>>> [Link]([Link](22, 24)) \ Show age: values are TRUE if between >>> [Link]("customer")
>>> fields = [StructField(field_name, StringType(), True) .show() 22 and 24
for field_name in [Link]()]
>>> schema = StructType(fields) QueryViews
>>> [Link](people, schema).show()
+--------+---+
Add, Update & Remove Columns >>> df5 = [Link]("SELECT * FROM customer").show()
| name|age| >>> peopledf2 = [Link]("SELECT * FROM global_temp.people")\
+--------+---+
| Mine| 28|
Adding Columns .show()
| Filip| 29|
|Jonathan| 30| >>> df = [Link]('city',[Link]) \
+--------+---+ .withColumn('postalCode',[Link]) \
From Spark DataSources .withColumn('state',[Link]) \
.withColumn('streetAddress',[Link]) \
Output
JSON
.withColumn('telePhoneNumber',
explode([Link])) \ DataStructures
>>> df = [Link]("[Link]") .withColumn('telePhoneType',
>>> [Link]() >>> rdd1 = [Link] Convert df into an RDD
explode([Link])) >>> [Link]().first()
+--------------------+---+---------+--------+--------------------
| address|age|firstName |lastName| Convert df into a RDD of string
+ phoneNumber| >>> [Link]() Return the contents of df as Pandas
+--------------------+---+---------+--------+--------------------
+
UpdatingColumns DataFrame
|[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...|
|[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...| >>> df = [Link]('telePhoneNumber', 'phoneNumber') Write & Save to Files
+--------------------+---+---------+--------+--------------------
+
>>> df2 = [Link]("[Link]", format="json")
RemovingColumns >>> [Link]("firstName", "city")\
.write \
Parquetfiles >>> df = [Link]("address", "phoneNumber") .save("[Link]")
>>> df3 = [Link]("[Link]") >>> df = [Link]([Link]).drop([Link]) >>> [Link]("firstName", "age") \
.write \
TXT files .save("[Link]",format="json")
Inspect Data
>>> df4 = [Link]("[Link]")
>>> [Link] Return df column names and data types >>> [Link]().show() Compute summary statistics Stopping SparkSession
>>> [Link]() Display the content of df >>> [Link] Return the columns of df
>>> [Link]() Count the number of rows in df >>> [Link]()
>>> [Link]() Return first n rows
>>> [Link]() Return first row >>> [Link]().count() Count the number of distinct rows in df
>>> [Link](2) Return the first n rows >>> [Link]() Print the schema of df
>>> [Link] Return the schema of df >>> [Link]() Print the (logical and physical) plans