You are on page 1of 28

The advantages of Scala for data science

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Data scientists can use virtually any language for analytics, but Python and
R are the most popular. Scala is a valuable language for data science because it is a
language designed for scalability. This is especially important when working with large
datasets. Scala runs on the Java Virtual Machine and therefore it can run anywhere Java
runs. It uses both functional and object-oriented programming paradigms. Functional
programming is a style of computation that uses functions to compute values and
reduce the amount of state information that has to be maintained. Scala also employs
object-oriented techniques such as structuring programs around data and
methods.Scala programs can work with relational databases using SQL from Scala. JDBC
drivers that are used with Java can also be used with Scala programs for querying
data and issuing database commands. Scala is designed to take advantage of multiple
cores.Abstractions like parallel collections make it easy to parallelize computations over
large datasets. Apache Spark is a widely used big data analytics platform that's written in
Scala. Although Spark supports Java, Python and R programs, Scala is a popular
language for Spark applications that want to take full advantage of fast execution
times.Now this concludes our brief look at the advantages of Scala for data science.

Installing Scala

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Scala's freely available for downloading, at scala-lang.org. Now I have to
open the browser and navigate it to that website, and I've selected download, and I'm
simply going to click and download Scala. This download package is a compressed
file.So, it's done downloading, I'm going to switch over to a terminal window, and I'm
going to change my default directory to where I downloaded the Scala file. Now the
commands I'm going to execute will work on a Mac or Linux. Now similar steps will work
in Windows environment, but the commands and the step tools will be slightly
different.So, the first thing I want to do, is I just want to list the download. So I have a tar
file, that means that I'm going to use a tar command to uncompress, and now what that
does is creates a folder for me. I'm going to move that folder from my downloads
directory, and I'm going to put that in user local. Now I'm going to cd over to user
local. I want to be able to refer to this directory as Scala, so I'm going to create
something called a link.This will essentially create a link between the word Scala and the
scala-2.11.11 directory.The link command is something specific to Linux and Mac. In a
Windows environment,you can simply rename scala.2.11.11 to Scala. Now the last thing I
want to do is make sure that when I type scala at the command line, my command line
interpreter's is able to find that, and to do that, I'm going to edit a file. It's called bash
profile. It's in my home directory, so I'll use /.bash_profile. What I'm going to do is add a
command which says export PATH=user/local/scala/bin. We may already have a PATH
variable to find, so let's append whatever exists on the PATH variable, there. I'll exit, and
I will save the changes,and I'll write that out. Now we have Scala installed. Once I start a
new terminal window,the Scala command will be available. Okay, let's start using Scala.

Scala data types

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Here is a list of basic Scala data types. These are scalar data types which
means that they store a single value. We'll discuss arrays, lists, and other non-scalar data
types in an upcoming lesson. If you've used other programming languages, this list of
data types probably looks familiar. Byte, as the name implies, is used to store an 8-bit
value. Short, int, long, float, and double are numeric data types. Char is used to store a
single character. Let's take a look at how to work with these data types in Scala. If you
followed along with the previous lesson on installing the Scala IDE, you will have
installed a Scala command line tool known as a REPL, which is short for Read Eval Print
Loop. Let's switch to the terminal window. I'll enter the Scala command to start the
REPL. Then I'll print a welcome message and display the Scala prompt. First thing I want
to do is to find a variable. Now we define variables with the var keyword followed by a
variable name. In this case I'll use a and I'll indicate that I'm going to use an integer so
I'll have the data type in the name but that's really not necessary and then I'll follow the
name with an optional data type. In this case, I'll create an integer. Then I'll assign it a
value by specifying the equal sign and the value and then I just hit return and as the
print part of the Read Eval Print Loop, the Scala REPL, will indicate the name of the
variable as well as the data type and the value after it's evaluated the var command.Let's
create another variable, let's create a character, then we'll call it a_char and it's of type C-
H-A-R and we'll assign it to the letter d. And again, we have our variable name,our data
type, and the value. And let's work with a long. Let's create a var and we'll callthis a_long
so this is a long value. And let's give it a value of 8345679 and that's a long.Now, we can
opt to not include a specific data type. If we don't include a data type,Scala will infer the
data type. So for example, let's create a variable b, and let's just assign it to value 3. In
this case, it inferred that b is an integer. Let's create another variable. We'll call this
b_char and we'll assign that to value of lowercase d and again, it inferred correctly. Now
sometimes Scala might infer a different data type than the one we want. So for example,
we might want to define a long variable type as var b, just to keep track of it. As a long,
I'm going to include long in the variable name, but again that's not necessary. Strictly
for tracking here in the exercise and we'll assign, 8345679489 Now here, we got
something unexpected. We got an error message. We got an error because at first, this
number looks like an int says Scala, but it's too large to be stored in an int. In cases like
this, we simply append an l to the end of the number to indicate a long data type. As in
this, you just type the up arrow, add an l to the end of the number, and then assign it
and it's correctly determined to be a long. Let's try a float data type. Let's type a var c
and we'll call this float and we'll assign the value 12345.Now, we'll notice here that I got
a double. Now again, by default, Scala assigns a double data type to decimal numbers. If
you prefer to use a float, you can explicitly indicate that by attaching an f to the end of
the number. And that redefines the variable c_float as a float value. Now, I do want to
point out, in addition to declaring variables with the var keyword, we can define values
with constants with the val keyword and we follow a very similar pattern, but instead of
saying var, we say val and then we assign some variable name such as z to a value, and
let's give it a long value. 8345678 489L. And that creates z as a long value type. And that
wraps it up for our quick introduction to scalar data types.

Scala collections

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In data science, we often have to work with collections of data such as
arrays of numbers or sets of labels. Scala provides a comprehensive set of collection
types that include sequences, sets, and maps. Each of these have subsets. Sequences
which are known as seqs for short include things like streams and lists and queues. Sets
can be sorted, tree based, or based on a hash. Maps can have HashMaps, SortedMaps,
ListMaps. We won't go into the details of each of these here, but we will describe the
basic characteristics of each type and discuss when to use each kind of collection.  It's
worth nothing that collections are either mutable or immutable. Mutable collections can
be changed. For example, by adding or removing an item. Immutable collections do not
change once they are created. Scala simulates changes to an immutable collection by
making a new version of the collection with the change. For example, deleting an
element of an immutable set will result in a new set identical to the original except that
an element is removed. Now, let's take a look at how to work with different types of
collections.
Scala sets Scala arrays, vectors, and ranges 

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Let's talk about Scala arrays, vectors and ranges. First, we'll start with scala
REPL. Scala arrays are indexed collections of values much like you'll find in other
programming languages. Arrays are a mutable collection type. Let's create an array of
temperatures. So, we'll use val temps and we'll create an array and we'll forget about
units of measure now and just simply work with integers, 50, 51, 56, 53 and 40. Arrays in
Scala are zero based, so getting the array value at index one returns actually the second
item in the array, so for example, if we say temps of one we'll get 51. If we want the first
item, we should use zero. Scala is an object-oriented language. Variables, including
arrays, are objects and have methods associated with them. To find the length of an
array, we can use the length method which is invoked by specifying a value
namefollowed by a period and the word length as in temps.length returns five which is
the length of the array. The values in array can be updated by referencing the array
variableand specifying an index. For example, to change the first value in the temps
array from 50 to 52, we can use this command, temps of zero equals 52. We can
explicitly create an array using variable declaration syntax and specify the type stored in
the array.Here's how to create an array of 10 integers. We can start with val or var if you
choose,temps2 and we're going to say it's an array of type int and we're going to assign
it a value of a new Array of type int and length 10 or size 10. Now, we can create
multidimensional arrays as well. We use the array keyword along with the ofDim method
or specification. So, let's create val temps3 and define that as an array.ofDim to hold an
int or integer type and its sub dimension's 10 by 10. Scala organizes language
functionality into different packages. When we want to access that functionality, we
import the corresponding package. When working with arrays, we can import the array
package using this command, import Array._ and that'll import the array. My screen's
getting a little full so I'm going to hit control L and clear the screen. Now that we have
imported array, this allows us to use functions such as concat to concatenate arrays, so
now we can concat temps and temps2 and that produces a concatenation of the two
arrays. When using the REPL and IDEs, sometimes we like to see all the
methodsassociated with an object. In the Scala REPL, we can list methods associated
with an array by entering the name of the array variable and typing the tap
command. So for example, we can type temps period and then press the tab and this
lists the methodsthat are associated with an array. So for example, we can get the max
and the min valuein an array. The tab command works with other collections as well, so
see the Scala documentation for details on each of these operations and
calculations that can be performed by these methods. Okay, let's consider another
sequenced data type called vectors. I'll clear the screen. Vectors are like arrays in that
they support indexed access to data in a collection. Unlike arrays, vectors are
immutable so if we define val vec1, short for a vector, storing ints or integers and we can
say we would like the vector to contain the values one, two, three, four, five, six, seven,
eight, nine, and 10. Now, like with arrays, we can reference a value by saying the name
of the value or the variable vecand give it an index like vec2 and that'll return a
three. Now, let's look at one more sequence type. We'll look at the range. A range is a
data structure for representing integer values in a range from a certain start value to an
end value. The range includes the start and end values and by default a range has a step
value of one. So, let's declare my range to be a range from one to 10. It's an immutable
collection and it includes the numbers one through 10. We can specify a step
value when explicitly specifying a new range, for example, I could say if I create a value
myRange2 and make that of type rangeand create it using a new Range and make it
from one to 101 with a step value of two.This concludes our discussion of arrays, vectors
and ranges.

Scala maps 

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Let's talk about Scala maps. Maps are Scala collections used for groups of
key value pairs. Let's start by creating a collection of country names and country capital
cities so we'll create a val called capitals and I would like to create a map and I would
like to list some countries and their capitals. We'll start with Argentina and the capital is
Buenos Aires. We'll add Canada and its capital Ottawa. And Egypt and its capital
Cairo. Liberia and the capital of Liberia is Monrovia. And the Netherlands and the capital
there is Amsterdam. And we'll conclude with the United States and its capital
Washington D.C. Now, maps are collections of keys and values. So to get the list of keys
from a map, we can specify the name of the value or variable capitals followed by the
keys method. So let's clear the screen. Now let's do some operations on capitals. Maps
are collections of keys and values. To get the list of keys from a map, we can specify
capitals.keys and that returns a list of all the keys. Similarly, we can say capitals.valuesto
get the list of values. If you have used maps in Java, dictionaries in Python or hashes in
Ruby, Scala maps may seem familiar. The get operator is used to look up a value. For
example, capitals get Argentina will return the capital Buenos Aires. Now, if a value is
not found then none is returned. Okay, so let's clear the screen. So that's how we look
up a value using the get command. Now, if we specify something that's not there like
capitals get Mexico, we'll get the value none. Now, if we want to do a lookup, we can
also use a shorter notation by specifying capitals and then an open parenthesis and
then a value like Canada and that returns Ottawa. We can also test if a map contains a
key using the contains operator. So for example, we can ask if capitals contains
Egypt and it does.Now, if you want to return a default value when a key value pair is not
found, you can use the get or else operation. So let's just quickly clear the screen
again so we work with a blank slate here, capitals getOrElse China and if China is not
found, return the string no capitals found and the value that returned is the string no
capitals found because China is not a member of the map of capitals. Now, we can add
a key value pair using the plus operator. So for example, we could say capitals plus
Ireland and its capital Dublin. That'll give us the map collection of capitals, but now it
includes Ireland and Dublin. Now similarly, we can remove a value. Let's clear using
Control + L and now let's remove something. Now, we just need to specify the key when
we remove something. Take from capitals and ask it to remove Liberia. We'll get our
collection of capitals, but with Liberia and its value Monrovia removed. This wraps up
our look at maps in Scala.

Scala expressions 

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Now let's discuss Scala expressions.  Scala expressions are computable
statements. Numeric expressions are some of the easiest to understand so we'll start
with them. Scala's arithmetic expressions are similar to other programming
languages.Here are some examples. We can say two plus two and that gives us four. We
can use subtraction using the minus operator. We can do multiplication using
asterisk and division using the slash sign which would give us integer division. And we
can also use the percent sign for modular arithmetic which would give us the remainder
from division.Now, in addition to arithmetic expressions, we can also work with
relational expressionsfor comparing values and variables. So, let's clear the screen and
now let's look at a relational expression like three is greater than four. Well, that's
obviously false. We can say something else like five is less than or equal to 10 which is
similarly obviously true.Now, we can put relational expressions together using logical
operators like And and Or.So for example, we could say three is greater than four and
five is less than or equal to 10 and that'll obviously evaluate to false. We can use Or
operators by using the double pipe symbol so we could say for example three is greater
than four or five is less than or equal to 10. And if we want to use the not operation, we
can use the exclamation point.So for example, we can negate three is greater than
four by using an exclamation in front of that expression and get a true value. Now, let's
take a look at ways to shorten assignment operators. Let's clear the screen and let's
create three variables. We'll create variable a which is equal to 10, variable b which is
equal to 20, variable c which is equal to 30. Now, we can add to the value of c the value
of a using a shorthanded notation. This is equivalent of saying c is assigned the value of
c plus a. We have an equivalent for subtraction and multiplication. So for example, we
could say c multiply equals a. This will multiply the value of c which is currently 40 by the
value of a which is 10. And if we evaluate c, we get 400. When evaluating expressions
with multiple operations, Scala uses the same order of precedence that is found in most
programming languages. Consult the Scala documentation if you have questions. Now,
one more thing I'd like to look at is multiple expressions and treating those as a single
block. We can do that by using braces. So for example, let's print out a line and let's
have multiple expressions in here. Let's first of all create a val, we'll call it a, and we'll set
a equal to two times three and then let's add the number four to a and then we'll close
off our block and we'll close off the parenthesis for the print line and that evaluates
those two lines, the value definition and then the operation of a plus four together as a
single unit. So, if you have multiple expressions you want to treat as a single
expression,use blocking mechanisms which is defined using the braces like we've done
here. That's our look at Scala expressions.

Scala functions

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Lets talk about scala functions. I've started the Scala REPL here. Now scala
functions are expressions that can be called with parameters to compute a value.If
you've worked with other programming languages that use functions, you've probably
worked with a syntax that allows you to define a function, give it a name, and optionally
pass in some parameters. We can do this in Scala as well. Let's look at the structure. We
use the def keyword to define a function. Give it a name like myFunction, and tell it what
parameters we want to pass in, like the variable a, which is a type integer, and b, which is
also a type integer. Then I want to indicate that this function is going to return a
particular value. In this case, it'll return an integer. And I want to compute a new
valuewithin this function called c, and c is going to be the value a times b. Then I simply
want to return a value of c. Then I will close off my definition and now I have my
function. Now once we've defined a function, we can call it using the function name and
pass in any needed parameters. So I can call myFunction and pass in two and three, and
I'll get the value six. Now in Scala, a function is not required to return a value. These
functions are called procedures, and are used primarily to produce side effects, such as
printing messages or writing information to a log file. Let's create one such
procedure. First, we'll clear the screen. Now we'll use def myProcedure, and we'll pass in
something.We'll call it in string, and it's of type string. Now this is a procedure. It's not
going to return a value, so I will specify that the return value of this function is called
unit, and I'll specify that my procedure is defined with the following statements. We're
just going to have one side effect, which is to print line of whatever string we passed
in. Now again, notice that we specified the return type of unit. This is the Scala
equivalent of void,which you may have used if you've programmed in Java. Now we can
invoke myProcedure just like we invoke a function by specifying the name, and passing
in any values that we need for parameters. So I'll pass in this is a log message and that
will execute the side effect which is to print a log message. Functions are useful for
grouping related expressions into logical units of work and specifying computations that
you may want to use throughout your data analysis work in Scala.

Scala objects

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Now we're going to discuss Scala classes and objects. I'm going to start the
Scala REPL. Classes are definitions of structures and operations on those
structures. Recall that Scala is an object oriented language and variables are all
objects.When we create a variable or value of a particular data type we can access the
operations that are defined on that type. For example, when we create an array we get
access to a method, which can sort the elements of the array. Let's create a value, we'll
call it y, and we'll say that it's an Array. And let's give it some names of countries, like
England, Liberia, Haiti, Australia, and Sweden. Now the Array, of course, will order the
elements according to how we entered them, but I can say y and call the sorted method
on the Array y and it will return a sorted version, in this case alphabetical order
version,because it's strings. Now we get this because the sort method is defined for
Arrays.Now often when working with data sets it can help to define our own
classes. Classes define a structure. We then use the new operator to create instances or
objects of those classes. Let's assume we're working with sensor data collected from
different locations around the world. Each sensor has a latitude, longitude, and altitude
associated with it.Latitude and longitude are defined by an integer and a character
indicating a direction,such as N, S, E, or W. Longitude is expressed relative to north or
south, latitude is expressed relative to east and west. Now we can define a class to
create an object using the following definition. First of all, let's start with a clear
screen and let's define a class and we'll call it location. And location is defined as
having a variable called latitude, and that's an Integer value, a latitude direction or
lat_direction, which is of type Char, a longitude, which is also an Integer. A
long_direction, which is a Char. And finally, the altitude, which is an Int. And we're
ignoring units of measure with regards to altitude here. So now we've defined our class
location. Let's create an object from the class location. Specify val loc1, and we'll say that
that's a new location, and it's located at 45 degrees north and 120 degrees west, and
300 meters, for example, is the altitude. If we look at loc1 and specify altitude we get the
value 300. And if we look at loc1 latitude direction we get the Character N for north. So
that's as we expected. Now to see the attributes of a class, these are known as
members, you can type the name of the value or variable followed by the Tab key in the
REPL. So if I type loc1. + Tab I can see the members. By default members are available
outside the class. They're considered public. If you want to have a member, but not
make it accessible outside the class methods you can declare it private. So let's do that
first by clearing the screen, and now let's define a new class. And let's call it
myPublicPrivate, and let's define three values.We'll have x, which is an Integer, and we'll
initialize that to 0, and we'll have a val of y, which is an Int, and we'll define that to
default to 0, and then we'll specify private value z, which is an Int and defaults to 0. So
now we have myPublicPrivate class defined and we'll create myPP, which is = to new
myPublicPrivate. Now we have an instance of my class, and if we look at that
instance and then press the Tab key in the REPL we'll see that x and y are listed, but z is
not. The private member z is not listed and can only be referenced inside the
class. Okay, now let's create a class and define an operation on that class. We'll clear the
screen. Now I'm going to create a class called Point2D and this is for two-dimensional
points. And this will have two coordinates, coordinate 1, which is an Int, and coordinate
2, which is also an Integer. And I'm going to define this as having a variable a, which is
an Integer, and it's going to be set to the value that we pass into coordinate 1. Variable
b is an Integer, which will be set to the value we pass in as coordinate 2. So now I have
my members. And now I want to define a function or a method and I want to define this
function to be able to move. And so I want to be able to move one of my points, and so
I'll specify a delta or an amount to move, and I'll specify it for one coordinate, we'll call
that delta_a, and that's going to be an Int. And the other coordinate value will be
delta_b, which is also an Int. And now we're just defining a function. And I can reference
the variables I have, so I'm going to say that a is = to a+delta_a, and b after the move
will be set to the value of b+ the delta value b. And then I'll use the closing brace to
close off the definition of move, and then one more to close off the definition of the
class. Now let's create a point. We'll call this val point1, then we'll say it's a new
Point2D, with coordinates 10 and 20. And let's just check on that.Point1.a. Oh, as you'll
notice, I pressed a and then I must have hit the Tab by mistake.This is a useful mistake
on my part. In the past when I've been hitting the Tab with regards to values and
variables I've been listing all of the values. Now this is helpful, if you type in the first few
characters it'll narrow down the list. So in this case, point1 has two different
operations or two different methods or members that we can work withthat begin with
a. A, an instance, asInstanceOf. So again, if you want to shorten your listwhen you're
working with the Tab just type one or two letters. Okay, so we have point1.a,the value is
10, which makes sense, and point1.b is 20. So now if I take this point, point1,and apply
the move operation by 5 and 15 we will see point1.a has been moved by 5,point1.b has
been moved by 15. Again, as expected. Classes and objects are useful for
organizing structure definitions and functions on those structures and can help organize
and simplify code used for data science operations.

Advantages of parallel collections

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Let's consider the advantages of parallel collections. Multi-core processors are
common today. Many desktop machines have two or four cores and servers typically have
multiple times as many. Scala makes it easy to take advantage of multiple cores and hyper-
threaded processors with the use of parallel collections. A common programming practice is to
use for-loops to process each element of a collection, one at a time. This works well for small
collections, but when we have thousands or more items in a collection, the processing time can
begin to add up. Like an assembly line, we can process data in a collection faster if we work on
multiple elements at a time. A parallel collection is a collection that allows us to do just
that. Let's consider a case where we have an array of 1000 numbers and we need to multiple each
number by two. Let's say we use a for-loop and multiply each number one at a time.Then it will
take, let's say, a thousand units of time. Now if we split the array in two, and process both
collections at once, we could finish in 500 units of time. On a quad processor with hyper-
threading, we could run eight processes in parallel and finish the task in 125 units of time. The
primary advantage of using parallel collections is that it allows us to finish computation
faster than we would with sequentially processed collections. Another advantage is ease of
use. Other programming languages have support for parallel processing, but Scala makes parallel
processing as easy as sequential processing. The overhead of using Scala parallel collections is
fairly low. For some collection types, using the parallel collection version does not incur any
noticeable overhead when compared to using the sequential version. Scala has a variety of
parallel collection types, including the parallel array, or ParArray, ParVector, ParHashMap, and
ParSet. Additional parallel collections are described in the Scala documentation. In our
discussion here, we'll focus on using parallel arrays and parallel vectors.

Creating parallel collections

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] Let's create some parallel collections. We'll start the scala REPL. Now, there are two
ways to create a parallel collection. We can convert a sequential collectioninto a parallel
collection, or we can create a variable or value with a parallel collection type. We'll look at
examples of both. So first, let's create a range of a hundred integers,and I'll call that val range, or
rng for short, 100, and set that one to 100. Now, I want to create a parallel version, using the par
method. I'll do that creating a parallel range 100,which is simply equal to the range we just
created, with the par method applied. Notice the type of this object
is scala.collection.parallel.immutable.ParRange ParRange is the parallel version of the sequential
range object. Let's type the name of the parallel rangefollowed by a period and then hit the tab
key. Notice the list of methods includes things like BuilderOps, par, range, iterator,
SSCTask, SignallingOps, and TaskOps. These are operations that are not in the sequential
version of the range object. They are used to implement parallel collections. Let's clear, by
typing Control + L. Let's look at the methods available under the sequential version of the
range. So those operations we discussed like par, range, iterator, and SignalingOps, they're not
included here, because they're not needed for sequential operations. Now, let's take a look at an
example of how to create a parallel vector using an explicit definition. First, we'll clear the
screen,and now what I want to do is import
from scala.collection.parallel.immutable.ParVector.Now I've created that class for a parallel
vector, so I can now create an instance of that.I'll call this pvec, for parallel vector, 200, and I'm
going to create a ParVector with a range of zero to 200. So, I'm defining this new value
pvec200, and it's going to be of type ParVector, and the range will be zero to 200. Creating
parallel collections is as easyas creating sequential collections. It's also easy to convert
between sequential and parallel collections.

Mapping functions over parallel collections

Selecting transcript lines in this section will navigate to timestamp in the video
- [Teacher] Let's work with functions over parallel collections. We'll start the Scala REPL. Now I
want to create a value v and I'm going to set this to be a range of one to 100. I'm going to want to
convert it to an array. We can convert an array into a parallel collection using the par
method, and I'll call it pv for parallel version of v. And it's v.par.Now I have a parallel array. The
value of pv is the same length and has the same values as v, however the pv value allows for
parallel operations. Let's start by multiplying each member of the v array by two. First, I'll clear
the screen. Now I have v. Now to apply an operation to every member of a collection I can use
the map function. I'll use the underscore as an alias for each member of the collection. I'll say, for
each member of the collection, multiply by two. That doubles everything in the array. Similarly,
for parallel version, I can use the same code. I can apply the map function, use the anonymous
variable, and then multiply every member by two. What's going on here, is that I'm getting the
same results but if we were working with very large arrays we would be able to finish these
calculations faster with the parallel version. In both examples the underscore is an anonymous
variable that gets bound to each member of the collection.When working with a sequential
collection the multiplication is applied to one member at a time. But the parallel collection, a
number of members are processed at the same time. We use the map method on collections when
we talk about mapping functions over a collection we're talking about something different from
map collections. Map collections are groups of key value pairs. The map method is a functional
programing construct which allows us to apply a function to each member of a collection. We
could define a custom function to apply to members of a collection. For example, let's define a
function called square. I'll just clear the screen. We'll define square as a function which takes one
integer as a parameter and it returns an integer type and it is defined asreturning x times x. We
can test that. Square of four should be 16, correct. Okay, now we can apply square using
map. We could say, v, which is our sequential array and apply the map function and the function
square. We want to apply that to each member of the collection. Now I have an array of squares
of that original collection. Again, I can do it in parallel simply by referencing a parallel
collection, applying map, specifying the square function, and saying I want this to apply to each
member of the collection. I'll return and again, the results are the same because we're working
with the same inputs but these calculations were done in parallel and the results were merged at
the end. The map method is a simple way to apply computations in parallel and take advantage
of multi-core processors.

Filtering parallel collections

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Sometimes, when we have large collections we want to filter them. Scala makes it
easy to filter collections so you can find all the members of a collection that meet some
criteria. So for example, let's create an array of numbers. We'll create val v.And we'll make this
one to 10,000. And let's make it an array. And let's create a parallel version by using the par
method. Now let's just check the length of the collections.V.lengths and pv.length. OK, they're
the same. Numbers appear to be the same. So we'll just clear the screen, and we'll move on to our
next step. So we have a collection of 10,000 elements. What I'd like to do now is create another
value that has the elements from pv, the parallel vector, that are greater than 5,000. So I'm going
to make a new value, and I'll call it pvf for the filtered version of pv. And I'm going to define that
as pv.filter. So I'll apply a filter. So for each element of the collection, I want to do a test and see
if it is greater than 5,000. And now I have a value called pvf, which contains all the values
greater than 5,000. And if we check the length, we'll see that it's 5,000 long.Which is what we
would expect. Scala parallel collections also have a filter not methodfor applying the negation of
the filter. So let's create another val pvf2, we'll call it. And for this we will use pv.filterNot, apply
to all the members of the collection and the condition is greater than 5,000. So we would expect
all the values that are less than 5,000. Which it appears we have that. Let's just double check the
length to make sure we got them all. Great, we have 5,000. The filter and filter not methods can
take custom functions that return a boolean value. Let's define a boolean function that takes
integersas an input and returns a boolean value. So first I'll clear the screen so I have a little room
to work. Now I'm going to create this function called div3 and I'm going to pass in a single
integer. We'll call that x. Now the function itself will return a boolean, so we'll specify that. And
that will define the function and we will say that we have a value called y, which is an int, and
it's defined to be x, that perimeter we passed in, ma-cha-lo arithmetic 3. So that essentially gives
us the remainder of division by three. And then we return the value of a relational check. We
want to know if y is equivalent to zero. 'Cause if it is, then we have a number that's divisible by
three. So let's just check that. Let's call div3 with three. Truth, good. That's divisible by
three. Div3 of nine should also be true.Yup. But div3 of five should not. Great. So div seems to
be working correctly. Now let's apply it to our value pv and let's filter using div3. And again,
want to apply this to each member of the collection. So we'll use the anonymous symbol,
underscore, and then execute and you'll notice all of the values that are returned in the par array
are multiples of three. So the filter and not filter methods are handy ways of selecting a subset of
elements from a parallel collection.
When and when not to use parallel collections

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Here are some things to keep in mind when considering the use of parallel
collections. First of all, parallel collections should be considered only when you have at least
thousands, possibly tens of thousands of elements. For some types of collections, converting
between the sequential and parallel type requires copying data,so keep that in mind. Now you
want to avoid side effects. It's best to avoid applying procedures with side effects in parallel
collections. Side effects can lead to nondeterminism. That means different times you execute the
operation you may get different ordering of results. And side effects could take affect in different
orders each time the operation is executed. Also you want to avoid nonassociative
operations when working with parallel collections. In associative operations, the order of
operations doesn't matter. Now if your computation depends on state information as you go
through the processing of a collection, and the order of that operations matters, then you should
not use parallel collections. So those are some tips to keep in mind when you work with parallel
collections.

Installing PostgreSQL

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] We will be using PostgreSQL as our relational database. I've opened a
browser and navigated to enterprisedb.com. From here I can download a version of
PostgreSQL. I'm going to select the latest version, and I'm going to select the operating
system. I'm using a Mac OS X, but the installation works similarly on Linux and
Windows.And I will download now. And that will start the download process. Now, I'm
opening the image that I downloaded. And that displays a folder with, in this
case, "PostGres," a version, dash "OSX." So, I'm just going to double-click on that. And I
get a warning message, and I'll enter that. And I'll enter my password. And when the
setup wizard starts, I can just select Next. I'm going to select defaults, for the most
part. And I'll enter a password. And of course, it's important to remember that. We'll
need that when we connect to the database. And I'll use the default port, and the
default locale. And I'll install. Now at this point, we're offered the option of installing
some additional packages. I'm going to deselect that. We don't need to use
Stackbuilder. So I will click Finish. And we have installed PostGres. The EnterpriseDB
installation package will start Postgres for you. If it doesn't start automatically for
you, check the Postgres documentation.

Loading data into PostgreSQL


Selecting transcript lines in this section will navigate to timestamp in the video
[Instructor] We've installed the PostgreSQL software. So now let's set it up by creating a
database and a user, and loading some data into it. Now Postgres provides a couple of
handy commands. One is called createdb, and I'll give it a name scala_db. This will create
a database called scala_db. Now I also create a user and I'll give that user the same
name, scala_db. Now what I'd like to do is create some tables and load some data into
those tables. Now I've created a script called emps.sql. If you have access to the exercise
files you can download emps.sql and follow along. And what I'm going to do is issue a
command using Psql, a command line for working with Postgres. And I'm going to
indicate that I want to log in as a user scala_db, and I want to use the database
scala_db. And now I want to execute a script and to do that we pass in the dash a, dash f
options, and then we specify our file. Now I have emps.sql in my download directory,so
I'll just specify the path to that location. And the command executes. So I'll clear the
screen here. So, now that I've cleared the screen I'm going to execute Psql, which is my
command line for working with Postgres. I'm going to indicate that I want to use the
scala_db user and connect to the scala_db database. Now basically what I want to do is
just double check and make sure that I've actually loaded up my employees. So I'll select
count star from emps, and there's a thousand. And let's just select star from emps
limit, five should be enough. And so what we have here is, we have some basic
information about employees such as their name, their email, the department they work
in, their start date and so forth. So we've installed Postgres, we've created a
database,and we've loaded some data for us to work with. So now we're ready to work
with Scalaand SQL together.

Connecting to PostgreSQL

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Now we'll start Scala and we'll use JDBC to connect to our database. So,
we'll specify scala to start the REPL but this time we're going to to something a little
different. We're going to include a classpath option and I'm going to specify the
directory where I downloaded my JDBC driver and for me I downloaded it to this
directory. Now, your classpath may be different depending on where you downloaded
the JDBC driver and be sure to include the name of the jar file in the classpath. First
thing I want to do is import a couple of classes that will allow us to work with SQL from
within Scala. So, we'll import java.sql. DriverManager and I'm just hitting tab here to
show you the different things that we have available to us. DriverManager and I also
want to import java.sql.Connection. So, now we have our two packages installed, so I'm
just going to clear the screen and now I want to define a couple of strings that we'll use
when we connect to the database. The first I'll call driver and that's simply the name of
the class within the JDBC driver of the driver that you're going to be using and in our
case it's org.postgresql.driver. And the other string we want to create is a URL which
points to the database and that is jdbc:postgresql and our postgres database is running
locally so I'll specify a localhost and the database I want to connect to is scala_db and I
want to connect as user scala_db. Okay, now the next thing I want to do is actually load
the driver and I do that by calling Class.forName and then specify the driver. So, that
loaded the JDBC driver for us. Now, I want to create a variable for a connection and we'll
just call it connection and its type is connection and we'll just set it to null for
now. Okay, let's clear the screen again. Now, let's set our connection by calling
DriverManager.getConnection and let's specify our URL. Great, so now we have our
connection. Now, the next thing we want to do is also create a value where we can store
a statement, so we'll create a new value and we'll call it statement and this will be
connection.createStatement and we'll notice here as I was typing I had the tab key and
that showed me the options that I have including things like createStatement or
CreateStruc. So, now we have a statement. Now, we can execute a statement and to do
that we specify another val and let's get a resultSet and we'll take our statement and
we'll execute a query using that statement and let's do something simple like select star
from emps. Now, that gave us a resultSet. Now, the resultSet is basically a cursor if
you're familiar with that so it's an iterator which allows us to get values. So, the first
thing we'll do is say alright, I want to see what the resultSet.next is and that will get the
next value, so I have the next value stored in resultSet and now I can look up a
valuefrom resultSet such as let's get the string that's associated with last_name. In this
case, it's someone named Kelley. So, those are the basic steps of creating a connection
and doing a basic select query. In our next video we'll spend more time looking at select
statements.

Querying with SQL strings

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] So, let's continue our discussion  about how we select using strings. Now,
recall we had to find a value called resultSet, and resultSet was set to the statement
object, which executed a query, and the query we had executed was "Select * from
emps," and then this return to cursor, so we wanted to move that along to the next item
in the list, and then we're able to look up individual columns on this particular row. So,
for example, we could look up the resultSet, get string, and I'm just going to show all of
the options with getS, and we'll look up department. And this first person works in the
Computer Department. Let's change that, let's just up arrow and change department to
last name so we can see who we're talking about. Ah, okay, that's right, it's Kelley, and
let's just check the start date. Okay, so we figured out that the person whose last name
is Kelley, we know what department they work in and what their start date is, so that's
great. It's a little inconvenient to work with just one column at a time, especially if we
want to work with multiple rows. Fortunately, it's fairly easy to work with a cursor
because we can just use a while loop, so let's take a look at that. First, I'll clear the screen
so we have some space to work. Okay, so we'll create resultSet2, and we're going to
execute a statement. And we'll execute the query, and the query we'll execute is "Select
* from company_divisions." Now, I want to iterate over each row in the resultSet, so to
do that, I'm going to use a while loop, and basically, I'm going to iterate as long
as resultSet2 continues to have values. So, each time there's a next, I'm going to keep
iterating. And then, for each row, I want to execute a block of code, and that block will
define a value department, which is from resultSet2. I'm going to get the string for the
column department. Then, I'll also create a value called comp_div, and that is also from
resultSet2. This is company_division. Now, I want to print these out, and I'll simply print
out the department, plus a couple spaces, and comp_div. And that's all I want to do, so
I'll close off the block, and we'll iterate through. And what you'll notice here is we're
printing out each of the departments followed by their division. So, this is how we
typically work when we're doing relatively simple strings or very ad hoc strings. In the
next lesson, we'll look at how to use prepared statements, which are useful when we
want to execute the same statements over again.

Querying with prepared statements

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Now let's take a look at working with prepared statements. A prepared
statement allows us to execute a query repeatedly without forcing the database to
parse and build a query execution plan each time we execute that statement. So
prepared statements are especially useful when we're using O Statement repeatedly. So
the first thing I want to do is to find a query string. And I'll just type in val, query,
STR.And I'll say that this statement is select star, from company, regions where region
ID is greater than some number. And I want to put in, essentially a parameter here. So
I'll be able to change it. So we'll use a question mark for that. Now, what I will do is
actually to create, using that query string, a prepared statement. And I'll just use PS for
the name of the value for that. And to do that we specify our connection, and use the
prepare statement method. And we pass in our query string. Okay so now we have a
prepared statement. And that means the database has parsed the string, built a query
execution plan, and now we're ready to pass in a parameter. And we do that by
specifying our prepared statement, and using a set operator. So we can say, in this case I
want to set an integer. And the integer is the first question mark we find in the
string. And I want to set that to five. Now, that I've specified my parameter, I can actually
execute this and generate a result set. And I'll just create a value to hold that. And I'll
call, call the result set RS just to keep the typing down to a minimum. So I'll take my
prepared statementand I'll execute the query. Now, RS is a result set. So as before I'll go
to next in my result set. Now let's get a region ID. And that's an int, so I'll get region
ID from this row. And we'll see it's number six, which makes sense since our query was
to select region IDs greater than five. Now let's also try another one. Let's get a
string. Let's get company regions from this row. And that's in Quebec. So if we get the
string for country, we should see Canada, which we do. Okay, so things are working as
expected. Now let's change this. Let's execute it with our prepared statement again, but
this time we'll set our parameter. And it's the first question mark in the string. We'll
replace that with the number three. And now we'll go to the next row in the result
set. And let's get the region ID. And now we have another region ID. Now this one
happens to be seven. And if we get the strings for the regions again- we'll find that we
have Nova Scotia. And again we should get a country. Let's get the string for country on
this row. And of course we have Canada. So that's a look at working with prepared
statements.

Summary of SQL in Scala

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] So to summarize our work with SQL and JDBC,  some things I'd like to point
out is that first of all, it requires the installation of a JDBC driver, so we have to have Java
installed, but we also have to install a specific JDBC driver for the particular kind of
database that you're working with. So if you're using MySQL, that would be one driver. If
using Postgres, that would be another, and so on. Another thing to keep in mind is that
when we work with queries they return a structure called a cursor, so you have to move
through those using the next operator, and SQL statements can be constructed as string
and executed but prepared statements are more efficient if we're repeatedly using those
statements.

Introduction to Spark

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] There are many reasons to use Scala for data science and analytics. Scala is
a functional programming language and those languages are well-suited for applying
computations to data. It's also an object-oriented language. That allows us to create
objects and methods that keep our data organized according to the structure of the
business problem we're working on. Features like parallel collections help when we're
working with large data sets. They allow us to take advantage of multiple CPUs that are
found in contemporary desktops and laptops. When you start working with big data,that
is data that cannot be processed in a reasonable amount of time on a single server, then
it's time to consider a distributed processing framework like Spark. Spark is a distributed
processing framework written in Scala. It's known for its fast processing. It's faster than
Hadoop, the first popular big data analytics platform, libraries for analytics, stream
processing for near real-time analysis, it's fault-tolerant, so servers can fail and
processing can still continue. And finally, it's scalable. One could easily add servers to a
Spark cluster. This is especially useful in a cloud environment where it's easy to add
nodes or servers to a cluster. Spark provides packages for distributed processing that
allow us to do data science over big data-sized data sets. Much of what we have learned
about Scala and using SQL in Scala is applicable to work with Spark. Just as parallel
collections gave us easy access to parallel processing capabilities of a single server,
Spark gives us access to the parallel and distributed processing capabilities of an entire
cluster of servers.

Installing Spark

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Now it's time to install Spark. Here I've opened a browser and navigated to
spark.apache.org. And from the main page, I'm going to select the Download
option,and that will bring me to another page where I can choose from among several
different downloads. And I'm just going to accept all of the defaults in terms of the
Spark release,the package type, and the download type. And I'm going to click on the
link that is provided here to download Spark. Now that Spark has finished
downloading, I'm going to go over to my download directory and I have a compressed
file here, so I'm going to open that. Now that creates a new folder, and this has a fairly
long name so I'm going to rename this and I'm just going to call it Spark, which is
sufficient for our purposes. Now also, Spark is now in my download directory. I'm just
going to drag it over to my home directory. Okay, so now I have in my home directory, I
have the Spark folder, so I'm going to move to my terminal window and I'm going to
cd to my home directory. And then also I'll cd into the Spark subdirectory. And I'll list
the contents. And you'll notice, we have things that you would expect like the
license, the notice, the release notes, and a bin directory. So I'm going to cd into the bin
directory, and I'll list the contents there. Now there are a number of different commands
in here. The one that we're interested in is spark-shell. So I'm going to execute spark-
shell. Now Spark can take a minute or two to start up. There's quite a few things that are
going on here. And by default, the spark-shell issues a fairly verbose set of log messages
and warnings. So if you want you can change your default log level from WARN to
ERROR and that'll reduce the number of messages that are displayed during
startup. And there are instructions here. These first lines give instructions on how to
change the logging level. Okay, so we have Spark installed.

Getting Started with Spark RDDs

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Spark has a data structure called the Resilient Distributed Dataset, or RDD
for short. These are immutable distributed collections. They're organized into logical
partitions, and they're a form of fault-tolerant collection. Data in resilient distributed
datasets may be kept in memory or persisted to disk. RDDs are like parallel collections in
a lot of ways. They're groups of data of the same type or structure, the data is processed
in parallel, and RDDs are generally faster than working with sequential operations. Now,
there are some differences between RDDs and parallel collections.RDDs are partitioned
by a hash function. Parallel collections are broken into subsets and distributed across
cores or threads within a single server at run time. Now, RDDs are distributed across
multiple servers. Parallel collections work across a single server.Within RDDs, the data
can be easily persisted to permanent storage while working with the RDD. RDDs are
broken up again into partitions, and here is an example of a set of four partitions, each
partition storing pairs, in this case, pairs of strings and integers.Now let's switch over to
a terminal window and start our Spark REPL, or the read-eval-print loop. Now, I've
navigated to the location where I have installed Spark, and I'm in the bin directory of
Spark, and I'm going to start the Spark Scala REPL, and I should point out there is a REPL
for Python as well. We'll be using the Scala REPL in this course. Now Spark, by default,
prints out some warning and information messages as it starts up, but you know you're
in good shape when you see the Spark text banner, and you'll notice that the prompt is
the Scala prompt, so this is going to feel a lot like working with the Scala REPL. So the
first thing I want to do is I want to create a large group, or a large set of random
numbers. So the first thing I want to do is import a library to help with that, and we'll
use the Scala util Random package. Okay, so we've just imported scala dot util dot
random, so let's create a value, val. Let's call it big range, and we'll call Scala util Random
shuffle, and we'll create one to 100,000. And there we have 100,000 random
integers. Okay, we've just created our value called big range. Let's convert that to a
resilient distributed data set, an RDD, and we'll do that by typing val big parallelized
RNG, then we'll specify SC for the Spark context. Then we'll call parallelize, then we'll
pass in big range, and now what we have is that we have a new data type and it's called
an RDD, in this case, it's an RDD of integers. Let's take a look at some of the
methodsthat are applied for the big P range. And I'm doing this by typing in the name
of the value followed by the period followed by the tab. And what we'll notice is we
have some operations which gives us statistics on these numbers, because it's an RDD of
integers.So for example, we could start with big P range, and ask for the mean. And the
mean is about 50,000, which is reasonable to expect in a randomly selected list of
100,000 numbers. And let's find the min. So we'll type big PR ng dot min, and the min is
one, and let's try the max as well. Max is 100,000. It's reasonable to expect at least one
of themwould be equal to the minimum and the maximum that was allowed. Now also,
if you're familiar with statistics, you've probably heard of the statistic called the
population standard deviation. And we can look at the big P range dot pop, short for
population, standard deviation. And this gives us a description of the spread, how
spread out is it?And in this case, it's about 28,867 or so. So that concludes our
introduction to resilient distributed data sets.

Mapping Functions over RDDs

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Let's take a look at how we can apply mapping functions over RDDs or
Resilient Distributed Datasets. So let's start the spark-repl, and I'm running the spark-
repl from the bin directory of the spark package that I installed. So you just need to
navigate to whatever directory you installed it and navigate to the bin subdirectory and
then start spark-shell. Okay, I am going to work with a list of random numbers, so I'm
going to import a helper package, import scala.util.Random, and I'm going to create a
value called big range or bigRng for short, and I'm going to call scala.util.Random, and
from that package, I'm going to get the shuffle method, and I'm going to specify that I
want a range of one to 100000, and this'll generate random numbers for me. Great, so
now I have a collection. We'll see here it's a collection of random numbers, so I'm just
going to hit Ctrl+L to clear the screen. Now what I want to do is map this into an
RDD. So I'll specify val bigPRng for parallel, and I'll reference the SparkContext called
parallelize,and I'll pass in the data that I want to parallelize, which in this case is
bigRng, and what we see here is we now have an RDD, which is a type of parallel
collection. Let's just take a quick look at some of the members of bigRng, and the way
you can do that is to use the take function, which is kind of like head if you worked with
that with the Linux or Mac command line. I'm going to say, take the first 25 and show
them to me, and so what we have here is we have 25 different random numbers. So we
know that our bigPRng has random numbers. Now what I'd like to do, is I'd like to
double those. So I want to create a new value called bigPRng2, which is simply bigPRng,
which is our RDD, and I want to apply the map function, and if you saw the videos on
parallel collection, you're familiar with this. So I want to apply map and I want to map to
all members of the collection and I want to multiply it by two. Now what we have, so
we've defined a new value, bigPRng2,and if we just scroll back, and take a look at the
first 25 of these, we'll notice, that they've doubled the values that we had before. So our
mapping actually did double each element of the list, but of course, we can only look at
25 of them. When we work with large datasets and big data, we often want to use
statistics to give us a sense of what our data is doing. So let's look at bigPRng2 and let's
get the mean or the average.Should still be around 100000, which we would expect. Our
previous mean was about 50,000, so we'd expect this to double. So, we can work with
maps to apply arithmetic operations or apply functions to integers, we can also work
with Boolean's. So let's create a function, and I'm going to create a function called div
three which takes a parameter, which is an integer. The function itself returns a Boolean
value, and this function is defined as a val of y, which is an integer and that's equal to
x, the parameter we pass in, modulo arithmetic three, and then I want to return the
evaluation of y==0. So that's my Boolean function. So if I call div and pass in the
number two, I should get a false. If I call div and pass in the value three, I should get
true. Div and eight, should be false. Div and nine, should be true, and so there we have
it. Our div function returns the number modulo arithmetic three. So let's just clear the
screen. Now let's apply our div function, and we can do that by creating another
value. Let's call this big Boolean or bigBool for short, and I want to apply this to the
bigPRng2, which is our doubled, and I want to apply the map function, and the function
I want to apply is div, and I want to apply it to all of the elements of the RDD. So now
our bigBool is a Boolean RDD. So if we look at bigBool and take the first 25, just to get a
feel for what things look like, we should see a list of trues and falses. So that's how we
can create RDDs of different types. We worked with RDDs events and RDDs of
Boolean's. Let's take a quick look at working with strings or text values. Now I had
downloaded a text version of the Republic by Plato, and this is available at the
Gutenberg Project website, www.gutenberg.org, and the link is displayed on the
screen. Now I have downloaded that and I have it in a local directory, so I'm going to
create a value called republic, 'cause this is Plato's Republic, and I'm going to reference
the sc or the SparkContext. This is the same context we used when we called
parallelize, except now I'm going to call textFile, and now I'm going to give it the name
of the file, which is simply /Users/danielsullivan/Downloads/pg1497.txt, and really any
text file will do. Great, so now what we have done is we have created an RDD of
string. So the first thing I want to do is just clear the screen here. So now, republic, is an
RDD of string. Let's take a look at it, and let's take 25. Now for each of these, let's print it
out, make it a little easier to read. And so what you notice here is we have the first 25
lines of the text file, and this includes information about Project Gutenberg, and the fact
that it's a Republic by Plato, and it was translated by B. Jowett, and so on. Now another
thing I want to point out is that you'll notice, I was able to string together a set of
operations. I started with the republic RDD, I take 25 elements, and then each of those
elements, I call the foreach method, which essentially loops over things, and for each of
those, I call println, and this is a common pattern in scala of stringing together
operations like that. Now let's do a little something different. Let's clear the screen, and
now we have this RDD, republic. So we have this RDD of strings. Let's filter this, for
say,lines that have the word Socrates in them. So I'm going to create a val called
linesWithSocrates, and I'm going to reference the republic, because that's the name of
my RDD. We're going to call the filter operation, and I want to pass in a function, this is
an anonymous function, and it's going to take a line, and for each line, I'm going to
apply an operation. Now the line will be a string, so I can do a string operations. In this
case, I can do contains, and I want to see does it contain the word Socrates. Now what I
have, is linesWithSocrates is yet another RDD. So RDDs can create other RDDs. This one
is a subset. So let's look at linesWithSocrates, and let's take 10 lines, and let's print them
out, foreach(println), and let's see what we have. Yep, and we'll notice each of these
lines, here's 10 of them, each of them contains the word Socrates in them. So that
concludes our look at mapping functions over Resilient Distributed Datasets or RDDs.

Statistics over RDDs

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] If we're working with Spark, we're probably working with big data and if
we're working with big data we probably want to use statistics. There are two general
kinds of statistics. There's descriptive statistics which help us understand what's the
shape of our data, how do the numbers fall in various ways. Then there's also the other
branch of statistics which help to test hypotheses and make predictions. Let's take a
quick look at some of the statistics functionality that comes with a Spark and RDDs.First
thing I want to do is import some packages that we'll be using. So, I want to import
scala.util.Random which helps us with random number generation and I'll import from
org.apache.spark.mllib.stat.Statistics. Okay, so I've imported a couple of packages that
we'll need. I'll clear the screen so we can start at the top. First thing I want to do let's
work with a big parallel dataset and as you may remember in our previous video, we can
create val big range, we're going to define that as scala.til.Random.shuffle(1 to
100000)and let's parallelize that. And again, parallelizing means converting it into an
RDD. Great, so now we have big range and let's create a second one, val big parallel
range two. And this one is equal to big parallel range of one, mapping multiplication by
two. So, now we have the big ranges we're going to be working with. I'll clear the screen
once again so we can focus. What I'd like to do is take a sample. So, a sample is a subset
that's randomly selected from our collection or in this case our RDD. So, I'll create a
value. I'll just call it X and I'm going to go reference big parallel range two and I'm going
to apply the method takeSample and I'm going to indicate that when I'm sampling, I
want to sample with replacement, so I may draw a value twice. Once I draw it or pick
it it's assumed to be put back in and it's available for drawing again and let's draw
1,000.Great, now X, my sample, is actually an array, it's not an RDD but it's 1,000 integers
that I pulled out of big P range. Now, let's run it one more time. What you'll notice
here is we have different results. That's because this is a random sample. Now, if for
some reason you want to get the same results each time, for example, you're doing
testing and you want to be able to compare results, you can specify a seed value which
is a number that the random number uses to get itself going and it's the first one it
starts with. If we always start with the same number, we'll always get the same
values. So, let's specify that we want to pick big range but now we want to specify a
seed of 1234. Now when we get a value and we run it twice, we get the same values
each time. So again, if you want to take a sample and work with subsets, the takeSample
method works well and you can decide if you want to sample with replacements and if
so, use true as the first option and if you want consistent results over multiple runs, you
can apply a seed. Now, let's clear the screen. As I mentioned earlier, some statistics are
descriptive, so let's take a look at some of those. Let's look at our big parallel range
two and let's look at the mean or the average. This is a simple descriptive statistic and
some others are the minand the max. If you know the min, the max and the mean, it's
starting to get a sense of what the shape of the data is like. Now, sometimes you want a
whole bunch of these common statistics. Well, it's easy to do that with scala by
specifying a big P range twoand then just saying stats and that'll return a number of
useful statistics like the count, the mean, and the standard deviation and the max and
the min that we just saw and if you're not familiar with statistics, that's fine, but we'll
mention some additional thingsthat may be useful to people who have a need for some
more statistical functionality.Let's start by looking at correlations. So, I'm going to clear
the screen. What I want to do is I want to determine if a list of two numbers is correlated
in some way, for example, do they both rise and fall at the same time? Well, let's create
a couple values to work withand we'll call these series1 and two and series1 is simply an
array that's filled with 100,000 random doubles. And what I can do is create another
value called series2 and again, I call the array function, I fill it with 100,000 and the value
that I assign is a random number and it happens to be a double data type. I could have
said nextInt if I wanted an integer but in this case I wanted a double. So, now I have two
random series. Now, if they're truly random they really shouldn't be correlated, so let's
see if that's the case.Clear the screen. The next thing I want to do is parallelize my
series. I'll create val pseries of one and I'll set that equal to the Spark contents
parallelize and I'll use series1, okay. And next I'll create pseries2 and I'll parallelize
series2. Great, now we have two RDDs. Now, we can call a function in the statistics
package. Well, first I'll create a value called myCorrelation and I'll explicitly declare this is
a double and then I'll make a call to the statistics package, statistics.corr, short for
correlation, then I'll pass in pseries1,pseries2 and then I can tell it how I want to test
correlation and I want to use a test called the pearson correlation. Now, this correlation
has a value that will range from negative one if it's negatively correlated to positive one
if it's highly correlated. So, if it has one number goes up, the other goes up, that's
positive correlation. If it has one number goes up, the other number goes down, that's
negative correlation. When there's no correlation at all, they appear to just change at
random relative to each other, the number should be close to zero. Well, our number
is .006, so that's pretty close to zero,so our statistical measure says yeah, these two
series are not really correlated, so that's the kind of statistical tests we can do as
well. Now, another test that's sometimes usefulis taking a look at your data to see how
normal it is, does it follow a bell curve shape?Well, there's a way to do that and it's
called the Kolmogorov Smirnov Test. And so, let's just take a look. Let's create a new
value and let's check for our distribution test and we'll call the statistics package and
we're going to call kolmogorovSmirnovTest and we're going to pass in pseries1 and I
want to know if this is normally distributed. And we pass in zero and one. So, here we
have our results of the Kolmogorov Smirnov Test and it has a verbal description, a very
strong presumption against the null hypothesis,sample follows theoretical
distribution. That theoretical distribution we're talking aboutis the bell curve or the
normal distribution. So, that's a look at some of the things we can do with descriptive
statistics as well as predictive and analytic statistics in Spark.

Summary of Scala and Spark RDDs

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Let's summarize some of the key facts about Scala and Spark RDDs. RDDs
are distributed data structures. That means they run across multiple nodes. Now single-
node clusters, like we're using here, are useful for development and test, but in big data
production environments, you should consider running multiple nodes in your Spark
cluster.

Creating DataFrames

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I'm going to create data frames. Now, data frames, they're
kind of like relational tables. They're a data structure that's organized into rows, and
they have named columns. Now you may have heard of data frames before if you've
worked with R/R or the pandas package in Python. The data frames in Spark are very
similar. Now, what I'd like to do here, is create three data frames using some text
filesthat are available as exercise files. So I you have access to those exercise files, you
can download the text files and follow along with me. The first thing I'll do is create a
value,or a local variable, called spark, which is a Spark session. So the first thing I'm
going to do for that is import a package that we'll be using, and that package is
calledorg.apache.spark. sql.SparkSession, and then I'll create a value called spark, and I'll
assign that to a session, and I'll call the builder function, and I'll give this session a
name, I'll call it DataFrameExercise, and then we'll ask Spark to get it, or create the
session. So now what I've done is, I've started the Spark shell, imported a helper
package, and then created a context, or a session, that I can work with. So I'll clear the
screen, so I have some room to work here, and now the next thing I want to do is define
a value called df_emps, and that's short for data frame about employees. And I'm going
to reference my session, which is called spark, and I want to read a file, so I'm going to
use the read operator, and I'm going to specify an option about reading this file, and my
option is that I have a header in this file. So I'm going to specify header as true. Now, I'm
reading a comma-separated file, so I'm going to use the csv method, and then I'm going
to pass in the name of the file. Now, what we've done is we've created a data frame, and
that's kind of like a relational table, and it's based on the file we just specified. And we
can take a look at the data frame called df_emps, and we can say, "Take first ten." So
you can start to see a little bit what the structure looks like. Now, if you want to know
exactly what the structure looks like, we could use something called df_emps, and
specify schema, and that'll tell us exactly the contents, in terms of the structure. Now
another useful operator or function that comes with a data frame, is called show. This is
a little more table like. Now, it doesn't show all of the rows. This data frame has about a
thousand rows in it, so it's just showing the first 20. Okay, let's clear the screen, and load
another data frame. I'll use Control-L, and I'll define another data frame, and we'll call
this one cr, because this deals with countries and regions. And, I'll specify spark, which is
my context. I want to read a file. I want to specify the option that I have a header in that
file. And it's a csv file, and I'll give it the name. And it's called country_region.txt. Now,
again, let's just take a look at the contents. Okay, it's a fairly simple table, so let's use
show, get a more structured layout. Okay. So basically, what we have here is a
table which has some ID's, which identifies a region, has the name of a region, and it
indicates which country that particular region is in. Now another we can do with data
frames is list out the columns. So if you ever want to know, just, what are the names of
the columns, that's an easy way to do it. Let's clear the screen, and load one more
table. We'll call this value df_dd, short for department and division, and this pattern will
look familiar by now, spark.read.option("header","true"), and I'll specify the file
name, and I'll specify df_dd, and I'll just use the show function to see what we have
here. And again, it's a simple table. In this case, it's just two columns. Has a name of a
department, and then a higher level company division. So that's our introduction to
creating data frames. I'm going to leave this session running, because we'll be using
those data frames in later lessons.

Grouping and filtering on DataFrames

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] One of the nice things about working data frames, is that we can use SQLto
perform grouping and filtering operations. I'm starting in a Spark session that I left offin
our last video. So I'm just going to clear the screen, and I'm going to show a data frame
called df_emps.show which I created in a pervious video. So this is a table of about 1,000
employees. Now I'd like to be able to use SQL with this, so I'm going to perform an
operation that allows me to essentially create a temporary view. And to do that I'm
going to specify df_emps, which is the name of the data frame; create
OrReplaceTempView, and I'm going to call that temporary view employees. So now
what I've done is essentially told Spark to create a data structure call it employees, and
allow me to use SQL on that . So let's clear the screen. Now let's define a new data
frame with a select statement. So I'll create val, let's call it sqldf_emps. And I'm going to
simply call it spark.sql, and now I'm going to pass in a select statement. Now what I've
done is I've executed that select statement and the results are now available in a new
data framecalled sqldf_emps. Now let's perform an aggregation. First I'm going to clear
the screen,and I'm going to create a new data frame. This is a SQL data frame, so I'll
prefix it with sqldf, and we'll call this emps_by_dept. And this is going to be a data
frame that's created by Spark SQL. And the SQL statement for this is select
department, then the count of the employees in that department, from employees
group by department. So this is the same kind of SQL statement with an aggregate
function, and a group by statement. Now let's take a look at that. And what we'll see
is, we have a data frame which has two columns, which is what we would expect. There's
the department column, and the count column. Okay, let's clear the screen. And now,
let's make a new aggregation. Instead of listing employees by department, let's list
employees by department and gender. So I will modify the select statement, so that it
includes in the output department, gender, and count. And my group by statement
should saydepartment, comma, gender. And now let's show the results of that. And
what you see is, we've aggregated on department and gender, and gotten our
counts. Let's imagine we want to get a list of department names. Let's clear the screen,
and let's create a new data frame with just a list of departments. And to do that, we'll
create a new SQL data frame, which we'll simply call SQL data frame departments. And
we'll execute a Spark SQL statement. And that statement will be select distinct
department from employees. If we show the contents of that data frame, this is what we
get. We get a list of individual department names, which is exactly what we
wanted. Now let's clear the screen, and show one more kind of SQL operation that's
quite useful when we're working with data frames, and that is filtering. We'll call it
sqldf_emps_100. And we'll issue a Spark SQL command, and it will be SELECT * FROM
employees WHERE. We could put any criteria in here that's a valid SQL criteria
statement, but I'm going to just say WHERE id < 100.And now I'm going to look at
sqldf_emps_100 and show the results of that. So what we have is the top 20 rows of
sqldf_emps_100. So that's a look at how to work with SQL for filtering and aggregation.

Joining DataFrames

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Now another common operation  is joining tables. Now with Spark SQL we
can join DataFrames. So I'm going to pick up where I left off in the previous lesson with
my Scala REPL active here. I'm going to just clear the screen. And just as a refresher I'm
going to show the contents of a DataFrame called emps. And there's the first 20 rows of
emps. I also have another DataFrame that deals with countries and regions. And that's
called df_cr. And let's show the contents of those. Now you'll notice both the employees
table and the country region table have a column called region ID. That means I can join
these two DataFrames. Joining is really simple in Spark SQL. Let's clear the screen. And
let's take a quick look at how to do that. Let's create a new DataFrame. We'll it a
DataFrame, we'll simply call it joined. And it's going to be a join of the DataFrame called
emps for df emps. And I'm going to apply the join operation to it. And I'm going to tell
Spark to join emps to the DataFrame called cr for country regions. And I want to join on
the column region id. And now let's look at the columns in the DataFrame df
joined.What you'll notice here these are the same columns  that one finds in df
emps. Plus the columns that are found in df cr for country region which is as we would
expect with a join. Let's take a look at the contents of DataFrame joined. Now the
contents are running over lines here because it's a little bit long. But what you'll notice is
this is the same contents that we listed when we listed the employees table plus we now
have in each row the country and region information. So that just shows how simple it is
to perform joins with SQL Spark.
Working with JSON files

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] In this session I'd like to show how easy it is  to work with JSON files when
we're working with Spark. Now I've started my Spark RPL. Now if you've viewed previous
lessons, you're probably familiar with the idea of creating a Spark session, so I'm simply
going to import org.apache.spark.sql.SparkSession and I'm going to create a context
variable called Spark and I'm going to make a Spark session. I'm going to call the
builder and I'll give this an app name. And we're going to call this
"DataFrameExercise" and I'm going to simply ask Spark to get or create this Spark
session. Great, so now I have a Spark session that I can work with to load my JSON
file.So I'm going to clear the screen and I'm going to load up a JSON file. And let me just
flip over to a text editor to show you what that looks like. This is the same data that
we've worked with in the department division table. Instead of using a comma-
separated file, I formatted this as a JSON file. So let's load that into Spark. The way we'll
do this is we'll create a value called data frame, and I'll just note that JSON is the source
of this And I'll use a shorthand "dd" for department and division. And I'll reference the
Spark sessionand I'll say I want to read a file. Now it's a JSON file so I'll specify JSON and
I'll give it the file name. Now that creates a data frame. Let's show it. And what we have
here is a data frame that has exactly the same structure as the data frame we created in
previous exercises, but where we used a text file instead of a JSON file. So what we have
here is the ability to load data from text or from JSON. And actually there are a number
of otherfile formats that Spark works with, and I would just suggest you consult the
Spark documentation for that. So that concludes our look at loading JSON files into
Spark data frames.

Summary of Scala and Spark DataFrames

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] DataFrames are a real useful data structure  for data scientists working with
Spark and Scala. DataFrames are table-like data structures and in Spark it's very easy to
load data from either Comma Separated Value files or JSON files, and in fact several
other formats are supported as well. One of the especially useful features about
DataFrames is that we can use SQL statements to filter and aggregate the data. We can
also join DataFrames to create new DataFrames based on data that we already have in
existing DataFrames.

You might also like