You are on page 1of 20

DPLYR AND IMPORT DATA

SORTING DATA FRAMES


NESTED SORTING
THE TOP N
GROUP
EXERCISE

• Load the dataset mtcars.


• Sort the data mtcars by Weight (wt) from heavy to light. What is the heaviest
car?
•  group the data by Number of cylinders (cyl) and find the mean, sd, min and
max of Miles/(US) gallon(mpg) for each group.
• What are the 5 cars with highest horsepower (hp)?
IMPORTING DATA
• We have been using data sets already stored as R objects.
• A data scientist will rarely have such luck and will have to import data into R
from either a file, a database, or other sources.
• Currently, one of the most common ways of storing and sharing data for
analysis is through electronic spreadsheets.
• A spreadsheet stores data in rows and columns.
• It is basically a file version of a data frame.
• When saving such a table to a computer file, one needs a way to define when
a new row or column ends and the other begins. This in turn defines the cells
in which single values are stored.
WORKING DIRECTORY
Getting working directory

Changing working directory


READ FILE

Addition function

For txt files


DOWNLOADING FILES

• Another common place for data to reside is on the internet.


• When these data are in files, we can download them and then import them or
even read them directly from the web.
• For example, we note that because our dslabs package is on GitHub, the file
we downloaded with the package has a url:
EXERCISE

• Using a for loop simulate the flip a coin twenty times, keeping track of the
individual outcomes (1 = heads, 0 = tails) in a vector that you preallocte.
• Use a while loop to investigate the number of terms required before the
product
1⋅2⋅3⋅4⋅…
reaches above 10 million.
GGPLOT2
The first step in learning ggplot2 is to be able to break a graph apart into components. Let’s break down the plot
above and introduce some of the ggplot2 terminology.
The main three components to note are:

•Data: The US murders data table is being summarized. We refer to this as the data component.
•Geometry: The plot above is a scatterplot. This is referred to as the geometry component. Other possible
geometries are barplot, histogram, smooth densities, qqplot, and boxplot.
•Aesthetic mapping: The plot uses several visual cues to represent the information provided by the dataset. The two
most important cues in this plot are the point positions on the x-axis and y-axis.
Color is another visual cue that we map to region. We refer to this as the aesthetic mapping component.
EXERCISE

• Use the “mtcars” dataset (built-in) to make a scatterplot of “weight” and


“mpg”.
• Set the color of the points to the “cyl” variable. 
CLASSWORK
1. Load in the dataset movies.csv.
2. Find a subset of the movies produced after 2010. Save the subset in ‘movies.sub’
variable.
3. Keep columns ‘name’, ‘director’, ‘year’, ‘country’, ‘genre’, ‘budget’, ‘gross’,
‘score’ in the ‘movies.sub’.
4. Find the profit for each movie in ‘movies.sub’ as a fraction of its budget. Convert
‘budget’ and ‘gross’ columns million dollar units founded to the first decimal
point. Use round() to round numbers.
5. Count the number of movies in ‘movies.sub’ produced by each genre, and order
them in the descending count order.
6. Now group movies in ‘movies.sub’ by countries and genre. Then, count the
number of movies in each group (use the fuction n()) and the corresponding
median fractional profit, the mean and standard deviation of the movie score
for each group.
7. Using chaining and pipes, for each genre find the three directors with the top
mean movie scores received for the movies produced after 2000, but do not
include the directors with fewer than 3 movies in total. Hint: Use top_n()
function to select top n from each group.
8. Pick your favourite genre and the top 3 directors to find movie
recommendations for your next movie night!

You might also like