You are on page 1of 10

Excel Basics Cheat Sheet > Operators Use multiple logical conditions to determine the return value with

> Operators Use multiple logical conditions to determine the return value with IFS()
=IFS(cond1, return1, cond2, return2)

Arithmetic operators =IFS(A1 B1, "1st", A2 > B2, "2nd", B3, "3rd") Returns "3rd"

Learn Excel online at www.DataCamp.com


> A3 >

Similar IF(), but allowing multiple pairs of logical conditions and return values. If the first condition, cond1, is TRUE then the
to
=A2 + A3 Add two values with +. This example returns 3 + 6 =
function returns the first return value, return1. If the second condition, cond2 is TRUE, the function returns the second return value;
=A4 - B4 Subtract a value from another with -.This example returns 10 - 7 =
and so on.
=A6 * B1 Multiply two values with *. This example returns 21 * 2 = 4

> Definitions =C3 / B4 Divide two values with /. This example returns 28 /
=C5% Convert a value to a percentage with %. This
7 =
example returns 3.2 Provide a default value in case of errors with IFERROR()
=B1 ^ C1 Raise a value to power with ^. This example returns 2 ^ 6 = 64
=IFERROR(value, value_if_error)

This cheat sheet describes the behavior of the Microsoft 365 version of Excel, and slight differences exist between Excel versions.
=IFERROR(A5 / A5, 1) Division of two missing values gives an error; this returns 1

Numeric comparison operators


Definitions If the first input does not result in an error then it is returned. If it does result in an error, the second input is returned.
Test for equality with =
Test for inequality with <>

Spreadsheet: An application, like Microsoft Excel, where you can store data, perform calculations, and organize information.
=A1 = B1 Returns 1 = 2 which is FALSE

=A2 = B2 Returns 3 = 3 which is TRUE


=A1
=A2
< >
< >
B1 Returns 1
B2 Returns 3
< >
< >
2 which is TRUE

3 which is FALSE
Choose a return value based on a table of inputs with SWITCH()
Workbook: A file containing a collection of one or more worksheets.
Test greater than with >
Test greater than or equal to with >=
=SWITCH(value, choice1, return1, choice2, return2, ...)

Worksheet: A single page in a workbook. It is a grid of cells arranged in rows and columns. =A3 > = B3 Returns 5 > 5 which is TRUE
=A3 > = B3 Returns 6 > =5 which is TRUE
=SWITCH(MID(D3, 1, 5), "World", "planet", "Solar", "planetary system", "Milky", "galaxy", "Local",
=A2 > B2 Returns 3 > 3 which is FALSE =A2 > B2 Returns 3 > = 3 which is TRUE "galaxy group") Returns "galaxy"

Cell: A rectangular box in a worksheet that can store a data value, a formula, or other content.
Formula: A piece of code to perform a calculation. Formulas start with an equals sign (=), and contain functions, mathematical Test less than with <
Test less than or equal to with <=

Takes a value as its first argument, followed by pairs of choices and return values. If the value matches the first choice, the function
operators, values, and cell references. =A1 < B1 Returns 1 < 2 which is TRUE
=A1 < = B1 Returns 1 < = 2 which is TRUE

returns the first return value; if the value matches the second choice, the function returns the second return value; and so on. If no
Cell reference: The location of a cell. The column is described with letters and the row is described with numbers. For example, the =A2 < B2 Returns 3 < 3 which is FALSE =A2 < = B2 Returns 3 < = 3 which is TRUE
values match, the function returns an error.
cell in the 4th column, 7th row would be denoted D7.
Cell range: A group of adjacent cells in a worksheet. A cell range is typically referred to by its upper-left and lower-right cells, such

> Logical functions > Conditional computation


as A1:C3, referring to the cells in columns A, B, and C and rows 1 through 3. You can use cell ranges to perform calculations on
multiple cells at once or to apply formatting to a group of cells.

- A B C
Logical NOT with NOT()
Logical AND with AND()
Get the number of cells that meet a condition with COUNTIF()
1 Cell A1 Cell B1 Cell C1 =NOT(A1 = B1)
=AND(A1 > 10, B1 < 20)

Returns NOT(1 = 2) which is TRUE


Returns AND(1 > 10, 2 < 20) which is FALSE
=COUNTIF(A1:A6, ">5") Returns 3: the number of cells greater than 5, ignoring blanks

2 Cell A2 Cell B2 Cell C1


=NOT(A1 = B1)
=AND(A1 < 2, B1 < 20)
=COUNTIF(D1:D6, "Milky Way") Returns 1: the number of cells equal to “Milky Way”
3 Cell A3 Cell B3 Cell C3 Returns NOT(2 = 2) which is FALSE Returns AND(1 < 2, 2 < 20) which is TRUE

Logical OR with OR()


Logical XOR with XOR()
Calculate the total of cells meeting conditions with SUMIF() and SUMIFS()
=OR(A1 > 10, B1 < 20)
=XOR(A1 > 10, B1 < 20)

> Getting help Returns OR(1 > 10, 2 < 20) which is TRUE

=OR(A1 < 2, B1 < 20)

Returns XOR(1 > 10, 2 < 20) which is TRUE

=XOR(A1 > 10, B1 > 20)

=SUMIF(A1:A6, ">5") Returns 37: the sum of elements in A1 to A6 filtered with values greater than 5

=SUMIF(A1:A6, ">5", B1:B6) Returns 25: the sum of elements in B1 to B6 corresponding to values in A1 to A6 that are greater
Returns OR(1 < 2, 2 < 20) which is TRUE Returns XOR(1 > 2, 2 > 20) which is FALSE
than 5

You can get help by accessing the help menu =SUMIFS(B1:B6, A1:A6, ">5", D1:D6, "<>Local Group") Returns 18: the sum of B1:B6 where A1:A6 is greater than 5 and
D1:D6 is not equal to "Local Group"

Open Microsoft Excel


Click on the "Help" menu at the top of the screen
In the Help menu, you will see various options for getting help, including a search bar where you can enter keywords to search for > Data types Calculate the mean of cells meeting conditions with AVERAGEIF() & AVERAGEIFS()
specific topics =AVERAGEIF(A1:A6, ">5") Returns 12.33: the mean of elements in A1 to A6 filtered with values greater than 8

You can also click on the "Help" button to open the Help pane, where you can browse through various topics and find answers to =ISNUMBER(A1) Checks if a cell is a number. Returns TRU =AVERAGEIF(A1:A6, ">5", B1:B6) Returns 8.33: the mean of elements in B1 to B6 corresponding to values in A1 to A6 that are
common questions. =ISTEXT(D1) Checks if a cell is a text. Returns TRU greater than 5

=ISLOGICAL(A1) Checks if a cell is a boolean. Returns FALS =AVERAGEIFS(B1:B6, A1:A6, ">5", D1:D6, "<>Local Group") Returns 9: the mean of B1:B6 where A1:A6 is greater than 5
and D1:D6 is not equal to "Local Group"
How to add a comment to a cell =ISLOGICAL(A1=A1) Checks if a cell is a boolean. Returns TRU

=N(E1) Converts to number. Returns 44927: the serial date - the date as a number, counting Dec 31st 1899 as
Click on the cell where you want to add a comment =N(D1) Converts to number. Returns an error, since it’s not a numbe
Right-click or CTRL+click on the cell and select the "New Comment" option from the context menu. You can also click on the
Insert menu then "New Comment"
=VALUETOTEXT(A1) Convert to text. Returns "1
=TEXT(C6, "0.00E+0") Convert to formatted text. Returns "4.96E+2 > Text functions and operators
This will open a small text box next to the cell, where you can type your comment =DATEVALUE("1/1/2022") Convert text to serial. Returns 44927: the serial date
Once you have entered your comment, click the green arrow button to save it. Basics
=LEN(D5) Returns the length of a string in characters. This example returns 28.

> Cells and ranges > Counting data Combining and splitting strings
=COUNT(A5:E5) 3: the number of cells in the range containing numbers, dates and currencies
Returns ="Hello " & D1 & "!" Returns "Hello World!
Specifying cell locations with column letter, row number format =COUNTA(A5:E5) Returns 4: the number of cells in the range that aren't empt =REPT(D6, 3) Repeats text. This example returns "UniverseUniverseUniverse
=COUNTBLANK(A5:E5) Returns 1: the number of cells that are empty or contain the empty string ("") =TEXTSPLIT(D4, "o") Splits a string on a delimiter. This example returns "L", "cal Gr", "up" in 3 cells: "Local Group"
=B2 Here we refer to the cell in column B, row 2.
split on the letter "o
=TEXTSPLIT(D5, {"a","u"}) Splits a string on a delimiter. This example returns "L", "ni", "ke", "S", "percl",
Specifying absolute cell references with $ prefixes
> Math functions "ster" in 6 cells: "Laniakea Supercluster" split on the letter "a" or the letter "u".

The $ symbol before the column letter and/or row number tells Excel that the reference is absolute and should not change when the
formula is copied or moved to another cell. The following examples all specify column B, row 2.
=LOG(100, 10) Returns 2: the base 10 logarithm of 10
Mutating strings
=$B$2 Column and row references are both absolute
=EXP(2) Returns e^2 = 7.39
=MID(text, start, [length]) Extracts a substring starting at the position specified in the second argument and with the
=$B2 Column reference is absolute, row reference is relative
=MAX(A1:A6, C1:C3, 12) Returns 28: the largest value in all cell ranges or values inputte
length specified in the third argument. For example =MID(D6, 4, 5) Returns "verse
=B$2 Column reference is relative, row reference is absolute =MIN(A1:A6, C1:C3, 12) Returns 1: the smallest value in all cell ranges or values inputted
=UPPER(text) Converts the text to uppercase. For example =UPPER(D3) Returns "MILKY WAY
=MAXA(A1:A6, C1:C3, FALSE) Returns same as MAX(), except TRUE is valued at 1 and FALSE is valued at
=LOWER(text) Converts the text to lowercase. For example =LOWER(D3) Returns "milky way
Specifying ranges with the start:end format =MINA(A1:A6, C1:C3, FALSE) Returns same as MIN(), except TRUE is valued at 1 and FALSE is valued at
=PROPER(text) Converts the text to title case. For example =PROPER("milky way") Returns "Milky Way"
=SUM(A1:A6, C1:C3, 12) Returns 108: the total of all cell ranges or values inputte
The start:end format is a convenient way to specify a range of cells in a formula.
=AVERAGE(A1:A6, C1:C3, 12) Returns 12: the mean of all cell ranges or values inputte
Here is an example of start:end format when using the SUM() formula:
=MEDIAN(A1:A6, C1:C3, 12) Returns 10: the median of all cell ranges or values inputte
=SUM(B2:B5) =PERCENTILE.INC(C1:C6,
=ROUND(PI(), 2)
0.25) Returns 22.75: the 25th percentile of the cell rang
3.14: pi rounded to 2 decimal place
Returns
> Data manipulation
=CEILING(PI(), 0.1) Returns 3.2: pi rounded upwards to the nearest 0.
=FILTER(A1:B6, C1:C6>100) Gets a subset of the cell range in the first input that meets the condition in the second input
Example dataset
=FLOOR(PI(), 0.1) Returns 3.1: pi rounded downwards to the nearest 0.
> =VAR.S(B1:B6) Returns 19.37: sample variance of the cell rang
=STDEV.S(B1:B6) Returns 4.40: sample standard deviation of the cell range
=SORT(A1:E6, 4) Returns the dataset with rows in alphabetical order of the fourth column. Sorts the rows of the data
according to values in specified columns
=SORTBY(A1:E6, D1:D6) Returns the same as the SORT() example. Alternate, more flexible, syntax for sorting. Rather than
Throughout most of this cheat sheet, we’ll be using this dummy dataset of 5 columns and 6 rows. specifying the column number, you specify an array to sort by
=UNIQUE(A1:A6) Gets a list of unique values from the specified data
- A B C D E
> Flow control =SEQUENCE(5, 1, 3, 2) Returns 5 rows and 1 column containing the values 3, 5, 7, 9, 11. Generates a sequence of numbers,
1 1 2 6 World 1/1/2023 starting at the specified start value and with the specified step size.

2 3 3 21 Solar System 1/2/2023 Use a logical condition to determine the return value with IF()
3 6 5 28 Milky Way 1/3/2023
=IF(cond, return_if_true, return_if_false)

4 10 7 301 Local Group 1/4/2023 =IF(ISBLANK(A5), "A5 is blank", "A5 is not blank") Returns "A5 is blank"

Learn Excel Online at


www.DataCamp.com
5 21 11 325 Laniakea Supercluster 1/5/2023
Takes a logical condition, cond, as its first argument. If cond is TRUE, IF() returns the value specified in the second argument
6 21 13 496 Universe 1/6/2023 (return_if_true); if cond is TRUE, IF() returns the value specified in the third argument (return_if_false).
5. Get the listing id, city, ordered by the number_of_rooms in descending order
Filtering on missing data
SELECT id, city

SQL for Data Science


FROM airbnb_listings

ORDER BY number_of_rooms DESC; 12. Return the listings where number_of_rooms is missing

6. Get the first 5 rows from the airbnb_listings table SELECT *

FROM airbnb_listings

SQL Basics Cheat Sheet


SELECT *
WHERE number_of_rooms IS NULL;
FROM airbnb_listings

LIMIT 5; 13. Return the listings where number_of_rooms is not missing

7. Get a unique list of cities where there are listings SELECT *

Learn SQL online at www.DataCamp.com SELECT DISTINCT city

FROM airbnb_listings

WHERE number_of_rooms IS NOT NULL;


FROM airbnb_lisitings;

> Filtering Data > Aggregating Data


What is SQL?
Filtering on numeric columns Simple aggregations
SQL stands for “structured query language”. It is a language used to query,
1. Get the total number of rooms available across all listings
analyze, and manipulate data from databases. Today, SQL is one of the most 1. Get all the listings where number_of_rooms is more or equal to 3
widely used tools in data.
SELECT SUM(number_of_rooms)

SELECT *
FROM airbnb_listings;
FROM airbnb_listings

WHERE number_of_rooms >= 3; 2. Get the average number of rooms per listing across all listings

> The different dialects of SQL 2. Get all the listings where number_of_rooms is more than 3
SELECT *

SELECT AVG(number_of_rooms)

FROM airbnb_listings;
FROM airbnb_listings
3. Get the listing with the highest number of rooms across all listings

Although SQL languages all share a basic structure, some of the specific WHERE number_of_rooms > 3; SELECT MAX(number_of_rooms)

3. Get all the listings where number_of_rooms is exactly equal to 3 FROM airbnb_listings;
commands and styles can differ slightly. Popular dialects include MySQL,
SELECT *
4. Get the listing with the lowest number of rooms across all listings
SQLite, SQL Server, Oracle SQL, and more. PostgreSQL is a good place to start FROM airbnb_listings
SELECT MIN(number_of_rooms)

—since it’s close to standard SQL syntax and is easily adapted to other WHERE number_of_rooms = 3; FROM airbnb_listings;
dialects.
4. Get all the listings where number_of_rooms is lower or equal to 3
SELECT *

FROM airbnb_listings
Grouping, filtering, and sorting
> Sample Data WHERE number_of_rooms <= 3;
5. Get all the listings where number_of_rooms is lower than 3 5. Get the total number of rooms for each country

SELECT *
SELECT country, SUM(number_of_rooms)

Throughout this cheat sheet, we’ll use the columns listed in this sample table of FROM airbnb_listings
FROM airbnb_listings

WHERE number_of_rooms < 3; GROUP BY country;


airbnb_listings
6. Get all the listings with 3 to 6 rooms 6. Get the average number of rooms for each country

SELECT *
SELECT country, AVG(number_of_rooms)

airbnb_listings FROM airbnb_listings


FROM airbnb_listings

WHERE number_of_rooms BETWEEN 3 AND 6; GROUP BY country;


7. Get the listing with the maximum number of rooms per country
id city country number_of_rooms year_listed
SELECT country, MAX(number_of_rooms)

Filtering on text columns FROM airbnb_listings

GROUP BY country;
1 Paris France 5 2018
7. Get all the listings that are based in ‘Paris’ 8. Get the listing with the lowest amount of rooms per country
SELECT *
SELECT country, MIN(number_of_rooms)

2 Tokyo Japan 2 2017 FROM airbnb_listings


FROM airbnb_listings

WHERE city = ‘Paris’; GROUP BY country;


8. Get the listings based in the ‘USA’ and in ‘France’ 9. For each country, get the average number of rooms per listing, sorted by ascending order
3 New York USA 2 2022
SELECT *
SELECT country, AVG(number_of_rooms) AS avg_rooms

FROM airbnb_listings
FROM airbnb_listings

WHERE country IN (‘USA’, ‘France’); GROUP BY country

ORDER BY avg_rooms ASC;

> Querying tables


9. Get all the listings where the city starts with ‘j’ and where the city does not end in ‘t’
10. For Japan and the USA, get the average number of rooms per listing in each country
SELECT *

FROM airbnb_listings
SELECT country, MAX(number_of_rooms)

WHERE city LIKE ‘j%’ AND city NOT LIKE ‘%t’; FROM airbnb_listings

1. Get all the columns from a table WHERE country IN (‘USA’,‘Japan’);

SELECT *
GROUP BY country;
FROM airbnb_listings; 11. Get the number of cities per country, where there are listings
2. Return the city column from the table
Filtering on multiple columns
SELECT country, COUNT(city) AS number_of_cities

SELECT city
FROM airbnb_listings

FROM airbnb_listings; 10. Get all the listings in `Paris` where number_of_rooms is bigger than 3 GROUP BY country;
3. Get the city and year_listed columns from the table
SELECT *
12. Get all the years where there were more than 100 listings per year
FROM airbnb_listings

SELECT city, year_listed


SELECT year_listed

WHERE city = ‘Paris’ AND number_of_rooms > 3;


FROM airbnb_listings; FROM airbnb_listings

11. Get all the listings in `Paris` OR the ones that were listed after 2012 GROUP BY year_listed

4. Get the listing id, city, ordered by the number_of_rooms in ascending order HAVING COUNT(id) > 100;
SELECT *

SELECT id, city


FROM airbnb_listings

FROM airbnb_listings
WHERE city = ‘Paris’ OR year_listed > 2012;
ORDER BY number_of_rooms ASC;

Learn Data Skills Online at www.DataCamp.com


Appending datasets

Create your first visualization You can append one dataset to anothe

Click on the Report View and go to the Visualizations pane on the right-hand sid Click on Append Queries under the Home tab under the Combine grou
Select the type of visualization you would like to plot your data on. Keep reading this cheat to learn different Select to append either Two tables or Three or more table
Add tables to append under the provided section in the same window

Power BI for Business Intelligence


visualizations available in Power BI
Under the Field pane on the right-hand side, drag the variables of your choice into Values or Axis.

Merge Queries

Values let you visualize aggregate measures (e.g. Total Revenue)


You can use merge tables based on a related column
Axis let you visualize categories (e.g. Sales Person)

Power BI Cheat Sheet Click on Merge Queries under the Home tab under the Combine grou
Select the first table and the second table you would like to merge
Aggregating data Select the columns you would like to join the tables on by clicking on the column from the first dataset, and from
the second datase

earn Power BI online at www.DataCamp.com


L Power BI sums numerical fields when visualizing them under Values. However, you can choose different aggregation
Select the Join Kind that suits your operation:

Select the visualization you just create


Go to the Visualizations section on the right-hand sid
Go to Values—the visualized column should be there

What is Power BI?


On the selected column—click on the dropdown arrow and change the aggregation (i.e., AVERAGE, MAX, e outer
L ft Right outer Full outer Inner e anti
L ft Right anti
COUNT, etc..)
Click on Ok—new columns will be added to your current table

Data profiling

Power BI is a business intelligence tool that allows you


to effectively report insights through easy-to-use
customizable visualizations and dashboards.

> Data Visualizations in Power BI Data Profiling is a feature in Power Query that provides intuitive information about your dat

Click on the View tab in the Query ribbo


Power BI provides a wide range of data visualizations. Here is a list of the most useful visualizations you have in Power BI In the Data Preview tab—tick the options you want to visualiz
Bar Charts: Horizontal bars used for comparing specific values across categories (e.g. sales by region) Tick Column Quality to see the amount of missing dat
Tick Column Distribution to see the statistical distribution under every colum
Column Charts: Vertical columns for comparing specific values across categories
Tick Column Profile to see summary statistics and more detailed frequency information of columns

> Why use Power BI? L ine Charts: Used for looking at a numeric value over time (e.g. revenue over time)

> DAX Expressions


Area Chart: Based on the line chart with the difference that the area between the axis and line is filled in (e.g.
sales by month)

Easy to use—no coding Integrates seamlessly with Fast and can handle large Scatter: Displays one set of numerical data along the horizontal axis and another set along the vertical axis (e.g.
involved any data source datasets relation age and loan)
Combo Chart: Combines a column chart and a line chart (e.g. actual sales performance vs target) Data Analysis Expressions (DAX) is a calculation language used in Power BI that lets you create calculations and
Treemaps: Used to visualize categories with colored rectangles, sized with respect to their value (e.g. product perform data analysis. It is used to create calculated columns, measures, and custom tables. DAX functions are

> Power BI Components


category based on sales) predefined formulas that perform calculations on specific values called arguments.
Pie Chart: Circle divided into slices representing a category's proportion of the whole (e.g. market share)
Donut Chart: Similar to pie charts; used to show the proportion of sectors to a whole (e.g. market share)
Sample data

Throughout this section, we’ll use the columns listed in this sample table of `sales_data`
There are three components to Power BI—each of them serving different purposes Maps: Used to map categorical and quantitative information to spatial locations (e.g. sales per state)
Cards: Used for displaying a single fact or single data point (e.g. total sales) deal_size sales_person date customer _name
P ow e r B I D e s k to p P ow e r B I s e r v i c e P ow e r B I m o b i l e
Free desktop application that Cloud-based version of Power BI A mobile app of Power BI, which Table: Grid used to display data in a logical series of rows and columns (e.g. all products with sold items) 1,000 Maria Shuttleworth 30-03-2022 Acme Inc.
provides data analysis and with report editing and publishing allows you to author, view, and 3,000 Nuno Rocha 29-03-2022 Spotflix
creation tools. features. share reports on the go.
2,300 Terence Mickey 13-04-2022 DataChamp

> Power Query Editor in Power BI Simple aggregation


> Getting started with Power BI Power Query is Microsoft’s data transformation and data preparation engine. It is part of Power BI Desktop, and lets
UM
S

AVERAGE
ME AN DI
l (<co

l
l
adds all the numbers in a colum

(<co
umn>)

returns the average (arithmetic mean) of all numbers in a colum


(<co umn>)

returns the median of numbers in a colum


umn>)

There are three main views in Power BI you connect to one or many data sources, shape and transform data to meet your needs, and load it into Power BI. M N MAX
I / l returns the smallest biggest value in a colum
(<co umn>) /

COUNT l (<cocounts the number of cells in a column that contain non blank value
umn>) -
report view da t a v i e w model view T NCTCOUNT
DIS I l counts the number of distinct values in a column.
(<co umn>)

This view is the default This view lets you examine This view helps you Open the Power Query Editor EX AM P LE
view, where you can datasets associated with establish different
Sum of all deals — SUM(‘sales_data’[deal_size]
visualize data and create your reports relationships between
While loading dat Average deal size — AVERAGE(‘sales_data’[deal_size]
reports datasets
Underneath the Home tab, click on Get Dat Distinct number of customers — DISTINCTCOUNT(‘sales_data’[customer_name])
Choose any of your datasets and double clic
Click on Transform Data

Logical unction
f

IF(< l ogic al_test > , <v al e_ _t e


u if ru >[ , <v al e_ _ alse
u if f >]) check the result of an expression and

> Visualizing your first dataset When data is already loade


Go to the Data Vie EX
create conditional results
AM P LE

Under Queries in the Home tab of the ribbon, click on Transform Data drop-down, then on the Transform Data Create a column called large_deal that returns “Yes” if deal_size is bigger than 2,000 and “No” otherwise
button la e_deal sales_data deal_s e , es , N
Upload datasets into Power BI rg = IF( ‘ ’[ iz ] > 2000 “Y ” “ o”)

e t unction
Using the Power Query Editor
T x F

Underneath the Home tab, click on Get Dat LE F T te t ,


(< x _ a s returns the specified number of characters from the start of a tex
> <num ch r >)

Choose any of your datasets and double clic LO W ER te t converts a text string to all lowercase letter
(< x >)

Click on Load if not prior data needs processin Removing rows


UPP ER te t converts a text string to all uppercase letter
(< x >)

If you need to transform the data, click Transform which will launch Power Query. Keep reading this cheat sheet for You can remove rows dependent on their location, and propertie RE PL ACE ld_te t , sta t_
(<o x > , < _ a s , e _te t replaces part of a text string with a
r num> <num ch r > <n w x >)

how to apply transformations in Power Query Click on the Home tab in the Query ribbo different text string.
Inspect your data by clicking on the Data View Click on Remove Rows in the Reduce Rows group EX AM P LE

Choose which option to remove, whether Remove Top Rows, Remove Bottom Rows, etc. Change column customer_name be only lower case 

Choose the number of rows to remov st e _ a e O ER sales_data st e _ a e
Create relationships in Power BI
cu om r n m = L W (‘ ’[cu om r n m ])

You can undo your action by removing it from the Applied Steps list on the right-hand side

Date and time function


If you have different datasets you want to connect. First, upload them into Adding a new column
CA EN AR sta t
L D (< date , e d date generates a column of continuous sets of date
r > < n >)
Sales Performance
Power B You can create new columns based on existing or new dat DATE ea ,(<y t , da returns the specified date in the datetime forma
r> <mon h> < y>)

SalesPersonID
Click on the Model View from the left-hand pan Click on the Add Column tab in the Query ribbo WEE A date ,
KD Y(< et _t e returns 1 corresponding to the day of the week of a date ( et
> <r urn yp >) -7 r urn _t e
yp

Connect key columns from different datasets by dragging one to another Click on Custom Column in the General grou indicates week start and end (1: Sunday Saturday, 2: Monday Sunday)
- -

(e.g., EmployeeID to e.g., SalespersonID) Name your new column by using the New Column Name optio
Employee Database EX AM P LE
Define the new column formula under the custom column formula using the available data

EmployeeID
Return the day of week of each deal

Replace values

week_da EE A sales_data
y = W KD Y(‘ ’[ date ] , 2)

You can replace one value with another value wherever that value is found in a colum
In the Power Query Editor, select the cell or column you want to replac
Click on the column or value, and click on Replace Values under the Home tab under the Transform grou
Fill the Value to Find and Replace With fields to complete your operation Learn Data Skills Online at www.DataCamp.com
3. Measures: A measure is a type of field that contains quantitative values (e.g. revenue, costs, and Aggregating data
market sizes). When dragged into a view, this data is aggregated, which is determined by the When data is dragged into the Rows and Columns on a sheet, it is aggregated based on the dimensions in the sheet.
dimensions in the view
This is typically a summed value. The default aggregation can be changed using the steps below:
4. Data types: Every field has a data type which is determined by the type of information it contains.
The available data types in Tableau include text, date values, date & time values, numerical values, Right-click on a measure field in the Data pan
Go down to Default properties, Aggregation, and select the aggregation you would like to use
Tableau for Business Intelligence
boolean values, geographical values, and cluster groups

Changing colors
Color is a critical component of visualizations. It draws attention to details. Attention is the most important
The Canvas
Tableau Basics Cheat Sheet The canvas is where you’ll create data visualizations
component of strong storytelling. Colors in a graph can be set using the marks card.
Create a visualization by dragging fields into the Rows and Columns section at the top of the scree
Drag dimensions into the Marks field, specifically into the Color squar
1. Tableau Canvas: The canvas takes up most of the screen on Tableau and is where you can add visualizations
To change from the default colors, go to the upper-right corner of the color legend and select Edit Colors. This
earn Tableau online at www.DataCamp.com
L 2. Rows and columns: Rows and columns dictate how the data is displayed in the canvas. When dimensions will bring up a dialog that allows you to select a different palette
are placed, they create headers for the rows or columns while measures add quantitative values

3. Marks card: The marks card allows users to add visual details such as color, size, labels, etc. to rows and columns. Changing fonts
This is done by dragging fields from the data pane into the marks card

Fonts can help with the aesthetic of the visualization or help with consistent branding. To change the workbook s font,

use the following steps

What is Tableau? > Visualizing Your First Dataset


In the Format menu on the top ribbon, press on Select Workbook. This will replace the Data pane and
allow you to make formatting decisions for the Workboo
From here, select the font, font size, and color

Tableau is a business intelligence tool that allows you to Upload a dataset to Tableau
effectively report insights through easy-to-use
Launch Tablea
customizable visualizations and dashboards
In the Connect section, under To a File, press on the file format of your choice
For selecting an Excel file, select .xlsx or .xlsx
> Creating dashboards with Tableau
Creating your first visualization Dashboards are an excellent way to consolidate visualizations and present data to a variety of stakeholders. Here is a
Once your file is uploaded, open a Worksheet and click on the Data pane on the left-hand sid step by step process you can follow to create a dashboard.

> Why use Tableau? Drag and drop at least one field into the Columns section, and one field into the Rows section at the top
of the canva
To add more detail, drag and drop a dimension into the Marks card (e.g. drag a dimension over the color square
Launch Tablea
In the Connect section under To A File, press on your desired file typ
Select your fil
in the marks card to color visualization components by that dimension Click the New Sheet at the bottom to create a new shee
Easy to use—no coding Integrates seamlessly with Fast and can handle large To a summary insight like a trendline, click on the Analytics pane and drag the trend line into your visualization
involved any data source datasets Create a visualization in the sheet by following the steps in the previous sections of this cheat shee
You can change the type of visualization for your data by clicking on the Show Me button on the top right Repeat steps 4 and 5 untill you have created all the visualizations you want to include in your dashboar
Click the New Dashboard at the bottom of the scree
On the left-hand side, you will see all your created sheets. Drag sheets into the dashboar
> Tableau Versions > Data Visualizations in Tableau
Adjust the layout of your sheets by dragging and dropping your visualizations

There are two main versions of Tableau


Tableau provides a wide range of data visualizations to use. Here is a list of the most useful visualizations you
T ableau Public T ableau Deskto p have in Tableau
A free version of Tableau that lets you connect to limited A paid version of tableau which lets you connect to
data sources, create visualizations and dashboards, and all types of data sources, allows you to save work Bar Charts: Horizontal bars used for comparing specific values across categories (e.g. sales by region)
publish dashboards online
locally, and unlimited data sizes

Stacked Bar Chart: Used to show categorical data within a bar chart (e.g., sales by region and department)

Side-by-Side Bar Chart: Used to compare values across categories in a bar chart format (e.g., sales by
> Getting started with Tableau region comparing product types)

L ine Charts: Used for looking at a numeric value over time (e.g., revenue over time)

When working with Tableau, you will work with Workbooks. Workbooks contain sheets, dashboards, and stories.
Similar to Microsoft Excel, a Workbook can contain multiple sheets. A sheet can be any of the following and can be Scatter Plot: Used to identify patterns between two continuous variables (e.g., profit vs. sales volume)

Dashboard examples in Tableau


accessed on the bottom left of a workbook

H istogram: Used to show a distribution of data (e.g., Distribution of monthly revenue)

Worksheet Dashboard story Box-and-Whisker Plot: Used to compare distributions between categorical variables (e.g., distribution of
A worksheet is a single
view in a workbook. You
A collection of multiple
worksheets used to
A story is a collection of
multiple dashboards and/
revenue by region)

eat Map: Used to visualize data in rows and columns as colors (e.g., revenue by marketing channel)

> Creating stories with Tableau


display multiple views
H
can add shelves, cards, or sheets that describe a
legends, visualizations, simultaneously
data story
A story is a collection of multiple dashboards and/or sheets that describe a data story
and more in a worksheet
Highlight Table: Used to show data values with conditional color formatting (e.g., site-traffic by marketing
channel and year)
Click the New Story at the bottom of the scree
Change the size of the story to the desired size in the bottom left-hand corner of the screen under Siz
Symbol Map: Used to show geographical data (e.g., Market size opportunity by state)
Edit the title of the story by renaming the story. To do this, right-click on the story sheet at the bottom
and press Renam

> The Anatomy of a Worksheet M ap: Used to show geographical data with color formatting (e.g., Covid cases by state)

Treemap: Used to show hierarchical data (e.g., Show how much revenue subdivisions generate relative to
A story is made of story points, which lets you cycle through different visualizations and dashboard
To begin adding to the story, add a story point from the left-hand side. You can add a blank story poin
To add a summary text to the story, click Add a caption and summarize the story poin
the whole department within an organization)

Add as many story points as you would like to finalize your data story

When opening a worksheet, you will work with a variety of tools and interfaces
Dual Co bination: Used to show two visualizations within the same visualization (e.g., pro it or a store each
m f f

month as a bar chart with in entory o er time as a line chart)

v v

The Sidebar
In the sidebar, you’ll find useful panes for working with dat
Data: The data pane on the left-hand side contains all of the fields in the currently selected data sourc
> Customizing Visualizations with Tableau
Analytics: The analytics pane on the left-hand side lets you add useful insights like trend lines, error bars,
Tableau provides a deep ability to filter, format, aggregate, customize, and highlight specific parts of your data
and other useful summaries to visualizations

visualizations

Filtering data with highlights


Tableau Data Definitions Once you’ve created a visual, click and drag your mouse over the specific portion you want to highlight
2. Once you let go, you will have the option to Keep Only or Exclude the data
When working with data in Tableau, there are multiple definitions to be mindful o 3. Open the Data pane on the side bar. Then, you can drag-and-drop a field into the fitlers card just to the
left of the pane. Stories examples in Tableau
Fields: Fields are all of the different columns or values in a data source or that are calculated in the
workbook. They show up in the data pane and can either be dimension or measure field Filtering data with filters
Dimensions: A dimension is a type of field that contains qualitative values (e.g. locations, names, and
Open the Data pane on the left-hand-sid
departments). Dimensions dictate the amount of granularity in visualizations and help reveal nuanced details
in the data

Drag-and-drop a field you want to filter on and add it to the Filters car
Fill out in the modal how you would like your visuals to be filtered on the data
Learn Data Skills Online at www.DataCamp.com
Data Storytelling & Communication Use text appropriately > Crafting effective narratives with data
Cheat Sheet While too much text can add clutter, text can also be an extremely effective tool at highlighting insights within your
visualizations. Cole Nussbaumer Knaflic, Author of Storytelling with Data, provides an excellent example with the Know the audience
following visualization.
Learn more online at www.DataCamp.com To communicate effectively, you need to know who your audience is, and what their priorities are. There is a range of
Please approve the hire of 2 FTEs possible audiences you may encounter when presenting, and crafting an audience specific message will be important.
to backfill those who quit in the past year
Examples of audiences you may present to are:
Ticket volume over time 2 employees quit in May. We nearly kept up with incoming

volume in the following two months, but fell behind with the

> What is data storytelling?


300 300

Number of tickets
increase in Aug and haven’t been able to catch up since.

$
250 250 202
177
200 Received 200 160 Received
149
139
150 Processed 150 156
Processed
140
100 100 126 124

Data storytelling is often called the last mile of analytics. Sound communication skills, allows data professionals to 50 50 104

0 0
drive action out of their insights. According to Brent Dykes, Author of Effective Data Storytelling: How to Drive Change Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
with Data, Narrative, and Visuals—Data Storytelling is a combination of data, visuals, and narrative. 2014
Data source: XYZ Dashboard, as of 12/31/2014 | A detailed analysis on tickets processed per person
and time to resolve issues was undertaken to inform this request and can be provided if needed.

How text can be a useful visual tool when crafting effective visuals

(Source: Storytelling with Data by Cole Nussbaumer Knaflic)
Executive Data Leader Business Partner
Basic data literacy skills Data expert Advanced data literacy skills
Using text in data visualizations Prioritizes outcomes & Prioritizes rigour & Prioritizes tactical 

decisions insights next steps
When applicable, label axes and titles for clarity
Data Visuals Narrative Label important data points when necessary Cares much more about Cares much more about Cares much more about
The three elements of data storytelling 
 Provide useful context around insights within the title or subtitle business impact than a 1% how your arrived at your how your analysis impacts
(Source: Effective Data Storytelling: How to Drive Change with Data, Narrative, and Visuals by Brent Dykes) Adjust font size when highlighting specific messages within your labels incremental gain in a insights and to battle test their workflow, and what
When applicable, try to answer common audience questions with labels machine learning model them for rigour should be their main
accuracy or a new takeaway from the data
technique you’re using story
> Crafting effective visuals Use colors effectively
Considerations when crafting audience specific messaging
Choose the best visualization for your story The fundamentals of color theory in data visualization
Aspect What do you need to consider?
Color is one of the most powerful tools available for emphasizing different aspects of your data visualization. Here are k
Prior nowledge What context do they have about the problem?
Each plot type is suited for communicating specific things about specific types of data. Start by choosing an different properties to keep in mind when choosing an appropriate color palette for your visualization.
appropriate plot type. What is their level of data literacy?
Hue represents the range of possible colors, from red, through orange, green and blue, to purple and back to red. Priorities What does the audience care about?
Line plot Bar plot Scatter plot Histogram Chroma is the intensity of the color, from grey to a bright color. How does your message relate to their goals?
Luminance is the brightness of the color, from black to white. Who is driving decision-making within your audience?
Constraints What is the audience’s preferred format?
There are three common types of color palettes, that depend on these dimensions. How much time does an audience have to consume a data story?

Type Purpose What to vary Example


Qualitative Distinguish unordered Hue A bar chart of 2022 smartphone sales for
Show changes in numeric
values over time.
Visualizes numeric values Show the relationship
between two numeric
Show the distribution of
numeric values.
categories different smartphone manufacturers Choose the best medium to share your story
by categories. It can be
ranked or unranked values. Sequential Showcase intensity of a Chroma or luminance A map showcasing Covid-19 vaccination
single variable prevalence
There are different ways you can deliver a data story. The importance of each is different depending on the audience
To learn about all the types of visualizations you can use, check out our Data Visualization Cheat Sheet. Diverging Compare between two Chroma or luminance with two hues Voter registration prevalence by political party of your data story and the setting you’re delivering your data story in.
groups in the USA
Type Important considerations
Keep visualizations minimal and avoid clutter Presentation Ensure the length of your presentation is appropriate

Do not mislead with data stories Leave any highly technical details to the appendix
Ensure there is a narrative arc to your presentation
Ruthlessly edit your plots to remove or minimize elements that distract from the message of the plot. In particular,
make non-data elements (parts of the plot that don't directly represent a data value, like the grid lines) less distracting. Long-form report Be extra diligent about providing useful context around data visualizations
A great example comes from Darkhorse Analytics, which showcases exactly the value of decluttering visualizations. and insights
The fastest way to lose credibility when presenting data stories is to inadvertently (or intentionally) mislead with your
data insights. Here are top best practices to avoid misleading with data stories. Leave any highly technical details to the appendix
Calories per 100g Notebook Ensure that you provide useful context on how you arrived at a certain
Same Data, Different Y-Axis conclusion

607
Dashboard Make use of the dashboard grid layout
542
533 Interest rates Interest rates Organize data insights from left to right, top to bottom
Provide useful summary text of key visualizations in your dashboard
3.154% 3.50%
296
260 3.152% 3.00%

3.150% 2.50%
French
Potato
Bacon Pizza Chili Dog 3.148% 2.00%
Fries Chips
3.146% 1.50%
3.144% 1.00%
Learn more about data storytelling at
Decluttering a visualization in action
 3.142% 0.50%

(Source: Darkhorse Analytics) 3.140% 0.00% www.DataCamp.com


2008 2009 2010 2011 2012 2008 2009 2010 2011 2012

Data visualization decluttering best practices


Starting the y-axis at the smallest value or at zero dramatically changes the story told by the plot
Use just enough white space to keep the visualization from looking busy
Remove chart borders when applicable Best practices to avoid misleading with data stories
Remove or minimize gridlines or axes when applicable
If you are visualizing times series data, make sure your time horizons are large enough to truly represent the data
Clean up axis labels when applicable
If the relative size of each value is important, then ensure that your axes start with zero
Label data directly (as opposed to using a legend)
Ensure that axes scales are appropriate given the data you’re treating
Remove data markers when applicable
If you are sampling data for descriptive purposes, make sure the sample is representative of the broader population
Use special effects (bold, underline, italics, shadows) sparingly
Use centrality measures such as mean or median to provide context around your data
> Part-to-whole charts
Pie chart Donut pie chart Heat maps Stacked column chart Treemap charts

The Data Visualization Cheat Sheet

Learn Data Visualization online at www.DataCamp.com One of the most common ways to
show part to whole data. It is also
The donut pie chart is a variant of the
pie chart, the difference being it has a
Heatmaps are two-dimensional charts
that use color shading to represent
Best to compare subcategories within
categorical data. Can also be used to
2D rectangles whose size is
proportional to the value being
commonly used with percentages hole in the center for readability data trends. compare percentages measured and can be used to display
hierarchically structured data

Use cases Use cases Use cases Use cases Use cases

How to use this cheat sheet


Voting preference by age grou Android OS market shar Average monthly temperatures Quarterly sales per regio Grocery sales count with
Market share of cloud providers Monthly sales by channel across the year Total car sales by producer categorie
Departments with the highest Stock price comparison by
amount of attrition over time industry and company
Use this cheat sheet for inspiration when making your next data visualizations. For more data visualization cheat sheets,
check out our cheat sheets repository here.

> Capture a trend > Visualize a single value > Capture distributions
Line chart Multi-line chart Area chart Stacked area chart Spline chart Card Table chart Gauge chart Histogram Box plot Violin plot Density plot

$7.47M
Total Sales

Cards are great for showing Best to be used on small This chart is often used in Shows the distribution of a Shows the distribution of a A variation of the box plot.
Visualizes a distribution by
The most straightforward way to Captures multiple numeric Shows how a numeric value Most commonly used variation of Smoothened version of a line chart. and tracking KPIs in datasets, it displays tabular executive dashboard reports variable. It converts variable using 5 key It also shows the full using smoothing to allow
capture how a numeric variable is variables over time. It can include progresses by shading the area area charts, the best use is to track It differs in that data points are dashboards or presentations data in a table
to show relevant KPIs numerical data into bins as summary statistics— distribution of the data smoother distributions and
changing over time multiple axes allowing comparison between line and the x-axis the breakdown of a numeric value connected with smoothed curves columns. The x-axis shows minimum, first quartile, alongside summary statistics better capture the
of different units and scale ranges by subgroups to account for missing values, as the range, and the y-axis median, third quartile, and distribution shape of the data
opposed to straight lines represents the frequency maximum

Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases

Revenue in $ over tim Apple vs Amazon stocks Total sales over tim Active users over time by Electricity consumption over Revenue to date on a Account executive NPS score Distribution of salaries in Gas efficiency of vehicle Time spent in restaurants Distribution of price of
Energy consumption in kWh over tim Active users over time segmen tim sales dashboar leaderboard Revenue to target an organizatio Time spent reading across across age group hotel listing
over tim Lebron vs Steph Curry Total revenue over time by CO2 emissions over time Total sign-ups after a Registrations per webinar Distribution of height in readers Length of pill effects by Comparing NPS scores by
Google searches over time searches over tim country promotion one cohort dose customer segment
Bitcoin vs Ethereum price
over time

> Visualize relationships > Visualize a flow


Bar chart Column chart Scatter plot Connected scatterplot Bubble chart Word cloud chart Sankey chart Chord chart Network chart

Data Analyst
Science
Engineer

One of the easiest charts to Also known as a vertical bar Most commonly used chart A hybrid between a scatter Often used to visualize data A convenient visualization for Useful for representing flows in Useful for presenting Similar to a graph, it
read which helps in quick
comparison of categorical
chart, where the categories
are placed on the x-axis.
when observing the
relationship between two
plot and a line plot, the
scatter dots are connected
points with 3 dimensions,
namely visualized on the x-
visualizing the most prevalent
words that appear in a text
systems. This flow can be any
measurable quantity

weighted relationships or
flows between nodes.
consists of nodes and
interconnected edges. It
Learn Data Skills Online at
data. One axis contains These are preferred over bar variables. It is especially with a line axis, y-axis, and with the size Especially useful for illustrates how different www.DataCamp.com
categories and the other axis charts for short labels, date useful for quickly surfacing of the bubble. It tries to show highlighting the dominant or items have relationships
represents values ranges, or negatives in values potential correlations relations between data points important flows
with each other
between data points using location and size

Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases

Volume of google Brand market shar Display the relationship Cryptocurrency price Adwords analysis: CPC vs Top 100 used words by Energy flow between Export between countries How different airports are
searches by regio Profit Analysis by region between time-on-platform inde Conversions vs Share of customers in customer countrie to showcase biggest connected worldwide
Market share in revenue and chur Visualizing timelines and total conversion service tickets Supply chain volumes export partner Social media friend group
by product Display the relationship events when analyzing Relationship between life between warehouses Supply chain volumes analysis
between salary and years two variables expectancy, GDP per between the largest
spent at company capita, & population size warehouses
Descriptive Statistics
Cheat Sheet > Numerical Dataset—Glasses of Water Visualizing Numeric Variables

There are a variety of ways of visualizing numerical data, here’s a few of them in action:
Learn more online at www.DataCamp.com
300ml 60ml 300ml 120ml 180ml 180ml 300ml Histogram Box plot
Median
To illustrate statistical concepts on numerical data, we’ll be using a numerical
variable, consisting of the volume of water in different glasses.

Minimum Maximum
> Key Definitions Measures of Center
0 300
Shows the distribution of a variable. It converts numerical
Q1 Q3
Shows the distribution of a variable using 5 key summary
data into bins as columns. The x-axis shows the range, and statistics—minimum, first quartile, median, third quartile,
Throughout this cheat sheet, you’ll find terms and specific statistical jargon being used. Here’s a rundown of all the the y-axis represents the frequency and maximum
terms you may encounter. Measures of center allow you to describe or summarize your data by capturing one value that describes the center of
its distribution.
Variable: In statistics, a variable is a quantity that can be measured or counted. In data analysis, a variable is
typically a column in a data frame
Descriptive statistics: Numbers that summarize variables. They are also called summary statistics or aggregations Measure Definition How to find it Result
Categorical data: Data that consists of discrete groups. The categories are called ordered (e.g., educational levels)
if you can sort them from lowest to highest, and unordered otherwise (e.g., country of origin)
Arithmetic mean The total of the values
divided by how many
)
) 205.7 ml > Correlation
Numerical data: Data that consists of numbers (e.g., age). 7
values there are
Strong negative Weak negative No correlation Weak positive Strong positive
Median The middle value, when 180 ml
sorted from smallest to 180ml
largest

> Categorical Data—Trail Mix Mode The most common value 300 ml
300ml 300ml 300ml

Correlation is a measure of the linear relationship between two variables. That is, when one variable goes up, does the
To illustrate statistical concepts on categorical data, we’ll be using an unordered
categorical variable, consisting different elements of a trail mix. Our categorical
Other Measures of Location other variable go up or down? There are several algorithms to calculate correlation, but it is always a score between -1
and +1.

variable contains 15 almonds, 13 cashews, and 25 cranberries.


There are other measures that you can use, that can help better describe or summarize your data. For two variables, X and Y, correlation has the following interpretation:

Measure Definition How to find it Result Correlation score I nterpretation

-1 hen X increases, Y decreases. Scatter plot forms a perfect straight line with negative slope
Counts and Proportions Minimum The lowest value in your 60 ml W

data 60ml Between -1 and 0 W hen X increases, Y decreases

Counts and proportions are measures of how much data you have. They allow you to understand how many data 0 There is no linear relationship between X and Y, so the scatter plot looks like a noisy mess
Maximum The highest value in your 300 ml
points belong to different categories in your data.
data 300ml Between 0 and 1 + W hen X increases, Y increases
A count is the number of times a data point occurs in the dataset
A proportion is the fraction of times a data point occurs in the dataset. + 1 W hen X increases, Y increases. Scatter plot forms a perfect straight line with positive slope
Percentile: Cut points that divide the data into 100 intervals with the same amount of data in each interval (e.g., in
the water cup example, the 100th percentile is 300 ml Note that correlation does not account for non-linear effects, so if X and Y do not have a straight-line relationship,
Food category Count Proportion
Quartile: Similar to the concept of percentile, but with four intervals rather than 100. The first quartile is the same the correlation score may not be meaningful.
as the 25th percentile, which is 120 ml. The third quartile is the same as the 75th percentile, which is 300 ml.
Almond 15 15 / 48 = 0.283

Cashew 13 13 / 48 = 0.245
Measures of Spread
Cranberry 25 25 / 48 = 0.472

Sometimes, rather than caring about the size of values, you care about how different they are.
Visualizing Categorical Variables Measure Definition How to find it Result

Range The highest value minus 240 ml


Bar plot Stacked bar chart Treemap chart
the lowest value
300ml 60ml

Variance The sum of the squares of )


-
2
) + ... + )
- )
2 9428.6 ml
2
Learn Data Skills Online at
the differences between
each value and the mean,
60ml Mean

(7 - 1)
300ml Mean
www.DataCamp.com
all divided by one less
than the number of data
One of the easiest charts to read Best to compare subcategories within 2D rectangles whose size is points
which helps in quick comparison of categorical data. Can also be used to proportional to the value being
categorical data. One axis contains compare proportions measured and can be used to display Inter-quartile range The third quartile minus 180 ml
categories and the other axis hierarchically structured data the first quartile
represents values 300ml 120ml
Introduction to ultiplication Rules: Probability of two Addition Rules: Probability of at least one
> M
>
Probability Rules 
 events happening event happening

Cheat Sheet

M utually exclusive events M utually exclusive events


Definition: The probability of two mutually exclusive events happening is zero. Definition: The probability of at least one mutually exclusive event happening is
the sum of the probabilities of each event happening.
Formula: P(A ∩ B)=0
A B A B ormula: P(A ∪ B)=P(A) + P(B)
Example: If the probability of it being sunny at midday is 0.3 and the
F

probability of it raining at midday is 0.4, the probability of it being sunny and Example: If the probability of it being sunny at midday is 0.3 and the
Learn statistics online at www.DataCamp.com rainy is 0, since these events are mutually exclusive. probability of it raining at midday is 0.4, the probability of it being sunny or
Mutually exclusive (disjoint) M utually exclusive (disjoint) rainy is 0.3 + 0.4 = 0.7, since these events are mutually exclusive.

Independent events Independent events


> Definitions Definition: The probability of two independent events happening is the product
Definition: The probability of at least one mutually exclusive event happening
of the probabilities of each event.
is the sum of the probabilities of each event happening minus the probability
The following pieces of jargon occur frequently when discussing probability. A B Formula: P(A ∩ B)=P(A) P(B) A B
of both events happening.

Event: A thing that you can observe whether it happens or not. Example: If the probability of it being sunny at midday is 0.3 and the F ormula: P(A ∪ B)=P(A) + P(B) - P(A ∩ B)
probability of your favorite soccer team winning their game today is 0.6, the
Intersection A ∩ B Example: If the probability of it being sunny at midday is 0.3 and the
Probability: The chance that an event happens, on a scale from 0 (cannot happen) to 1 (always then probability of it being sunny at midday and your favorite soccer team Union A U B probability of your favorite soccer team winning their game today is 0.6, the
happens). Denoted P(event). winning their game today is 0.3 * 0.6 = 0.18.
then probability of it being sunny at midday or your favorite soccer team
winning their game today is 0.3 + 0.6 - (0.3 * 0.6) = 0.72.
Probability universe: The probability space where all the events you are considering can either
The conjunctive fallacy
happen or not happen.
The disjunctive fallacy
Mutually exclusive events: If one event happens, then the other event cannot happen (e.g., you Definition: The probability of both events happening is always less than or equal to the probability of one event
cannot roll a dice that is both 5 and 1). happening. That is P(A ∩ B)≤ P(A), and P(A ∩ B)≤ P(B). The conjunctive fallacy is when you don't think
Definition: The probability of at least one event happening is always greater than or equal to the
carefully about probabilities and estimate that probability of both events happening is greater than the
probability of one event happening. That is P(A ∪ B) P(A), and P(A ∪ B) P(B). The disjunctive
Independent events: If one event happens, it does not affect the probability that the other event probability of one of the events.
fallacy is when you don't think carefully about probabilities and estimate that the probability of at least
happens (e.g., the weather does not affect the outcome of a dice roll). Example: A famous example known as 'The Linda problem" comes from a 1980s research experiment. A fictional one event happening is less than the probability of one of the events.
person was described:
Dependent events: If one event happens, it changes the probability that the other event happens. Example: Returning to the "Linda problem", consider having to rank these two statements in order of
(e.g., the weather affects traffic outcomes). Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply probability:
concerned with issues of discrimination and social justice and also participated in anti-nuclear demonstrations.
Conjunctive probability (a.k.a. joint probability): The probability that all events happen. Linda is a bank teller
Participants had to choose which statement had a higher probability of being true: Linda is a bank teller or is active in the feminist movement.
Disjunctive probability: The probability that at least one event happens. Linda is a bank teller The disjunctive fallacy would be to think that choice 1 had a higher probability of being true, even
Linda is a bank teller and is active in the feminist movement.
Conditional probability: The probability that one event happens, given another event happened. though that is impossible because of the additive rule of probabilities.
M any participants chose fell for the conjunctive fallacy and chose option 2, even though it must be less likely
than option 1 using the multiplication rule.

> Complement Rule: Probability of events not


happening Bayes Rule: Probability of an event
> happening given another event happened
Definition: The complement of A is the probability that event A does not
A’ happen. It is denoted A' or Acᶜ
A Formula: P(A')=1 - P(A) Definition: For dependent events, the probability of event B happening
Example: The probability of basketball player Stephen Curry successfully given that event A happened is equal to the probability that both events
shooting a three-pointer is 0.43. The complement, the probability that he happen divided by the probability that event A happens. Equivalently, it is
omplement of A:A’
C
misses, is 1 - 0.43 = 0.57. AB B
equal to the probability that event A happens given that event B
happened times the probability of event B happening divided by the
probability that event A happens.
Learn Statistics Online at
> Odds: Probability of event happening F ormula: P(B|A)=P(A ∩ B) /P(A)=P(A|B)P(B)/P(A)
www.DataCamp.com
versus not happening Example: Suppose it's a cloudy morning and you want to know the
probability of rain today. If the probability it raining that day given a
Definition: The odds of event A happening is the probability that the event happens divided by the probability cloudy morning is 0.6, and the probability of it raining on any day is 0.1,
that the event doesn't happen. and the probability of it being cloudy on any day is 0.3, then the
Formula: Odds(A)=P(A)/P(A')=P(A)/(1-P(A)) probability of it raining given a cloudy morning is 0.6 * 0.1 / 0.3 = 0.2.

That is, if you have a cloudy morning it is twice as likely to rain than if

Example: The odds of basketball player Stephen Curry successfully shooting a three-pointer is the probability
you didn't have a cloudy morning, due to the dependence of the events.
that he scores divided by the probability that he misses, 0.43 / 0.57 = 0.75.
> Getting started with lists > Getting started with characters and strings
A list is an ordered and changeable sequence of elements. It can hold integers, characters, floats, strings, and even objects.
# Create a string with double or single quotes


Python Basics Creating lists


"DataCamp"

# Embed a quote in string with the escape character \

"He said, \"DataCamp\""

Python Cheat Sheet for Beginners #


x
Create lists with [],
= [1, 3, 2]

elements separated by commas

# Create multi-line strings with triple quotes

"""

Learn Python online at www.DataCamp.com List functions and methods A Frame of Data

Tidy, Mine, Analyze It

x.sorted(x) # Return a sorted copy of the list e.g., [1,2,3]


Now You Have Meaning

x.sort() # Sorts the list in-place (replaces x)


Citation: https://mdsr-book.github.io/haikus.html

> How to use this cheat sheet reversed(x) # Reverse the order of elements in x e.g., [2,3,1]

x.reversed() # Reverse the list in-place

"""

x.count(2) # Count the number of element 2 in the list


str[0] # Get the character at a specific position

Python is the most popular programming language in data science. It is easy to learn and comes with a wide array of str[0:2] # Get a substring from starting to ending index (exclusive)

powerful libraries for data analysis. This cheat sheet provides beginners and intermediate users a guide to starting
using python. Use it to jump-start your journey with python. If you want more detailed Python cheat sheets, check out Selecting list elements
the following cheat sheets below:
Combining and splitting strings
Python lists are zero-indexed (the first element has index 0). For ranges, the first element is included but the last is not.

# Define the list


"Data" + "Framed" # Concatenate strings with +, this returns 'DataFramed'

x = ['a', 'b', 'c', 'd', 'e']


x[1:3] # Select 1st (inclusive) to 3rd (exclusive)
3 * "data " # Repeat strings with *, this returns 'data data data '

x[0] # Select the 0th element in the list


x[2:] # Select the 2nd to the end
"beekeepers".split("e") # Split a string on a delimiter, returns ['b', '', 'k', '', 'p', 'rs']

x[-1] # Select the last element in the list


x[:3] # Select 0th to 3rd (exclusive)

Mutate strings
Importing data in python Data wrangling in pandas

Concatenating lists
str = "Jack and Jill" # Define str

# Define the x and y lists


Returns [1, 3, 6, 10, 15, 21]

> Accessing help and getting object types


x + y #
str.upper() # Convert a string to uppercase, returns 'JACK AND JILL'

x = [1, 3, 6]
3 * x # Returns [1, 3, 6, 1, 3, 6, 1, 3, 6] str.lower() # Convert a string to lowercase, returns 'jack and jill'

y = [10, 15, 21]


str.title() # Convert a string to title case, returns 'Jack And Jill'

1 + 1 # Everything after the hash symbol is ignored by Python


str.replace("J", "P") # Replaces matches of a substring with another, returns 'Pack and Pill'

help(max) # Display the documentation for the max function

type('a') # Get the type of an object — this returns str > Getting started with dictionaries
A dictionary stores data values in key-value pairs. That is, unlike lists which are indexed by position, dictionaries are indexed
> Getting started with DataFrames
> Importing packages by their keys, the names of which must be unique.
Pandas is a fast and powerful package for data analysis and manipulation in python. To import the package, you can
use import pandas as pd. A pandas DataFrame is a structure that contains two-dimensional data stored as rows and
Python packages are a collection of useful tools developed by the open-source community. They extend the
Creating dictionaries columns. A pandas series is a structure that contains one-dimensional data.

capabilities of the python language. To install a new package (for example, pandas), you can go to your command
prompt and type in pip install pandas. Once a package is installed, you can import it as follows.

# Create
{'a': 1,
a dictionary with {}

'b': 4, 'c': 9}

Creating DataFrames
import pandas # Import a package without an alias

import pandas as pd # Import a package with an alias

from pandas import DataFrame # Import an object from a package

Dictionary functions and methods # Create a dataframe from a


pd.DataFrame({

dictionary
# Create a dataframe from a list
pd.DataFrame([

of dictionaries

'a': [1, 2, 3],


{'a': 1, 'b': 4, 'c': 'x'},

x = {'a': 1, 'b': 2, 'c': 3} # Define the x ditionary


'b': np.array([4, 4, 6]),
{'a': 1, 'b': 4, 'c': 'x'},

x.keys() # Get the keys of a dictionary, returns dict_keys(['a', 'b', 'c'])


'c': ['x', 'x', 'y']
{'a': 3, 'b': 6, 'c': 'y'}

> The working directory x.values() # Get the values of a dictionary, returns dict_values([1, 2, 3])
}) ])

Selecting dictionary elements Selecting DataFrame Elements


The working directory is the default file path that python reads or saves files into. An example of the working directory
is ”C://file/path". The os library is needed to set and get the working directory.

x['a'] # 1 # Get a value from a dictionary by specifying the key


Select a row, column or element from a dataframe. Remember: all positions are counted from zero, not one.

import os # Import the operating system package

# Select the 3rd row

os.getcwd() # Get the current directory



df.iloc[3]

os.setcwd("new/working/directory") # Set the working directory to a new file path


> NumPy arrays # Select one column by name

df['col']

# Select multiple columns by names

> Operators NumPy is a python package for scientific computing. It provides multidimensional array objects and efficient operations
on them. To import NumPy, you can run this Python code import numpy as np

df[['col1', 'col2']]

# Select 2nd column

df.iloc[:, 2]

Arithmetic operators Creating arrays


# Select the element in the 3rd row, 2nd column

df.iloc[3, 2]

102 + 37 # Add two numbers with +


22 // 7 # Integer divide a number with //

Convert a python list to a NumPy array



Manipulating DataFrames
102 - 37 # Subtract a number with -
3 ^ 4 # Raise to the power with ^
#
4 * 6 # Multiply two numbers with *
22 % 7 # Returns 1 # Get the remainder after np.array([1, 2, 3]) # Returns array([1, 2, 3])

22 / 7 # Divide a number by another with /


division with %
# Return a sequence from start (inclusive) to end (exclusive)

np.arange(1,5) # Returns array([1, 2, 3, 4])
# Concatenate DataFrames vertically
# Calculate the mean of each column

# Return a stepped sequence from start (inclusive) to end (exclusive)
 pd.concat([df, df])
df.mean()

Assignment operators np.arange(1,5,2) # Returns array([1, 3])


# Concatenate DataFrames horizontally
# Get summary statistics by column

# Repeat values n times
 pd.concat([df,df],axis="columns")


df.agg(aggregation_function)

a = 5 # Assign a value to a
np.repeat([1, 3, 6], 3) # Returns array([1, 1, 1, 3, 3, 3, 6, 6, 6])
# Get rows matching a condition
# Get unique rows

x[0] = 1 # Change the value of an item in a list # Repeat values n times


df.query('logical_condition')
df.drop_duplicates()

np.tile([1, 3, 6], 3) # Returns array([1, 3, 6, 1, 3, 6, 1, 3, 6])


# Drop columns by name
# Sort by values in a column

Numeric comparison operators df.drop(columns=['col_name'])

# Rename columns

df.sort_values(by='col_name')

# Get rows with largest values


Math functions and methods
in a column

3 == 3 # Test for equality with ==


3 >= 3 # Test greater than or equal to with >=

> df.rename(columns={"oldname": "newname"})

# Add a new column

df.nlargest(n, 'col_name')

3 != 3 # Test for inequality with !=


3 < 4 # Test less than with <
df.assign(temp_f=9 / 5 * df['temp_c'] + 32)
All functions take an array as the input.
3 > 1 # Test greater than with >
3 <= 4 # Test less than or equal to with <=
np.log(x) # Calculate logarithm
np.quantile(x, q) # Calculate q-th quantile

np.exp(x) # Calculate exponential


np.round(x, n) # Round to n decimal places

Logical operators np.max(x) # Get maximum value


np.var(x) # Calculate variance

np.min(x) # Get minimum value


np.std(x) # Calculate standard deviation

~(2 == 2) # Logical NOT with ~


(1 >= 1) | (1 < 1) # Logical OR with |
np.sum(x) # Calculate sum

(1 != 1) & (1 < 1) # Logical AND with & (1 != 1) ^ (1 < 1) # Logical XOR with ^ np.mean(x) # Calculate mean
> G etting started with vectors > G etting started with Data Frames in R

R for ata Science

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector A data frame has the variables of a data set as columns and the observations as rows.
is a simple tool to store data.
D #This creates the data frame df, seen on
the right

x y z #This selects all rows of the second


column

x y z
1 h 12

Creating vectors df <- data.frame(x = 1:3, y = i 13


df[ ,2]
1 h 12

Getting started with R Cheat Sheet


2

c(“h”, “i”, “j”), z = 12:14) 3 j 14


2 i 13
I nput Output D escription

Learn R online at www.DataCamp.com #This selects all columns of the third row
x y z 3 j 14
c(1,3,5) 1 3 5 Creates a vector using elements
df[ ,3] 1 h 12
separated by commas
i 13
#This selects the third column of the
2

1 :7 1 2 3 4 5 6 7 Creates a vector of integers x y z


3 j 14 second row

between two numbers

>
df[2,3]
How to use this cheat sheet
1 h 12
seq(2,8,by = )
2 2 4 6 8 Creates a vector between two #This selects the column z
x y z

numbers, with a specified interval df$z 2 i 13


1 h 12

between each element.


R is one of the most popular programming languages in data science and is widely used across various industries and
2 i 13
3 j 14
in academia. Given that it’s open-source, easy to learn, and capable of handling complex data and statistical rep(2,8,times = 4 ) 2 8 2 8 2 8 2 8 Creates a vector of given 3 j 14

manipulations, R has become the preferred computing environment for many data scientists today.

 elements repeated a number of
times.
This cheat sheet will cover an overview of getting started with R. Use it as a handy, high-level reference for a quick
rep(2,8,each = ) Creates a vector of given

Data Frames in R
3 2 2 2 8 8 8
start with R. For more detailed R Cheat Sheets, follow the highlighted cheat sheets below.
elements repeating each element
a number of times. > Manipulating
Vector functions Selecting vector elements dplyr allows us to easily and precisely manipulate data frames. To use the following functions, you should install and
load dplyr using install.packages(“dplyr”)
These functions perform operations over a whole vector. These functions allow us to refer to particular parts of a
xts Cheat Sheet data.table Cheat Sheet vector.
#Takes a sequence of vector, #Moves columns to a new position

sort(my_vector) #Returns my_vector sorted

matrix or data-frame arguments


rev(my_vector) #Reverses order of my_vector
y_vector[6] #Returns the sixth element of my_vector
relocate(df, x, .after =
m
and combines them by columns

table(my_vector) #Count of the values in a vector


y_vector[-6] #Returns all but the sixth element
last_col())
m
bind_cols(df1, df2)
unique(my_vector) #Distinct elements in a vector y_vector[2:6] #Returns elements two to six

>
m

Accessing help m y_vector[-(2:6)] #Returns all elements except


those between the second and the sixth

#Takes a sequence of vector, #Renames columns

matrix or data frame arguments rename(df, “age” = z)

m y_vector[c(2,6)] #Returns the second and sixth and combines them by rows

Accessing help files and documentation elements

bind_rows(df1, df2)
m y_vector[x == 5] #Returns elements equal to 5

m y_vector[x < 5 ]#Returns elements less than 5

?m ax Shows
# the help documentation for the max function

m y_vector[x %in% c(2, 5 ,8 )] #Returns elements #Extracts rows that meet


#Orders rows by values of a
? tidyverse Shows
# the documentation for the tidyverse package

in the set {2, 5, 8}


logical
column from high to low

criteria

??"m ax" #Returns documentation associated with a given input arrange(df, desc(x))
filter(df, x == 2 )

>
Information about objects
Math functions #Removes rows with duplicate
#Computes table of summaries

str(my_df) #Returns the structure and information of a given object

summarise(df, total =
values

sum(x))
class(my_df) #Returns the class of a given object These functions enable us to perform basic mathematical operations within R distinct(df, z)

log(x) #Returns the logarithm of a variable


quantile(x) #Percentage quantiles of a vector

#Computes table of summaries.

exp(x) #Returns exponential of a variable


round(x, n) #Round to n decimal places

>
=
Using packages
summarise(df, total
m maximum value of a vector

ax(x) #Returns rank(x) #Rank of elements in a vector


#Selects rows by position

sum(x))
min(x) #Returns minimum value of a vector
signif(x, n) #Round off n significant figures
slice(df, 10:15)
mean(x) #Returns mean of a vector
var(x) #Variance of a vector

R packages are collections of functions and tools developed by the R community. They increase the power of R by sum(x) #Returns sum of a vector
cor(x, y) #Correlation between two vectors
#Use group_by() to create a "grouped" copy of a table
improving existing base R functionalities, or by adding new ones.

median(x) #Returns median of a vector sd(x) #Standard deviation of a vector grouped by columns (similarly to a pivot table in
#Selects rows with the highest
spreadsheets). dplyr functions will then manipulate
values

install.packages(“tidyverse”) #Lets you install new packages (e.g., tidyverse package)


each "group" separately and combine the results

slice_max(df, z, prop =

library(tidyverse) #Lets you load and use packages (e.g., tidyverse package)

> G etting started with strings 0.25)


df %>%

group_by(z) %>%

#Extracts column values as a summarise(total = sum(x))

> The working directory The “stringr” package makes it easier to work with strings in R - you should install and load this package to use the
following functions.
vector, by
pull(df, y)
name or index

The working directory is a file path that R will use as the starting point for relative file paths. That is, it's the default Find Matches Subset
location for importing and exporting files. An example of a working directory looks like ”C://file/path”

#Extracts columns as a table

#Detects the presence of a pattern match in a string


#Extracts substrings from a character vector

select(df, y)
getwd() #Returns your current working directory

str_detect(string, pattern, negate = FALSE)


str_sub(string, start = 1L, end = -1L)
x,

#Detects the presence of a pattern match at the #Returns strings that contain a pattern match

setwd(“C://file/path”) - #Changes your current working directory to a desired filepath


beginning of a string
str_subset(string, pattern, negate = FALSE)

str_starts(string, pattern, negate = FALSE)


#Returns first pattern match in each string as a vector

#Finds the index of strings that contain pattern match


str_extract(string, pattern)

str_which(string, pattern, negate = FALSE)


#Returns first pattern match in each string as a matrix
> Operators #Locates the positions of pattern
str_locate(string, pattern)

matches in a string
with a column for each group in the pattern

str_match(string, pattern)
#Counts the number of pattern matches in a string

R has multiple operators that allow you to perform a variety of tasks. Arithmetic operators let you perform arithmetic str_count(string, pattern)
such as addition and multiplication. Relational operators are used to compare between values. Logical operators are
used for Boolean operators. Mutate Join and Split Try this Cheat Sheet on
DataCamp Workspace
Arithmetic Operators Relational Operators Logical Operators #Replaces substrings by identifying the substrings #Repeats strings n times

withstr_sub() and assigning them to the results.


str_dup(string, n)

a + b #Sums two variables


a == b #Tests for equality
! #Logical NOT

str_sub() <- value


#Splits a vector of strings into a matrix of substrings

a - b #Subtracts two variables


a != b #Tests for inequality
& #Element-wise logical AND
#Replaces the first matched pattern in each string.
str_split_fixed(string, pattern, n)
* > b #Tests for greater than
&& #Logical AND

a b #Multiply two variables

a / b #Divide two variables

a
a < b #Tests for lower than
| #Element-wise OR

str_replace(string, pattern, replacement)


Get Started
logical #Replaces all matched patterns in each string

Order
a ^ b #Exponentiation of a variable
a >= b #Tests for greater than or equal to
|| #Logical OR str_replace_all(string, pattern, replacement)

a%%b #Remainder of a variable


a <= b #Tests for less than or equal to #Converts strings to lowercase

#Returns the vector of indexes that sorts a character


a%/%b #Integer division of variables str_to_lower(string)
vector

#Converts strings to uppercase

str_order(x)

Assignment Operators Other Operators str_to_upper(string)


#Sorts a character vector

#Converts strings to title case

x <- 1 # Assigns a variable to x


%in% #Identifies whether an element belongs to a vector
str_sort(x)
str_to_title(string)
x = 1 #Assigns a variable to x $ #Allows you to access objects stored within an object

%>% #Part of magrittr package, it’s used to pass objects to functions

You might also like