Professional Documents
Culture Documents
> Operators Use multiple logical conditions to determine the return value with IFS()
=IFS(cond1, return1, cond2, return2)
Arithmetic operators =IFS(A1 B1, "1st", A2 > B2, "2nd", B3, "3rd") Returns "3rd"
Similar IF(), but allowing multiple pairs of logical conditions and return values. If the first condition, cond1, is TRUE then the
to
=A2 + A3 Add two values with +. This example returns 3 + 6 =
function returns the first return value, return1. If the second condition, cond2 is TRUE, the function returns the second return value;
=A4 - B4 Subtract a value from another with -.This example returns 10 - 7 =
and so on.
=A6 * B1 Multiply two values with *. This example returns 21 * 2 = 4
> Definitions =C3 / B4 Divide two values with /. This example returns 28 /
=C5% Convert a value to a percentage with %. This
7 =
example returns 3.2 Provide a default value in case of errors with IFERROR()
=B1 ^ C1 Raise a value to power with ^. This example returns 2 ^ 6 = 64
=IFERROR(value, value_if_error)
This cheat sheet describes the behavior of the Microsoft 365 version of Excel, and slight differences exist between Excel versions.
=IFERROR(A5 / A5, 1) Division of two missing values gives an error; this returns 1
Spreadsheet: An application, like Microsoft Excel, where you can store data, perform calculations, and organize information.
=A1 = B1 Returns 1 = 2 which is FALSE
3 which is FALSE
Choose a return value based on a table of inputs with SWITCH()
Workbook: A file containing a collection of one or more worksheets.
Test greater than with >
Test greater than or equal to with >=
=SWITCH(value, choice1, return1, choice2, return2, ...)
Worksheet: A single page in a workbook. It is a grid of cells arranged in rows and columns. =A3 > = B3 Returns 5 > 5 which is TRUE
=A3 > = B3 Returns 6 > =5 which is TRUE
=SWITCH(MID(D3, 1, 5), "World", "planet", "Solar", "planetary system", "Milky", "galaxy", "Local",
=A2 > B2 Returns 3 > 3 which is FALSE =A2 > B2 Returns 3 > = 3 which is TRUE "galaxy group") Returns "galaxy"
Cell: A rectangular box in a worksheet that can store a data value, a formula, or other content.
Formula: A piece of code to perform a calculation. Formulas start with an equals sign (=), and contain functions, mathematical Test less than with <
Test less than or equal to with <=
Takes a value as its first argument, followed by pairs of choices and return values. If the value matches the first choice, the function
operators, values, and cell references. =A1 < B1 Returns 1 < 2 which is TRUE
=A1 < = B1 Returns 1 < = 2 which is TRUE
returns the first return value; if the value matches the second choice, the function returns the second return value; and so on. If no
Cell reference: The location of a cell. The column is described with letters and the row is described with numbers. For example, the =A2 < B2 Returns 3 < 3 which is FALSE =A2 < = B2 Returns 3 < = 3 which is TRUE
values match, the function returns an error.
cell in the 4th column, 7th row would be denoted D7.
Cell range: A group of adjacent cells in a worksheet. A cell range is typically referred to by its upper-left and lower-right cells, such
- A B C
Logical NOT with NOT()
Logical AND with AND()
Get the number of cells that meet a condition with COUNTIF()
1 Cell A1 Cell B1 Cell C1 =NOT(A1 = B1)
=AND(A1 > 10, B1 < 20)
> Getting help Returns OR(1 > 10, 2 < 20) which is TRUE
=SUMIF(A1:A6, ">5") Returns 37: the sum of elements in A1 to A6 filtered with values greater than 5
=SUMIF(A1:A6, ">5", B1:B6) Returns 25: the sum of elements in B1 to B6 corresponding to values in A1 to A6 that are greater
Returns OR(1 < 2, 2 < 20) which is TRUE Returns XOR(1 > 2, 2 > 20) which is FALSE
than 5
You can get help by accessing the help menu =SUMIFS(B1:B6, A1:A6, ">5", D1:D6, "<>Local Group") Returns 18: the sum of B1:B6 where A1:A6 is greater than 5 and
D1:D6 is not equal to "Local Group"
You can also click on the "Help" button to open the Help pane, where you can browse through various topics and find answers to =ISNUMBER(A1) Checks if a cell is a number. Returns TRU =AVERAGEIF(A1:A6, ">5", B1:B6) Returns 8.33: the mean of elements in B1 to B6 corresponding to values in A1 to A6 that are
common questions. =ISTEXT(D1) Checks if a cell is a text. Returns TRU greater than 5
=ISLOGICAL(A1) Checks if a cell is a boolean. Returns FALS =AVERAGEIFS(B1:B6, A1:A6, ">5", D1:D6, "<>Local Group") Returns 9: the mean of B1:B6 where A1:A6 is greater than 5
and D1:D6 is not equal to "Local Group"
How to add a comment to a cell =ISLOGICAL(A1=A1) Checks if a cell is a boolean. Returns TRU
=N(E1) Converts to number. Returns 44927: the serial date - the date as a number, counting Dec 31st 1899 as
Click on the cell where you want to add a comment =N(D1) Converts to number. Returns an error, since it’s not a numbe
Right-click or CTRL+click on the cell and select the "New Comment" option from the context menu. You can also click on the
Insert menu then "New Comment"
=VALUETOTEXT(A1) Convert to text. Returns "1
=TEXT(C6, "0.00E+0") Convert to formatted text. Returns "4.96E+2 > Text functions and operators
This will open a small text box next to the cell, where you can type your comment =DATEVALUE("1/1/2022") Convert text to serial. Returns 44927: the serial date
Once you have entered your comment, click the green arrow button to save it. Basics
=LEN(D5) Returns the length of a string in characters. This example returns 28.
> Cells and ranges > Counting data Combining and splitting strings
=COUNT(A5:E5) 3: the number of cells in the range containing numbers, dates and currencies
Returns ="Hello " & D1 & "!" Returns "Hello World!
Specifying cell locations with column letter, row number format =COUNTA(A5:E5) Returns 4: the number of cells in the range that aren't empt =REPT(D6, 3) Repeats text. This example returns "UniverseUniverseUniverse
=COUNTBLANK(A5:E5) Returns 1: the number of cells that are empty or contain the empty string ("") =TEXTSPLIT(D4, "o") Splits a string on a delimiter. This example returns "L", "cal Gr", "up" in 3 cells: "Local Group"
=B2 Here we refer to the cell in column B, row 2.
split on the letter "o
=TEXTSPLIT(D5, {"a","u"}) Splits a string on a delimiter. This example returns "L", "ni", "ke", "S", "percl",
Specifying absolute cell references with $ prefixes
> Math functions "ster" in 6 cells: "Laniakea Supercluster" split on the letter "a" or the letter "u".
The $ symbol before the column letter and/or row number tells Excel that the reference is absolute and should not change when the
formula is copied or moved to another cell. The following examples all specify column B, row 2.
=LOG(100, 10) Returns 2: the base 10 logarithm of 10
Mutating strings
=$B$2 Column and row references are both absolute
=EXP(2) Returns e^2 = 7.39
=MID(text, start, [length]) Extracts a substring starting at the position specified in the second argument and with the
=$B2 Column reference is absolute, row reference is relative
=MAX(A1:A6, C1:C3, 12) Returns 28: the largest value in all cell ranges or values inputte
length specified in the third argument. For example =MID(D6, 4, 5) Returns "verse
=B$2 Column reference is relative, row reference is absolute =MIN(A1:A6, C1:C3, 12) Returns 1: the smallest value in all cell ranges or values inputted
=UPPER(text) Converts the text to uppercase. For example =UPPER(D3) Returns "MILKY WAY
=MAXA(A1:A6, C1:C3, FALSE) Returns same as MAX(), except TRUE is valued at 1 and FALSE is valued at
=LOWER(text) Converts the text to lowercase. For example =LOWER(D3) Returns "milky way
Specifying ranges with the start:end format =MINA(A1:A6, C1:C3, FALSE) Returns same as MIN(), except TRUE is valued at 1 and FALSE is valued at
=PROPER(text) Converts the text to title case. For example =PROPER("milky way") Returns "Milky Way"
=SUM(A1:A6, C1:C3, 12) Returns 108: the total of all cell ranges or values inputte
The start:end format is a convenient way to specify a range of cells in a formula.
=AVERAGE(A1:A6, C1:C3, 12) Returns 12: the mean of all cell ranges or values inputte
Here is an example of start:end format when using the SUM() formula:
=MEDIAN(A1:A6, C1:C3, 12) Returns 10: the median of all cell ranges or values inputte
=SUM(B2:B5) =PERCENTILE.INC(C1:C6,
=ROUND(PI(), 2)
0.25) Returns 22.75: the 25th percentile of the cell rang
3.14: pi rounded to 2 decimal place
Returns
> Data manipulation
=CEILING(PI(), 0.1) Returns 3.2: pi rounded upwards to the nearest 0.
=FILTER(A1:B6, C1:C6>100) Gets a subset of the cell range in the first input that meets the condition in the second input
Example dataset
=FLOOR(PI(), 0.1) Returns 3.1: pi rounded downwards to the nearest 0.
> =VAR.S(B1:B6) Returns 19.37: sample variance of the cell rang
=STDEV.S(B1:B6) Returns 4.40: sample standard deviation of the cell range
=SORT(A1:E6, 4) Returns the dataset with rows in alphabetical order of the fourth column. Sorts the rows of the data
according to values in specified columns
=SORTBY(A1:E6, D1:D6) Returns the same as the SORT() example. Alternate, more flexible, syntax for sorting. Rather than
Throughout most of this cheat sheet, we’ll be using this dummy dataset of 5 columns and 6 rows. specifying the column number, you specify an array to sort by
=UNIQUE(A1:A6) Gets a list of unique values from the specified data
- A B C D E
> Flow control =SEQUENCE(5, 1, 3, 2) Returns 5 rows and 1 column containing the values 3, 5, 7, 9, 11. Generates a sequence of numbers,
1 1 2 6 World 1/1/2023 starting at the specified start value and with the specified step size.
2 3 3 21 Solar System 1/2/2023 Use a logical condition to determine the return value with IF()
3 6 5 28 Milky Way 1/3/2023
=IF(cond, return_if_true, return_if_false)
4 10 7 301 Local Group 1/4/2023 =IF(ISBLANK(A5), "A5 is blank", "A5 is not blank") Returns "A5 is blank"
ORDER BY number_of_rooms DESC; 12. Return the listings where number_of_rooms is missing
FROM airbnb_listings
FROM airbnb_listings
SELECT *
FROM airbnb_listings;
FROM airbnb_listings
WHERE number_of_rooms >= 3; 2. Get the average number of rooms per listing across all listings
> The different dialects of SQL 2. Get all the listings where number_of_rooms is more than 3
SELECT *
SELECT AVG(number_of_rooms)
FROM airbnb_listings;
FROM airbnb_listings
3. Get the listing with the highest number of rooms across all listings
Although SQL languages all share a basic structure, some of the specific WHERE number_of_rooms > 3; SELECT MAX(number_of_rooms)
3. Get all the listings where number_of_rooms is exactly equal to 3 FROM airbnb_listings;
commands and styles can differ slightly. Popular dialects include MySQL,
SELECT *
4. Get the listing with the lowest number of rooms across all listings
SQLite, SQL Server, Oracle SQL, and more. PostgreSQL is a good place to start FROM airbnb_listings
SELECT MIN(number_of_rooms)
—since it’s close to standard SQL syntax and is easily adapted to other WHERE number_of_rooms = 3; FROM airbnb_listings;
dialects.
4. Get all the listings where number_of_rooms is lower or equal to 3
SELECT *
FROM airbnb_listings
Grouping, filtering, and sorting
> Sample Data WHERE number_of_rooms <= 3;
5. Get all the listings where number_of_rooms is lower than 3 5. Get the total number of rooms for each country
SELECT *
SELECT country, SUM(number_of_rooms)
Throughout this cheat sheet, we’ll use the columns listed in this sample table of FROM airbnb_listings
FROM airbnb_listings
SELECT *
SELECT country, AVG(number_of_rooms)
GROUP BY country;
1 Paris France 5 2018
7. Get all the listings that are based in ‘Paris’ 8. Get the listing with the lowest amount of rooms per country
SELECT *
SELECT country, MIN(number_of_rooms)
FROM airbnb_listings
FROM airbnb_listings
FROM airbnb_listings
SELECT country, MAX(number_of_rooms)
WHERE city LIKE ‘j%’ AND city NOT LIKE ‘%t’; FROM airbnb_listings
SELECT *
GROUP BY country;
FROM airbnb_listings; 11. Get the number of cities per country, where there are listings
2. Return the city column from the table
Filtering on multiple columns
SELECT country, COUNT(city) AS number_of_cities
SELECT city
FROM airbnb_listings
FROM airbnb_listings; 10. Get all the listings in `Paris` where number_of_rooms is bigger than 3 GROUP BY country;
3. Get the city and year_listed columns from the table
SELECT *
12. Get all the years where there were more than 100 listings per year
FROM airbnb_listings
11. Get all the listings in `Paris` OR the ones that were listed after 2012 GROUP BY year_listed
4. Get the listing id, city, ordered by the number_of_rooms in ascending order HAVING COUNT(id) > 100;
SELECT *
FROM airbnb_listings
WHERE city = ‘Paris’ OR year_listed > 2012;
ORDER BY number_of_rooms ASC;
Create your first visualization You can append one dataset to anothe
Click on the Report View and go to the Visualizations pane on the right-hand sid Click on Append Queries under the Home tab under the Combine grou
Select the type of visualization you would like to plot your data on. Keep reading this cheat to learn different Select to append either Two tables or Three or more table
Add tables to append under the provided section in the same window
Merge Queries
Power BI Cheat Sheet Click on Merge Queries under the Home tab under the Combine grou
Select the first table and the second table you would like to merge
Aggregating data Select the columns you would like to join the tables on by clicking on the column from the first dataset, and from
the second datase
Data profiling
> Data Visualizations in Power BI Data Profiling is a feature in Power Query that provides intuitive information about your dat
> Why use Power BI? L ine Charts: Used for looking at a numeric value over time (e.g. revenue over time)
Easy to use—no coding Integrates seamlessly with Fast and can handle large Scatter: Displays one set of numerical data along the horizontal axis and another set along the vertical axis (e.g.
involved any data source datasets relation age and loan)
Combo Chart: Combines a column chart and a line chart (e.g. actual sales performance vs target) Data Analysis Expressions (DAX) is a calculation language used in Power BI that lets you create calculations and
Treemaps: Used to visualize categories with colored rectangles, sized with respect to their value (e.g. product perform data analysis. It is used to create calculated columns, measures, and custom tables. DAX functions are
Throughout this section, we’ll use the columns listed in this sample table of `sales_data`
There are three components to Power BI—each of them serving different purposes Maps: Used to map categorical and quantitative information to spatial locations (e.g. sales per state)
Cards: Used for displaying a single fact or single data point (e.g. total sales) deal_size sales_person date customer _name
P ow e r B I D e s k to p P ow e r B I s e r v i c e P ow e r B I m o b i l e
Free desktop application that Cloud-based version of Power BI A mobile app of Power BI, which Table: Grid used to display data in a logical series of rows and columns (e.g. all products with sold items) 1,000 Maria Shuttleworth 30-03-2022 Acme Inc.
provides data analysis and with report editing and publishing allows you to author, view, and 3,000 Nuno Rocha 29-03-2022 Spotflix
creation tools. features. share reports on the go.
2,300 Terence Mickey 13-04-2022 DataChamp
AVERAGE
ME AN DI
l (<co
l
l
adds all the numbers in a colum
(<co
umn>)
There are three main views in Power BI you connect to one or many data sources, shape and transform data to meet your needs, and load it into Power BI. M N MAX
I / l returns the smallest biggest value in a colum
(<co umn>) /
COUNT l (<cocounts the number of cells in a column that contain non blank value
umn>) -
report view da t a v i e w model view T NCTCOUNT
DIS I l counts the number of distinct values in a column.
(<co umn>)
This view is the default This view lets you examine This view helps you Open the Power Query Editor EX AM P LE
view, where you can datasets associated with establish different
Sum of all deals — SUM(‘sales_data’[deal_size]
visualize data and create your reports relationships between
While loading dat Average deal size — AVERAGE(‘sales_data’[deal_size]
reports datasets
Underneath the Home tab, click on Get Dat Distinct number of customers — DISTINCTCOUNT(‘sales_data’[customer_name])
Choose any of your datasets and double clic
Click on Transform Data
Logical unction
f
Under Queries in the Home tab of the ribbon, click on Transform Data drop-down, then on the Transform Data Create a column called large_deal that returns “Yes” if deal_size is bigger than 2,000 and “No” otherwise
button la e_deal sales_data deal_s e , es , N
Upload datasets into Power BI rg = IF( ‘ ’[ iz ] > 2000 “Y ” “ o”)
e t unction
Using the Power Query Editor
T x F
Choose any of your datasets and double clic LO W ER te t converts a text string to all lowercase letter
(< x >)
If you need to transform the data, click Transform which will launch Power Query. Keep reading this cheat sheet for You can remove rows dependent on their location, and propertie RE PL ACE ld_te t , sta t_
(<o x > , < _ a s , e _te t replaces part of a text string with a
r num> <num ch r > <n w x >)
how to apply transformations in Power Query Click on the Home tab in the Query ribbo different text string.
Inspect your data by clicking on the Data View Click on Remove Rows in the Reduce Rows group EX AM P LE
Choose which option to remove, whether Remove Top Rows, Remove Bottom Rows, etc. Change column customer_name be only lower case
Choose the number of rows to remov st e _ a e O ER sales_data st e _ a e
Create relationships in Power BI
cu om r n m = L W (‘ ’[cu om r n m ])
You can undo your action by removing it from the Applied Steps list on the right-hand side
SalesPersonID
Click on the Model View from the left-hand pan Click on the Add Column tab in the Query ribbo WEE A date ,
KD Y(< et _t e returns 1 corresponding to the day of the week of a date ( et
> <r urn yp >) -7 r urn _t e
yp
Connect key columns from different datasets by dragging one to another Click on Custom Column in the General grou indicates week start and end (1: Sunday Saturday, 2: Monday Sunday)
- -
(e.g., EmployeeID to e.g., SalespersonID) Name your new column by using the New Column Name optio
Employee Database EX AM P LE
Define the new column formula under the custom column formula using the available data
EmployeeID
Return the day of week of each deal
Replace values
week_da EE A sales_data
y = W KD Y(‘ ’[ date ] , 2)
You can replace one value with another value wherever that value is found in a colum
In the Power Query Editor, select the cell or column you want to replac
Click on the column or value, and click on Replace Values under the Home tab under the Transform grou
Fill the Value to Find and Replace With fields to complete your operation Learn Data Skills Online at www.DataCamp.com
3. Measures: A measure is a type of field that contains quantitative values (e.g. revenue, costs, and Aggregating data
market sizes). When dragged into a view, this data is aggregated, which is determined by the When data is dragged into the Rows and Columns on a sheet, it is aggregated based on the dimensions in the sheet.
dimensions in the view
This is typically a summed value. The default aggregation can be changed using the steps below:
4. Data types: Every field has a data type which is determined by the type of information it contains.
The available data types in Tableau include text, date values, date & time values, numerical values, Right-click on a measure field in the Data pan
Go down to Default properties, Aggregation, and select the aggregation you would like to use
Tableau for Business Intelligence
boolean values, geographical values, and cluster groups
Changing colors
Color is a critical component of visualizations. It draws attention to details. Attention is the most important
The Canvas
Tableau Basics Cheat Sheet The canvas is where you’ll create data visualizations
component of strong storytelling. Colors in a graph can be set using the marks card.
Create a visualization by dragging fields into the Rows and Columns section at the top of the scree
Drag dimensions into the Marks field, specifically into the Color squar
1. Tableau Canvas: The canvas takes up most of the screen on Tableau and is where you can add visualizations
To change from the default colors, go to the upper-right corner of the color legend and select Edit Colors. This
earn Tableau online at www.DataCamp.com
L 2. Rows and columns: Rows and columns dictate how the data is displayed in the canvas. When dimensions will bring up a dialog that allows you to select a different palette
are placed, they create headers for the rows or columns while measures add quantitative values
3. Marks card: The marks card allows users to add visual details such as color, size, labels, etc. to rows and columns. Changing fonts
This is done by dragging fields from the data pane into the marks card
Fonts can help with the aesthetic of the visualization or help with consistent branding. To change the workbook s font,
’
Tableau is a business intelligence tool that allows you to Upload a dataset to Tableau
effectively report insights through easy-to-use
Launch Tablea
customizable visualizations and dashboards
In the Connect section, under To a File, press on the file format of your choice
For selecting an Excel file, select .xlsx or .xlsx
> Creating dashboards with Tableau
Creating your first visualization Dashboards are an excellent way to consolidate visualizations and present data to a variety of stakeholders. Here is a
Once your file is uploaded, open a Worksheet and click on the Data pane on the left-hand sid step by step process you can follow to create a dashboard.
> Why use Tableau? Drag and drop at least one field into the Columns section, and one field into the Rows section at the top
of the canva
To add more detail, drag and drop a dimension into the Marks card (e.g. drag a dimension over the color square
Launch Tablea
In the Connect section under To A File, press on your desired file typ
Select your fil
in the marks card to color visualization components by that dimension Click the New Sheet at the bottom to create a new shee
Easy to use—no coding Integrates seamlessly with Fast and can handle large To a summary insight like a trendline, click on the Analytics pane and drag the trend line into your visualization
involved any data source datasets Create a visualization in the sheet by following the steps in the previous sections of this cheat shee
You can change the type of visualization for your data by clicking on the Show Me button on the top right Repeat steps 4 and 5 untill you have created all the visualizations you want to include in your dashboar
Click the New Dashboard at the bottom of the scree
On the left-hand side, you will see all your created sheets. Drag sheets into the dashboar
> Tableau Versions > Data Visualizations in Tableau
Adjust the layout of your sheets by dragging and dropping your visualizations
Stacked Bar Chart: Used to show categorical data within a bar chart (e.g., sales by region and department)
Side-by-Side Bar Chart: Used to compare values across categories in a bar chart format (e.g., sales by
> Getting started with Tableau region comparing product types)
L ine Charts: Used for looking at a numeric value over time (e.g., revenue over time)
When working with Tableau, you will work with Workbooks. Workbooks contain sheets, dashboards, and stories.
Similar to Microsoft Excel, a Workbook can contain multiple sheets. A sheet can be any of the following and can be Scatter Plot: Used to identify patterns between two continuous variables (e.g., profit vs. sales volume)
Worksheet Dashboard story Box-and-Whisker Plot: Used to compare distributions between categorical variables (e.g., distribution of
A worksheet is a single
view in a workbook. You
A collection of multiple
worksheets used to
A story is a collection of
multiple dashboards and/
revenue by region)
eat Map: Used to visualize data in rows and columns as colors (e.g., revenue by marketing channel)
> The Anatomy of a Worksheet M ap: Used to show geographical data with color formatting (e.g., Covid cases by state)
Treemap: Used to show hierarchical data (e.g., Show how much revenue subdivisions generate relative to
A story is made of story points, which lets you cycle through different visualizations and dashboard
To begin adding to the story, add a story point from the left-hand side. You can add a blank story poin
To add a summary text to the story, click Add a caption and summarize the story poin
the whole department within an organization)
Add as many story points as you would like to finalize your data story
When opening a worksheet, you will work with a variety of tools and interfaces
Dual Co bination: Used to show two visualizations within the same visualization (e.g., pro it or a store each
m f f
v v
The Sidebar
In the sidebar, you’ll find useful panes for working with dat
Data: The data pane on the left-hand side contains all of the fields in the currently selected data sourc
> Customizing Visualizations with Tableau
Analytics: The analytics pane on the left-hand side lets you add useful insights like trend lines, error bars,
Tableau provides a deep ability to filter, format, aggregate, customize, and highlight specific parts of your data
and other useful summaries to visualizations
visualizations
Drag-and-drop a field you want to filter on and add it to the Filters car
Fill out in the modal how you would like your visuals to be filtered on the data
Learn Data Skills Online at www.DataCamp.com
Data Storytelling & Communication Use text appropriately > Crafting effective narratives with data
Cheat Sheet While too much text can add clutter, text can also be an extremely effective tool at highlighting insights within your
visualizations. Cole Nussbaumer Knaflic, Author of Storytelling with Data, provides an excellent example with the Know the audience
following visualization.
Learn more online at www.DataCamp.com To communicate effectively, you need to know who your audience is, and what their priorities are. There is a range of
Please approve the hire of 2 FTEs possible audiences you may encounter when presenting, and crafting an audience specific message will be important.
to backfill those who quit in the past year
Examples of audiences you may present to are:
Ticket volume over time 2 employees quit in May. We nearly kept up with incoming
volume in the following two months, but fell behind with the
Number of tickets
increase in Aug and haven’t been able to catch up since.
$
250 250 202
177
200 Received 200 160 Received
149
139
150 Processed 150 156
Processed
140
100 100 126 124
Data storytelling is often called the last mile of analytics. Sound communication skills, allows data professionals to 50 50 104
0 0
drive action out of their insights. According to Brent Dykes, Author of Effective Data Storytelling: How to Drive Change Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
with Data, Narrative, and Visuals—Data Storytelling is a combination of data, visuals, and narrative. 2014
Data source: XYZ Dashboard, as of 12/31/2014 | A detailed analysis on tickets processed per person
and time to resolve issues was undertaken to inform this request and can be provided if needed.
How text can be a useful visual tool when crafting effective visuals
(Source: Storytelling with Data by Cole Nussbaumer Knaflic)
Executive Data Leader Business Partner
Basic data literacy skills Data expert Advanced data literacy skills
Using text in data visualizations Prioritizes outcomes & Prioritizes rigour & Prioritizes tactical
decisions insights next steps
When applicable, label axes and titles for clarity
Data Visuals Narrative Label important data points when necessary Cares much more about Cares much more about Cares much more about
The three elements of data storytelling
Provide useful context around insights within the title or subtitle business impact than a 1% how your arrived at your how your analysis impacts
(Source: Effective Data Storytelling: How to Drive Change with Data, Narrative, and Visuals by Brent Dykes) Adjust font size when highlighting specific messages within your labels incremental gain in a insights and to battle test their workflow, and what
When applicable, try to answer common audience questions with labels machine learning model them for rigour should be their main
accuracy or a new takeaway from the data
technique you’re using story
> Crafting effective visuals Use colors effectively
Considerations when crafting audience specific messaging
Choose the best visualization for your story The fundamentals of color theory in data visualization
Aspect What do you need to consider?
Color is one of the most powerful tools available for emphasizing different aspects of your data visualization. Here are k
Prior nowledge What context do they have about the problem?
Each plot type is suited for communicating specific things about specific types of data. Start by choosing an different properties to keep in mind when choosing an appropriate color palette for your visualization.
appropriate plot type. What is their level of data literacy?
Hue represents the range of possible colors, from red, through orange, green and blue, to purple and back to red. Priorities What does the audience care about?
Line plot Bar plot Scatter plot Histogram Chroma is the intensity of the color, from grey to a bright color. How does your message relate to their goals?
Luminance is the brightness of the color, from black to white. Who is driving decision-making within your audience?
Constraints What is the audience’s preferred format?
There are three common types of color palettes, that depend on these dimensions. How much time does an audience have to consume a data story?
Do not mislead with data stories Leave any highly technical details to the appendix
Ensure there is a narrative arc to your presentation
Ruthlessly edit your plots to remove or minimize elements that distract from the message of the plot. In particular,
make non-data elements (parts of the plot that don't directly represent a data value, like the grid lines) less distracting. Long-form report Be extra diligent about providing useful context around data visualizations
A great example comes from Darkhorse Analytics, which showcases exactly the value of decluttering visualizations. and insights
The fastest way to lose credibility when presenting data stories is to inadvertently (or intentionally) mislead with your
data insights. Here are top best practices to avoid misleading with data stories. Leave any highly technical details to the appendix
Calories per 100g Notebook Ensure that you provide useful context on how you arrived at a certain
Same Data, Different Y-Axis conclusion
607
Dashboard Make use of the dashboard grid layout
542
533 Interest rates Interest rates Organize data insights from left to right, top to bottom
Provide useful summary text of key visualizations in your dashboard
3.154% 3.50%
296
260 3.152% 3.00%
3.150% 2.50%
French
Potato
Bacon Pizza Chili Dog 3.148% 2.00%
Fries Chips
3.146% 1.50%
3.144% 1.00%
Learn more about data storytelling at
Decluttering a visualization in action
3.142% 0.50%
Learn Data Visualization online at www.DataCamp.com One of the most common ways to
show part to whole data. It is also
The donut pie chart is a variant of the
pie chart, the difference being it has a
Heatmaps are two-dimensional charts
that use color shading to represent
Best to compare subcategories within
categorical data. Can also be used to
2D rectangles whose size is
proportional to the value being
commonly used with percentages hole in the center for readability data trends. compare percentages measured and can be used to display
hierarchically structured data
Use cases Use cases Use cases Use cases Use cases
> Capture a trend > Visualize a single value > Capture distributions
Line chart Multi-line chart Area chart Stacked area chart Spline chart Card Table chart Gauge chart Histogram Box plot Violin plot Density plot
$7.47M
Total Sales
Cards are great for showing Best to be used on small This chart is often used in Shows the distribution of a Shows the distribution of a A variation of the box plot.
Visualizes a distribution by
The most straightforward way to Captures multiple numeric Shows how a numeric value Most commonly used variation of Smoothened version of a line chart. and tracking KPIs in datasets, it displays tabular executive dashboard reports variable. It converts variable using 5 key It also shows the full using smoothing to allow
capture how a numeric variable is variables over time. It can include progresses by shading the area area charts, the best use is to track It differs in that data points are dashboards or presentations data in a table
to show relevant KPIs numerical data into bins as summary statistics— distribution of the data smoother distributions and
changing over time multiple axes allowing comparison between line and the x-axis the breakdown of a numeric value connected with smoothed curves columns. The x-axis shows minimum, first quartile, alongside summary statistics better capture the
of different units and scale ranges by subgroups to account for missing values, as the range, and the y-axis median, third quartile, and distribution shape of the data
opposed to straight lines represents the frequency maximum
Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases
Revenue in $ over tim Apple vs Amazon stocks Total sales over tim Active users over time by Electricity consumption over Revenue to date on a Account executive NPS score Distribution of salaries in Gas efficiency of vehicle Time spent in restaurants Distribution of price of
Energy consumption in kWh over tim Active users over time segmen tim sales dashboar leaderboard Revenue to target an organizatio Time spent reading across across age group hotel listing
over tim Lebron vs Steph Curry Total revenue over time by CO2 emissions over time Total sign-ups after a Registrations per webinar Distribution of height in readers Length of pill effects by Comparing NPS scores by
Google searches over time searches over tim country promotion one cohort dose customer segment
Bitcoin vs Ethereum price
over time
Data Analyst
Science
Engineer
One of the easiest charts to Also known as a vertical bar Most commonly used chart A hybrid between a scatter Often used to visualize data A convenient visualization for Useful for representing flows in Useful for presenting Similar to a graph, it
read which helps in quick
comparison of categorical
chart, where the categories
are placed on the x-axis.
when observing the
relationship between two
plot and a line plot, the
scatter dots are connected
points with 3 dimensions,
namely visualized on the x-
visualizing the most prevalent
words that appear in a text
systems. This flow can be any
measurable quantity
weighted relationships or
flows between nodes.
consists of nodes and
interconnected edges. It
Learn Data Skills Online at
data. One axis contains These are preferred over bar variables. It is especially with a line axis, y-axis, and with the size Especially useful for illustrates how different www.DataCamp.com
categories and the other axis charts for short labels, date useful for quickly surfacing of the bubble. It tries to show highlighting the dominant or items have relationships
represents values ranges, or negatives in values potential correlations relations between data points important flows
with each other
between data points using location and size
Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases
Volume of google Brand market shar Display the relationship Cryptocurrency price Adwords analysis: CPC vs Top 100 used words by Energy flow between Export between countries How different airports are
searches by regio Profit Analysis by region between time-on-platform inde Conversions vs Share of customers in customer countrie to showcase biggest connected worldwide
Market share in revenue and chur Visualizing timelines and total conversion service tickets Supply chain volumes export partner Social media friend group
by product Display the relationship events when analyzing Relationship between life between warehouses Supply chain volumes analysis
between salary and years two variables expectancy, GDP per between the largest
spent at company capita, & population size warehouses
Descriptive Statistics
Cheat Sheet > Numerical Dataset—Glasses of Water Visualizing Numeric Variables
There are a variety of ways of visualizing numerical data, here’s a few of them in action:
Learn more online at www.DataCamp.com
300ml 60ml 300ml 120ml 180ml 180ml 300ml Histogram Box plot
Median
To illustrate statistical concepts on numerical data, we’ll be using a numerical
variable, consisting of the volume of water in different glasses.
Minimum Maximum
> Key Definitions Measures of Center
0 300
Shows the distribution of a variable. It converts numerical
Q1 Q3
Shows the distribution of a variable using 5 key summary
data into bins as columns. The x-axis shows the range, and statistics—minimum, first quartile, median, third quartile,
Throughout this cheat sheet, you’ll find terms and specific statistical jargon being used. Here’s a rundown of all the the y-axis represents the frequency and maximum
terms you may encounter. Measures of center allow you to describe or summarize your data by capturing one value that describes the center of
its distribution.
Variable: In statistics, a variable is a quantity that can be measured or counted. In data analysis, a variable is
typically a column in a data frame
Descriptive statistics: Numbers that summarize variables. They are also called summary statistics or aggregations Measure Definition How to find it Result
Categorical data: Data that consists of discrete groups. The categories are called ordered (e.g., educational levels)
if you can sort them from lowest to highest, and unordered otherwise (e.g., country of origin)
Arithmetic mean The total of the values
divided by how many
)
) 205.7 ml > Correlation
Numerical data: Data that consists of numbers (e.g., age). 7
values there are
Strong negative Weak negative No correlation Weak positive Strong positive
Median The middle value, when 180 ml
sorted from smallest to 180ml
largest
> Categorical Data—Trail Mix Mode The most common value 300 ml
300ml 300ml 300ml
Correlation is a measure of the linear relationship between two variables. That is, when one variable goes up, does the
To illustrate statistical concepts on categorical data, we’ll be using an unordered
categorical variable, consisting different elements of a trail mix. Our categorical
Other Measures of Location other variable go up or down? There are several algorithms to calculate correlation, but it is always a score between -1
and +1.
-1 hen X increases, Y decreases. Scatter plot forms a perfect straight line with negative slope
Counts and Proportions Minimum The lowest value in your 60 ml W
Counts and proportions are measures of how much data you have. They allow you to understand how many data 0 There is no linear relationship between X and Y, so the scatter plot looks like a noisy mess
Maximum The highest value in your 300 ml
points belong to different categories in your data.
data 300ml Between 0 and 1 + W hen X increases, Y increases
A count is the number of times a data point occurs in the dataset
A proportion is the fraction of times a data point occurs in the dataset. + 1 W hen X increases, Y increases. Scatter plot forms a perfect straight line with positive slope
Percentile: Cut points that divide the data into 100 intervals with the same amount of data in each interval (e.g., in
the water cup example, the 100th percentile is 300 ml Note that correlation does not account for non-linear effects, so if X and Y do not have a straight-line relationship,
Food category Count Proportion
Quartile: Similar to the concept of percentile, but with four intervals rather than 100. The first quartile is the same the correlation score may not be meaningful.
as the 25th percentile, which is 120 ml. The third quartile is the same as the 75th percentile, which is 300 ml.
Almond 15 15 / 48 = 0.283
Cashew 13 13 / 48 = 0.245
Measures of Spread
Cranberry 25 25 / 48 = 0.472
Sometimes, rather than caring about the size of values, you care about how different they are.
Visualizing Categorical Variables Measure Definition How to find it Result
(7 - 1)
300ml Mean
www.DataCamp.com
all divided by one less
than the number of data
One of the easiest charts to read Best to compare subcategories within 2D rectangles whose size is points
which helps in quick comparison of categorical data. Can also be used to proportional to the value being
categorical data. One axis contains compare proportions measured and can be used to display Inter-quartile range The third quartile minus 180 ml
categories and the other axis hierarchically structured data the first quartile
represents values 300ml 120ml
Introduction to ultiplication Rules: Probability of two Addition Rules: Probability of at least one
> M
>
Probability Rules
events happening event happening
Cheat Sheet
probability of it raining at midday is 0.4, the probability of it being sunny and Example: If the probability of it being sunny at midday is 0.3 and the
Learn statistics online at www.DataCamp.com rainy is 0, since these events are mutually exclusive. probability of it raining at midday is 0.4, the probability of it being sunny or
Mutually exclusive (disjoint) M utually exclusive (disjoint) rainy is 0.3 + 0.4 = 0.7, since these events are mutually exclusive.
Event: A thing that you can observe whether it happens or not. Example: If the probability of it being sunny at midday is 0.3 and the F ormula: P(A ∪ B)=P(A) + P(B) - P(A ∩ B)
probability of your favorite soccer team winning their game today is 0.6, the
Intersection A ∩ B Example: If the probability of it being sunny at midday is 0.3 and the
Probability: The chance that an event happens, on a scale from 0 (cannot happen) to 1 (always then probability of it being sunny at midday and your favorite soccer team Union A U B probability of your favorite soccer team winning their game today is 0.6, the
happens). Denoted P(event). winning their game today is 0.3 * 0.6 = 0.18.
then probability of it being sunny at midday or your favorite soccer team
winning their game today is 0.3 + 0.6 - (0.3 * 0.6) = 0.72.
Probability universe: The probability space where all the events you are considering can either
The conjunctive fallacy
happen or not happen.
The disjunctive fallacy
Mutually exclusive events: If one event happens, then the other event cannot happen (e.g., you Definition: The probability of both events happening is always less than or equal to the probability of one event
cannot roll a dice that is both 5 and 1). happening. That is P(A ∩ B)≤ P(A), and P(A ∩ B)≤ P(B). The conjunctive fallacy is when you don't think
Definition: The probability of at least one event happening is always greater than or equal to the
carefully about probabilities and estimate that probability of both events happening is greater than the
probability of one event happening. That is P(A ∪ B) P(A), and P(A ∪ B) P(B). The disjunctive
Independent events: If one event happens, it does not affect the probability that the other event probability of one of the events.
fallacy is when you don't think carefully about probabilities and estimate that the probability of at least
happens (e.g., the weather does not affect the outcome of a dice roll). Example: A famous example known as 'The Linda problem" comes from a 1980s research experiment. A fictional one event happening is less than the probability of one of the events.
person was described:
Dependent events: If one event happens, it changes the probability that the other event happens. Example: Returning to the "Linda problem", consider having to rank these two statements in order of
(e.g., the weather affects traffic outcomes). Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply probability:
concerned with issues of discrimination and social justice and also participated in anti-nuclear demonstrations.
Conjunctive probability (a.k.a. joint probability): The probability that all events happen. Linda is a bank teller
Participants had to choose which statement had a higher probability of being true: Linda is a bank teller or is active in the feminist movement.
Disjunctive probability: The probability that at least one event happens. Linda is a bank teller The disjunctive fallacy would be to think that choice 1 had a higher probability of being true, even
Linda is a bank teller and is active in the feminist movement.
Conditional probability: The probability that one event happens, given another event happened. though that is impossible because of the additive rule of probabilities.
M any participants chose fell for the conjunctive fallacy and chose option 2, even though it must be less likely
than option 1 using the multiplication rule.
That is, if you have a cloudy morning it is twice as likely to rain than if
Example: The odds of basketball player Stephen Curry successfully shooting a three-pointer is the probability
you didn't have a cloudy morning, due to the dependence of the events.
that he scores divided by the probability that he misses, 0.43 / 0.57 = 0.75.
> Getting started with lists > Getting started with characters and strings
A list is an ordered and changeable sequence of elements. It can hold integers, characters, floats, strings, and even objects.
# Create a string with double or single quotes
"""
Learn Python online at www.DataCamp.com List functions and methods A Frame of Data
> How to use this cheat sheet reversed(x) # Reverse the order of elements in x e.g., [2,3,1]
"""
Python is the most popular programming language in data science. It is easy to learn and comes with a wide array of str[0:2] # Get a substring from starting to ending index (exclusive)
powerful libraries for data analysis. This cheat sheet provides beginners and intermediate users a guide to starting
using python. Use it to jump-start your journey with python. If you want more detailed Python cheat sheets, check out Selecting list elements
the following cheat sheets below:
Combining and splitting strings
Python lists are zero-indexed (the first element has index 0). For ranges, the first element is included but the last is not.
Mutate strings
Importing data in python Data wrangling in pandas
Concatenating lists
str = "Jack and Jill" # Define str
x = [1, 3, 6]
3 * x # Returns [1, 3, 6, 1, 3, 6, 1, 3, 6] str.lower() # Convert a string to lowercase, returns 'jack and jill'
type('a') # Get the type of an object — this returns str > Getting started with dictionaries
A dictionary stores data values in key-value pairs. That is, unlike lists which are indexed by position, dictionaries are indexed
> Getting started with DataFrames
> Importing packages by their keys, the names of which must be unique.
Pandas is a fast and powerful package for data analysis and manipulation in python. To import the package, you can
use import pandas as pd. A pandas DataFrame is a structure that contains two-dimensional data stored as rows and
Python packages are a collection of useful tools developed by the open-source community. They extend the
Creating dictionaries columns. A pandas series is a structure that contains one-dimensional data.
capabilities of the python language. To install a new package (for example, pandas), you can go to your command
prompt and type in pip install pandas. Once a package is installed, you can import it as follows.
# Create
{'a': 1,
a dictionary with {}
'b': 4, 'c': 9}
Creating DataFrames
import pandas # Import a package without an alias
dictionary
# Create a dataframe from a list
pd.DataFrame([
of dictionaries
> The working directory x.values() # Get the values of a dictionary, returns dict_values([1, 2, 3])
}) ])
df['col']
> Operators NumPy is a python package for scientific computing. It provides multidimensional array objects and efficient operations
on them. To import NumPy, you can run this Python code import numpy as np
df[['col1', 'col2']]
df.iloc[:, 2]
df.iloc[3, 2]
# Return a stepped sequence from start (inclusive) to end (exclusive)
pd.concat([df, df])
df.mean()
a = 5 # Assign a value to a
np.repeat([1, 3, 6], 3) # Returns array([1, 1, 1, 3, 3, 3, 6, 6, 6])
# Get rows matching a condition
# Get unique rows
# Rename columns
df.sort_values(by='col_name')
df.nlargest(n, 'col_name')
(1 != 1) & (1 < 1) # Logical AND with & (1 != 1) ^ (1 < 1) # Logical XOR with ^ np.mean(x) # Calculate mean
> G etting started with vectors > G etting started with Data Frames in R
Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector A data frame has the variables of a data set as columns and the observations as rows.
is a simple tool to store data.
D #This creates the data frame df, seen on
the right
x y z
1 h 12
Learn R online at www.DataCamp.com #This selects all columns of the third row
x y z 3 j 14
c(1,3,5) 1 3 5 Creates a vector using elements
df[ ,3] 1 h 12
separated by commas
i 13
#This selects the third column of the
2
>
df[2,3]
How to use this cheat sheet
1 h 12
seq(2,8,by = )
2 2 4 6 8 Creates a vector between two #This selects the column z
x y z
manipulations, R has become the preferred computing environment for many data scientists today.
elements repeated a number of
times.
This cheat sheet will cover an overview of getting started with R. Use it as a handy, high-level reference for a quick
rep(2,8,each = ) Creates a vector of given
Data Frames in R
3 2 2 2 8 8 8
start with R. For more detailed R Cheat Sheets, follow the highlighted cheat sheets below.
elements repeating each element
a number of times. > Manipulating
Vector functions Selecting vector elements dplyr allows us to easily and precisely manipulate data frames. To use the following functions, you should install and
load dplyr using install.packages(“dplyr”)
These functions perform operations over a whole vector. These functions allow us to refer to particular parts of a
xts Cheat Sheet data.table Cheat Sheet vector.
#Takes a sequence of vector, #Moves columns to a new position
>
m
m y_vector[c(2,6)] #Returns the second and sixth and combines them by rows
bind_rows(df1, df2)
m y_vector[x == 5] #Returns elements equal to 5
?m ax Shows
# the help documentation for the max function
criteria
??"m ax" #Returns documentation associated with a given input arrange(df, desc(x))
filter(df, x == 2 )
>
Information about objects
Math functions #Removes rows with duplicate
#Computes table of summaries
summarise(df, total =
values
sum(x))
class(my_df) #Returns the class of a given object These functions enable us to perform basic mathematical operations within R distinct(df, z)
>
=
Using packages
summarise(df, total
m maximum value of a vector
sum(x))
min(x) #Returns minimum value of a vector
signif(x, n) #Round off n significant figures
slice(df, 10:15)
mean(x) #Returns mean of a vector
var(x) #Variance of a vector
R packages are collections of functions and tools developed by the R community. They increase the power of R by sum(x) #Returns sum of a vector
cor(x, y) #Correlation between two vectors
#Use group_by() to create a "grouped" copy of a table
improving existing base R functionalities, or by adding new ones.
median(x) #Returns median of a vector sd(x) #Standard deviation of a vector grouped by columns (similarly to a pivot table in
#Selects rows with the highest
spreadsheets). dplyr functions will then manipulate
values
slice_max(df, z, prop =
library(tidyverse) #Lets you load and use packages (e.g., tidyverse package)
group_by(z) %>%
> The working directory The “stringr” package makes it easier to work with strings in R - you should install and load this package to use the
following functions.
vector, by
pull(df, y)
name or index
The working directory is a file path that R will use as the starting point for relative file paths. That is, it's the default Find Matches Subset
location for importing and exporting files. An example of a working directory looks like ”C://file/path”
select(df, y)
getwd() #Returns your current working directory
#Detects the presence of a pattern match at the #Returns strings that contain a pattern match
matches in a string
with a column for each group in the pattern
str_match(string, pattern)
#Counts the number of pattern matches in a string
R has multiple operators that allow you to perform a variety of tasks. Arithmetic operators let you perform arithmetic str_count(string, pattern)
such as addition and multiplication. Relational operators are used to compare between values. Logical operators are
used for Boolean operators. Mutate Join and Split Try this Cheat Sheet on
DataCamp Workspace
Arithmetic Operators Relational Operators Logical Operators #Replaces substrings by identifying the substrings #Repeats strings n times
a
a < b #Tests for lower than
| #Element-wise OR
Order
a ^ b #Exponentiation of a variable
a >= b #Tests for greater than or equal to
|| #Logical OR str_replace_all(string, pattern, replacement)
str_order(x)