Professional Documents
Culture Documents
Week 1
Data Analytics: The umbrella that contains everything else (i.e. data analysis, data
ecosystem, etc.)
Data Analysis: To prepare, process, and transform raw data to draw conclusions, make
predictions, and drive informed decision-making.
In addition, try asking yourself these questions about a project to help find the perfect
balance:
SUBJECT-MATTER EXPERTS are very familiar with the business problem and can look
at the results of data analysis to validate the choices being made. Including insights
from people who are familiar with the business problem is an example of data-driven
decision-making.
https://www.informit.com/articles/article.aspx?p=2473128&seqNum=11&ranMID=24808
Week 2
Understanding context: The analytical skill that has to do with how you group
things into categories
Technical Mindset: The analytical skill that involves breaking processes down into
smaller steps and working with them in an orderly, logical way
Data design: the analytical skill that involves how you organize information
Data strategy: The analytical skill that involves the management of the people,
processes, and tools used in data analysis
5 Why’s: The Five Whys process is used to reveal the root cause of a problem
through the answer to the fifth question.
Gap Analysis: It lets you compare where you are currently on a process, and where
you want to be in the future.
What did we not consider before? - Another good question to ask as a data
analyst for problem solving.
I am always thinking of ways of automating tasks I do on a daily basis, and for that purpose I
often tend to divide those tasks into subtasks where I spot possibilities for automation, or
optimization. I can benefit from that habit on my way to becoming a data analyst.
I also like to play puzzle games like sudoku, which is basically about deciphering patterns
and that relates a lot with understanding context.
In general, I think my current ways of thinking relate with that of a data analyst. It is a
question of keep studying to polish, and make a habit of thinking as a pro data analyst.
● Identifying the motivation behind the collection of a dataset helps clarify the context.
● Gathering extra information about data to understand the broader picture provides
context.
● Adding descriptive headers to columns of data in a spreadsheet provides context about
the data below.
Week 3
Act: Put your insights to work in order to solve the original problem
Data Life Cycle
1. Plan: Decide what kind of data is needed, how it will be managed, and who
will be responsible for it.
2. Capture: Collect or bring in data from a variety of different sources.
3. Manage: Care for and maintain the data. This includes determining how and
where it is stored and the tools used to do so.
4. Analyze: Use the data to solve problems, make decisions, and support
business goals. (This is the step where Data Analysts shine).
5. Archive: Keep relevant data stored for long-term and future reference.
6. Destroy: Remove data from storage and delete any shared copies of the
data.
Warning: Be careful not to mix up or confuse the six stages of the data life cycle
(Plan, Capture, Manage, Analyze, Archive, and Destroy) with the six phases of the
data analysis life cycle (Ask, Prepare, Process, Analyze, Share, and Act). They
shouldn't be used or referred to interchangeably.
To sum it up, although data life cycles vary, one data management principle is
universal. Govern how data is handled so that it is accurate, secure, and available to
meet your organization's needs.
Plan: a business decides what kind of data it needs, how it will be managed, who will
be responsible for it, and the optimal outcomes.
Capture: gathering data from various sources and bringing it into the organization.
Manage: how data is cared for, how and where it’s stored, the tools used to keep it
safe and secure, and the actions taken to make sure it’s maintained properly.
Analyze: using data to make smart decisions and support business goals.
Archive: Archiving involves storing files in a place where it's still available.
Destroy: Erasing or shredding files describes the destroy phase of the data life cycle.
Query languages
Spreadsheets Databases
The data life cycle deals with the stages that data goes through during its useful life;
data analysis involves following a process to analyze data.
While the data analysis process will drive your projects and help you reach your
business goals, you must understand the life cycle of your data in order to use that
process. To analyze your data well, you need to have a thorough understanding of it.
Similarly, you can collect all the data you want, but the data is only useful to you if
you have a plan for analyzing it.
The Plan and Ask phases both involve planning and asking questions, but they
tackle different subjects. The Ask phase in the data analysis process focuses on
big-picture strategic thinking about business goals. However, the Plan phase
focuses on the fundamentals of the project, such as what data you have access to,
what data you need, and where you’re going to get it.
https://sfmagazine.com/post-entry/july-2018-the-data-life-cycle/
Week 4
Spreadsheets
SELECT #2
FROM #1
WHERE #3
The previous represents the suggested order in which we should write SQL queries.
Start big (datasets, tables) and go small (columns, specific conditions).
To avoid syntax errors and to simplify SQL queries writing we should always start our
sequel queries by typing:
SELECT
FROM
WHERE
And filling each field starting from FROM, then SELECT, and finally WHERE. We
start from FROM since this defines where our data is located. Then SELECT to
specify which columns are we interested in. And finally WHERE, to add any
additional condition to filter the data.
The semicolon (;) is a statement terminator and is part of the American National
Standards Institute (ANSI) SQL-92 standard. It is commonly used at the end of the
WHERE clause.
The LIKE clause: this is a very powerful clause that helps filtering data in bulk. It is
used as the following:
Adding Comments: To Add comments to our SQL queries we have two methods of
our choice:
It is recommended to use the double dash for making comments in SQL, due to it
being generally supported by all SQL versions.
Aliases: Aliases are used to rename a column or table inside an specific SQL query,
this is helpful for keeping track of the query’s purpose without having to add
comments.
[table] AS [Alias_B]
Spreadsheets Training:
https://support.google.com/a/users/answer/9282959?visit_id=637361702049227170-
1815413770&rd=1
https://support.google.com/a/users/answer/9300022
https://support.microsoft.com/en-us/office/excel-for-windows-training-9bc05390-e94c
-46af-a5b3-d7c22f6990bb
SQL Training:
SQL Tutorial from W3SCHOOLS
https://www.w3schools.com/sql/default.asp
https://towardsdatascience.com/sql-cheat-sheet-776f8e3189fa
Week 5
Fairness means ensuring that your analysis doesn't create or reinforce bias.
Case Study #1
To improve the effectiveness of its teaching staff, the administration of a high school
offered the opportunity for all teachers to participate in a workshop. They were not
required to attend; instead, the administration encouraged teachers to sign up. Of
the 43 teachers on staff, 19 chose to take the workshop.
At the end of the academic year, the administration collected data on teacher
performance for all teachers on staff. The data was collected via student survey. In
the survey, students were asked to rank each teacher's effectiveness on a scale of 1
(very poor) to 6 (very good).
The administration compared data on teachers who attended the workshop to data
on teachers who did not. The comparison revealed that teachers who attended the
workshop had an average score of 4.95, while teachers who did not attend had an
average score of 4.22. The administration concluded that the workshop was a
success.
The workshop might have been effective, but other explanations for the differences in the ratings
cannot be ruled out. For example, another explanation could be that the staff volunteering for the
workshop were the better, more motivated teachers. This group of teachers would be rated
higher whether or not the workshop was effective.
It’s also notable that there is no direct connection between student survey responses and
workshop attendance. The data analyst could correct this by asking for the teachers to be
selected randomly to participate in the workshop. They could also collect data that measures
something more directly related to workshop attendance, such as the success of a technique the
teachers learned in that workshop.
Case Study #2
An automotive company tests the driving capabilities of its self-driving car prototype.
They carry out the tests on various types of roadways—specifically, a race track, trail
track, and dirt road.
The researchers only test the prototype during the daytime. They collect two types of
data: sensor data from the car during the drives and video data of the drives from
cameras on the car.
They review the data after the initial tests. The results illustrate that the new
self-driving car meets the performance standards across each of the roadways. As a
result, the car can progress to the next phase of testing, which will include driving in
various weather conditions.
Answer
Great work reinforcing your learning with a thoughtful self-reflection! This case study shows an
unfair practice. While the researchers test the prototype on three different tracks, they only
conduct tests during the day.
Conditions on each track may be very different during the day and night and this could change
the results significantly. The data analyst should correct this by asking the test team to add in
nighttime testing to get a full perspective of how the prototype performs at any time of the day on
the tracks.
Case Study #3
An amusement park plans to add new rides to their property. First, they need to
determine what kinds of new rides visitors want the park to build. In order to
understand their visitors’ interests, the park develops a survey.
They decide to distribute the survey near the roller coasters because the lines are
long enough that visitors will have time to answer all of the questions. After collecting
this survey data, they find that most of the respondents want more roller coasters at
the park. They conclude that they should add more roller coasters, as most of their
visitors prefer them.
Answer
Great work reinforcing your learning with a thoughtful self-reflection! This case study
contains an unfair practice. While the decision to distribute surveys in places where
visitors would have time to respond makes sense, it accidentally introduces sampling
bias.
The only respondents to the survey are people waiting in line for the roller coasters.
This may unfairly bias survey results, because respondents might prefer roller
coasters. A data analyst could reduce sampling bias by distributing the survey at the
entrance and exit of the amusement park. This would avoid targeting roller coaster
fans and provide results from the park’s general audience.
Think first in your interests. There are a lot of fields where you can work as a data analyst,
but if you mix your passion with your skills you’ll find something you will love to do.
Potential employers will want to know why you're interested in their company, and how you can
address their needs, so if you can speak about your motivation to work in data analytics during
interviews, you'll make yourself stand out in a great way.
CHAPTER 2:
Week 1
Step 1: Ask
It’s impossible to solve a problem if you don’t know what it is. These are some things to
consider:
Step 2: Prepare
You will decide what data you need to collect in order to answer your questions and how to
organize it so that it is useful. You might use your business task to decide:
Step 3: Process
Clean data is the best data and you will need to clean up your data to get rid of any possible
errors, inaccuracies, or inconsistencies. This might mean:
Step 4: Analyze
You will want to think analytically about your data. At this stage, you might sort and format your
data to make it easier to:
● Perform calculations
● Combine data from multiple sources
● Create tables with your results
Step 5: Share
Everyone shares their results differently so be sure to summarize your results with clear and
enticing visuals of your analysis using data via tools like graphs or dashboards. This is your
chance to show the stakeholders you have solved their problem and how you got there. Sharing
will certainly help your team:
1. How can I make what I present to the stakeholders engaging and easy to understand?
2. What would help me understand this if I were the listener?
Step 6: Act
Now it’s time to act on your data. You will take everything you have learned from your data
analysis and put it to use. This could mean providing your stakeholders with recommendations
based on your findings so they can make data-driven decisions.
1. How can I use the feedback I received during the share phase (step 5) to actually meet
the stakeholder’s needs and expectations?
Structured thinking: The process of recognizing the current problem or situation,
organizing available information, revealing gaps and opportunities, and identifying options.
Making predictions Using data to make informed decisions about how things may
be in the future.
Spotting something unusual Identifying data that is different from the norm. (e.g,
The problem type of spotting something unusual could involve a data analyst
examining why a dataset has a surprising and rare data point.)
Finding patterns Using historical data about what happened in the past to
understand how likely it is to happen again. (e.g, Recognizing broader concepts and
trends from categorized data.)
Highly effective questions are SMART questions:
Here's an example that breaks down the thought process of turning a problem
question into one or more SMART questions using the SMART method: What
features do people look for when buying a new car?
Questions should be open-ended. This is the best way to get responses that will
help you accurately qualify or disqualify potential solutions to your specific problem.
So, based on the thought process, possible SMART questions might be:
● On a scale of 1-10 (with 10 being the most important) how important is your
car having four-wheel drive?
● What are the top five features you would like to see in a car package?
● What features, if included with four-wheel drive, would make you more
inclined to buy the car?
● How much more would you pay for a car with four-wheel drive?
● Has four-wheel drive become more or less popular in the last three years?
The first question is giving you a bias on how to answer it. A better question might
be, “What is your opinion of this product?”.
Now, if your problem is actually focused on pricing, you could ask a question like
“What price (or price range) would make you consider purchasing this
product?” This question will provide a lot of different measurable responses.
Closed-ended questions: questions that ask for a one-word or brief response only.
This question can be answered with a mere yes or no. It is not relevant to collect
useful data.
A better question might be, “What did you learn about customer experience from
the trial.” This encourages people to provide more detail besides “It went well.”
This question does not provide us with context. It is not possible to establish whether
the question is asking about a comparison between an old tool and a new one, or
about the general performance of the said tool.
A better inquiry might be, “When it comes to data entry, is the new tool faster,
slower, or about the same as the old tool? If faster, how much time is saved? If
slower, how much time is lost?”
Quantitative data is a specific and objective measure, such as a number, quantity or
range.
Dashboards monitor live, incoming data from multiple datasets and organize the
information into one central location. Reports are static collections of data.
Often, businesses will tailor a dashboard for a specific purpose. The three most
common categories are:
● Strategic: focuses on long term goals and strategies at the highest level of
metrics
● Operational: short-term performance tracking and intermediate goals
● Analytical: consists of the datasets and the mathematics used in these sets
Strategic dashboards
A wide range of businesses use strategic dashboards when evaluating and aligning
their strategic goals. These dashboards provide information over the longest time
frame—from a single financial quarter to years.
This allows businesses to track and maintain their immediate operational processes
in light of their strategic goals. The operational dashboard below focuses on
customer service.
Resolutions are divided between first call resolution (61%) and unresolved calls (9%)
Analytical dashboards
Analytic dashboards contain a vast amount of data used by data analysts. These
dashboards contain the details involved in the usage, analysis, and predictions made
by data scientists.
Certainly the most technical category, analytic dashboards are usually created and
maintained by data science teams and rarely shared with upper management as
they can be very difficult to understand. The analytic dashboard below focuses on
metrics for a company’s financial performance.
Some differences include the time frame described in each dashboard. The
operational dashboard has a timeframe of days and weeks, while the strategic
dashboard displays the entire year. The analytic dashboard skips a specific
timeframe. Instead, it identifies and tracks the various KPIs that may be used to
assess strategic and operational goals.
Dashboards can help companies perform many helpful tasks, such as:
While almost every company can benefit in some way from using a dashboard,
larger companies and companies with a wider range of products or services will
likely benefit more. Companies operating in volatile, or swiftly changing markets like
marketing, sales, and tech also tend to more quickly gain insights and make
data-informed decisions.
Small data
Big data
The The different How fast the data The quality and
amount of kinds of data can be processed reliability of the
data data
Week 3:
Data analysts use structured thinking to recognize the current situation, organize
information, and identify opportunities.
At this point, try not to confuse Statement of work with Scope of work, which are
both abbreviated as SOW. Although they both are used to define deliverables and a
timeline, they aren't the same and shouldn't be used interchangeably.
A statement of work is a document that clearly identifies the products and services
a vendor or contractor will provide to an organization. It includes objectives,
guidelines, deliverables, schedule, and costs.
As a junior data analyst, It's more typical to be asked to create a scope of work
than a statement of work.
A data analyst asks who, what, when, where, why, and how in order to put
information into context.
Context
can turn raw data into meaningful information. It is very important for data
analysts to contextualize their data. This means giving the data perspective by
defining it. To do this, you need to identify:
● Who: The person or organization that created, collected, and/or funded the
data collection
● What: The things in the world that data could have an impact on
● Where: The origin of the data
● When: The time when the data was created or collected
● Why: The motivation behind the creation or collection
● How: The method used to create or collect it
Week 4:
Ning’s report.
The VP of sales provides strategic and operational direction but is less interested in specific
details. Ning prepares questions ahead of time to focus on the key findings that the company
expects from an annual sales report.
Sales team
Members of the sales team have direct interactions with customers and are highly attuned to
how the company performed over the past year. They can provide detailed information on the
types of data that will matter most to the company’s customers.
The data analysts on Ning’s team each have a dataset that they focus on and can help pull the
various types of data that Ning needs to satisfy the other stakeholders. Ning collaborates with
them to complete the report.
Data science managers
The data science managers oversee all of the company’s datasets and can help Ning prioritize
the types of data and analyses required for the annual report. They can also advise on making
an effective presentation.
Communication is key
When a stakeholder comes at you with a request that seems to be not the ideal one,
do not be afraid to say no.
Try to reframe the problem they present to you, and find solutions to it.
Like in the example of the HR VP that asked why many new employers did not
complete the training course? He wanted to act right away, and just cancel the
training program. But you as a data analyst, instead of going right into the existing
data and confirming that there is indeed a high unfinishing rate for the program,
asked for more time to conduct a survey and ask directly to the new employees
about why they did not complete the course. This would give additional insights that
could help revamp the training program instead of just canceling it.
Instead of saying, "There's no way I can do that in this time frame," try to re-frame it
by saying, "I would be happy to do that, but I'll just take this amount of time, let's take
a step back so I can better understand what you'd like to do with the data and we
can work together to find the best path forward."
Think: There may be some other important things I should consider. I’m going to
look into that.
Say: I’d like to help you reach your goal. Let’s discuss how I can do that. (When
someone is asking you for doing a task on a tight deadline)
I would be happy to do this project. I will consider the necessary steps and get back
to you soon with a time estimate.
Doing additional research and asking questions are effective ways to determine how
to proceed with a new project.
Discussion is the key to conflict resolution. If you find yourself in the middle of a
conflict, start a conversation so you can each explain your concerns and figure out
the best path forward.
Who is my audience?
What do they already know?
What do they need to know?
And how can I communicate effectively with them?
Additional material to look into:
https://www.thinkwithgoogle.com/future-of-marketing/digital-transformation/crate-and
-barrel-digital-customer-experiences/
https://www.thinkwithgoogle.com/marketing-strategies/data-and-measurement/pepsi-
digital-transformation/
https://www.inc.com/magazine/201809/jason-fried/illusion-agreement-team-project.ht
ml
http://www.kaushik.net/
CHAPTER 3:
Week 1
Content:
1. Understanding data types and structures
2. Understanding bias, credibility, privacy, ethics, and access
3. Databases: Where data lives
4. Organizing and protecting your data
5. Engaging in the data community (optional)
Decide if you will collect the data using your own resources or receive (and possibly
purchase it) from another party. Data that you collect yourself is called first-party
data.
Data sources
If you don’t collect the data using your own resources, you might get data from
second-party or third-party data providers. Second-party data is collected directly by
another group and then sold. Third-party data is sold by a provider that didn’t collect
the data themselves. Third-party data might come from a number of different
sources.
Datasets can show a lot of interesting information. But be sure to choose data that
can actually help solve your problem question. For example, if you are analyzing
trends over time, make sure you use time series data — in other words, data that
includes dates.
If you are collecting your own data, make reasonable decisions about sample size. A
random sample from existing data might be fine for some projects. Other projects
might need more strategic data collection to focus on certain criteria. Each project
has its own needs.
Time frame
If you are collecting your own data, decide how long you will need to collect it,
especially if you are tracking trends over a long period of time. If you need an
immediate answer, you might not have time to collect new data. In this case, you
would need to use historical data that already exists.
Use the flowchart below if data collection relies heavily on how much time you
have:
Some definitions:
In data analytics, a population refers to all possible data values in a certain dataset.
The data-collection process involves deciding what data to use, determining how
much data to collect, selecting the right data type of data, determining the time
frame, and choosing data sources.
Data Format Definition Examples
Classification
- Temperature
- Fashion preferences of
young adults
- Population of elephants in
Africa
- Store inventory
My response:
I would say Quick Draw! doodles are a form of structured data since each doodle is
categorized and quantified.
They are similar to other types of data in the sense that each category could
represent an Attribute on a table and each doodle could be a cell in an observation,
whereas they differ on what they describe which is people ability to represent a
certain object.
The fact that this data is not organized as colums as rows might make it
unstructured.
Google’s:
The data on Quick, Draw! is organized loosely based on the category, but not
beyond that. Within each category, there is no organization. Unstructured data
also has no established rule about how to compare two different pieces of data. On
the other hand, structured data conforms to organizational rules.
In the example of the elephant: there are no rules that make one picture more
elephant-like than any other. Rules are one way to structure data, as they can act as
a test to help determine if a data point (in this case, an image) should or shouldn’t be
considered a picture of an elephant.
When discussing structured databases, data analysts refer to the data contained in a
row as a record to the data contained in a column as a field.
Long data is data where each row contains a single data point for a particular item.
Wide data is data where each row contains multiple data points for the particular
items identified in the columns.
Wide data subjects can have data in multiple columns. Long data subjects can have
multiple rows that hold the values of subject attributes.
Creating tables and charts with Storing a lot of variables about each
a few variables about each subject. For example, 60 years worth of
subject interest rates for each bank
Week 2:
Sampling Bias: When a sample is not representative of the whole population
● Sampling Bias
● Observer Bias: The tendency for different people to observe things differently
● Interpretation Bias: The tendency to always interpret ambiguous situations in
a positive or negative way.
● Confirmation Bias: The tendency to search for or interpret information in a
way that confirms pre-existing beliefs. (e.g., following a certain blogger
because his writer shares our beliefs)
● Reliable: good data sources are reliable, meaning it is accurate, complete and
unbiased.
● Original: When collecting data from a second or third party source, we should
always try to validate this data with the original source.
● Comprehensive: To make sure that the data contains all the critical
information to answer our question or find a solution to the problem.
● Current: Usefulness of data decreases as time passes, the best data sources
are current and relevant to the task at hand.
● Cited: It is important that your data source is cited, so you know it is credible.
Who created the dataset, is it part of a credible organization, when was the data last
refreshed? Those are questions to make sure your data is Good Data.
Make sure your data agrees with all these aspects so it will ROCCC!
And what about Bad Data? Well, Bad Data is exactly the opposite to Good Data.
Data ethics refers to well- founded standards of right and wrong that dictate how
data is collected, shared, and used.
Data can only be considered open when it meets all three of these standards.
Week 3:
Databases
Relational databases
A relational database is a database that contains a series of tables that can be
connected to show relationships. Basically, they allow data analysts to organize and
link data based on what the data has in common.
In a non-relational table, you will find all of the possible variables you might be
interested in analyzing all grouped together. This can make it really hard to sort
through. This is one reason why relational databases are so common in data
analysis: they simplify a lot of analysis processes and make data easier to find and
use across an entire database.
Primary key:
Foreing key:
A foreign key is a field within a table that is a primary key in another table. A table
can have only one primary key, but it can have multiple foreign keys. These keys are
what create the relationships between tables in a relational database, which helps
organize and connect data across multiple tables in the database.
A primary key may also be constructed using multiple columns of a table. This type
of primary key is called a composite key. For example, if customer_id and
location_id are two columns of a composite key for a customer table, the values
assigned to those fields in any given row must be unique within the entire table.
Metadata:
Structural metadata indicates exactly how many collections data lives in. It provides
information about how a piece of data is organized and whether it’s part of one, or
more than one, data collection.
Administrative metadata indicates the technical source and details for a digital
asset.The date and time a photo was taken is an example of administrative
metadata.
● Getting started with MySQL: This is a guide to setting up and using MySQL.
● Getting started with Microsoft SQL Server: This is a tutorial to get started
using SQL Server.
● Getting started with PostgreSQL: This is a tutorial to get started using
PostgreSQL.
● Getting started with SQLite: This is a quick start guide for using SQLite.
Clauses
ORDER BY: Used for sorting queries. Generally applied with DESC or ASC
LIMIT: used for limiting the number of results from a query. It is followed by an
INTEGER.
Functions:
COUNT: Counts the parameter of the query returning an extra field that can be
named using AS followed by the desired name.
https://dataedo.com/blog/basic-data-modeling-techniques
Learn about who pioneered Boolean logic in this historical article: Origins of
Boolean Algebra in the Logic of Classes.
Find more information about using AND, OR, and NOT from these tips for
searching with Boolean operators.
https://www.thedataschool.co.uk/anna-prosvetova/web-scraping-made-easy-im
port-html-tables-or-lists-using-google-sheets-and-excel
Sites and resources for open data
Luckily for data analysts, there are lots of trustworthy sites and resources
available for open data. It is important to remember that even reputable data
needs to be constantly evaluated, but these websites are a useful starting
point:
Google Sheets:
Find & Replace: Works for finding data in columns and replacing it with other equivalent
code or number designed to it.
To keep a header row at the top of a spreadsheet, highlight the row and select freeze
from the View menu.
Excel:
Regex tutorial (if you don’t know what regular expressions are)