You are on page 1of 38

CHAPTER 1:

Week 1

Data Analysis Fundamentals

Data Analytics: The umbrella that contains everything else (i.e. data analysis, data
ecosystem, etc.)

Data Analysis: To prepare, process, and transform raw data to draw conclusions, make
predictions, and drive informed decision-making.

STEPS FOR THE DATA ANALYSIS PROCESS:

1. Ask questions and define the problem.


2. Prepare data by collecting and storing the information.
3. Process data by cleaning and checking the information.
4. Analyze data to find patterns, relationships, and trends.
5. Share data with your audience.
6. Act on the data and use the analysis results.

QUESTIONS TO ASK WHILE TACKLING ON A PROJECT:

Analysts often ask, “How do I define success for this project?”

In addition, try asking yourself these questions about a project to help find the perfect
balance:

● What kind of results are needed?


● Who will be informed?
● Am I answering the question being asked?
● How quickly does a decision need to be made?

SUBJECT-MATTER EXPERTS are very familiar with the business problem and can look
at the results of data analysis to validate the choices being made. Including insights
from people who are familiar with the business problem is an example of data-driven
decision-making.

Additional material to look into:

4 Examples of business analytics in action: ✅


https://online.hbs.edu/blog/post/business-analytics-examples
Ebook: Data Science & Big Data Analytics.
https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119183686

SAS iterative cycle ✅


https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/manage-analytical-life-cycle-
continuous-innovation-106179.pdf

Data Analytics Project Life cycle


http://pingax.com/understanding-data-analytics-project-life-cycle/

Big Data Adoption and Planning considerations

https://www.informit.com/articles/article.aspx?p=2473128&seqNum=11&ranMID=24808

Week 2

Analytical skills the qualities and characteristics associated with solving


problems using facts

Curiosity: wanting to learn something

Understanding context: The analytical skill that has to do with how you group
things into categories

Technical Mindset: The analytical skill that involves breaking processes down into
smaller steps and working with them in an orderly, logical way

Data design: the analytical skill that involves how you organize information

Data strategy: The analytical skill that involves the management of the people,
processes, and tools used in data analysis

Asking Questions Techniques & Processes:

5 Why’s: The Five Whys process is used to reveal the root cause of a problem
through the answer to the fifth question.

Gap Analysis: It lets you compare where you are currently on a process, and where
you want to be in the future.

What did we not consider before? - Another good question to ask as a data
analyst for problem solving.
I am always thinking of ways of automating tasks I do on a daily basis, and for that purpose I
often tend to divide those tasks into subtasks where I spot possibilities for automation, or
optimization. I can benefit from that habit on my way to becoming a data analyst.

I also like to play puzzle games like sudoku, which is basically about deciphering patterns
and that relates a lot with understanding context.

In general, I think my current ways of thinking relate with that of a data analyst. It is a
question of keep studying to polish, and make a habit of thinking as a pro data analyst.

Tips for giving context to data

● Identifying the motivation behind the collection of a dataset helps clarify the context.
● Gathering extra information about data to understand the broader picture provides
context.
● Adding descriptive headers to columns of data in a spreadsheet provides context about
the data below.

Additional material to look into:

Article about 5 Why’s ✅


https://www.fastcompany.com/1669738/to-get-to-the-root-of-a-hard-problem-just-ask-
why-five-times

Week 3

Phases of data analysis

Ask: Define the problem and confirm stakeholder expectations

Prepare: Collect and store data for analysis

Process: Clean and transform data to ensure integrity

Analyze: Use data analysis tools to draw conclusions

Share: Interpret and communicate results to others to make data-driven decisions

Act: Put your insights to work in order to solve the original problem
Data Life Cycle

1. Plan: Decide what kind of data is needed, how it will be managed, and who
will be responsible for it.
2. Capture: Collect or bring in data from a variety of different sources.
3. Manage: Care for and maintain the data. This includes determining how and
where it is stored and the tools used to do so.
4. Analyze: Use the data to solve problems, make decisions, and support
business goals. (This is the step where Data Analysts shine).
5. Archive: Keep relevant data stored for long-term and future reference.
6. Destroy: Remove data from storage and delete any shared copies of the
data.

Warning: Be careful not to mix up or confuse the six stages of the data life cycle
(Plan, Capture, Manage, Analyze, Archive, and Destroy) with the six phases of the
data analysis life cycle (Ask, Prepare, Process, Analyze, Share, and Act). They
shouldn't be used or referred to interchangeably.

To sum it up, although data life cycles vary, one data management principle is
universal. Govern how data is handled so that it is accurate, secure, and available to
meet your organization's needs.

Plan: a business decides what kind of data it needs, how it will be managed, who will
be responsible for it, and the optimal outcomes.

Capture: gathering data from various sources and bringing it into the organization.

Manage: how data is cared for, how and where it’s stored, the tools used to keep it
safe and secure, and the actions taken to make sure it’s maintained properly.

Analyze: using data to make smart decisions and support business goals.

Archive: Archiving involves storing files in a place where it's still available.

Destroy: Erasing or shredding files describes the destroy phase of the data life cycle.

Data Analyst Tools:

Query Language: A query language is a computer programming language that


enables data analysts to retrieve and manipulate data from a database.
Spreadsheets structure data in a meaningful way by letting you

● Collect, store, organize, and sort information


● Identify patterns and piece the data together in a way that works for each
specific data project
● Create excellent data visualizations, like graphs and charts.

Query languages

● Allow analysts to isolate specific information from a database(s)


● Make it easier for you to learn and understand the requests made to
databases
● Allow analysts to select, create, add, or download data from a database for
analysis

Spreadsheets Databases

Software applications Data stores - accessed using a query


language (e.g. SQL)

Structure data in a row and Structure data using rules and


column format relationships

Organize information in cells Organize information in complex


collections

Provide access to a limited Provide access to huge amounts of data


amount of data

Manual data entry Strict and consistent data entry

Generally one user at a time Multiple users

Controlled by the user Controlled by a database management


system
Data Life Cycle & Data Analysis:

The data life cycle deals with the stages that data goes through during its useful life;
data analysis involves following a process to analyze data.

While the data analysis process will drive your projects and help you reach your
business goals, you must understand the life cycle of your data in order to use that
process. To analyze your data well, you need to have a thorough understanding of it.
Similarly, you can collect all the data you want, but the data is only useful to you if
you have a plan for analyzing it.

The Plan and Ask phases both involve planning and asking questions, but they
tackle different subjects. The Ask phase in the data analysis process focuses on
big-picture strategic thinking about business goals. However, the Plan phase
focuses on the fundamentals of the project, such as what data you have access to,
what data you need, and where you’re going to get it.

Additional material to look into:

Data life cycle in Financial Institutions

https://sfmagazine.com/post-entry/july-2018-the-data-life-cycle/

Week 4

Data Analysis Tools

Spreadsheets

In a table, an attribute is a characteristic or quality of data used to label a column.

In a data table, a row is called an observation. An observation includes all of the


attributes for what is contained in the row. An attribute is a quality or characteristic
of data used to label a column.
SQL:

Basic Structure of a SQL query:

SELECT #2

[choose the column(s) you want]

FROM #1

[from the appropriate table]

WHERE #3

[a certain condition is met]

The previous represents the suggested order in which we should write SQL queries.
Start big (datasets, tables) and go small (columns, specific conditions).

To avoid syntax errors and to simplify SQL queries writing we should always start our
sequel queries by typing:

SELECT

FROM

WHERE

And filling each field starting from FROM, then SELECT, and finally WHERE. We
start from FROM since this defines where our data is located. Then SELECT to
specify which columns are we interested in. And finally WHERE, to add any
additional condition to filter the data.

Considerations About coding in SQL

The semicolon (;) is a statement terminator and is part of the American National
Standards Institute (ANSI) SQL-92 standard. It is commonly used at the end of the
WHERE clause.

The LIKE clause: this is a very powerful clause that helps filtering data in bulk. It is
used as the following:

root WHERE [column_name] LIKE [condition for the like clause]

example WHERE field1 LIKE 'Ch%'


In the case of our example, the LIKE clause would look for every word that starts
with the letters “CH” like chess, chief, chop, etc. The % symbol is used as a wildcard
to match one or more characters. Some databases use an asterisk (*) instead of a
percent sign as the wildcard.

Adding Comments: To Add comments to our SQL queries we have two methods of
our choice:

The double dash - - this would be interpreted as a comment in SQL

Placing text between /* this would be interpreted as a comment in SQL */

It is recommended to use the double dash for making comments in SQL, due to it
being generally supported by all SQL versions.

Aliases: Aliases are used to rename a column or table inside an specific SQL query,
this is helpful for keeping track of the query’s purpose without having to add
comments.

Aliases are made using a SQL clause named AS

Syntax for AS: [column] AS [Alias_A]

[table] AS [Alias_B]

Spreadsheets Training:

Google Sheets Training

https://support.google.com/a/users/answer/9282959?visit_id=637361702049227170-
1815413770&rd=1

Google Sheets Cheat Sheet

https://support.google.com/a/users/answer/9300022

Microsoft Excel Training

https://support.microsoft.com/en-us/office/excel-for-windows-training-9bc05390-e94c
-46af-a5b3-d7c22f6990bb

SQL Training:
SQL Tutorial from W3SCHOOLS

https://www.w3schools.com/sql/default.asp

SQL Cheat SHEET (More advanced stuff!)

https://towardsdatascience.com/sql-cheat-sheet-776f8e3189fa

Week 5

An issue is a topic or subject to investigate.


A question is designed to discover information.
A problem is an obstacle or complication that needs to be worked out.

A problem is an obstacle or complication to be solved, whereas a question is


designed to discover information. These two things are the foundation of business
tasks.

Fairness means ensuring that your analysis doesn't create or reinforce bias.

Case Study #1

To improve the effectiveness of its teaching staff, the administration of a high school
offered the opportunity for all teachers to participate in a workshop. They were not
required to attend; instead, the administration encouraged teachers to sign up. Of
the 43 teachers on staff, 19 chose to take the workshop.

At the end of the academic year, the administration collected data on teacher
performance for all teachers on staff. The data was collected via student survey. In
the survey, students were asked to rank each teacher's effectiveness on a scale of 1
(very poor) to 6 (very good).

The administration compared data on teachers who attended the workshop to data
on teachers who did not. The comparison revealed that teachers who attended the
workshop had an average score of 4.95, while teachers who did not attend had an
average score of 4.22. The administration concluded that the workshop was a
success.

Consider this scenario:

● What are the examples of fair or unfair practices?


● How could a data analyst correct the unfair practices?
Answer
Great work reinforcing your learning with a thoughtful self-reflection! This is an example of unfair
practice. It is tempting to conclude—as the administration did—that the workshop was a success.
However, since the workshop was voluntary and not random, it is not appropriate to infer a
causal relationship between attending the workshop and the higher rating.

The workshop might have been effective, but other explanations for the differences in the ratings
cannot be ruled out. For example, another explanation could be that the staff volunteering for the
workshop were the better, more motivated teachers. This group of teachers would be rated
higher whether or not the workshop was effective.

It’s also notable that there is no direct connection between student survey responses and
workshop attendance. The data analyst could correct this by asking for the teachers to be
selected randomly to participate in the workshop. They could also collect data that measures
something more directly related to workshop attendance, such as the success of a technique the
teachers learned in that workshop.

Case Study #2

An automotive company tests the driving capabilities of its self-driving car prototype.
They carry out the tests on various types of roadways—specifically, a race track, trail
track, and dirt road.

The researchers only test the prototype during the daytime. They collect two types of
data: sensor data from the car during the drives and video data of the drives from
cameras on the car.

They review the data after the initial tests. The results illustrate that the new
self-driving car meets the performance standards across each of the roadways. As a
result, the car can progress to the next phase of testing, which will include driving in
various weather conditions.

Answer
Great work reinforcing your learning with a thoughtful self-reflection! This case study shows an
unfair practice. While the researchers test the prototype on three different tracks, they only
conduct tests during the day.

Conditions on each track may be very different during the day and night and this could change
the results significantly. The data analyst should correct this by asking the test team to add in
nighttime testing to get a full perspective of how the prototype performs at any time of the day on
the tracks.
Case Study #3

An amusement park plans to add new rides to their property. First, they need to
determine what kinds of new rides visitors want the park to build. In order to
understand their visitors’ interests, the park develops a survey.

They decide to distribute the survey near the roller coasters because the lines are
long enough that visitors will have time to answer all of the questions. After collecting
this survey data, they find that most of the respondents want more roller coasters at
the park. They conclude that they should add more roller coasters, as most of their
visitors prefer them.

Answer
Great work reinforcing your learning with a thoughtful self-reflection! This case study
contains an unfair practice. While the decision to distribute surveys in places where
visitors would have time to respond makes sense, it accidentally introduces sampling
bias.

The only respondents to the survey are people waiting in line for the roller coasters.
This may unfairly bias survey results, because respondents might prefer roller
coasters. A data analyst could reduce sampling bias by distributing the survey at the
entrance and exit of the amusement park. This would avoid targeting roller coaster
fans and provide results from the park’s general audience.

ADVICE ON HOW TO FIND A JOB

Think first in your interests. There are a lot of fields where you can work as a data analyst,
but if you mix your passion with your skills you’ll find something you will love to do.

Potential employers will want to know why you're interested in their company, and how you can
address their needs, so if you can speak about your motivation to work in data analytics during
interviews, you'll make yourself stand out in a great way.
CHAPTER 2:
Week 1

Step 1: Ask

It’s impossible to solve a problem if you don’t know what it is. These are some things to
consider:

● Define the problem you’re trying to solve


● Make sure you fully understand the stakeholder’s expectations
● Focus on the actual problem and avoid any distractions
● Collaborate with stakeholders and keep an open line of communication
● Take a step back and see the whole situation in context

Questions to ask yourself in this step:

1. What are my stakeholders saying their problems are?


2. Now that I’ve identified the issues, how can I help the stakeholders resolve their
questions?

Step 2: Prepare

You will decide what data you need to collect in order to answer your questions and how to
organize it so that it is useful. You might use your business task to decide:

● What metrics to measure


● Locate data in your database
● Create security measures to protect that data

Questions to ask yourself in this step:

1. What do I need to figure out how to solve this problem?


2. What research do I need to do?

Step 3: Process

Clean data is the best data and you will need to clean up your data to get rid of any possible
errors, inaccuracies, or inconsistencies. This might mean:

● Using spreadsheet functions to find incorrectly entered data


● Using SQL functions to check for extra spaces
● Removing repeated entries
● Checking as much as possible for bias in the data

Questions to ask yourself in this step:


1. What data errors or inaccuracies might get in my way of getting the best possible answer
to the problem I am trying to solve?
2. How can I clean my data so the information I have is more consistent?

Step 4: Analyze

You will want to think analytically about your data. At this stage, you might sort and format your
data to make it easier to:

● Perform calculations
● Combine data from multiple sources
● Create tables with your results

Questions to ask yourself in this step:

1. What story is my data telling me?


2. How will my data help me solve this problem?
3. Who needs my company’s product or service? What type of person is most likely to use
it?

Step 5: Share

Everyone shares their results differently so be sure to summarize your results with clear and
enticing visuals of your analysis using data via tools like graphs or dashboards. This is your
chance to show the stakeholders you have solved their problem and how you got there. Sharing
will certainly help your team:

● Make better decisions


● Make more informed decisions
● Lead to stronger outcomes
● Successfully communicate your findings

Questions to ask yourself in this step:

1. How can I make what I present to the stakeholders engaging and easy to understand?
2. What would help me understand this if I were the listener?

Step 6: Act

Now it’s time to act on your data. You will take everything you have learned from your data
analysis and put it to use. This could mean providing your stakeholders with recommendations
based on your findings so they can make data-driven decisions.

Questions to ask yourself in this step:

1. How can I use the feedback I received during the share phase (step 5) to actually meet
the stakeholder’s needs and expectations?
Structured thinking: The process of recognizing the current problem or situation,
organizing available information, revealing gaps and opportunities, and identifying options.

Data analysts typically work with six problem types

Making predictions Using data to make informed decisions about how things may
be in the future.

Categorizing things Grouping data based on common features. (e.g, A data


analyst identifying and classifying keywords from customer reviews to improve
customer satisfaction is an example of categorizing things.)

Spotting something unusual Identifying data that is different from the norm. (e.g,
The problem type of spotting something unusual could involve a data analyst
examining why a dataset has a surprising and rare data point.)

Identifying themes Recognizing broader concepts and trends from categorized


data.

Discovering Connections Identifying similar challenges across different


entities—and using data and insights to find common solutions.

Finding patterns Using historical data about what happened in the past to
understand how likely it is to happen again. (e.g, Recognizing broader concepts and
trends from categorized data.)
Highly effective questions are SMART questions:

In the SMART methodology, measurable questions can be quantified and assessed.


This might include a 1-to-5 scale or questions with yes-or-no responses.

Here's an example that breaks down the thought process of turning a problem
question into one or more SMART questions using the SMART method: What
features do people look for when buying a new car?

● Specific: Does the question focus on a particular car feature?


● Measurable: Does the question include a feature rating system?
● Action-oriented: Does the question influence creation of different or new
feature packages?
● Relevant: Does the question identify which features make or break a potential
car purchase?
● Time-bound: Does the question validate data on the most popular features
from the last three years?

Questions should be open-ended. This is the best way to get responses that will
help you accurately qualify or disqualify potential solutions to your specific problem.
So, based on the thought process, possible SMART questions might be:

● On a scale of 1-10 (with 10 being the most important) how important is your
car having four-wheel drive?
● What are the top five features you would like to see in a car package?
● What features, if included with four-wheel drive, would make you more
inclined to buy the car?
● How much more would you pay for a car with four-wheel drive?
● Has four-wheel drive become more or less popular in the last three years?

Things to avoid when asking questions


Leading questions: questions that only have a particular response

● Example: This product is too expensive, isn’t it?

The first question is giving you a bias on how to answer it. A better question might
be, “What is your opinion of this product?”.

Now, if your problem is actually focused on pricing, you could ask a question like
“What price (or price range) would make you consider purchasing this
product?” This question will provide a lot of different measurable responses.

Closed-ended questions: questions that ask for a one-word or brief response only.

● Example: Were you satisfied with the customer trial?

This question can be answered with a mere yes or no. It is not relevant to collect
useful data.

A better question might be, “What did you learn about customer experience from
the trial.” This encourages people to provide more detail besides “It went well.”

Vague questions: questions that aren’t specific or don’t provide context

● Example: Does the tool work for you?

This question does not provide us with context. It is not possible to establish whether
the question is asking about a comparison between an old tool and a new one, or
about the general performance of the said tool.

A better inquiry might be, “When it comes to data entry, is the new tool faster,
slower, or about the same as the old tool? If faster, how much time is saved? If
slower, how much time is lost?”
Quantitative data is a specific and objective measure, such as a number, quantity or
range.

Qualitative data is a subjective and explanatory measure of a quality or


characteristic.

Data is a collection of facts.

Metrics are quantifiable data types used for measurement.

Dashboards monitor live, incoming data from multiple datasets and organize the
information into one central location. Reports are static collections of data.

A dashboard is a single point of access for managing a business's information. It


allows analysts to pull key information from data in a quick review by visualizing the
data in a way that makes findings easy to understand.

Often, businesses will tailor a dashboard for a specific purpose. The three most
common categories are:

● Strategic: focuses on long term goals and strategies at the highest level of
metrics
● Operational: short-term performance tracking and intermediate goals
● Analytical: consists of the datasets and the mathematics used in these sets

Strategic dashboards

A wide range of businesses use strategic dashboards when evaluating and aligning
their strategic goals. These dashboards provide information over the longest time
frame—from a single financial quarter to years.

They typically contain information that is useful for enterprise-wide decision-making.


Below is an example of a strategic dashboard which focuses on key performance
indicators (KPIs) over a year.
Operational dashboards

Operational dashboards are, arguably, the most common type of dashboard.


Because these dashboards contain information on a time scale of days, weeks, or
months, they can provide performance insight almost in real-time.

This allows businesses to track and maintain their immediate operational processes
in light of their strategic goals. The operational dashboard below focuses on
customer service.

Resolutions are divided between first call resolution (61%) and unresolved calls (9%)
Analytical dashboards

Analytic dashboards contain a vast amount of data used by data analysts. These
dashboards contain the details involved in the usage, analysis, and predictions made
by data scientists.

Certainly the most technical category, analytic dashboards are usually created and
maintained by data science teams and rarely shared with upper management as
they can be very difficult to understand. The analytic dashboard below focuses on
metrics for a company’s financial performance.

DIFFERENCES BETWEEN THE THREE TYPES OF DASHBOARDS:

Some differences include the time frame described in each dashboard. The
operational dashboard has a timeframe of days and weeks, while the strategic
dashboard displays the entire year. The analytic dashboard skips a specific
timeframe. Instead, it identifies and tracks the various KPIs that may be used to
assess strategic and operational goals.

Dashboards can help companies perform many helpful tasks, such as:

● Track historical and current performance.


● Establish both long-term and/or short-term goals.
● Define key performance indicators or metrics.
● Identify potential issues or points of inefficiency.

While almost every company can benefit in some way from using a dashboard,
larger companies and companies with a wider range of products or services will
likely benefit more. Companies operating in volatile, or swiftly changing markets like
marketing, sales, and tech also tend to more quickly gain insights and make
data-informed decisions.

Small data
Big data

Describes a data set made up Describes large, less-specific data


of specific metrics over a sets that cover a long time period
short, well-defined time period

Usually organized and Usually kept in a database and


analyzed in spreadsheets queried

Likely to be used by small and Likely to be used by large


midsize businesses organizations

Simple to collect, store, Takes a lot of effort to collect, store,


manage, sort, and visually manage, sort, and visually represent
represent

Usually already a manageable Usually needs to be broken into


size for analysis smaller pieces in order to be
organized and analyzed effectively for
decision-making

The three (or four) V words for big data


When thinking about the benefits and challenges of big data, it helps to think about
the three Vs: volume, variety, and velocity. Volume describes the amount of data.
Variety describes the different kinds of data. Velocity describes how fast the data can
be processed. Some data analysts also consider a fourth V: veracity. Veracity refers
to the quality and reliability of the data. These are all important considerations
related to processing huge, complex data sets.
Volume Variety Velocity Veracity

The The different How fast the data The quality and
amount of kinds of data can be processed reliability of the
data data

Week 3:

Data analysts use structured thinking to recognize the current situation, organize
information, and identify opportunities.

At this point, try not to confuse Statement of work with Scope of work, which are
both abbreviated as SOW. Although they both are used to define deliverables and a
timeline, they aren't the same and shouldn't be used interchangeably.

A statement of work is a document that clearly identifies the products and services
a vendor or contractor will provide to an organization. It includes objectives,
guidelines, deliverables, schedule, and costs.

A scope of work is project-based and sets the expectations and boundaries of a


project. A scope of work may be included in a statement of work to help define
project outcomes.

As a junior data analyst, It's more typical to be asked to create a scope of work
than a statement of work.

A data analyst asks who, what, when, where, why, and how in order to put
information into context.

Context
can turn raw data into meaningful information. It is very important for data
analysts to contextualize their data. This means giving the data perspective by
defining it. To do this, you need to identify:

● Who: The person or organization that created, collected, and/or funded the
data collection
● What: The things in the world that data could have an impact on
● Where: The origin of the data
● When: The time when the data was created or collected
● Why: The motivation behind the creation or collection
● How: The method used to create or collect it
Week 4:

Common Examples of Stakeholders

Meet the stakeholders


Select the label in each infographic to understand how each stakeholder informs

Ning’s report.

Vice president of sales

The VP of sales provides strategic and operational direction but is less interested in specific
details. Ning prepares questions ahead of time to focus on the key findings that the company
expects from an annual sales report.
Sales team

Members of the sales team have direct interactions with customers and are highly attuned to
how the company performed over the past year. They can provide detailed information on the
types of data that will matter most to the company’s customers.

Data analytics team

The data analysts on Ning’s team each have a dataset that they focus on and can help pull the
various types of data that Ning needs to satisfy the other stakeholders. Ning collaborates with
them to complete the report.
Data science managers

The data science managers oversee all of the company’s datasets and can help Ning prioritize
the types of data and analyses required for the annual report. They can also advise on making
an effective presentation.

Communication is key

When a stakeholder comes at you with a request that seems to be not the ideal one,
do not be afraid to say no.

Try to reframe the problem they present to you, and find solutions to it.
Like in the example of the HR VP that asked why many new employers did not
complete the training course? He wanted to act right away, and just cancel the
training program. But you as a data analyst, instead of going right into the existing
data and confirming that there is indeed a high unfinishing rate for the program,
asked for more time to conduct a survey and ask directly to the new employees
about why they did not complete the course. This would give additional insights that
could help revamp the training program instead of just canceling it.

What to say and what to think when conflict arises at work

Instead of saying, "There's no way I can do that in this time frame," try to re-frame it
by saying, "I would be happy to do that, but I'll just take this amount of time, let's take
a step back so I can better understand what you'd like to do with the data and we
can work together to find the best path forward."

Think: There may be some other important things I should consider. I’m going to
look into that.

Say: I’d like to help you reach your goal. Let’s discuss how I can do that. (When
someone is asking you for doing a task on a tight deadline)
I would be happy to do this project. I will consider the necessary steps and get back
to you soon with a time estimate.

Doing additional research and asking questions are effective ways to determine how
to proceed with a new project.

Discussion is the key to conflict resolution. If you find yourself in the middle of a
conflict, start a conversation so you can each explain your concerns and figure out
the best path forward.

Question to ask oneself before addressing team members (effective


communication):

Who is my audience?
What do they already know?
What do they need to know?
And how can I communicate effectively with them?
Additional material to look into:

CRATE & BARREL’S DATA STRATEGY

https://www.thinkwithgoogle.com/future-of-marketing/digital-transformation/crate-and
-barrel-digital-customer-experiences/

How PepsiCo is delivering a more personal and valuable experience to


customers using data

https://www.thinkwithgoogle.com/marketing-strategies/data-and-measurement/pepsi-
digital-transformation/

Illusion Agreement on team projects

https://www.inc.com/magazine/201809/jason-fried/illusion-agreement-team-project.ht
ml

Blog with good insights on data analysis

http://www.kaushik.net/
CHAPTER 3:

Week 1

Content:
1. Understanding data types and structures
2. Understanding bias, credibility, privacy, ethics, and access
3. Databases: Where data lives
4. Organizing and protecting your data
5. Engaging in the data community (optional)

How the data will be collected

Decide if you will collect the data using your own resources or receive (and possibly
purchase it) from another party. Data that you collect yourself is called first-party
data.

Data sources

If you don’t collect the data using your own resources, you might get data from
second-party or third-party data providers. Second-party data is collected directly by
another group and then sold. Third-party data is sold by a provider that didn’t collect
the data themselves. Third-party data might come from a number of different
sources.

Solving your business problem

Datasets can show a lot of interesting information. But be sure to choose data that
can actually help solve your problem question. For example, if you are analyzing
trends over time, make sure you use time series data — in other words, data that
includes dates.

How much data to collect

If you are collecting your own data, make reasonable decisions about sample size. A
random sample from existing data might be fine for some projects. Other projects
might need more strategic data collection to focus on certain criteria. Each project
has its own needs.
Time frame

If you are collecting your own data, decide how long you will need to collect it,
especially if you are tracking trends over a long period of time. If you need an
immediate answer, you might not have time to collect new data. In this case, you
would need to use historical data that already exists.

Use the flowchart below if data collection relies heavily on how much time you
have:

Some definitions:

In data analytics, a population refers to all possible data values in a certain dataset.

A sample is a part of a population that is representative of that population.

The data-collection process involves deciding what data to use, determining how
much data to collect, selecting the right data type of data, determining the time
frame, and choosing data sources.
Data Format Definition Examples
Classification

Primary data Collected by a - Data from an interview


researcher from you conducted
first-hand sources - Data from a survey
returned from 20
participants
- Data from questionnaires
you got back from a group
of workers

Secondary data Gathered by other - Data you bought from a


people or from local data analytics firm’s
other research customer profiles
- Demographic data
collected by a university
- Census data gathered by
the federal government

Internal data Data that lives inside - Wages of employees


a company’s own across different business
systems units tracked by HR

- Sales data by store location

- Product inventory levels


across distribution centers

External data Data that lives - National average wages for


outside of a the various positions
company or throughout your organization
organization
- Credit reports for
customers of an auto
dealership
Continuous data Data that is - Height of kids in third
measured and can grade classes (52.5 inches,
have almost any 65.7 inches)
numeric value
- Runtime markers in a video

- Temperature

Discrete data Data that is counted - Number of people who visit


and has a limited a hospital on a daily basis
number of values (10, 20, 200)

- Room’s maximum capacity


allowed

- Tickets sold in the current


month

Qualitative Subjective and - Exercise activity most


explanatory enjoyed
measures of
qualities and - Favorite brands of most
characteristics loyal customers

- Fashion preferences of
young adults

Quantitative Specific and - Percentage of board


objective measures certified doctors who are
of numerical facts women

- Population of elephants in
Africa

- Distance from Earth to


Mars
Nominal A type of qualitative - First time customer,
data that isn’t returning customer, regular
categorized with a customer
set order
- New job applicant, existing
applicant, internal applicant

- New listing, reduced price


listing, foreclosure

Ordinal A type of qualitative - Movie ratings (number of


data with a set order stars: 1 star, 2 stars, 3 stars)
or scale
- Ranked-choice voting
selections (1st, 2nd, 3rd)

- Income level (low income,


middle income, high income)

Structured data Data organized in a - Expense reports


certain format, like
rows and columns - Tax returns

- Store inventory

Unstructured data Data that isn’t - Social media posts


organized in any
easily identifiable - Emails
manner
- Videos
Doodle Draw! Exercise:

My response:

I would say Quick Draw! doodles are a form of structured data since each doodle is
categorized and quantified.

They are similar to other types of data in the sense that each category could
represent an Attribute on a table and each doodle could be a cell in an observation,
whereas they differ on what they describe which is people ability to represent a
certain object.

The fact that this data is not organized as colums as rows might make it
unstructured.

Google’s:

Great work reinforcing your learning with a thoughtful self-reflection! A good


reflection on this topic would address that these doodles are unstructured data: Data
that is either not organized or organized in a highly superficial manner.

The data on Quick, Draw! is organized loosely based on the category, but not
beyond that. Within each category, there is no organization. Unstructured data
also has no established rule about how to compare two different pieces of data. On
the other hand, structured data conforms to organizational rules.

In the example of the elephant: there are no rules that make one picture more
elephant-like than any other. Rules are one way to structure data, as they can act as
a test to help determine if a data point (in this case, an image) should or shouldn’t be
considered a picture of an elephant.

● Structured data: Organized in a certain format, such as rows and columns.


● Unstructured data: Not organized in any easy-to-identify way.

When discussing structured databases, data analysts refer to the data contained in a
row as a record to the data contained in a column as a field.

Long data is data where each row contains a single data point for a particular item.
Wide data is data where each row contains multiple data points for the particular
items identified in the columns.
Wide data subjects can have data in multiple columns. Long data subjects can have
multiple rows that hold the values of subject attributes.

Wide data is preferred when Long data is preferred when

Creating tables and charts with Storing a lot of variables about each
a few variables about each subject. For example, 60 years worth of
subject interest rates for each bank

Comparing straightforward line Performing advanced statistical analysis


graphs or graphing

Week 2:
Sampling Bias: When a sample is not representative of the whole population

Unbiased Sample: A sample that is representative of the whole population.

Types of data Bias

● Sampling Bias
● Observer Bias: The tendency for different people to observe things differently
● Interpretation Bias: The tendency to always interpret ambiguous situations in
a positive or negative way.
● Confirmation Bias: The tendency to search for or interpret information in a
way that confirms pre-existing beliefs. (e.g., following a certain blogger
because his writer shares our beliefs)

Good Data is data that “ROCCC”, which means it is:

● Reliable: good data sources are reliable, meaning it is accurate, complete and
unbiased.
● Original: When collecting data from a second or third party source, we should
always try to validate this data with the original source.
● Comprehensive: To make sure that the data contains all the critical
information to answer our question or find a solution to the problem.
● Current: Usefulness of data decreases as time passes, the best data sources
are current and relevant to the task at hand.
● Cited: It is important that your data source is cited, so you know it is credible.

Who created the dataset, is it part of a credible organization, when was the data last
refreshed? Those are questions to make sure your data is Good Data.

Make sure your data agrees with all these aspects so it will ROCCC!

And what about Bad Data? Well, Bad Data is exactly the opposite to Good Data.

Bad Data is:

● UnReliable: It is inaccurate, incomplete and/or biased.


● UnOriginal: If you can not locate the original data source, and have to rely
solely on second or third party information, it is a red flag.
● Not Comprehensive: Bad data sources are missing important information
needed to answer the question or find the solution. What's worse, they may
contain human error, too.
● Not Current: Bad data sources are out of date and irrelevant. Many respected
sources refresh their data regularly, giving you confidence that it's the most
current info available.
● Not Cited: If your source hasn't been cited or vetted, it's a no-go.

So, basically Bad Data is data that does not ROCCC.

Data ethics refers to well- founded standards of right and wrong that dictate how
data is collected, shared, and used.

Personally identifiable information, or PII, is information that can be used by itself


or with other data to track down a person's identity.

Data anonymization is the process of protecting people's private or sensitive data


by eliminating that kind of information. Typically, data anonymization involves
blanking, hashing, or masking personal information, often by using fixed-length
codes to represent data columns, or hiding data with altered values.

What is open data?


In data analytics, open data is part of data ethics, which has to do with using data
ethically. Openness refers to free access, usage, and sharing of data. But for data to
be considered open, it has to:

● Be available and accessible to the public as a complete dataset


● Be provided under terms that allow it to be reused and redistributed
● Allow universal participation so that anyone can use, reuse, and redistribute
the data

Data can only be considered open when it meets all three of these standards.

Week 3:

Databases

Relational databases
A relational database is a database that contains a series of tables that can be
connected to show relationships. Basically, they allow data analysts to organize and
link data based on what the data has in common.

In a non-relational table, you will find all of the possible variables you might be
interested in analyzing all grouped together. This can make it really hard to sort
through. This is one reason why relational databases are so common in data
analysis: they simplify a lot of analysis processes and make data easier to find and
use across an entire database.

Primary key:

Is an identifier that references a column in which each value is unique. In other


words, it's a column of a table that is used to uniquely identify each record within that
table. The value assigned to the primary key in a particular row must be unique
within the entire table. For example, if customer_id is the primary key for the
customer table, no two customers will ever have the same customer_id.

Foreing key:

A foreign key is a field within a table that is a primary key in another table. A table
can have only one primary key, but it can have multiple foreign keys. These keys are
what create the relationships between tables in a relational database, which helps
organize and connect data across multiple tables in the database.

A primary key may also be constructed using multiple columns of a table. This type
of primary key is called a composite key. For example, if customer_id and
location_id are two columns of a composite key for a customer table, the values
assigned to those fields in any given row must be unique within the entire table.
Metadata:

Descriptive metadata describes a piece of data or can be used to identify it at any


time. The ID numbers are descriptive metadata.

Structural metadata indicates exactly how many collections data lives in. It provides
information about how a piece of data is organized and whether it’s part of one, or
more than one, data collection.

Administrative metadata indicates the technical source and details for a digital
asset.The date and time a photo was taken is an example of administrative
metadata.

Some Databases 101:

● G​etting started with MySQL: This is a guide to setting up and using MySQL.
● G​etting started with Microsoft SQL Server: This is a tutorial to get started
using SQL Server.
● G​etting started with PostgreSQL: This is a tutorial to get started using
PostgreSQL.
● Getting started with SQLite: This is a quick start guide for using SQLite.

SQL New Learnt clauses and commands:

Clauses

ORDER BY: Used for sorting queries. Generally applied with DESC or ASC

LIMIT: used for limiting the number of results from a query. It is followed by an
INTEGER.

Functions:

COUNT: Counts the parameter of the query returning an extra field that can be
named using AS followed by the desired name.

DISTINCT: Returns only distinct values inside a table.


AS: used along COUNT command for naming the new count field.

DESC: used along ORDER BY clause to sort results in descending order.

ASC: used along ORDER BY clause to sort results in ascending order.

SUM: used for totalizing values in a column.

Syntax for these functions:

SUM (column_name) AS cool_given_name

COUNT (column_name) AS cool_given_name

COUNT (DISTINCT column_name) AS cool_given_name /* For counting just different


values*/

Additional material to look into:

https://dataedo.com/blog/basic-data-modeling-techniques

Levels of data modeling


https://www.1keydata.com/datawarehousing/data-modeling-levels.html

Learn about who pioneered Boolean logic in this historical article: Origins of
Boolean Algebra in the Logic of Classes.

F​ind more information about using AND, OR, and NOT from these tips for
searching with Boolean operators.

Web Scraping Made easy

https://www.thedataschool.co.uk/anna-prosvetova/web-scraping-made-easy-im
port-html-tables-or-lists-using-google-sheets-and-excel
Sites and resources for open data

Luckily for data analysts, there are lots of trustworthy sites and resources
available for open data. It is important to remember that even reputable data
needs to be constantly evaluated, but these websites are a useful starting
point:

1. U.S. government data site: Data.gov is one of the most comprehensive


data sources in the US. This resource gives users the data and tools
that they need to do research, and even helps them develop web and
mobile applications and design data visualizations.
2. U.S. Census Bureau: This open data source offers demographic
information from federal, state, and local governments, and commercial
entities in the U.S. too.
3. Open Data Network: This data source has a really powerful search
engine and advanced filters. Here, you can find data on topics like
finance, public safety, infrastructure, and housing and development.
4. Google Cloud Public Datasets: There are a selection of public datasets
available through the Google Cloud Public Dataset Program that you
can find already loaded into BigQuery.
5. Dataset Search: The Dataset Search is a search engine designed
specifically for data sets; you can use this to search for specific data
sets.
6. Finally, BigQuery hosts 150+ public datasets you can access and use.

Public health datasets


1. Global Health Observatory data: You can search for datasets from this
page or explore featured data collections from the World Health
Organization.
2. The Cancer Imaging Archive (TCIA) dataset: Just like the earlier dataset,
this data is hosted by the Google Cloud Public Datasets and can be
uploaded to BigQuery.
3. 1000 Genomes: This is another dataset from the Google Cloud Public
resources that can be uploaded to BigQuery.

Public climate datasets


1. National Climatic Data Center: The NCDC Quick Links page has a
selection of datasets you can explore.
2. NOAA Public Dataset Gallery: The NOAA Public Dataset Gallery
contains a searchable collection of public datasets.
Public social-political datasets
1. UNICEF State of the World’s Children: This dataset from UNICEF
includes a collection of tables that can be downloaded.
2. CPS Labor Force Statistics: This page contains links to several available
datasets that you can explore.
3. The Stanford Open Policing Project: This dataset can be downloaded as
a .CSV file for your own use.

Importing Data to spreadsheets:

Google Sheets:

Find & Replace: Works for finding data in columns and replacing it with other equivalent
code or number designed to it.

IMPORTRANGE: Import data from another spreadsheet.

IMPORTHTML: Import data from a web page (table or list).

To keep a header row at the top of a spreadsheet, highlight the row and select freeze
from the View menu.

Excel:

Text Editor Material:

You can begin with these resources:

Search and replace in Sublime Text

Regex tutorial (if you don’t know what regular expressions are)

Regex cheat sheet

You might also like