You are on page 1of 83

1

Introduction to Data Analytics


Module One

Data Analytics in daily lives

 Examples of data analytics in everyday life:


o Identifying patterns and relationships to make predictions
o Businesses use data to improve processes, identify opportunities and trends,
launch new products, serve customers, and make decisions
o Turning data into insights and sharing with others to make decisions and take
action
 Data analytics can help organizations rethink something they do or point
them in a new direction

The analysts organized those tasks and activities around the six phases of the
data analysis process:

1. Ask
2. Prepare
3. Process
4. Analyze
5. Share
6. Act

The analysts asked questions to define both the issue to be solved and what
would equal a successful result. Next, they prepared by building a timeline and
collecting data with employee surveys that were designed to be inclusive. They
processed the data by cleaning it to make sure it was complete, correct, relevant,
and free of errors and outliner. They analyzed the clean employee survey data.
Then the analysts shared their findings and recommendations with team leaders.
Afterward, leadership acted on the results and focused on improving key areas.

Data Ecosystem

 Data ecosystems are made up of various elements that interact with one
another in order to produce, manage, store, organize, analyze, and share
data.
 These elements include hardware and software tools, and the people who
use them.
 Data can also be found in the cloud, which is a virtual location accessed
over the internet.
 As a data analyst, it is their job to harness the power of the data
ecosystem, find the right information, and provide the team with analysis
that helps them make smart decisions.
 Examples of data ecosystems include retail stores, human resources
departments, and agricultural companies.
 Data scientists create new questions using data, while analysts find
answers to existing questions by creating insights from data sources.
 Data analysis is the collection, transformation, and organization of data in
order to draw conclusions, make predictions, and drive informed decision-
making.
 Data analytics is the science of data and encompasses everything from
the job of managing and using data to the tools and methods that data
workers use each and every day.

How data informs better decisions

 Data can be used in everyday life (e.g. fitness trackers, product reviews)
and in business (e.g. learning about customers, improving processes,
helping employees)
 Data-driven decision-making is using facts to guide business strategy
 First step is to define the business need (e.g. brand recognition, product
improvement, employee satisfaction)
 Data analyst finds data, analyzes it, and uses it to uncover trends,
patterns, and relationships
 Examples of data-driven decision-making: music/movie streaming
services, e-commerce, mobile phones
 Data analysts play a critical role in their companies' success, but data
alone is not as powerful as data combined with human experience,
observation, and intuition
 Subject matter experts can identify inconsistencies, make sense of gray
areas, and validate choices being made

data analysis life cycle—the process of going from data to decision. Data goes
through several phases as it gets created, consumed, tested, processed, and
reused.

The process presented as part of the Google Data Analytics Certificate is one
that will be valuable to you as you keep moving forward in your career:
1. Ask: Business Challenge/Objective/Question
2. Prepare: Data generation, collection, storage, and data management
3. Process: Data cleaning/data integrity
4. Analyze: Data exploration, visualization, and analysis
5. Share: Communicating and interpreting results
6. Act: Putting your insights to work to solve the problem

EMC's data analysis life cycle

EMC Corporation's data analytics life cycle is cyclical with six steps:

1. Discovery
2. Pre-processing data
3. Model planning
4. Model building
5. Communicate results
6. Operationalize

EMC Corporation is now Dell EMC. This model, created by David Dietrich, reflects
the cyclical nature of real-world projects. The phases aren’t static milestones; each
step connects and leads to the next, and eventually repeats. Key questions help
analysts test whether they have accomplished enough to move forward and ensure
that teams have spent enough time on each of the phases and don’t start modeling
before the data is ready.

SAS's iterative life cycle

An iterative life cycle was created by a company called SAS, a leading data
analytics solutions provider. It can be used to produce repeatable, reliable, and
predictive results:

1. Ask
2. Prepare
3. Explore
4. Model
5. Implement
6. Act
7. Evaluate

The SAS model emphasizes the cyclical nature of their model by visualizing it as an
infinity symbol. Their life cycle has seven steps, many of which we have seen in the
other models, like Ask, Prepare, Model, and Act. But this life cycle is also a little
different; it includes a step after the act phase designed to help analysts evaluate
their solutions and potentially return to the ask phase again.

Project-based data analytics life cycle

A project-based data analytics life cycle has five simple steps:

1. Identifying the problem


2. Designing data requirements
3. Pre-processing data
4. Performing data analysis
5. Visualizing data

This data analytics project life cycle was developed by Vignesh Prajapati. It doesn’t
include the sixth phase, or what we have been referring to as the Act phase.
However, it still covers a lot of the same steps as the life cycles we have already
described. It begins with identifying the problem, preparing and processing data
before analysis, and ends with data visualization.

Big data analytics life cycle

Authors Thomas Erl, Wajid Khattak, and Paul Buhler proposed a big data
analytics life cycle in their book, Big Data Fundamentals: Concepts, Drivers &
Techniques. Their life cycle suggests phases divided into nine steps:

1. Business case evaluation


2. Data identification
3. Data acquisition and filtering
4. Data extraction
5. Data validation and cleaning
6. Data aggregation and representation
7. Data analysis
8. Data visualization
9. Utilization of analysis results

This life cycle appears to have three or four more steps than the previous life cycle
models. But in reality, they have just broken down what we have been referring to as
Prepare and Process into smaller steps. It emphasizes the individual tasks required
for gathering, preparing, and cleaning data before the analysis phase.
Glossary of terms

Data: A collection of facts

Data analysis: The collection, transformation, and organization of data in order to


draw conclusions, make predictions, and drive informed decision-making

Data analyst: Someone who collects, transforms, and organizes data in order to
drive informed decision-making

Data analytics: The science of data

Data-driven decision-making: Using facts to guide business strategy

Data ecosystem: The various elements that interact with one another in order to
produce, manage, store, organize, analyze, and share data

Data science: A field of study that uses raw data to create new ways of modeling
and understanding the unknown

Dataset: A collection of data that can be manipulated or analyzed as one unit

Module 2
Embrace your data analyst skills

Key data analyst skills:

 Analytical skills are qualities and characteristics associated with solving


problems using facts
 Five essential points of analytical skills: curiosity, understanding context,
having a technical mindset, data design, and data strategy
 Curiosity is all about wanting to learn something and seeking out new
challenges and experiences
 Understanding context is the condition in which something exists or
happens
 Having a technical mindset involves the ability to break things down into
smaller steps or pieces and work with them in an orderly and logical way
 Data design is how you organize information
 Data strategy is the management of the people, processes, and tools used
in data analysis

 Thinking About Analytical Thinking

All about thinking Analytically:


 Analytical thinking involves identifying and defining a problem and then
solving it by using data in an organized, step-by-step manner
 Five key aspects to analytical thinking: visualization, strategy, problem-
orientation, correlation, and big-picture and detail-oriented thinking
 Visualization is the graphical representation of information and helps data
analysts understand and explain information more effectively
 Strategizing helps data analysts stay focused and on track and improves
the quality and usefulness of the data collected
 Problem-orientation involves keeping the problem top of mind throughout
the entire project
 Correlation is the relationship between two or more pieces of data, but
correlation does not equal causation
 Big-picture thinking is being able to see the big picture as well as the
details, while detail-oriented thinking is about figuring out all of the aspects
that will help execute a plan
 Most people are naturally better at one or the other, but you can develop
the skills to fit both pieces together

 Exploring core analytical skills

 Recap of 5 key aspects of analytical thinking: visualization, strategy,


problem-orientation, correlation, and using big-picture and detail-oriented
thinking
 Different people naturally use certain types of thinking, but can grow and
develop skills that don't come as easily
 Becoming a versatile thinker is important for data analysis
 Thinking critically to find the right questions to ask, and thinking creatively
to get new and unexpected answers
 Examples of questions data analysts ask: root cause of a problem (Five
Whys), gaps in process (gap analysis), what was not considered before
 Analytical thinking and understanding how to ask the right questions can
have a huge impact on the success of a business.

Other

1. Curiosity: a desire to know more about something, asking the right questions
2. Understanding context: understanding where information fits into the “big picture”
3. Having a technical mindset: breaking big things into smaller steps
4. Data design: thinking about how to organize data and information
5. Data strategy: thinking about the people, processes, and tools used in data analysis

Gap analysis is used to examine and evaluate how a process currently works
with the goal of getting to where you want to be in the future.

 Thinking about outcomes

Using data to drive successful outcomes


 Data-driven decision-making involves using facts to guide business
strategy
 Data analysts can use data to gain valuable insights, verify
theories/assumptions, better understand opportunities/challenges, support
an objective, and help make a plan
 Benefits of data-driven decision-making include improved results, greater
confidence in decisions, and more proactive approach to opportunities
 Five essential analytical skills for data-driven decision-making are
curiosity, understanding context, having a technical mindset, data design,
and data strategy
 Curiosity and context help analysts to make predictions, research
answers, and draw conclusions
 Technical mindset involves using facts to explore gut feelings
 Data design helps to make data easy to access, understand, and make
the most of
 Data strategy incorporates people, processes, and tools to solve a
problem
 Data-driven decision-making is more likely to be successful if everyone is
on board and on the same page

Real world data magic

 Google used data-driven decision-making to determine the value of


managers in their organization
 Data was plotted on a graph and quartiles were used to further analyze
the data
 The data revealed that teams with the best managers were significantly
happier, more productive, and more likely to stay at Google
 To identify what makes a great manager, Google launched an awards
program and interviewed managers from the top and bottom quartiles
 Nonprofits and journalists can use data-driven decision-making to make a
meaningful impact and motivate people to work together
 Data analysts use tools and processes to pinpoint problems and ask the
right questions to solve them
 Data analysts can test their understanding of concepts by collecting data
and reviewing videos and readings

Glossary of terms

 Analytical skills: Qualities and characteristics associated with using facts to


solve problems
 Analytical thinking: The process of identifying and defining a problem, then
solving it by using data in an organized, step-by-step manner
 Context: The condition in which something exists or happens
 Data: A collection of facts
 Data analysis: The collection, transformation, and organization of data in
order to draw conclusions, make predictions, and drive informed decision-
making
 Data analyst: Someone who collects, transforms, and organizes data in order
to draw conclusions, make predictions, and drive informed decision-making
 Data analytics: The science of data
 Data design: How information is organized
 Data-driven decision-making: Using facts to guide business strategy
 Data ecosystem: The various elements that interact with one another in order
to produce, manage, store, organize, analyze, and share data
 Data science: A field of study that uses raw data to create new ways of
modeling and understanding the unknown
 Data strategy: The management of the people, processes, and tools used in
data analysis
 Data visualization: The graphical representation of data
 Dataset: A collection of data that can be manipulated or analyzed as one unit
 Gap analysis: A method for examining and evaluating the current state of a
process in order to identify opportunities for improvement in the future
 Root cause: The reason why a problem occurs
 Technical mindset: The ability to break things down into smaller steps or
pieces and work with them in an orderly and logical way

Module 3

Stages of data life cycle

 Data has its own life cycle, which consists of the following stages: plan,
capture, manage, analyze, archive and destroy.
 Planning involves deciding what kind of data is needed, how it will be
managed, who will be responsible for it, and the optimal outcomes.
 Capture involves collecting data from a variety of sources and bringing it
into the organization.
 Manage involves caring for the data, storing it, keeping it safe and secure,
and taking actions to maintain it properly.
 Analyze involves using the data to solve problems, make decisions, and
support business goals.
 Archive involves storing data in a place where it is still available, but may
not be used again.
 Destroy involves using secure data erasure software to delete data from
hard drives and shredding paper files.
Different Variations of data life cycle.

The U.S. Fish and Wildlife Service uses the following data life cycle:

1. Plan
2. Acquire
3. Maintain
4. Access
5. Evaluate
6. Archive

The USGS uses the data life cycle below:

1. Plan
2. Acquire
3. Process
4. Analyze
5. Preserve
6. Publish/Share

Several cross-cutting or overarching activities are also performed during each stage of their life cycle:

1 Describe (metadata and documentation)


2. Manage Quality
3. Backup and Secure

Financial Institutions

1. Capture
2. Qualify
3. Transform
4. Utilize
5. Report
6. Archive
7. Purge

Harvard Business School

1 Generation
2. Collection
3. Processing
4. Storage
5. Management
6. Analysis
7. Visualization
8. Interpretation

Historical data is important to both the U.S. Fish and Wildlife Service and the USGS,
so their data life cycle focuses on archiving and backing up data. Harvard's interests
are in research and teaching, so its data life cycle includes visualization and
interpretation even though these are more often associated with a data analysis life
cycle. The HBS data life cycle also doesn't call out a stage for purging or destroying
data. In contrast, the data life cycle for finance clearly identifies archive and purge
stages. To sum it up, although data life cycles vary, one data management principle
is universal. Govern how data is handled so that it is accurate, secure, and available
to meet your organization's needs.

Six Phases of data analysis

 Data analysis is the process of analyzing data, and is not a life cycle
 This program is split into six courses based on the steps of data analysis:
ask, prepare, process, analyze, share, and act
 The ask phase involves defining the problem to be solved and
understanding stakeholder expectations
 The prepare phase involves collecting and storing data, and identifying
which kinds of data are most useful
 The process phase involves finding and eliminating errors and
inaccuracies, and cleaning data
 The analyze phase involves using tools to transform and organize data to
draw useful conclusions
 The share phase involves interpreting results and sharing them with
others to help stakeholders make decisions
 The act phase involves putting insights to work to solve the original
business problem and preparing for a job search

Example of data Process

 Ask the right questions to understand the scope of the analysis


 Prepare for the data collection process, considering what type of data is
needed and how it will be collected
 Process the data by cleaning it and running quality assurance checks
 Analyze the data objectively and unbiasedly
 Share the data and insights with stakeholders
 Take action on the data-driven insights, introducing interventions at the
organizational and team level

Data Analyst Tool

 Introduction to the tools data analysts use: spreadsheets, query


languages, and visualization tools
 Story of how the speaker learned to use these tools together
 Spreadsheets: digital worksheet to store, organize, and sort data; formulas
and functions to perform calculations and tasks
 Query language: computer programming language to retrieve and
manipulate data from a database; most widely used is SQL
 Data visualization: graphical representation of information; tools like
Tableau and Looker to create visuals that are easy to understand
 Overview of the data life cycle and data analysis process

Spreadsheets Databases
Software applications Data stores - accessed using a query language (e.g. SQL)
Structure data in a row and column format Structure data using rules and relationships
Organize information in cells Organize information in complex collections
Provide access to a limited amount of data Provide access to huge amounts of data
Manual data entry Strict and consistent data entry
Generally one user at a time Multiple users
Controlled by the user Controlled by a database management system

 Encouragement to review the videos and readings and test out what has
been learned

Depending on which phase of the data analysis process you’re in, you will need to use
different tools. For example, if you are focusing on creating complex and eye-catching
visualizations, then the visualization tools we discussed earlier are the best choice. But if you
are focusing on organizing, cleaning, and analyzing data, then you will probably be choosing
between spreadsheets and databases using queries. Spreadsheets and databases both
offer ways to store, manage, and use data. The basic content for both tools are sets of values.
Yet, there are some key differences, too:

Glossary of terms

 Analytical skills: Qualities and characteristics associated with using facts to solve
problems.
 Analytical thinking: The process of identifying and defining a problem, then solving it
by using data in an organized, step-by-step manner
 Data: A collection of facts
 Data analysis: The collection, transformation, and organization of data in order to
draw conclusions, make predictions, and drive informed decision-making
 Data analyst: Someone who collects, transforms, and organizes data in order to
draw conclusions, make predictions, and drive informed decision-making
 Data analytics: The science of data
 Data design: How information is organized
 Data-driven decision-making: Using facts to guide business strategy
 Data ecosystem: The various elements that interact with one another in order to
produce, manage, store, organize, analyze, and share data
 Data science: A field of study that uses raw data to create new ways of modeling
and understanding the unknown
 Data strategy: The management of the people, processes, and tools used in data
analysis
 Data visualization: The graphical representation of data
 Database: A collection of data stored in a computer system
 Data set: A collection of data that can be manipulated or analyzed as one unit
 Formula: A set of instructions used to perform a calculation using the data in a
spreadsheet
 Function: A preset command that automatically performs a specified process or task
using the data in a spreadsheet
 Query: A request for data or information from a database
 Query language: A computer programming language used to communicate with a
database
 Stakeholders: People who invest time and resources into a project and are
interested in its outcome
 Structured Query Language: A computer programming language used to
communicate with a database
 Spreadsheet: A digital worksheet
 SQL: (Refer to Structured Query Language)

What is Query?

A query is a request for data or information from a database. When you query
databases, you use SQL to communicate your question or request. You and the
database can always exchange information as long as you speak the same
language.

Every programming language, including SQL, follows a unique set of guidelines


known as syntax. Syntax is the predetermined structure of a language that
includes all required words, symbols, and punctuation, as well as their proper
placement. As soon as you enter your search criteria using the correct syntax,
the query starts working to pull the data you’ve requested from the target
database.

The syntax of every SQL query is the same:

 Use SELECT to choose the columns you want to return.


 Use FROM to choose the tables where the columns you want are located.
 Use WHERE to filter for certain information.

Becoming Data Vhiz Whiz

 Data visualization is the graphical representation of information


 Florence Nightingale used data visualization to show the number of
preventable deaths during the Crimean War

Take a moment to appreciate all the work you have done in this course. You
identified a question to answer, and systematically worked your way through the
data analysis process to answer that question—just like professional data
analysts do every day!

In reviewing the data analysis process so far, you have already performed a lot of
these steps. Here are some examples to think about before you begin writing
your learning log entry:

 You asked an interesting question and defined a problem to solve through


data analysis to answer that question.
 You thought deeply about what data you would need and how you would
collect it in order to prepare for analysis.
 You processed your data by organizing and structuring it in a table and then
moving it to a spreadsheet.
 You analyzed your data by inspecting and scanning it for patterns.
 You shared your first data visualization: a bar chart.
 Finally, after completing all the other steps, you acted: You reflected on your
results, made decisions, and gained insight into your problem--even if that
insight was that you didn't have enough data, or that there were no obvious
patterns in your data.

Module 5
Data analytics helps businesses make better decisions. It all starts with a abusiness task and the
question it's trying to answer. With the skills you'll learn throughout this program, you'll be able to
ask the right questions, plan out the best way to gather and analyze data, and then present it visually to
arm your team so they can make an informed, data-driven decision. That makes you critical to the
success of any business you work for. Data is a powerful tool.

The power of data in business

 Data analysts help companies use data to tackle business tasks


 Business tasks are questions or problems data analysis answers for
business
 Examples of business tasks include Coca-Cola's question about new
products and the City Zoo and Aquarium's problem with staffing
 Data-driven decision making is when facts discovered through data
analysis are used to guide business strategy
 Data helps businesses make better decisions by providing a complete
picture of the problem and its causes
 Data analysts are responsible for gathering, analyzing, and presenting
data in a way that is fair to the people being represented by that data

Understanding Data and Fairness

 Data analysts have a responsibility to make sure their analyses are fair
 Fairness means ensuring that analysis does not create or reinforce bias
 There is no one standard definition of fairness
 Conclusions based on data can be true and unfair
 Example of a company that is notorious for being a boys club and wants to
see which employees are doing well
 Data shows that men are the only people succeeding at this company
 Conclusion that they should hire more men is true but unfair because it
ignores other systematic factors that are contributing to this problem
 Ethical data analyst can look at the data and conclude that the company
culture is preventing some employees from succeeding
 Harvard data scientists developing a mobile platform to track patients at
risk of cardiovascular disease in the Stroke Belt
 Team of analysts and social scientists to provide insights on human bias
and social context
 Collected self-reported data in a separate system to avoid potential for
racial bias
 Over sampled non-dominant groups to ensure model was including them
 Fairness was a top priority every step of the way to collect data and create
conclusions that didn't negatively impact the communities studied

Data Analyst in Different Industry.

 Data analysts are needed in many different industries, and it is important


to consider if a job is a good fit for you and your career goals.
 Common factors to consider when searching for a job include industry,
tools, location, travel, and culture.
 Different industries use data differently, so they need analysts with
different skills.
 Think about your interests when searching for a job and consider how you
want to use your skills.
 Consider location and travel when searching for a job, such as cost of
living, commute, and remote work.
 Think about your values and what company culture is a good fit for you.
 This course will help you learn the core skills for data analytics in any
setting.
 Upcoming courses will look at the skills successful data analysts have and
how to practice them.

Data Analyst role and job description

As technology continues to advance, being able to collect and analyze the data
from that new technology has become a huge competitive advantage for a lot of
businesses. Everything from websites to social media feeds are filled with
fascinating data that, when analyzed and used correctly, can help inform
business decisions. A company’s ability to thrive now often depends on how well
it can leverage data, apply analytics, and implement new technologies.

Decoding job description

 Business analyst — analyzes data to help businesses improve processes,


products, or services
 Data analytics consultant — analyzes the systems and models for using data
 Data engineer — prepares and integrates data from different sources for
analytical use
 Data scientist — uses expert skills in technology and social science to find
trends through data analysis
 Data specialist — organizes or converts data for use in databases or
software systems
 Operations analyst — analyzes data to assess the performance of business
operations and workflows
You might have noticed a common theme across every example. They all have issues to

explore, questions to answer, or problems to solve. It's easy for these things to get mixed up. Here's a
way to keep them straight when we talk about them in data analytics. An issue is a topic or subject to
investigate. A question is designed to discover information and a problem is an obstacle
or complication that needs to be worked out
Course 2
Ask Question to make Data Driven Decision
Module one
Taking Action with Data

The Six Data Analysis Phase

1. Ask
It’s impossible to solve a problem if you don’t know what it is. These are some
things to consider:

 Define the problem you’re trying to solve


 Make sure you fully understand the stakeholder’s expectations
 Focus on the actual problem and avoid any distractions
 Collaborate with stakeholders and keep an open line of communication
 Take a step back and see the whole situation in context

Questions to ask yourself in this step:

 What are my stakeholders saying their problems are?


 Now that I’ve identified the issues, how can I help the stakeholders resolve their
questions?

2. Prepare

You will decide what data you need to collect in order to answer your questions
and how to organize it so that it is useful. You might use your business task to
decide:

 What metrics to measure


 Locate data in your database
 Create security measures to protect that data

Questions to ask yourself in this step:

 What do I need to figure out how to solve this problem?


 What research do I need to do?

3. Process

Clean data is the best data and you will need to clean up your data to get rid of
any possible errors, inaccuracies, or inconsistencies. This might mean:

 Using spreadsheet functions to find incorrectly entered data


 Using SQL functions to check for extra spaces
 Removing repeated entries
 Checking as much as possible for bias in the data

Questions to ask yourself in this step:

 What data errors or inaccuracies might get in my way of getting the best
possible answer to the problem I am trying to solve?
 How can I clean my data so the information I have is more consistent?

4. Analyze
You will want to think analytically about your data. At this stage, you might sort
and format your data to make it easier to:

 Perform calculations
 Combine data from multiple sources
 Create tables with your results

Questions to ask yourself in this step:

 What story is my data telling me?


 How will my data help me solve this problem?
 Who needs my company’s product or service? What type of person is most
likely to use it?

5. Share

Everyone shares their results differently so be sure to summarize your results


with clear and enticing visuals of your analysis using data via tools like graphs or
dashboards. This is your chance to show the stakeholders you have solved their
problem and how you got there. Sharing will certainly help your team:

 Make better decisions

 Make more informed decisions

 Lead to stronger outcomes

 Successfully communicate your findings

Questions to ask yourself in this step:

 How can I make what I present to the stakeholders engaging and easy to
understand?
 What would help me understand this if I were the listener?

6. Act

Now it’s time to act on your data. You will take everything you have learned from
your data analysis and put it to use. This could mean providing your stakeholders
with recommendations based on your findings so they can make data-driven
decisions.

Questions to ask yourself in this step:

 How can I use the feedback I received during the share phase (step 5) to
actually meet the stakeholder’s needs and expectations?
These six steps can help you to break the data analysis process into smaller,
manageable parts, which is called structured thinking. This process involves
four basic activities:

 Recognizing the current problem or situation

 Organizing available information

 Revealing gaps and opportunities

 Identifying your options

When you are starting out in your career as a data analyst, it is normal to feel
pulled in a few different directions with your role and expectations. Following
processes like the ones outlined here and using structured thinking skills can
help get you back on track, fill in any gaps and let you know exactly what you
need.

This describes structured thinking. Structured thinking begins with recognizing


the current problem or situation. Next, information is organized to reveal gaps
and opportunities. Finally, the available options are identified. Structured thinking
is the process of recognizing the current problem or
situation, organizing available information, revealing gaps and opportunities, and
identifying the options. In this process, you address a vague, complex problem
by breaking it down into smaller steps, and then those steps lead you to a logical
solution.

Common Problem Types

1. Making predictions.
This problem type involves using data to make an informed decision about how
things may be in the future.

For example, a hospital system might use a remote patient monitoring to predict
health events for chronically ill patients. The patients would take their health
vitals at home every day, and that information combined with data about their
age, risk factors, and other important details could enable the hospital's algorithm
to predict future health problems and even reduce future hospitalizations.

2. Categorizing things.
This means assigning information to different groups or clusters based on
common features.
An example of this problem type is a manufacturer that reviews data on shop
floor employee performance. An analyst may create a group for employees who
are most and least effective at engineering. A group for employees who are most
and least effective at repair and maintenance, most and least effective at
assembly, and many more groups or clusters.

3. Spotting something unusual.


In this problem type, data analysts identify data that is different from the norm.

An instance of spotting something unusual in the real world is a school system


that has a sudden increase in the number of students registered, maybe as big
as a 30 percent jump in the number of students. A data analyst might look into
this upswing and discover that several new apartment complexes had been built
in the school district earlier that year. They could use this analysis to make sure
the school has enough resources to handle the additional students.

4. Identifying themes
Identifying themes takes categorization as a step further by grouping information
into broader concepts.

Going back to our manufacturer that has just reviewed data on the shop floor
employees. First, these people are grouped by types and tasks. But now a data
analyst could take those categories and group them into the broader concept of
low productivity and high productivity. This would make it possible for the
business to see who is most and least productive, in order to reward top
performers and provide additional support to those workers who need more
training.

5. Discovering connections
It enables data analysts to find similar challenges faced by different entities, and
then combine data and insights to address them.

Here's what I mean; say a scooter company is experiencing an issue with the
wheels it gets from its wheel supplier. That company would have to stop
production until it could get safe, quality wheels back in stock. But meanwhile,
the wheel companies encountering the problem with the rubber it uses to make
wheels, turns out its rubber supplier could not find the right materials either. If all
of these entities could talk about the problems they're facing and share data
openly, they would find a lot of similar challenges and better yet, be able to
collaborate to find a solution.
6. Finding patterns.
Data analysts use data to find patterns by using historical data to understand
what happened in the past and is therefore likely to happen again. E-commerce
companies use data to find patterns all the time. Data analysts look at transaction
data to understand customer buying habits at certain points in time throughout
the year. They may find that customers buy more canned goods right before a
hurricane, or they purchase fewer cold-weather accessories like hats and gloves
during warmer months. The e-commerce companies can use these insights to
make sure they stock the right amount of products at these key times.

Making predictions

A company that wants to know the best advertising method to bring in new customers is an
example of a problem requiring analysts to make predictions. Analysts with data on location, type
of media, and number of new customers acquired as a result of past ads can't guarantee future
results, but they can help predict the best placement of advertising to reach the target audience.

Categorizing things

An example of a problem requiring analysts to categorize things is a company's goal to improve


customer satisfaction. Analysts might classify customer service calls based on certain keywords
or scores. This could help identify top-performing customer service representatives or help
correlate certain actions taken with higher customer satisfaction scores.

Spotting something unusual

A company that sells smart watches that help people monitor their health would be interested in
designing their software to spot something unusual. Analysts who have analyzed aggregated
health data can help product developers determine the right algorithms to spot and set off alarms
when certain data doesn't trend normally.

Identifying themes

User experience (UX) designers might rely on analysts to analyze user interaction data. Similar to
problems that require analysts to categorize things, usability improvement projects might require
analysts to identify themes to help prioritize the right product features for improvement. Themes
are most often used to help researchers explore certain aspects of data. In a user study, user
beliefs, practices, and needs are examples of themes.

By now you might be wondering if there is a difference between categorizing things and
identifying themes. The best way to think about it is: categorizing things involves assigning items
to categories; identifying themes takes those categories a step further by grouping them into
broader themes.

Discovering connections

A third-party logistics company working with another company to get shipments delivered to
customers on time is a problem requiring analysts to discover connections. By analyzing the wait
times at shipping hubs, analysts can determine the appropriate schedule changes to increase the
number of on-time deliveries.

Finding patterns

Minimizing downtime caused by machine failure is an example of a problem requiring analysts to


find patterns in data. For example, by analyzing maintenance data, they might discover that most
failures happen if regular maintenance is delayed by more than a 15-day window.

Smart Questions
Effective questions follow the SMART methodology.
That means they're specific, measurable, action-oriented, relevant and time-
bound.
Let's break that down.
Specific questions are simple, significant and focused on a single topic or a few
closely related ideas. This helps us collect information that's relevant to what
we're investigating. If a question is too general, try to narrow it down by focusing
on just one element.
For example, instead of asking a closed-ended question, like,
are kids getting enough physical activities these days?
Ask what percentage of kids achieve the recommended
60 minutes of physical activity at least five days a week?
That question is much more specific and can give you more useful information.

Now, let's talk about measurable questions.


Measurable questions can be quantified and assessed.
An example of an unmeasurable question would be, why did a recent video go
viral? Instead, you could ask how many times was our video shared on social
channels the first week it was posted? That question is measurable because it
lets us count the shares and arrive at a concrete number.

Okay, now we've come to action-oriented questions.


Action-oriented questions encourage change. You might remember that problem
solving is about seeing the current state and figuring out how to transform it into
the ideal future state. Well, action-oriented questions help you get there. So
rather than asking, how can we get customers to recycle our product packaging?
You could ask, what design features will make our packaging easier to recycle?
This brings you answers you can act on.

All right, let's move on to relevant questions.


Relevant questions matter, are important and have significance to the problem
you're trying to solve. Let's say you're working on a problem related to a
threatened species of frog. And you asked, why does it matter that Pine Barrens
tree frogs started disappearing? This is an irrelevant question because the
answer won't help us find a way to prevent these frogs from going extinct. A more
relevant question would be, what environmental factors changed in
Durham, North Carolina between 1983 and 2004 that could cause Pine Barrens
tree frogs to disappear from the Sandhills Regions? This question would give us
answers we can use to help solve our problem. That's also a great example for
our final point, time-bound questions.
Time-bound questions specify the time to be studied. The time period we want
to study is 1983 to 2004. This limits the range of possibilities and enables the
data analyst to focus on relevant data. Okay, now that you have a general
understanding of SMART questions, there's something else that's very important
to keep in mind when crafting questions, fairness.

We've touched on fairness before, but as a quick reminder, fairness means


ensuring that your questions don't create or reinforce bias. Fairness also means
crafting questions that make sense to everyone. It's important for questions to be
clear and have a straightforward wording that anyone can easily
understand. Unfair questions also can make your job as a data analyst more
difficult. They lead to unreliable feedback and missed opportunities to gain some
truly valuable insights.

Examples of SMART questions

Here's an example that breaks down the thought process of turning a problem
question into one or more SMART questions using the SMART method:

What features do people look for when buying a new car?

 Specific: Does the question focus on a particular car feature?


 Measurable: Does the question include a feature rating system?
 Action-oriented: Does the question influence creation of different or new
feature packages?
 Relevant: Does the question identify which features make or break a
potential car purchase?
 Time-bound: Does the question validate data on the most popular features
from the last three years?
Questions should be open-ended. This is the best way to get responses that will
help you accurately qualify or disqualify potential solutions to your specific
problem. So, based on the thought process, possible SMART questions might
be:

 On a scale of 1-10 (with 10 being the most important) how important is your
car having four-wheel drive?
 What are the top five features you would like to see in a car package?
 What features, if included with four-wheel drive, would make you more
inclined to buy the car?
 How much more would you pay for a car with four-wheel drive?
 Has four-wheel drive become more or less popular in the last three years?

Things to avoid when asking questions

Leading questions: questions that only have a particular response

Example: This product is too expensive, isn’t it?


This is a leading question because it suggests an answer as part of the
question. A better question might be, “What is your opinion of this product?”
There are tons of answers to that question, and they could include information
about usability, features, accessories, color, reliability, and popularity, on top of
price. Now, if your problem is actually focused on pricing, you could ask a
question like “What price (or price range) would make you consider purchasing
this product?” This question would provide a lot of different measurable
responses.

Closed-ended questions: questions that ask for a one-word or brief response


only

Example: Were you satisfied with the customer trial?


This is a closed-ended question because it doesn’t encourage people to
expand on their answer. It is really easy for them to give one-word responses
that aren’t very informative. A better question might be, “What did you learn
about customer experience from the trial.” This encourages people to provide
more detail besides “It went well.”

Vague questions: questions that aren’t specific or don’t provide context

Example: Does the tool work for you?


This question is too vague because there is no context. Is it about
comparing the new tool to the one it replaces? You just don’t know. A better
inquiry might be, “When it comes to data entry, is the new tool faster, slower, or
about the same as the old tool? If faster, how much time is saved? If slower, how
much time is lost?” These questions give context (data entry) and help frame
responses that are measurable (time).
Module 2

Data-inspired decision-making explores different data sources to find out what


they have in common.
Data is straightforward, facts collected together, values that describe something.
Individual data points become more useful when they're collected and structured,
but they're still somewhat meaningless by themselves. We need to interpret data
to turn it into information

Quantitative and Qualitative Data


Quantitative data is all about the specific and objective measures of
numerical facts. This can often be the what, how many, and how often about a
problem. In other words, things you can measure.
Qualitative data describes subjective or explanatory measures of qualities
and characteristics or things that can't be measured with numerical data, like
your hair color. Qualitative data is great for helping us answer why questions.
Qualitative data can then give us a more high-level understanding of why
the numbers are the way they are.
This is important because it helps us add context to a problem. As a data
analyst, you'll be using both quantitative and qualitative analysis, depending on
your business task. Reviews are a great example of this. Think about a time you
used reviews to decide whether you wanted to buy something or go somewhere.
These reviews might have told you how many people dislike that thing and why.
Businesses read these reviews too, but they use the data in different ways.

The Big Reveal. Showing your Findings

Reports and dashboards are both useful for data visualization. But there
are pros and cons for each of them. A report is a static collection of data given to
stakeholders periodically. A dashboard on the other hand, monitors live,
incoming data. Let's talk about reports first. Reports are great for giving
snapshots of high level historical data for an organization. There are some
downsides to keep in mind too. Reports need regular maintenance and aren't
very visually appealing. Because they aren't automatic or dynamic, reports don't
show live, evolving data. For a live reflection of incoming data, you'll want to
design a dashboard. Dashboards are great for a lot of reasons, they give your
team more access to information being recorded, you can interact through data
by playing with filters, and because they're dynamic, they have long-term value.
ut dashboards do have some cons too. For one thing, they take a lot of time to
design and can actually be less efficient than reports, if they're not used very
often. If the base table breaks at any point, they need a lot of maintenance to get
back up and running again. Dashboards can sometimes overwhelm people with
information too. If you aren't used to looking through data on a dashboard, you
might get lost in it.
A pivot table is a data summarization tool that is used in data processing.
Pivot tables are used to summarize, sort, re-organize, group, count, total, or
average data stored in a database. It allows its users to transform columns into
rows and rows into columns.

Data vs. Metrics

A metric is a single, quantifiable type of data that can be used for


measurement. Think of it this way. Data starts as a collection of raw facts, until
we organize them into individual metrics that represent a single type of data.
Metrics can also be combined into formulas that you can plug your
numerical data into. Metrics usually involve simple math. Revenue, for example,
is the number of sales multiplied by the sales price.
Data contains a lot of raw details about the problem we're exploring. But
we need the right metrics to get the answers we're looking for. Different
industries will use all kinds of metrics to measure things in a data set.
Companies use this metric all the time. ROI, or Return on Investment is
essentially a formula designed using metrics that let a business know how well
an investment is doing. The ROI is made up of two metrics, the net profit over a
period of time and the cost of investment.
By comparing these two metrics, profit and cost of investment, the
company can analyze the data they have to see how well their investment is
doing. This can then help them decide how to invest in the future and which
investments to prioritize.

The Beauty of Dashboard

Dashboards are powerful visual tools that help you tell your data story. A
dashboard organizes information from multiple datasets into one central location,
offering huge time-savings. Data analysts use dashboards to track, analyze, and
visualize data in order to answer questions and solve problems.
* it is important to remember that changed data is pulled into dashboards
automatically only if the data structure is the same. If the data structure changes,
you have to update the dashboard design before the data can update live.

Creating a dashboard
Here is a process you can follow to create a dashboard:

1. Identify the stakeholders who need to see the data and how they will use it

To get started with this, you need to ask effective questions.

2. Design the dashboard (what should be displayed)

Use these tips to help make your dashboard design clear, easy to follow, and
simple:

 Use a clear header to label the information


 Add short text descriptions to each visualization
 Show the most important information at the top

3. Create mock-ups if desired

This is optional, but a lot of data analysts like to sketch out their
dashboards before creating them.

4. Select the visualizations you will use on the dashboard

You have a lot of options here and it all depends on what data story you
are telling. If you need to show a change of values over time, line charts or bar
graphs might be the best choice. If your goal is to show how each part
contributes to the whole amount being reported, a pie or donut chart is probably
a better choice.

5. Create filters as needed

Filters show certain data while hiding the rest of the data in a dashboard.
This can be a big help to identify patterns while keeping the original data intact. It
is common for data analysts to use and share the same dashboard, but manage
their part of it with a filter.

Dashboards are part of a business journey


Just like how the dashboard on an airplane shows the pilot their flight path, your
dashboard does the same for your stakeholders. It helps them navigate the path
of the project inside the data. If you add clear markers and highlight important
points on your dashboard, users will understand where your data story is
headed. Then, you can work together to make sure the business gets where it
needs to go.

Types of Dashboard

Strategic dashboards

A wide range of businesses use strategic dashboards when evaluating


and aligning their strategic goals. These dashboards provide information over the
longest time frame—from a single financial quarter to years.

They typically contain information that is useful for enterprise-wide


decision-making. Below is an example of a strategic dashboard which focuses on
key performance indicators (KPIs) over a year.
Operational dashboards

Operational dashboards are, arguably, the most common type of


dashboard. Because these dashboards contain information on a time scale of
days, weeks, or months, they can provide performance insight almost in real-
time.

This allows businesses to track and maintain their immediate operational


processes in light of their strategic goals. The operational dashboard below
focuses on customer service.

Analytical dashboards

Analytic dashboards contain a vast amount of data used by data analysts.


These dashboards contain the details involved in the usage, analysis, and
predictions made by data scientists.

Certainly the most technical category, analytic dashboards are usually


created and maintained by data science teams and rarely shared with upper
management as they can be very difficult to understand. The analytic dashboard
below focuses on metrics for a company’s financial performance.

Great work reinforcing your learning with a thoughtful self-reflection! A few


commonalities in these examples include:

 Dashboards are visualizations: Visualizing data can be enormously useful for


understanding and demonstrating what the data really means.
 Dashboards identify metrics: Relevant metrics may help analysts assess
company performance.
Some differences include the time frame described in each dashboard. The
operational dashboard has a time-frame of days and weeks, while the strategic
dashboard displays the entire year. The analytic dashboard skips a specific time-
frame. Instead, it identifies and tracks the various KPIs that may be used to
assess strategic and operational goals.
Thank you for your response! Dashboards can help companies perform many
helpful tasks, such as:

 Track historical and current performance.


 Establish both long-term and/or short-term goals.
 Define key performance indicators or metrics.
 Identify potential issues or points of inefficiency.
While almost every company can benefit in some way from using a dashboard,
larger companies and companies with a wider range of products or services will
likely benefit more. Companies operating in volatile, or swiftly changing markets
like marketing, sales, and tech also tend to more quickly gain insights and make
data-informed decisions.

Mathematical Thinking
Small data can be really small. These kinds of data tend to be made up of
data sets concerned with specific metrics over a short, well defined period of
time.
Big data on the other hand has larger, less specific data-sets covering a
longer period of time. They usually have to be broken down to be analyzed. Big
data is useful for looking at large- scale questions and problems, and they help
companies make big decisions.

Big and Small Data


As a data analyst, you will work with data both big and small. Both kinds of
data are valuable, but they play very different roles.
Whether you work with big or small data, you can use it to help
stakeholders improve business processes, answer questions, create new
products, and much more. But there are certain challenges and benefits that
come with big data and the following table explores the differences between big
and small data.

Challenges and Benefits


Here are some challenges you might face when working with big data:

 A lot of organizations deal with data overload and way too much unimportant
or irrelevant information.
 Important data can be hidden deep down with all of the non-important data,
which makes it harder to find and use. This can lead to slower and more
inefficient decision-making time frames.
 The data you need isn’t always easily accessible.
 Current technology tools and solutions still struggle to provide measurable
and reportable data. This can lead to unfair algorithmic bias.
 There are gaps in many big data business solutions.

Now for the good news! Here are some benefits that come with big data:

 When large amounts of data can be stored and analyzed, it can help
companies identify more efficient ways of doing business and save a lot of
time and money.
 Big data helps organizations spot the trends of customer buying patterns and
satisfaction levels, which can help them create new products and solutions
that will make customers happy.
 By analyzing big data, businesses get a much better understanding of current
market conditions, which can help them stay ahead of the competition.
 As in our earlier social media example, big data helps companies keep track
of their online presence—especially feedback, both good and bad, from
customers. This gives them the information they need to improve and protect
their brand.

The three (or four) V words for big data

When thinking about the benefits and challenges of big data, it helps to think
about the three Vs: volume, variety, and velocity. Volume describes the
amount of data. Variety describes the different kinds of data. Velocity describes
how fast the data can be processed. Some data analysts also consider a fourth
V: veracity. Veracity refers to the quality and reliability of the data. These are all
important considerations related to processing huge, complex data-sets.

Module 3

Spreadsheets and the Data Life Cycle

To better understand the benefits of using spreadsheets in data analytics, let’s


explore how they relate to each phase of the data life cycle: plan, capture,
manage, analyze, archive, and destroy.
 Plan for the users who will work within a spreadsheet by developing
organizational standards. This can mean formatting your cells, the headings
you choose to highlight, the color scheme, and the way you order your data
points. When you take the time to set these standards, you will improve
communication, ensure consistency, and help people be more efficient with
their time.
 Capture data by the source by connecting spreadsheets to other data
sources, such as an online survey application or a database. This data will
automatically be updated in the spreadsheet. That way, the information is
always as current and accurate as possible.
 Manage different kinds of data with a spreadsheet. This can involve storing,
organizing, filtering, and updating information. Spreadsheets also let you
decide who can access the data, how the information is shared, and how to
keep your data safe and secure.
 Analyze data in a spreadsheet to help make better decisions. Some of the
most common spreadsheet analysis tools include formulas to aggregate data
or create reports, and pivot tables for clear, easy-to-understand visuals.
 Archive any spreadsheet that you don’t use often, but might need to
reference later with built-in tools. This is especially useful if you want to store
historical data before it gets updated.
 Destroy your spreadsheet when you are certain that you will never need it
again, if you have better backup copies, or for legal or security reasons. Keep
in mind, lots of businesses are required to follow certain rules or have
measures in place to make sure data is destroyed properly.
Pro tip: Spotting errors in spreadsheets with
conditional formatting
Conditional formatting can be used to highlight cells a different color
based on their contents. This feature can be extremely helpful when you want to
locate all errors in a large spreadsheet. For example, using conditional
formatting, you can highlight in yellow all cells that contain an error, and then
work to fix them.

Conditional formatting in Microsoft Excel

To set up conditional formatting in Microsoft Excel to highlight all cells in a


spreadsheet that contain errors, do the following:

 Click the gray triangle above row number 1 and to the left of Column A to
select all cells in the spreadsheet.

 From the main menu, click Home, and then click Conditional Formatting to
select Highlight Cell Rules > More Rules.

 For Select a Rule Type, choose Use a formula to determine which cells to
format.

 For Format values where this formula is true, enter =ISERROR(A1).

 Click the Format button, select the Fill tab, select yellow (or any other color),
and then click OK.

 Click OK to close the format rule window.

To remove conditional formatting, click Home and select Conditional Formatting,


and then click Manage Rules. Locate the format rule in the list, click Delete Rule,
and then click OK.

Problem Domain

The specific area of analysis that encompasses every area of activity affecting or
affected by the problem

Structured thinking is the process of recognizing the current problem or


situation, organizing available information, revealing gaps and opportunities, and
identifying the options. In other words, it's a way of being super prepared. For
many businesses, this includes things like work details, schedules, and reports
that the client can expect.
Now, as a data analyst, your scope of work will be a bit more technical
and include those basic items we just mentioned, but you'll also focus on things
like data preparation, validation, analysis of quantitative and qualitative datasets,
initial results, and maybe even some visuals to really get the point across

Deliverables are items or tasks you will complete before you can finish
the project.

Timelines include due dates for when deliverables, milestones, and/or


reports are due

Milestones are significant tasks you will confirm along your timeline to
help everyone know the project is on track

Reports notify everyone as you finalize deliverables and meet milestones.

SOW - Scope of Work

 Deliverables: What work is being done, and what things are being created
as a result of this project? When the project is complete, what are you
expected to deliver to the stakeholders? Be specific here. Will you collect
data for this project? How much, or for how long?
Avoid vague statements. For example, “fixing traffic problems” doesn’t specify
the scope. This could mean anything from filling in a few potholes to building a
new overpass. Be specific! Use numbers and aim for hard, measurable goals
and objectives. For example: “Identify top 10 issues with traffic patterns within the
city limits, and identify the top 3 solutions that are most cost-effective for reducing
traffic congestion.”

 Milestones: This is closely related to your timeline. What are the major
milestones for progress in your project? How do you know when a given part
of the project is considered complete?
Milestones can be identified by you, by stakeholders, or by other team members
such as the Project Manager. Smaller examples might include incremental steps
in a larger project like “Collect and process 50% of required data (100 survey
responses)”, but may also be larger examples like ”complete initial data analysis
report” or “deliver completed dashboard visualizations and analysis reports to
stakeholders”.

 Timeline: Your timeline will be closely tied to the milestones you create for
your project. The timeline is a way of mapping expectations for how long
each step of the process should take. The timeline should be specific enough
to help all involved decide if a project is on schedule. When will the
deliverables be completed? How long do you expect the project will take to
complete? If all goes as planned, how long do you expect each component of
the project will take? When can we expect to reach each milestone?

 Reports: Good SOWs also set boundaries for how and when you’ll give
status updates to stakeholders. How will you communicate progress with
stakeholders and sponsors, and how often? Will progress be reported
weekly? Monthly? When milestones are completed? What information will
status reports contain?
At a minimum, any SOW should answer all the relevant questions in the above
areas. Note that these areas may differ depending on the project. But at their
core, the SOW document should always serve the same purpose by containing
information that is specific, relevant, and accurate. If something changes in the
project, your SOW should reflect those changes.

What is in and out of scope?

SOWs should also contain information specific to what is and isn’t considered
part of the project. The scope of your project is everything that you are expected
to complete or accomplish, defined to a level of detail that doesn’t leave any
ambiguity or confusion about whether a given task or item is part of the project or
not.

Notice how the previous example about studying traffic congestion defined its
scope as the area within the city limits. This doesn’t leave any room for confusion
— stakeholders need only to refer to a map to tell if a stretch of road or
intersection is part of the project or not. Defining requirements can be trickier
than it sounds, so it’s important to be as specific as possible in these documents,
and to use quantitative statements whenever possible.

For example, assume that you’re assigned to a project that involves studying the
environmental effects of climate change on the coastline of a city: How do you
define what parts of the coastline you are responsible for studying, and which
parts you are not?

In this case, it would be important to define the area you’re expected to study
using GPS locations, or landmarks. Using specific, quantifiable statements will
help ensure that everyone has a clear understanding of what’s expected.

“The best thing you can do for the fairness and accuracy of your data, is to make
sure you start with an accurate representation of the population, and collect the
data in the most appropriate, and objective way. Then, you'll have the facts so
you can pass on to your team”
Context can turn raw data into meaningful information. It is very important for
data analysts to contextualize their data. This means giving the data perspective
by defining it. To do this, you need to identify:

 Who: The person or organization that created, collected, and/or funded the
data collection
 What: The things in the world that data could have an impact on
 Where: The origin of the data
 When: The time when the data was created or collected
 Why: The motivation behind the creation or collection
 How: The method used to create or collect it

Structured thinking is the process of recognizing the current problem or situation,


organizing available information, revealing gaps and opportunities, and
identifying the options.

Module 4

Project managers are in charge of planning and executing a project. Part


of the project manager's job is keeping the project on track and overseeing the
progress of the entire team.

Working With Stakeholders

Your data analysis project should answer the business task and create
opportunities for data-driven decision-making. That's why it is so important to
focus on project stakeholders. As a data analyst, it is your responsibility to
understand and manage your stakeholders’ expectations while keeping the
project goals front and center.

You might remember that stakeholders are people who have invested time,
interest, and resources into the projects that you are working on. This can be a
pretty broad group, and your project stakeholders may change from project to
project. But there are three common stakeholder groups that you might find
yourself working with: the executive team, the customer-facing team, and the
data science team.

Let’s get to know more about the different stakeholders and their goals. Then
we'll learn some tips for communicating with them effectively.

1. Executive team

The executive team provides strategic and operational leadership to the


company. They set goals, develop strategy, and make sure that strategy is
executed effectively. The executive team might include vice presidents, the chief
marketing officer, and senior-level professionals who help plan and direct the
company’s work. These stakeholders think about decisions at a very high level
and they are looking for the headline news about your project first. They are less
interested in the details. Time is very limited with them, so make the most of it by
leading your presentations with the answers to their questions. You can keep the
more detailed information handy in your presentation appendix or your project
documentation for them to dig into when they have more time.

For example, you might find yourself working with the vice president of human
resources on an analysis project to understand the rate of employee absences. A
marketing director might look to you for competitive analyses. Part of your job will
be balancing what information they will need to make informed decisions with
their busy schedule.

But you don’t have to tackle that by yourself. Your project manager will be
overseeing the progress of the entire team, and you will be giving them more
regular updates than someone like the vice president of HR. They are able to
give you what you need to move forward on a project, including getting approvals
from the busy executive team. Working closely with your project manager can
help you pinpoint the needs of the executive stakeholders for your project, so
don’t be afraid to ask them for guidance.

2. Customer-facing team

The customer-facing team includes anyone in an organization who has


some level of interaction with customers and potential customers. Typically they
compile information, set expectations, and communicate customer feedback to
other parts of the internal organization. These stakeholders have their own
objectives and may come to you with specific asks. It is important to let the data
tell the story and not be swayed by asks from your stakeholders to find certain
patterns that might not exist.

Let’s say a customer-facing team is working with you to build a new version of a
company’s most popular product. Part of your work might involve collecting and
sharing data about consumers’ buying behavior to help inform product features.
Here, you want to be sure that your analysis and presentation focuses on what is
actually in the data-- not on what your stakeholders hope to find.

3. Data science team

Organizing data within a company takes teamwork. There's a good


chance you'll find yourself working with other data analysts, data scientists, and
data engineers. For example, maybe you team up with a company's data science
team to work on boosting company engagement to lower rates of employee
turnover. In that case, you might look into the data on employee productivity,
while another analyst looks at hiring data. Then you share those findings with the
data scientist on your team, who uses them to predict how new processes could
boost employee productivity and engagement. When you share what you found
in your individual analyses, you uncover the bigger story. A big part of your job
will be collaborating with other data team members to find new angles of the data
to explore. Here's a view of how different roles on a typical data science team
support different functions:

Working effectively with stakeholders

When you're working with each group of stakeholders- from the executive team,
to the customer-facing team, to the data science team, you'll often have to go
beyond the data. Use the following tips to communicate clearly, establish trust,
and deliver your findings across groups.
Discuss goals. Stakeholder requests are often tied to a bigger project or goal.
When they ask you for something, take the opportunity to learn more. Start a
discussion. Ask about the kind of results the stakeholder wants. Sometimes, a
quick chat about goals can help set expectations and plan the next steps.

Feel empowered to say “no.” Let’s say you are approached by a marketing
director who has a “high-priority” project and needs data to back up their
hypothesis. They ask you to produce the analysis and charts for a presentation
by tomorrow morning. Maybe you realize their hypothesis isn’t fully formed and
you have helpful ideas about a better way to approach the analysis. Or maybe
you realize it will take more time and effort to perform the analysis than
estimated. Whatever the case may be, don’t be afraid to push back when you
need to.

Stakeholders don’t always realize the time and effort that goes into collecting and
analyzing data. They also might not know what they actually need. You can help
stakeholders by asking about their goals and determining whether you can
deliver what they need. If you can’t, have the confidence to say “no,” and provide
a respectful explanation. If there’s an option that would be more helpful, point the
stakeholder toward those resources. If you find that you need to prioritize other
projects first, discuss what you can prioritize and when. When your stakeholders
understand what needs to be done and what can be accomplished in a given
timeline, they will usually be comfortable resetting their expectations. You should
feel empowered to say no-- just remember to give context so others understand
why.

Plan for the unexpected. Before you start a project, make a list of potential
roadblocks. Then, when you discuss project expectations and timeline with your
stakeholders, give yourself some extra time for problem-solving at each stage of
the process.

Know your project. Keep track of your discussions about the project over email
or reports, and be ready to answer questions about how certain aspects are
important for your organization. Get to know how your project connects to the
rest of the company and get involved in providing the most insight possible. If you
have a good understanding about why you are doing an analysis, it can help you
connect your work with other goals and be more effective at solving larger
problems.

Start with words and visuals. It is common for data analysts and stakeholders
to interpret things in different ways while assuming the other is on the same
page. This illusion of agreement* has been historically identified as a cause of
projects going back-and-forth a number of times before a direction is finally
nailed down. To help avoid this, start with a description and a quick visual of what
you are trying to convey. Stakeholders have many points of view and may prefer
to absorb information in words or pictures. Work with them to make changes and
improvements from there. The faster everyone agrees, the faster you can
perform the first analysis to test the usefulness of the project, measure the
feedback, learn from the data, and implement changes.

Communicate often. Your stakeholders will want regular updates on your


projects. Share notes about project milestones, setbacks, and changes. Then
use your notes to create a shareable report. Another great resource to use is a
change-log, which is a tool that will be explored further throughout the program.
For now, just know that a change-log is a file containing a chronologically
ordered list of modifications made to a project. Depending on the way you set it
up, stakeholders can even pop in and view updates whenever they want.

Effective Communication

Project follow-up email sample

After the next report is completed, you can also send out a project update
offering more information. The email could look like this:
Limitations with Data

Telling a Story

 Compare the same types of data: Data can get mixed up when you chart it
for visualization. Be sure to compare the same types of data and double
check that any segments in your chart definitely display different metrics.
 Visualize with care: A 0.01% drop in a score can look huge if you zoom in
close enough. To make sure your audience sees the full story clearly, it is a
good idea to set your Y-axis to 0.
 Leave out needless graphs: If a table can show your story at a glance, stick
with the table instead of a pie chart or a graph. Your busy audience will
appreciate the clarity.
 Test for statistical significance: Sometimes two data-sets will look
different, but you will need a way to test whether the difference is real and
important. So remember to run statistical tests to see how much confidence
you can place in that difference.
 Pay attention to sample size: Gather lots of data. If a sample size is small,
a few unusual responses can skew the results. If you find that you have too
little data, be careful about using it to form judgments. Look for opportunities
to collect more data, then chart those trends over longer periods.

*Focusing on stakeholder expectations enables data analysts to understand project


goals, improve communication, and build trust.

Action-oriented question: A question whose answers lead to change


Course 3
Prepare Data for Exploration

Module 1
Week 1

Data Collection in our World

Cookies, which are small files stored on computers that contain


information about users. Cookies can help inform advertisers about your
personal interests and habits based on your online surfing, without personally
identifying you.
First-party data. This is data collected by an individual or group using
their own resources. Collecting first-party data is typically the preferred method
because you know exactly where it came from.
You might also have second-party data, which is data collected by a
group directly from its audience and then sold
Third-party data or data collected from outside sources who did not
collect it directly. This data might have come from a number of different sources
before you investigated it. It might not be as reliable, but that doesn't mean it
can't be useful. You'll just want to make sure you check it for accuracy, bias, and
credibility.
Population refers to all possible data values in a certain data
A sample is a part of a population that is representative of the population
* if you needed an answer immediately, you'd have to use historical data, which
is data that already exists.

Selecting the right data

Following are some data-collection considerations to keep in mind for your


analysis:

How the data will be collected

Decide if you will collect the data using your own resources or receive (and
possibly purchase it) from another party. Data that you collect yourself is
called first-party data.

Data sources

If you don’t collect the data using your own resources, you might get data from
second-party or third-party data providers. Second-party data is collected
directly by another group and then sold. Third-party data is sold by a provider
that didn’t collect the data themselves. Third-party data might come from a
number of different sources.
Solving your business problem

Datasets can show a lot of interesting information. But be sure to choose data
that can actually help solve your problem question. For example, if you are
analyzing trends over time, make sure you use time series data — in other
words, data that includes dates.

How much data to collect

If you are collecting your own data, make reasonable decisions about sample
size. A random sample from existing data might be fine for some projects. Other
projects might need more strategic data collection to focus on certain criteria.
Each project has its own needs.

Time frame

If you are collecting your own data, decide how long you will need to collect it,
especially if you are tracking trends over a long period of time. If you need an
immediate answer, you might not have time to collect new data. In this case, you
would need to use historical data that already exists.

Discover Data Formats


Discrete data isn't limited to dollar amounts. Examples of other discrete
data are stars and points. When partial measurements (half-stars or quarter-
points) aren't allowed, the data is discrete. If you don't accept anything other than
full stars or points, the data is considered discrete. This is data that's counted
and has a limited number of values.
Continuous data can be measured using a timer, and its value can be
shown as a decimal with several places.
Nominal data is a type of qualitative data that's categorized without a set
order. In other words, this data doesn't have a sequence.
Ordinal data, on the other hand, is a type of qualitative data with a set
order or scale. If you asked a group of people to rank a movie from 1 to 5, some
might rank it as a 2, others a 4, and so on.
Internal data, which is data that lives within a company's own systems.
For example, if a movie studio had compiled all of the data in the spreadsheet
using only their own collection methods, then it would be their internal data. The
great thing about internal data is that it's usually more reliable and easier to
collect.
External data is, you guessed it, data that lives and is generated outside of
an organization.
Structured data is data that's organized in a certain format, such as rows
and columns. Spreadsheets and relational databases are two examples of
software that can store data in a structured way.
Unstructured data. This is data that is not organized in any easily
identifiable manner. Audio and video files are examples of unstructured data
because there's no clear way to identify or organize their content. Unstructured
data might have internal structure, but the data doesn't fit neatly in rows and
columns like structured data.

Data formats in practice


When you think about the word "format," a lot of things might come to mind.
Think of an advertisement for your favorite store. You might find it in the form of a
print ad, a billboard, or even a commercial. The information is presented in the
format that works best for you to take it in. The format of a dataset is a lot like
that, and choosing the right format will help you manage and use your data in the
best way possible.
Data Elements are pieces of information, such as people's names, account
numbers, and addresses. Data models help to keep data consistent and provide
a map of how data is organized. This makes it easier for analysts and other
stakeholders to make sense of their data and use it for business purposes.
The Structure of Data
Data is everywhere and it can be stored in lots of ways. Two general categories
of data are:

 Structured data: Organized in a certain format, such as rows and columns.


 Unstructured data: Not organized in any easy-to-identify way.
For example, when you rate your favorite restaurant online, you're creating
structured data. But when you use Google Earth to check out a satellite image of
a restaurant location, you're using unstructured data.

Here's a refresher on the characteristics of structured and unstructured data:

Structured data

As we described earlier, structured data is organized in a certain format.


This makes it easier to store and query for business needs. If the data is
exported, the structure goes along with the data.

Unstructured data

Unstructured data can’t be organized in any easily identifiable manner.


And there is much more unstructured than structured data in the world. Video
and audio files, text files, social media content, satellite imagery, presentations,
PDF files, open-ended survey responses, and websites all qualify as types of
unstructured data.

The fairness issue


The lack of structure makes unstructured data difficult to search, manage,
and analyze. But recent advancements in artificial intelligence and machine
learning algorithms are beginning to change that. Now, the new challenge facing
data scientists is making sure these tools are inclusive and unbiased. Otherwise,
certain elements of a data-set will be more heavily weighted and/or represented
than others. And as you're learning, an unfair dataset does not accurately
represent the population, causing skewed outcomes, low accuracy levels, and
unreliable analysis.

What is data modeling?

Data modeling is the process of creating diagrams that visually represent


how data is organized and structured. These visual representations are called
data models. You can think of data modeling as a blueprint of a house. At any
point, there might be electricians, carpenters, and plumbers using that blueprint.
Each one of these builders has a different relationship to the blueprint, but they
all need it to understand the overall structure of the house. Data models are
similar; different users might have different data needs, but the data model gives
them an understanding of the structure as a whole.

1. Conceptual data modeling gives a high-level view of the data structure,


such as how data interacts across an organization. For example, a
conceptual data model may be used to define the business requirements for
a new database. A conceptual data model doesn't contain technical details.
2. Logical data modeling focuses on the technical details of a database such
as relationships, attributes, and entities. For example, a logical data model
defines how individual records are uniquely identified in a database. But it
doesn't spell out actual names of database tables. That's the job of a
physical data model.
3. Physical data modeling depicts how a database operates. A physical data
model defines all entities and attributes used; for example, it includes table
names, column names, and data types for the database.

Data-modeling techniques

There are a lot of approaches when it comes to developing data models,


but two common methods are the Entity Relationship Diagram (ERD) and the
Unified Modeling Language (UML) diagram. ERDs are a visual way to
understand the relationship between entities in the data model. UML diagrams
are very detailed diagrams that describe the structure of a system by showing the
system's entities, attributes, operations, and their relationships. As a junior data
analyst, you will need to understand that there are different data modeling
techniques, but in practice, you will probably be using your organization’s existing
technique.

Data analysis and data modeling

Data modeling can help you explore the high-level details of your data and
how it is related across the organization’s information systems. Data modeling
sometimes requires data analysis to understand how the data is put together;
that way, you know how to map the data. And finally, data models make it easier
for everyone in your organization to understand and collaborate with you on your
data. This is important for you and everyone on your team!

*Text data type, or a string data type, which is a sequence of characters and
punctuation that contains textual information

*A Boolean data type is a data type with only two possible values: true or false.

Understanding BOOLEAN Logic

In this reading, you will explore the basics of Boolean logic and learn how to use
multiple conditions in a Boolean statement. These conditions are created with
Boolean operators, including AND, OR, and NOT. These operators are similar to
mathematical operators and can be used to create logical statements that filter
your results. Data analysts use Boolean statements to do a wide range of data
analysis tasks, such as creating queries for searches and checking for conditions
when writing programming code.

Imagine you are shopping for shoes, and are considering certain preferences:

 You will buy the shoes only if they are pink and Grey
 You will buy the shoes if they are entirely pink or entirely grey, or if they are
pink and grey
 You will buy the shoes if they are grey, but not if they have any pink
Below are Venn diagrams that illustrate these preferences. AND is the center of
the Venn diagram, where two conditions overlap. OR includes either condition.
NOT includes only the part of the Venn diagram that doesn't contain the
exception.

The AND operator

Your condition is “If the color of the shoe has any combination of grey and pink,
you will buy them.” The Boolean statement would break down the logic of that
statement to filter your results by both colors. It would say “IF (Color=”Grey”)
AND (Color=”Pink”) then buy them.” The AND operator lets you stack multiple
conditions.

Below is a simple truth table that outlines the Boolean logic at work in this
statement. In the Color is Grey column, there are two pairs of shoes that meet
the color condition. And in the Color is Pink column, there are two pairs that
meet that condition. But in the If Grey AND Pink column, there is only one pair of
shoes that meets both conditions. So, according to the Boolean logic of the
statement, there is only one pair marked true. In other words, there is one pair of
shoes that you can buy.

The OR operator

The OR operator lets you move forward if either one of your two conditions is
met. Your condition is “If the shoes are grey or pink, you will buy them.” The
Boolean statement would be “IF (Color=”Grey”) OR (Color=”Pink”) then buy
them.” Notice that any shoe that meets either the Color is Grey or the Color is
Pink condition is marked as true by the Boolean logic. According to the truth
table below, there are three pairs of shoes that you can buy.

The NOT operator

Finally, the NOT operator lets you filter by subtracting specific conditions from the
results. Your condition is "You will buy any grey shoe except for those with any
traces of pink in them." Your Boolean statement would be “IF (Color="Grey")
AND (Color=NOT “Pink”) then buy them.” Now, all of the grey shoes that aren't
pink are marked true by the Boolean logic for the NOT Pink condition. The pink
shoes are marked false by the Boolean logic for the NOT Pink condition. Only
one pair of shoes is excluded in the truth table below.

The power of multiple conditions

For data analysts, the real power of Boolean logic comes from being able to
combine multiple conditions in a single statement. For example, if you wanted to
filter for shoes that were grey or pink, and waterproof, you could construct a
Boolean statement such as: “IF ((Color = “Grey”) OR (Color = “Pink”)) AND
(Waterproof=“True”).” Notice that you can use parentheses to group your
conditions together.

Whether you are doing a search for new shoes or applying this logic to your
database queries, Boolean logic lets you create multiple conditions to filter your
results. And now that you know a little more about how Boolean logic is used,
you can start using it!
Wide data, every data subject has a single row with multiple columns to
hold the values of various attributes of the subject.
Long data is data in which each row is one time point per subject, so
each subject will have data in multiple rows.

Transforming Data

Data transformation is the process of changing the data’s format, structure, or


values. As a data analyst, there is a good chance you will need to transform data
at some point to make it easier for you to analyze it.

Data transformation usually involves:

 Adding, copying, or replicating data


 Deleting fields or records
 Standardizing the names of variables
 Renaming, moving, or combining columns in a database
 Joining one set of data with another
 Saving a file in a different format. For example, saving a spreadsheet as a
comma separated values (CSV) file.

Why transform data?

Goals for data transformation might be:

 Data organization: better organized data is easier to use


 Data compatibility: different applications or systems can then use the same
data
 Data migration: data with matching formats can be moved from one system
to another
 Data merging: data with the same organization can be merged together
 Data enhancement: data can be displayed with more detailed fields
 Data comparison: apples-to-apples comparisons of the data can then be
made

Module 2

Bias has evolved to become a preference in favor of or against a


person, group of people, or thing. It can be conscious or subconscious.
Data bias is a type of error that systematically skews results in a certain
direction. Maybe the questions on a survey had a particular slant to influence
answers, or maybe the sample group wasn't truly representative of the population
being studied.
Sampling bias is when a sample isn't representative of the population as
a whole.

Unbiased sampling results in a sample that's representative of the


population being measured.

Three more types of data bias, observer bias, interpretation bias, and
confirmation bias, and we'll learn how to avoid them;

Let's start with observer bias, which is sometimes referred to as experimenter


bias or research bias.

Interpretation bias, can lead to two people seeing or hearing the exact same
thing, and interpreting it in a variety of different ways, because they have different
backgrounds, and experiences.

Confirmation bias, is the tendency to search for, or interpret information in a


way that confirms preexisting beliefs.

The four types of data bias we covered, sampling bias, observer bias,
interpretation bias, and confirmation bias, are all unique, but they do have one
thing in common. They each affect the way we collect, and make sense of the
data. Unfortunately, they're also just a small sample, pun intended, of the types
of bias you may encounter in your career as a data analyst.

Identifying Good data Sources

The more high quality data we have, the more confidence we can have in our decisions.
Let's learn how we can go about finding and identifying good data sources.
First things first, we need to learn how to identify them. A process I like to call ROCCC, R-O-
C-C-C.

Like a good friend, good data sources are reliable. With this data you can trust
that you're getting accurate, complete and unbiased information that's been
vetted and proven fit for use. Okay. Onto O. O is for original. There's a good
chance you'll discover data through a second or third party source. To make sure
you're dealing with good data, be sure to validate it with the original source. Time
for the first C. C is for comprehensive. The best data sources contain all critical
information needed to answer the question or find the solution. Think about it like
this. You wouldn't want to work for a company just because you found one great
online review about it. You'd research every aspect of the organization to make
sure it was the right fit. It's important to do the same for your data analysis. The
next C is for current. The usefulness of data decreases as time passes. If you
wanted to invite all current clients to a business event, you wouldn't use a 10-
year-old client list. The same goes for data. The best data sources are
current and relevant to the task at hand. The last C is for cited. If you've ever told
a friend where you heard that a new movie sequel was in the works, you've cited
a source. Citing makes the information you're providing more credible. When
you're choosing a data source, think about three things.

Introduction to Data Ethics

Data ethics refers to well- founded standards of right and wrong that
dictate how data is collected, shared, and used.

General Data Protection Regulation of the European Union

Ownership. This answers the question who owns data? It isn't the
organization that invested time and money collecting, storing, processing, and
analyzing it. It's individuals who own the raw data they provide, and they have
primary control over its usage, how it's processed and how it's shared.
Transaction transparency, which is the idea that all data processing
activities and algorithms should be completely explainable and understood by the
individual who provides their data. This is in response to concerns over data
bias, which we discussed earlier, is a type of error that systematically skews
results in a certain direction. Biased outcomes can lead to negative
consequences. To avoid them, it's helpful to provide transparent analysis
especially to the people who share their data.
Consent. This is an individual's right to know explicit details about how and
why their data will be used before agreeing to provide it. They should know
answers to questions like why is the data being collected? How will it be
used? How long will it be stored? The best way to give consent is probably a
conversation between the person providing the data and the person requesting
it.
Currency. Individuals should be aware of financial transactions
resulting from the use of their personal data and the scale of these
transactions. If your data is helping to fund a company's efforts, you should know
what those efforts are all about and be given the opportunity to opt out
When talking about data, privacy means preserving a data subject's
information and activity any time a data transaction occurs.
When referring to data, openness refers to free access, usage and sharing
of data. Sometimes we refer to this as open data, but it doesn't mean we
ignore the other aspects of data ethics we covered.
*Interoperability is the ability of data systems and services to openly connect and share
data

What is data anonymization?

You have been learning about the importance of privacy in data analytics. Now, it
is time to talk about data anonymization and what types of data should be
anonymized. Personally identifiable information, or PII, is information that can
be used by itself or with other data to track down a person's identity.

Data anonymization is the process of protecting people's private or sensitive data


by eliminating that kind of information. Typically, data anonymization involves
blanking, hashing, or masking personal information, often by using fixed-length
codes to represent data columns, or hiding data with altered values.

Your role in data anonymization

Organizations have a responsibility to protect their data and the personal


information that data might contain. As a data analyst, you might be expected to
understand what data needs to be anonymized, but you generally wouldn't be
responsible for the data anonymization itself. A rare exception might be if you
work with a copy of the data for testing or development purposes. In this case,
you could be required to anonymize the data before you work with it.

What types of data should be anonymized?

Healthcare and financial data are two of the most sensitive types of data. These
industries rely a lot on data anonymization techniques. After all, the stakes are
very high. That’s why data in these two industries usually goes through de-
identification, which is a process used to wipe data clean of all personally
identifying information.

Data anonymization is used in just about every industry. That is why it is so


important for data analysts to understand the basics. Here is a list of data that is
often anonymized:

 Telephone numbers
 Names
 License plates and license numbers
 Social security numbers
 IP addresses
 Medical records
 Email addresses
 Photographs
 Account numbers
For some people, it just makes sense that this type of data should be
anonymized. For others, we have to be very specific about what needs to be
anonymized. Imagine a world where we all had access to each other’s
addresses, account numbers, and other identifiable information. That would
invade a lot of people’s privacy and make the world less safe. Data
anonymization is one of the ways we can keep data private and secure!

What is open data?

In data analytics, open data is part of data ethics, which has to do with using
data ethically. Openness refers to free access, usage, and sharing of data. But
for data to be considered open, it has to:
 Be available and accessible to the public as a complete dataset
 Be provided under terms that allow it to be reused and redistributed
 Allow universal participation so that anyone can use, reuse, and redistribute
the data
Data can only be considered open when it meets all three of these standards.

The open data debate: What data should be publicly available?

One of the biggest benefits of open data is that credible databases can be used
more widely. Basically, this means that all of that good data can be leveraged,
shared, and combined with other data. This could have a huge impact on
scientific collaboration, research advances, analytical capacity, and decision-
making. But it is important to think about the individuals being represented by the
public, open data, too.

Third-party data is collected by an entity that doesn’t have a direct relationship


with the data. You might remember learning about this type of data earlier. For
example, third parties might collect information about visitors to a certain website.
Doing this lets these third parties create audience profiles, which helps them
better understand user behavior and target them with more effective advertising.

Personal identifiable information (PII) is data that is reasonably likely to


identify a person and make information known about them. It is important to keep
this data safe. PII can include a person’s address, credit card information, social
security number, medical records, and more.

Everyone wants to keep personal information about themselves private. Because


third-party data is readily available, it is important to balance the openness of
data with the privacy of individuals.

The key aspects of universal participation are that everyone must be able
to use, reuse, and redistribute open data. Also, no one can place
restrictions on data to discriminate against a person or group.

The benefits of open data include making good data more widely available
and combining data from different fields of knowledge.

Module 3

Meta- data about Data

Databases in data analytics


Databases enable analysts to manipulate, store, and process data. This helps them search through data
a lot more efficiently to get the best insights.

Relational databases
A relational database is a database that contains a series of tables that can be connected to show
relationships. Basically, they allow data analysts to organize and link data based on what the data has
in common.

In a non-relational table, you will find all of the possible variables you might be interested in
analyzing all grouped together. This can make it really hard to sort through. This is one reason why
relational databases are so common in data analysis: they simplify a lot of analysis processes and
make data easier to find and use across an entire database.

Database Normalization
Normalization is a process of organizing data in a relational database. For example, creating tables
and establishing relationships between those tables. It is applied to eliminate data redundancy,
increase data integrity, and reduce complexity in a database.

The key to relational databases


Tables in a relational database are connected by the fields they have in common. You might remember
learning about primary and foreign keys before. As a quick refresher, a primary key is an identifier
that references a column in which each value is unique. In other words, it's a column of a table that is
used to uniquely identify each record within that table. The value assigned to the primary key in a
particular row must be unique within the entire table. For example, if customer_id is the primary key
for the customer table, no two customers will ever have the same customer_id.

By contrast, a foreign key is a field within a table that is a primary key in another table. A table can
have only one primary key, but it can have multiple foreign keys. These keys are what create the
relationships between tables in a relational database, which helps organize and connect data across
multiple tables in the database.

Some tables don't require a primary key. For example, a revenue table can have multiple foreign keys
and not have a primary key. A primary key may also be constructed using multiple columns of a table.
This type of primary key is called a composite key. For example, if customer_id and location_id are
two columns of a composite key for a customer table, the values assigned to those fields in any given
row must be unique within the entire table.
SQL? You’re speaking my language
Databases use a special language to communicate called a query language. Structured Query
Language (SQL) is a type of query language that lets data analysts communicate with a database. So,
a data analyst will use SQL to create a query to view the specific data that they want from within the
larger set. In a relational database, data analysts can write queries to get data from the related tables.
SQL is a powerful tool for working with databases

As a data analyst, there are three common types of metadata that you'll come across: descriptive,
structural, and administrative.

Descriptive metadata is metadata that describes a piece of data and can be used to identify it
at a later point in time. The date and time a photo was taken is an example of administrative metadata.
Administrative metadata indicates the technical source and details for a digital asse t. For instance, the
descriptive metadata of a book in a library would include the code you see on its spine, known as
a unique International Standard Book Number, also called the ISBN. (Description of the data)

Structural metadata, which is metadata that indicates how a piece of data is organized and
whether it's part of one or more than one data collection. Structural metadata indicates exactly how
many collections data lives in. It provides information about how a piece of data is organized and
whether it’s part of one, or more than one, data collection . Let's head back to the library. An example
of structural data would be how the pages of a book are put together to create different chapters. It's
important to note that structural metadata also keeps track of the relationship between two things. For
example, it can show us that the digital document of a book manuscript was actually the original
version of a now printed book. (How and where data is collected)

Administrative metadata is metadata that indicates the technical source of a digital asset. The
date and time a photo was taken is an example of administrative metadata. Administrative metadata
indicates the technical source and details for a digital asset. When we looked at the metadata inside the
photo, that was administrative metadata. It shows you the type of file it was, the date and time it was
taken, and much more.

Metadata is as important as the data itself


Data analytics, by design, is a field that thrives on collecting and organizing data. In this reading, you
are going to learn about how to analyze and thoroughly understand every aspect of your data.

Take a look at any data you find. What is it? Where did it come from? Is it useful? How do you know?
This is where metadata comes in to provide a deeper understanding of the data. To put it simply,
metadata is data about data. In database management, it provides information about other data and
helps data analysts interpret the contents of the data within a database.

Regardless of whether you are working with a large or small quantity of data, metadata is the mark of
a knowledgeable analytics team, helping to communicate about data across the business and making it
easier to reuse data. In essence, metadata tells the who, what, when, where, which, how, and why of
data.

Elements of metadata
Before looking at metadata examples, it is important to understand what type of information metadata
typically provides.

Title and description


What is the name of the file or website you are examining? What type of content does it contain?

Tags and categories


What is the general overview of the data that you have? Is the data indexed or described in a specific
way?

Who created it and when


Where did the data come from, and when was it created? Is it recent, or has it existed for a long time?

Who last modified it and when


Were any changes made to the data? If yes, were the modifications recent?

Who can access or update it


Is this dataset public? Are special permissions needed to customize or modify the dataset?

Examples of metadata
In today’s digital world, metadata is everywhere, and it is becoming a more common practice to
provide metadata on a lot of media and information you interact with. Here are some real-world
examples of where to find metadata:

Photos
Whenever a photo is captured with a camera, metadata such as camera filename, date, time, and
geolocation are gathered and saved with it.

Emails
When an email is sent or received, there is lots of visible metadata such as subject line, the sender, the
recipient and date and time sent. There is also hidden metadata that includes server names, IP
addresses, HTML format, and software details.

Spreadsheets and documents


Spreadsheets and documents are already filled with a considerable amount of data so it is no surprise
that metadata would also accompany them. Titles, author, creation date, number of pages, user
comments as well as names of tabs, tables, and columns are all metadata that one can find in
spreadsheets and documents.

Websites
Every web page has a number of standard metadata fields, such as tags and categories, site creator’s
name, web page title and description, time of creation and any iconography.

Digital files
Usually, if you right click on any computer file, you will see its metadata. This could consist of file
name, file size, date of creation and modification, and type of file.

Books
Metadata is not only digital. Every book has a number of standard metadata on the covers and inside
that will inform you of its title, author’s name, a table of contents, publisher information, copyright
description, index, and a brief description of the book’s contents.
Data as you know it
Knowing the content and context of your data, as well as how it is structured, is very valuable in your
career as a data analyst. When analyzing data, it is important to always understand the full picture. It is
not just about the data you are viewing, but how that data comes together. Metadata ensures that you
are able to find, use, preserve, and reuse data in the future. Remember, it will be your responsibility to
manage and make use of data in its entirety; metadata is as important as the data itself.

Metadata creates a single source of truth by keeping things consistent and uniform.. Metadata also
makes data more reliable by making sure it's accurate, precise, relevant, and timely. This also makes it
easier for data analysts to identify the root causes of any problems that might pop up. A metadata
repository is a database specifically created to store metadata. Metadata repositories make it easier
and faster to bring together multiple sources for data analysis. They do this by describing the state and
location of the metadata, the structure of the tables inside, and how data flows through the repository.

.., On the other hand, metadata is stored in a single, central location and it gives the company
standardized information about all of its data. This is done in two ways. First, metadata includes
information about where each system is located and where the data sets are located within those
systems. Second, the metadata describes how all of the data is connected between the various
systems. Another important aspect of metadata is something called data governance. Data
governance is a process to ensure the formal management of a company’s data assets. This gives an
organization better control of their data and helps a company manage issues related to data security
and privacy, integrity, usability, and internal and external data flows. It's important to note that data
governance is about more than just standardizing terminology and procedures. It's about the roles and
responsibilities of the people who work with the metadata every day

two basic types of data used by data analysts: internal and external. Internal data is data that lives
within a company's own systems. It's typically also generated from within the company. You may also
hear internal data described as primary data. External data is data that lives and is generated outside an
organization. It can come from a variety of places, including other businesses, government sources,
the media, professional associations, schools, and more. External data is sometimes called secondary
data. Gathering internal data can be complicated. Depending on your data analytics project, you might
need data from lots of different sources and departments, including sales, marketing, customer
relationship management, finance, human resources, and even the data archives. But the effort is
worth it. Internal data has plenty of advantages for a business. There are lots of reasons for these open
data initiatives. One is to make government activities more transparent, like letting the public see
where money is spent. It also helps educate citizens about voting and local issues. Open data also
improves public service by giving people ways to be a part of public planning or provide feedback to
the government. Finally, open data leads to innovation and economic growth by helping people and
companies better understand their markets.

Big Query

 SELECT is the section of a query that indicates what data you want SQL to return to you
 FROM is the section of a query that indicates which table the desired data comes from.
 WHERE is the section of a query that indicates any filters you’d like to apply to your dataset

Module 4

There are plenty of best practices you can use when organizing data, including naming conventions,
foldering, and archiving older files.

We've talked about file naming before, which is also known as naming conventions. These are
consistent guidelines that describe the content, date, or version of a file in its name. Basically, this
means you want to use logical and descriptive names for your files to make them easier to find and
use.

Speaking of easily finding things, organizing your files into folders helps keep project-related files
together in one place. This is called foldering.

Relational databases can help you avoid data duplication and store your data more efficiently.

Best practices for file naming conventions

Review the following file naming recommendations:

 Work out and agree on file naming conventions early on in a project to avoid renaming files
again and again.

 Align your file naming with your team's or company's existing file-naming conventions.

 Ensure that your file names are meaningful; consider including information like project name and
anything else that will help you quickly identify (and use) the file for the right purpose.

 Include the date and version number in file names; common formats are YYYYMMDD for dates
and v## for versions (or revisions).

 Create a text file as a sample file with content that describes (breaks down) the file naming
convention and a file name that applies it.

 Avoid spaces and special characters in file names. Instead, use dashes, underscores, or capital
letters. Spaces and special characters can cause errors in some applications.

Best practices for keeping files organized

Remember these tips for staying organized as you work with files:

 Create folders and sub folders in a logical hierarchy so related files are stored together.

 Separate ongoing from completed work so your current project files are easier to find. Archive
older files in a separate folder, or in an external storage location.

 If your files aren't automatically backed up, manually back them up often to avoid losing
important work.

To separate current from past work and reduce clutter, data analysts create archives.

The process of structuring folders broadly at the top, then breaking down those folders into more
specific topics, is creating a hierarchy.

Data security means protecting data from unauthorized access or corruption by adopting safety
measures.

Balancing security and analytics


The battle between security and data analytics
Data security means protecting data from unauthorized access or corruption by putting safety
measures in place. Usually the purpose of data security is to keep unauthorized users from accessing
or viewing sensitive data. Data analysts have to find a way to balance data security with their actual
analysis needs. This can be tricky-- we want to keep our data safe and secure, but we also want to use
it as soon as possible so that we can make meaningful and timely observations.

In order to do this, companies need to find ways to balance their data security measures with their data
access needs.
Luckily, there are a few security measures that can help companies do just that. The two we will talk
about here are encryption and tokenization.

Encryption uses a unique algorithm to alter data and make it unusable by users and applications that
don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the
encryption; so if you have the key, you can still use the data in its original form.

Tokenization replaces the data elements you want to protect with randomly generated data referred to
as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the
complete original data, the user or application needs to have permission to use the tokenized data and
the token mapping. This means that even if the tokenized data is hacked, the original data is still safe
and secure in a separate location.

Encryption and tokenization are just some of the data security options out there. There are a lot of
others, like using authentication devices for AI technology.

As a junior data analyst, you probably won’t be responsible for building out these systems. A lot of
companies have entire teams dedicated to data security or hire third party companies that specialize in
data security to create these systems. But it is important to know that all companies have a
responsibility to keep their data secure, and to understand some of the potential systems your future
employer might use.

*a mentor is a professional who shares their knowledge, skills, and experience to help you develop
and grow. They can be trusted advisors, sounding boards, critics, resources or all of the above
*A sponsor is a professional advocate who's committed to moving a sponsee's career forward with an
organization. To understand the difference between these two roles, think of it like this. A mentor
helps you skill up, a sponsor helps you move up. Having the support of a sponsor is like having a
safety net. They can give you the confidence to take risks at work,
like asking for a new assignment or promotion
Course 4
Module 1
Data Integrity and Analytics Objectives

*A strong analysis depends on the integrity of the data.


If you're replicating data at different times in different places, there's a chance your data will be out of
sync.
Data Replication - storing data into multiple location
Data transfer, which is the process of copying data from a storage device to memory, or from one
computer to another.
The data manipulation process involves changing the data to make it more organized and easier to
read.

Reference: Data constraints and examples


As you progress in your data journey, you'll come across many types of data constraints (or criteria
that determine validity). The table below offers definitions and examples of data constraint terms you
might come across.
Types Of Insufficient Data
 Data from only one source
 Data that keeps updating
 Outdated Data
 Geographically limited data

Ways to Address Insufficient Data

 Identify trends with the available data


 wait for more data if time allows;
 talk with stakeholders and adjust your objective;
 look for a new data set

What to do when you find an issue with your data


When you are getting ready for data analysis, you might realize you don’t have the data you need or
you don’t have enough of it. In some cases, you can use what is known as proxy data in place of the
real data. Think of it like substituting oil for butter in a recipe when you don’t have butter. In other
cases, there is no reasonable substitute and your only option is to collect more data.

Consider the following data issues and suggestions on how to work around them.

Data issue 1: no data


Possible Solutions Examples of solutions in real life

Gather the data on a small scale to perform a If you are surveying employees about what they think about
preliminary analysis and then request additional a new performance and bonus plan, use a sample for a
time to complete the analysis after you have preliminary analysis. Then, ask for another 3 weeks to collect
collected more data. the data from all employees.

If there isn’t time to collect data, perform the If you are analyzing peak travel times for commuters but
analysis using proxy data from other datasets. This don’t have the data for a particular city, use the data from
is the most common workaround. another city with a similar size and demographic.

Data issue 2: too little data

Possible Solutions Examples of solutions in real life


Do the analysis using proxy If you are analyzing trends for owners of golden retrievers, make your dataset
data along with actual data. larger by including the data from owners of labradors.

If you are missing data for 18- to 24-year-olds, do the analysis but note the
Adjust your analysis to align
following limitation in your report: this conclusion applies to adults 25 years
with the data you already have.
and older only.

Data issue 3: wrong data, including data with errors*

Possible Solutions Examples of solutions in real life

If you have the wrong data because requirements were If you need the data for female voters and received
misunderstood, communicate the requirements again. the data for male voters, restate your needs.

If your data is in a spreadsheet and there is a


Identify errors in the data and, if possible, correct them at conditional statement or boolean causing calculations
the source by looking for a pattern in the errors. to be wrong, change the conditional statement instead
of just fixing the calculated values.

If you can’t correct data errors yourself, you can ignore If your dataset was translated from a different
the wrong data and go ahead with the analysis if your language and some of the translations don’t make
sample size is still large enough and ignoring the data sense, ignore the data with bad translation and go
won’t cause systematic bias. ahead with the analysis of the other data.

* Important note: sometimes data with errors can be a warning sign that the data isn’t reliable. Use
your best judgment.
Use the following decision tree as a reminder of how to deal with data errors or not enough
data:

When you use sample size or a sample, you use a part of a population that's representative of the
population. The goal is to get enough information from a small group within a population to make
predictions or conclusions about the whole population. The sample size helps ensure the degree to
which you can be confident that your conclusions accurately represent the population Using a sample
for analysis is more cost-effective and takes less time.If done carefully and thoughtfully, you can get
the same results using a sample size instead of trying to hunt down every single cat owner to find out
their favorite cat toys. There is a potential downside, though. When you only use a small sample of a
population, it can lead to uncertainty. You can't really be 100 percent sure that your statistics are a
complete and accurate representation of the population. This leads to sampling bias, which we covered
earlier in the program. Sampling bias is when a sample isn't representative of the population as a
whole. This means some members of the population are being over-represented or underrepresented
using random sampling can help address some of those issues with sampling bias. Random sampling
is a way of selecting a sample from a population so that every possible type of the sample has an equal
chance of being chosen.

Calculating sample size


Before you dig deeper into sample size, familiarize yourself with these terms and definitions:

Terminology Definitions

The entire group that you are interested in for your study. For example, if you are surveying
Population
people in your company, the population would be all the employees in your company.

A subset of your population. Just like a food sample, it is called a sample because it is only a
Sample taste. So if your company is too large to survey every individual, you can survey a
representative sample of your population.

Since a sample is used to represent a population, the sample’s results are expected to differ
from what the result would have been if you had surveyed the entire population. This
Margin
difference is called the margin of error. The smaller the margin of error, the closer the results
of error
of the sample are to what the result would have been if you had surveyed the entire
population.

How confident you are in the survey results. For example, a 95% confidence level means
that if you were to run the same survey 100 times, you would get similar results 95 of those
Confidence level
100 times. Confidence level is targeted before you start your study because it will affect how
big your margin of error is at the end of your study.

Confidence The range of possible values that the population’s result would be at the confidence level of
interval the study. This range is the sample result +/- the margin of error.

Statistical The determination of whether your result could be due to random chance or not. The greater
significance the significance, the less due to chance.

Things to remember when determining the size of your sample

When figuring out a sample size, here are things to keep in mind:

 Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size
where an average result of a sample starts to represent the average result of a population.
 The confidence level most commonly used is 95%, but 90% can work in some cases.
Increase the sample size to meet specific needs of your project:
 For a higher confidence level, use a larger sample size
 To decrease the margin of error, use a larger sample size
 For greater statistical significance, use a larger sample size

Note: Sample size calculators use statistical formulas to determine a sample size. More about these are
coming up in the course! Stay tuned.

Why a minimum sample of 30?

This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and
statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution
from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid.
Researchers who rely on regression analysis – statistical methods to determine the relationships between
controlled and dependent variables – also prefer a minimum sample of 30.

Sample sizes vary by business problem


Sample size will vary based on the type of business problem you are trying to solve.

For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a
survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller
sample size look like?

Would 200 be alright if the people surveyed represented every district in the city?

Answer: It depends on the stakes.

 A sample size of 200 might be large enough if your business problem is to find out how residents
felt about the new library
 A sample size of 200 might not be large enough if your business problem is to determine how
residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the new library
versus surveying residents about how they would vote to fund it. For that reason, you would most
likely use a larger sample size for the voter survey.

Larger sample sizes have a higher cost


You also have to weigh the cost against the benefits of more accurate results with a larger sample size.
Someone who is trying to understand consumer preferences for a new line of products wouldn’t need
as large a sample size as someone who is trying to understand the effects of a new drug. For drug
safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a
smaller sample size at a lower cost could provide good enough results.

Testing your Data

 Statistical power is the probability of getting meaningful results from a test.


 Hypothesis testing is a way to see if a survey or experiment has meaningful results. Here's an
example. Let's say you work for a restaurant chain that's planning a marketing campaign for their
new milkshakes. You need to test the ad on a group of customers before turning it into a
nationwide ad campaign.
 You need a larger sample size: so you can make sure you get a good number of all types of
people for your test. Usually, the larger the sample size, the greater the chance you'll have
statistically significant results with your test. And that's statistical power.
 Statistical power is usually shown as a value out of one
 If a test is statistically significant, it means the results of the test are real and not an error caused
by random chance.
 You need a statistical power of at least 0.8 or 80% to consider your results statistically
significant.

Proxy data examples


Sometimes the data to support a business objective isn’t readily available. This is when proxy data is
useful. Take a look at the following scenarios and where proxy data comes in for each example:

Business scenario How proxy data can be used

A new car model was just launched a few days ago and
The analyst proxies the number of clicks to the car
the auto dealership can’t wait until the end of the month
specifications on the dealership’s website as an estimate
for sales data to come in. They want sales projections
of potential sales at the dealership.
now.
Business scenario How proxy data can be used

A brand new plant-based meat product was only recently The analyst proxies the sales data for a turkey substitute
stocked in grocery stores and the supplier needs to made out of tofu that has been on the market for several
estimate the demand over the next four years. years.

The Chamber of Commerce wants to know how a


The analyst proxies the historical data for airline
tourism campaign is going to impact travel to their city,
bookings to the city one to three months after a similar
but the results from the campaign aren’t publicly
campaign was run six months earlier.
available yet.

Determine the Sample Size


 The confidence level is the probability that your sample accurately reflects the greater
population. You can think of it the same way as confidence in anything else. It's how strongly
you feel that you can rely on something or someone. Having a 99 percent confidence level is
ideal. But most industries hope for at least a 90 or 95 percent confidence level.
 Margin of Error tells you how close your sample size results are to what your results would be if
you use the entire population that your sample size represents.
 In order to have a high confidence level in a customer survey, the sample size should accurately
reflect the entire population.

Sample size calculator


In this reading, you will learn the basics of sample size calculators, how to use them, and how to
understand the results. A sample size calculator tells you how many people you need to interview (or
things you need to test) to get results that represent the target population. Let’s review some terms you
will come across when using a sample size calculator:

 Confidence level: The probability that your sample size accurately reflects the greater
population.
 Margin of error: The maximum amount that the sample results are expected to differ from those
of the actual population.
 Population: This is the total number you hope to pull your sample from.
 Sample: A part of a population that is representative of the population.
 Estimated response rate: If you are running a survey of individuals, this is the percentage of
people you expect will complete your survey out of those who received the survey.

What to do with the results


After you have plugged your information into one of these calculators, it will give you a
recommended sample size. Keep in mind, the calculated sample size is the minimum number to
achieve what you input for confidence level and margin of error. If you are working with a survey,
you will also need to think about the estimated response rate to figure out how many surveys you will
need to send out. For example, if you need a sample size of 100 individuals and your estimated
response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses
you need for your analysis.

Evaluate the Reliability of your Data


 As a data analyst, it's important for you to figure out sample size and variables like confidence
level and margin of error before running any kind of test or survey. It's the best way to make sure
your results are objective, and it gives you a better chance of getting statistically significant
results. But if you already know the sample size, like when you're given survey results to
analyze, you can calculate the margin of error yourself. Then you'll have a better idea of how
much of a difference there is between your sample and your population.
 Margin of error is the maximum that the sample results are expected to differ from those of the
actual population.
 Based on the sample size, the resulting margin of error will tell us how different the results might
be compared to the results if we had surveyed the entire population.
 Margin of error helps you understand how reliable the data from your hypothesis testing is.
 The closer to zero the margin of error, the closer your results from your sample would match
results from the overall population.
 In general, the more people you include in your survey, the more likely your sample is
representative of the entire population.
 Decreasing the confidence level would also have the same effect, but that would also make it less
likely that your survey is accurate.

All about margin of error


Margin of error is the maximum amount that the sample results are expected to differ from those of
the actual population. More technically, the margin of error defines a range of values below and above
the average result for the sample. The average result for the entire population is expected to be within
that range. We can better understand margin of error by using some examples below.

Imagine you are playing baseball and that you are up at bat. The crowd is roaring, and you are getting
ready to try to hit the ball. The pitcher delivers a fastball traveling about 90-95mph, which takes about
400 milliseconds (ms) to reach the catcher’s glove. You swing and miss the first pitch because your
timing was a little off. You wonder if you should have swung slightly earlier or slightly later to hit a
home run. That time difference can be considered the margin of error, and it tells us how close or far
your timing was from the average home run swing.

Margin of error in marketing

The margin of error is also important in marketing. Let’s use A/B testing as an example. A/B testing
(or split testing) tests two variations of the same web page to determine which page is more successful
in attracting user traffic and generating revenue. User traffic that gets monetized is known as the
conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data
behind what is working and what isn’t working. Marketers use the confidence interval (determined
by the conversion rate and the margin of error) to understand the results.

For example, suppose you are conducting an A/B test to compare the effectiveness of two different
email subject lines to entice people to open the email. You find that subject line A: “Special offer just
for you” resulted in a 5% open rate compared to subject line B: “Don’t miss this opportunity” at 3%.

Does that mean subject line A is better than subject line B? It depends on your margin of error. If the
margin of error was 2%, then subject line A’s actual open rate or confidence interval is somewhere
between 3% and 7%. Since the lower end of the interval overlaps with subject line B’s results at 3%,
you can’t conclude that there is a statistically significant difference between subject line A and B.
Examining the margin of error is important when making conclusions based on your test results.

Want to calculate your margin of error?


All you need is population size, confidence level, and sample size. In order to better understand this
calculator, review these terms:

 Confidence level: A percentage indicating how likely your sample accurately reflects the greater
population
 Population: The total number you pull your sample from
 Sample: A part of a population that is representative of the population
 Margin of error: The maximum amount that the sample results are expected to differ from those
of the actual population
In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might
want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such
as the pharmaceutical industry.

Key takeaway
Margin of error is used to determine how close your sample’s result is to what the result would likely
have been if you could have surveyed or tested the entire population. Margin of error helps you
understand and interpret survey or test results in real-life. Calculating the margin of error is
particularly helpful when you are given the data to analyze. After using a calculator to calculate the
margin of error, you will know how much the sample results might differ from the results of the entire
population.

Module 2
 Can you guess what inaccurate or bad data costs businesses every year? Thousands of dollars?
Millions? Billions? Well, according to IBM, the yearly cost of poor-quality data is $3.1 trillion in
the US alone.
 It's not a new system implementation or a computer technical glitch. The most common factor is
actually human error.
 Dirty data can be the result of someone typing in a piece of data incorrectly; inconsistent
formatting; blank fields; or the same piece of data being entered more than once, which creates
duplicates. Dirty data is data that's incomplete, incorrect, or irrelevant to the problem you're
trying to solve.
 When you work with dirty data, you can't be sure that your results are correct. In fact, you can
pretty much bet they won't be. Earlier, you learned that data integrity is critical to reliable
data analytics results, and clean data helps you achieve data integrity. Clean data is data that's
complete, correct, and relevant to the problem you're trying to solve. When you work with clean
data, you'll find that your projects go much more smoothly.
 Clean data is incredibly important for effective analysis. If a piece of data is entered into a
spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank, or if data formats
are inconsistent, the result is dirty data. Small mistakes can lead to big consequences in the long
run.

Let's talk about some people you'll work with as a data analyst
 Data engineers transform data into a useful format for analysis and give it a reliable
infrastructure. This means they develop, maintain, and test databases, data processors and related
systems.
 Data warehousing specialists develop processes and procedures to effectively store and organize
data. They make sure that data is available, secure, and backed up to prevent loss.
*It's always a good idea to examine and clean data before beginning analysis
 Data cleaning becomes even more important when working with external data, especially if it
comes from multiple sources
 A null is an indication that a value does not exist in a data set. Note that it's not the same as a
zero. In the case of a survey, a null would mean the customers skipped that question. A zero
would mean they provided zero as their response. To do your analysis, you would first need to
clean this data. Step one would be to decide what to do with those nulls. You could either filter
them out and communicate that you now have a smaller sample size, or you can keep them in and
learn from the fact that the customers did not provide responses. There's lots of reasons why this
could have happened. Maybe your survey questions weren't written as well as they could be.

Types of Dirty Data

Duplicate data

Description Possible causes Potential harm to businesses

Skewed metrics or analyses, inflated or inaccurate


Any data record that shows Manual data entry, batch data
counts or predictions, or confusion during data
up more than once imports, or data migration
retrieval

Outdated data
Potential harm to
Description Possible causes
businesses

Any data that is old which should be People changing roles or companies, or Inaccurate insights,
replaced with newer and more accurate software and systems becoming decision-making, and
information obsolete analytics

Incomplete data

Description Possible causes Potential harm to businesses

Any data that is missing Improper data collection or Decreased productivity, inaccurate insights, or
important fields incorrect data entry inability to complete essential services

Incorrect/inaccurate data

Description Possible causes Potential harm to businesses

Any data that is complete Human error inserted during data input, Inaccurate insights or decision-making based
but inaccurate fake information, or mock data on bad information resulting in revenue loss

Inconsistent data

Description Possible causes Potential harm to businesses

Any data that uses different


Data stored incorrectly or errors Contradictory data points leading to confusion
formats to represent the same
inserted during data transfer or inability to classify or segment customers
thing

 Clean data depends largely on the data integrity rules that an organization follows, such as
spelling and punctuation guidelines, nulls are empty fields. This kind of dirty data requires a
little more work than just fixing a spelling error or changing a format.
 some other types of dirty data. The first has to do with labeling. To understand labeling, imagine
trying to get a computer to correctly identify panda bears among images of all different kinds of
animals. You need to show the computer thousands of images of panda bears. They're all labeled
as panda bears. Any incorrectly labeled picture, like the one here that's just bear, will cause a
problem.
 Field is a single piece of information from a row or column of a spreadsheet
 Field length is a tool for determining how many characters can be keyed into a field. Assigning a
certain length to the fields in your spreadsheet is a great way to avoid errors.
 Data validation is a tool for checking the accuracy and quality of data before adding or importing
it.
 consistency by ensuring measures are formatted or structured the same way across
systems.

 Check for completeness by ensuring that all important data is included.


 Validity the concept of using data integrity principles to ensure measures conform to
defined business rules or constraints
 Accuracy is the degree of conformity of a measure to a standard or a true value

 Clean data is essential to data integrity and reliable solutions and decisions.
 However, before removing unwanted data, it's always a good practice to make a copy of the data
set That way, if you remove something that you end up needing in the future, you can easily
access it and put it back in the data set.
 irrelevant data, which is data that doesn't fit the specific problem that you're trying to solve.
 Extra spaces can cause unexpected results when you sort, filter, or search through your data

Cleaning Data from Different source
 Merger which is an agreement that unites two organizations into a single new one.
 Data merging is the process of combining two or more datasets into a single dataset. This
presents a unique challenge because when two totally different datasets are combined, the
information is almost guaranteed to be inconsistent and misaligned.
 In data analytics, compatibility describes how well two or more datasets are able to work
together.

Common mistakes to avoid

 Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most
of the time the wrong spelling or common grammatical errors can be detected, but it gets harder
with things like names or addresses. For example, if you are working with a spreadsheet table of
customer data, you might come across a customer named “John” whose name has been input
incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if
you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it.

 Forgetting to document errors: Documenting your errors can be a big time saver, as it helps
you avoid those errors in the future by showing you how you resolved them. For example, you
might find an error in a formula in your spreadsheet. You discover that some of the dates in one
of your columns haven’t been formatted correctly. If you make a note of this fix, you can
reference it the next time your formula is broken, and get a head start on troubleshooting.
Documenting your errors also helps you keep track of changes in your work, so that you can
backtrack if a fix didn’t work.

 Not checking for misfielded values: A misfielded value happens when the values are entered
into the wrong field. These values might still be formatted correctly, which makes them harder to
catch if you aren’t careful. For example, you might have a dataset with columns for cities and
countries. These are the same type of data, so they are easy to mix up. But if you were trying to
find all of the instances of Spain in the country column, and Spain had mistakenly been entered
into the city column, you would miss key data points. Making sure your data has been entered
correctly is key to accurate, complete analysis.

 Overlooking missing values: Missing values in your dataset can create errors and give you
inaccurate conclusions. For example, if you were trying to get the total number of sales from the
last three months, but a week of transactions were missing, your calculations would be
inaccurate. As a best practice, try to keep your data as clean as possible by maintaining
completeness and consistency.

 Only looking at a subset of the data: It is important to think about all of the relevant data when
you are cleaning. This helps make sure you understand the whole story the data is telling, and
that you are paying attention to all possible errors. For example, if you are working with data
about bird migration patterns from different sources, but you only clean one source, you might
not realize that some of the data is being repeated. This will cause problems in your analysis later
on. If you want to avoid common errors like duplicates, each field of your data requires equal
attention.

 Losing track of business objectives: When you are cleaning data, you might make new and
interesting discoveries about your dataset-- but you don’t want those discoveries to distract you
from the task at hand. For example, if you were working with weather data to find the average
number of rainy days in your city, you might notice some interesting patterns about snowfall, too.
That is really interesting, but it isn’t related to the question you are trying to answer right now.
Being curious is great! But try not to let it distract you from the task at hand.
 Not fixing the source of the error: Fixing the error itself is important. But if that error is
actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will
have to keep fixing that same error over and over again. For example, imagine you have a team
spreadsheet that tracks everyone’s progress. The table keeps breaking because different people
are entering different values. You can keep fixing all of these problems one by one, or you can
set up your table to streamline data entry so everyone is on the same page. Addressing the source
of the errors in your data will save you a lot of time in the long run.

 Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future
errors, we need to understand the root cause of your dirty data. Imagine you are an auto
mechanic. You would find the cause of the problem before you started fixing the car, right? The
same goes for data. First, you figure out where the errors come from. Maybe it is from a data
entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you
understand where bad data comes from, you can control it and keep your data clean.

 Not backing up your data prior to data cleaning: It is always good to be proactive and create
your data backup before you start your data clean-up. If your program crashes, or if your changes
cause a problem in your dataset, you can always go back to the saved version and restore it. The
simple procedure of backing up your data can save you hours of work-- and most importantly, a
headache.

 Not accounting for data cleaning in your deadlines/process: All good things take time, and
that includes data cleaning. It is important to keep that in mind when going through your process
and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more
accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted
ETA.

Data cleaning is essential for accurate analysis and decision-making. Common mistakes to avoid when
cleaning data include spelling errors, misfielded values, missing values, only looking at a subset of the
data, losing track of business objectives, not fixing the source of the error, not analyzing the system prior
to data cleaning, not backing up your data prior to data cleaning, and not accounting for data cleaning in
your deadlines/process. By avoiding these mistakes, you can ensure that your data is clean and accurate,
leading to better outcomes for your business.

Data Cleaning Features in Spreadsheets


 Conditional formatting is a spreadsheet tool that changes how cells appear when values meet
specific conditions. Likewise, it can let you know when a cell does not meet the conditions
you've set. Visual cues like this are very useful for data analysts, especially when we're working
in a large spreadsheet with lots of data. Making certain data points standout makes the
information easier to understand and analyze. For cleaning data, knowing when the data doesn't
follow the condition is very helpful

 "Remove duplicates" is a tool that automatically searches for and eliminates duplicate entries
from a spreadsheet. Choose "Data has header row" because our spreadsheet has a row at the very
top that describes the contents of each column

 In data analytics, a text string is a group of characters within a cell, most often composed of
letters. An important characteristic of a text string is its length, which is the number of characters
in it.

 Substring is a smaller subset of a text string

 split is a tool that divides a text string around the specified character and puts each fragment into
a new and separate cell. Split is helpful when you have more than one piece of data in a cell and
you want to separate them out

 Split text to columns is also helpful for fixing instances of numbers stored as text. Sometimes
values in your spreadsheet will seem like numbers, but they're formatted as text. This can happen
when copying and pasting from one place to another or if the formatting's wrong

 Delimiter is a term for a character that indicates the beginning or end of a data item, such as a
comma

Optimize the data cleaning process

 Functions can optimize your efforts to ensure data integrity. As a reminder, a function is a set of
instructions that performs a specific calculation using the data in a spreadsheet.

 COUNTIF is a function that returns the number of cells that match a specified value. Basically, it
counts the number of times a value appears in a range of cells.

 LEN is a function that tells you the length of the text string by counting the number of characters
it contains. This is useful when cleaning data if you have a certain piece of information in your
spreadsheet that you know must contain a certain length. For example, this association uses six-
digit member identification codes. If we'd just imported this data and wanted to be sure our codes
are all the correct number of digits, we'd use LEN. The syntax of LEN is equals LEN, open
parenthesis, the range, and the close parenthesis.
 RIGHT is a function that gives you a set number of characters from the right side of a text string.
The syntax is equals RIGHT, open parenthesis, the range, a comma and the number of characters
we want. Then, we finish with a closed parenthesis

 The syntax of LEFT is equals LEFT, open parenthesis, the range, a comma, and a number of
characters from the left side of the text string we want. Then, we finish it with a closed
parenthesis.

 MID is a function that gives you a segment from the middle of a text string.The syntax for MID
is equals MID, open parenthesis, the range, then a comma. When using MID, you always need to
supply a reference point. In other words, you need to set where the function should start. After
that, place another comma, and how many middle characters you want.

 CONCATENATE, which is a function that joins together two or more text strings. The syntax is
equals CONCATENATE, then an open parenthesis inside indicates each text string you want to
join, separated by commas

 TRIM is a function that removes leading, trailing, and repeated spaces in data. Sometimes when
you import data, your cells have extra spaces, which can get in the way of your analysis

 The syntax for TRIM is equals TRIM, open parenthesis, your range, and closed parenthesis.
TRIM fixed the extra space

Workflow automation
In this reading, you will learn about workflow automation and how it can help you work faster and
more efficiently. Basically, workflow automation is the process of automating parts of your work.
That could mean creating an event trigger that sends a notification when a system is updated. Or it
could mean automating parts of the data cleaning process. As you can probably imagine, automating
different parts of your work can save you tons of time, increase productivity, and give you more
bandwidth to focus on other important aspects of the job.

What can
be
automated?
Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts of
the job that can’t be automated. Let's take a look at some things we can automate and some
things that we can’t.

Can it be
Task Why?
automated?

Communicating with Communication is key to understanding the needs of your team and
your team and No stakeholders as you complete the tasks you are working on. There is
stakeholders no replacement for person-to-person communications.

Presenting your data is a big part of your job as a data analyst. Making
data accessible and understandable to stakeholders and creating data
Presenting your findings No
visualizations can’t be automated for the same reasons that
communications can’t be automated.
Can it be
Task Why?
automated?

Some tasks in data preparation and cleaning can be automated by


Preparing and cleaning
Partially setting up specific processes, like using a programming script to
data
automatically detect missing values.

Sometimes the best way to understand data is to see it. Luckily, there
are plenty of tools available that can help automate the process of
Data exploration Partially visualizing data. These tools can speed up the process of visualizing
and understanding the data, but the exploration itself still needs to be
done by a data analyst.

Data modeling is a difficult process that involves lots of different


Modeling the data Yes factors; luckily there are tools that can completely automate the
different stages.

Difference Data Perspective


 pivot table is a data summarization tool that is used in data processing. Pivot tables sort,
reorganize, group, count, total or average data stored in the database. In data cleaning, pivot
tables are used to give you a quick, clutter- free view of your data. You can choose to look at the
specific parts of the data set that you need to get a visual in the form of a pivot table.
 VLOOKUP stands for vertical lookup. It's a function that searches for a certain value in a column
to return a corresponding piece ofinformation. When data analysts look up information for a
project, it's rare for all of the data they need to be in the same place. The syntax of the
VLOOKUP is equals VLOOKUP, open parenthesis, then the data you want to look up. Next is a
comma and where you want to look for that data.
 Plotting is very useful when trying to identify any skewed data or outliers

Even more Data Cleaning Techniques

 Data mapping is the process of matching fields from one database to another. This is very
important to the success of data migration, data integration, and lots of other data management
activities.
 compatibility describes how well two or more data sets are able to work together.

Module 3
Spreadsheets and SQL both have their advantages and disadvantages:

Features of Spreadsheets Features of SQL Databases

Smaller data sets Larger datasets

Enter data manually Access tables across a database

Create graphs and visualizations in the same


Prepare data for further analysis in another software
program

Built-in spell check and other useful functions Fast and powerful functionality

Great for collaborative work and tracking queries run by all


Best when working solo on a project
users
When it comes down to it, where the data lives will decide which tool you use. If you are working
with data that is already in a spreadsheet, that is most likely where you will perform your analysis.
And if you are working with data stored in a database, SQL will be the best tool for you to use for
your analysis.

 Data is measured by the number of bits it takes to represent it

A byte is a collection of 8 bits. Take a moment to examine the table below to get a feel for the
difference between data measurements and their relative sizes to one another.

Unit Equivalent to Abbreviation Real-World Example

Byte 8 bits B 1 character in a string

Kilobyte 1024 bytes KB A page of text (~4 kilobytes)

Megabyte 1024 Kilobytes MB 1 song in MP3 format (~2-3 megabytes)

Gigabyte 1024 Megabytes GB ~300 songs in MP3 format

Terabyte 1024 Gigabytes TB ~500 hours of HD video

Petabyte 1024 Terabytes PB 10 billion Facebook photos

Exabyte 1024 Petabytes EB ~500 million hours of HD video

Zettabyte 1024 Exabytes ZB All the data on the internet in 2019 (~4.5 ZB)

Advanced data cleaning functions

Module 4
 Verification is a process to confirm that a data cleaning effort was well-executed and the
resulting data is accurate and reliable. It involves rechecking your clean dataset, doing some
manual clean ups if needed, and taking a moment to sit back and really think about the original
purpose of the project. That way, you can be confident that the data you collected is credible and
appropriate for your purposes.
 Making sure your data is properly verified is so important because it allows you to double-check
that the work you did to clean up your data was thorough and accurate.
 Verification lets you catch mistakes before you begin analysis.
 Other big part of the verification process is reporting on your efforts
 Open communication is a lifeline for any data analytics project. Reports are a super effective
way to show your team that you're being 100 percent transparent about your data cleaning.
 Reporting is also a great opportunity to show stakeholders that you're accountable, build trust
with your team, and make sure you're all on the same page of important project details.
 A changelog is a file containing a chronologically ordered list of modifications made to a
project. It's usually organized by version and includes the date followed by a list of
added, improved, and removed features.
 Changelogs are very useful for keeping track of how a dataset evolved over the course of a
project. They're also another great way to communicate and report on data to others

Cleaning and your data cleaning expectations

 Verification is a critical part of any analysis project. Without it you have no way of knowing that
your insights can be relied on for data-driven decision-making. Think of verification as a stamp
of approval

 It also involves manually cleaning data to compare your expectations with what's actually
present. The first step in the verification process is going back to your original unclean data set
and comparing it to what you have now.

 Review the dirty data and try to identify any common problems. For example, maybe you had a
lot of nulls. In that case, you check your clean data to ensure no nulls are present. To do that, you
could search through the data manually or use tools like conditional formatting or filters.
 Another key part of verification involves taking a big-picture view of your project. This is an
opportunity to confirm you're actually focusing on the business problem that you need to solve
and the overall project goals and to make sure that your data is actually capable of solving that
problem and achieving those goals.
Documenting Results and the cleaning process
 recalling the errors that were cleaned and
 informing others of the changes -- assume that the data errors aren't fixable.
 documentation helps you to determine the quality of the data to be used in analysis.

Embrace Changelogs
What do engineers, writers, and data analysts have in common? Change.

Engineers use engineering change orders (ECOs) to keep track of new product design details and
proposed changes to existing products. Writers use document revision histories to keep track of
changes to document flow and edits. And data analysts use changelogs to keep track of data
transformation and cleaning. Here are some examples of these:

Automated version control takes you most of the way


Most software applications have a kind of history tracking built in. For example, in Google sheets, you
can check the version history of an entire sheet or an individual cell and go back to an earlier version.
In Microsoft Excel, you can use a feature called Track Changes. And in BigQuery, you can view the
history to check what has changed.

Here’s how it works:

1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right arrow > to
Google Sheets
move backward and forward in the history as needed.

Microsoft 1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Track
Excel Changes, click the Accept/Reject Changes option to accept or reject any change made.

Bring up a previous version (without reverting to it) and figure out what changed by comparing it
BigQuery
to the current version.

Changelogs take you down the last mile

A changelog can build on your automated version history by giving you an even more detailed record
of your work. This is where data analysts record all the changes they make to the data. Here is another
way of looking at it. Version histories record what was done in a data change for a project, but don't
tell us why. Changelogs are super useful for helping us understand the reasons changes have been
made. Changelogs have no set format and you can even make your entries in a blank document. But if
you are using a shared changelog, it is best to agree with other data analysts on the format of all your
log entries.

Typically, a changelog records this type of information:

 Data, file, formula, query, or any other component that changed


 Description of what changed
 Date of the change
 Person who made the change
 Person who approved the change
 Version number
 Reason for the change
Let’s say you made a change to a formula in a spreadsheet because you observed it in another report
and you wanted your data to match and be consistent. If you found out later that the report was
actually using the wrong formula, an automated version history would help you undo the change. But
if you also recorded the reason for the change in a changelog, you could go back to the creators of the
report and let them know about the incorrect formula. If the change happened a while ago, you might
not remember who to follow up with. Fortunately, your changelog would have that information ready
for you! By following up, you would ensure data integrity outside your project. You would also be
showing personal integrity as someone who can be trusted with data. That is the power of a
changelog!

Finally, a changelog is important for when lots of changes to a spreadsheet or query have been made.
Imagine an analyst made four changes and the change they want to revert to is change #2. Instead of
clicking the undo feature three times to undo change #2 (and losing changes #3 and #4), the analyst
can undo just change #2 and keep all the other changes. Now, our example was for just 4 changes, but
try to think about how important that changelog would be if there were hundreds of changes to keep
track of.

What also happens IRL (in real life)


A junior analyst probably only needs to know the above with one exception. If an analyst is making
changes to an existing SQL query that is shared across the company, the company most likely uses
what is called a version control system. An example might be a query that pulls daily revenue to
build a dashboard for senior management.

Here is how a version control system affects a change to a query:

1. A company has official versions of important queries in their version control system.
2. An analyst makes sure the most up-to-date version of the query is the one they will change. This
is called syncing
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a code review and can be
informally or formally done. An informal review could be as simple as asking a senior analyst to
take a look at the change.
5. After a reviewer approves the change, the analyst submits the updated version of the query to a
repository in the company's version control system. This is called a code commit. A best practice
is to document exactly what the change was and why it was made in a comments area. Going
back to our example of a query that pulls daily revenue, a comment might be: Updated revenue to
include revenue coming from the new product, Calypso.
6. After the change is submitted, everyone else in the company will be able to access and use this
new query when they sync to the most up-to-date queries stored in the version control system.
7. If the query has a problem or business needs change, the analyst can undo the change to the
query using the version control system. The analyst can look at a chronological list of all changes
made to the query and who made each change. Then, after finding their own change, the analyst
can revert to the previous version.
8. The query is back to what it was before the analyst made the change. And everyone at the
company sees this reverted, original query, too.

 Some of the most common errors involve human mistakes like mistyping or misspelling, flawed
processes like poor design of a survey form, and system issues where older systems integrate data
incorrectly

You might also like