You are on page 1of 14

Coursera Google Data Analytics

Course 3 Notes
Week 1 - Data Types and Structures

Data Types and Structures


Course 2 Learning Objectives:
● Explain how data is generated as a part of our daily activities with reference to
the types of data generated
● Explain factors that should be considered when making decisions about data
collection
● Explain the difference between structured and unstructured data
● Discuss the difference between data and data types
● Explain the relationship between data types, fields, and values
● Discuss wide and long data formats with references to organization and
purpose

Collecting Data
Data Collection in our world
Every day we generate a lot of data.

How data is collected:


- Interviews
- Observations
- Forms
- Questionnaires
- Surveys
- Cookies: Small files stored in computers that contain information about users.

Determining what data to collect


Data collection considerations

- How the data will be collected


Decide if you will collect the data using your own resources or receive (and possibly
purchase it) from another party. Data that you collect yourself is called first-party
data.

- Choose data sources — First-party data & second-party data


If you don’t collect the data using your own resources, you might get data from
second-party or third-party data providers. Second-party data is collected directly by
another group and then sold. Third-party data is sold by a provider that didn’t collect
the data themselves. Third-party data might come from a number of different
sources.
First-party data
Data collected by an individual or group using their own resources
Second-party data
Data collected by a group directly from its audience and then sold
Third-party data
Data collected from outside sources who did not collect it directly

- Decide what data to use


💡Every data we choose should apply to our needs, and it must be approved for use.

- How much data to collect


If you are collecting your own data, make reasonable decisions about sample size. A
random sample from existing data might be fine for some projects. Other projects
might need more strategic data collection to focus on certain criteria. Each project
has its own needs.

Population
All possible data values in a certain dataset

Sample
A part of a population that is representative of the population

- Select the right data type


- Determine the time frame
If you are collecting your own data, decide how long you will need to collect it,
especially if you are tracking trends over a long period of time. If you need an
immediate answer, you might not have time to collect new data. In this case, you
would need to use historical data that already exists.
Differentiate Between Data Formats & Structures
Discover Data Formats

Quantitative and qualitative data


Qualitative data usually is usually listed as name, category, or description.
Quantitative data can be measured or counted as a number. It can be broken down into
discrete and continuous data.

Discrete data
Data that is counted and has a limited number of values

💡Discrete data isn't limited to dollar amounts. Examples of other discrete data are stars
and points. When partial measurements (half-stars or quarter-points) aren't allowed, the
data is discrete. If you don't accept anything other than full stars or points, the data is
considered discrete.

Continuous data
Data that is measured and can have almost any numeric value

Nominal Data
A type of qualitative data that is categorized without a set order.
Example: A yes/no question

Ordinal Data
A type of qualitative data with a set order or scale.
Example: Scale from 1 to 5

Internal Data
Data that lives within a company’s own systems

External Data
Data that lives and is generated outside organizations.

Structured Data
Data is organized in a certain format such as rows and columns. Structured data that is
grouped together to form relations enables analysts to store, search, and analyze the data.
Example: Table, spreadsheet, and relational database

Unstructured Data
Data that is not organized in any easily identifiable manner
Example: Audio, video files, emails, photos, and social media

Understanding Structured Data


Structured data works nicely within a data model, which is a model that is used for
organizing data elements and how they relate to one another.

Data Model
A model that is used for organizing data elements and how they relate to each other.

Data Elements
Pieces of information, such as people’s names, account numbers, and addresses

Data modeling levels and techniques


What is data modeling?
Data modeling is the process of creating diagrams that visually represent how data is
organized and structured. These visual representations are called data models.

Levels of data modeling

1. Conceptual data modeling gives a high-level view of the data structure, such as
how data interacts across an organization. For example, a conceptual data model
may be used to define the business requirements for a new database. A conceptual
data model doesn't contain technical details.
2. Logical data modeling focuses on the technical details of a database such as
relationships, attributes, and entities. For example, a logical data model defines how
individual records are uniquely identified in a database. But it doesn't spell out the
actual names of database tables. That's the job of a physical data model.
3. Physical data modeling depicts how a database operates. A physical data model
defines all entities and attributes used; for example, it includes table names, column
names, and data types for the database.

Explore Data Types, Fields, and Values


Data Type
A specific kind of data attribute that tells what kind of value the data is

Data Types in Spreadsheets


- Number
- Text or string
A sequence of characters and punctuation that contains textual information
- Boolean
A data type that only has two possible values, like TRUE and FALSE

Wide and Long Data

Wide Data
Data in which every data subject has a single row with multiple columns to hold the values of
various attributes to the subject

Long Data
Data in which each row is one-time point per subject, so each subject will have data in
multiple rows
Data Transformation
Data transformation
The process of changing the data’s format, structure, or values.
Data transformation usually involves:
● Adding, copying, or replicating data
● Deleting fields or records
● Standardizing the names of variables
● Renaming, moving, or combining columns in a database
● Joining one set of data with another
● Saving a file in a different format. For example, saving a spreadsheet as a comma-
separated values (CSV) file.

Why Transform Data?


Goals for data transformation might be:
● Data organization: better-organized data is easier to use
● Data compatibility: different applications or systems can then use the same data
● Data migration: data with matching formats can be moved from one system to
another
● Data merging: data with the same organization can be merged together
● Data enhancement: data can be displayed with more detailed fields
● Data comparison: apples-to-apples comparisons of the data can then be made
Course 3 Notes
Week 2 - Bias, Credibility, Privacy, Ethics and Access

Bias, Credibility, Privacy, Ethics, and Access


Unbiased and Objective Data
Ensuring Data Integrity
1. Analyze data for bias and credibility
2. Good data vs. bad data
3. Data ethics, privacy, and access

Bias: From questions to conclusions


We run bias every day in our life.

Bias
A preference in favor of or against a person, group of people, or thing

Data Bias
A type of error that systematically skews results in a certain direction

Biased and unbiased data


Sampling Bias
When a sample isn’t representative of the population as a whole. We can avoid this bias by
choosing a sample at random.

Unbiased sampling
When a sample is representative of the population being measured.

Understanding bias in data


More types of data bias:
● Observer bias (Experimenter bias/research bias)
The tendency for different people to observe things differently
Example: Two scientists looking at the same microscope might see different things.
● Interpretation bias
The tendency to always interpret ambiguous situations in a positive or negative way
Example: Interpretation bias, can lead to two people seeing or hearing the exact
same thing, and interpreting it in a variety of different ways, because they have
different backgrounds, and experiences.
● Confirmation bias
The tendency to search for, or interpret information in a way that confirms preexisting
beliefs.
Example: We might get our news from a certain website because the writers share
our beliefs, or we socialize with people because we know that they hold similar
views.
Explore Data Credibility
How to Identify Good Data Sources:
- Reliable
Like a good friend, good data sources are reliable. With this data, you can trust that
you're getting accurate, complete, and unbiased information that's been vetted and
proven fit for use.
- Original
To make sure you're dealing with good data, be sure to validate it with the original
source.
- Comprehensive
The best data sources contain all critical information needed to answer the question
or find the solution
- Current
The usefulness of data decreases as time passes. If you wanted to invite all current
clients to a business event, you wouldn't use a 10-year-old client list. The same goes
for data. The best data sources are current and relevant to the task at hand.
- Cited
Citing makes the information you're providing more credible. When you're choosing a
data source, think about three things: Who created the data set? Is it part of a
credible organization? When was the data last refreshed?
Coursera Google Data Analytics
Course 3 Notes
Week 3 - Databases: Where Data Lives

Databases: Where Data Lives


Working with Databases
All About Databases
Metadata
Data about data. Deep… Think of it as a reference guide. Where the data comes from, when
and how it was created, and what’s all about.

Database Features
Example of Simple Database Structure:

Example of Relational Databases

Relational Database
A database that contains a series of related tables that can be connected via their
relationships. For two tables to have a relationship, one or more of the same fields must
exist inside both tables.

Primary Key
An identifier that references a column in which each value is unique. No two rows can have
the same primary keys. Also, it cannot be null or blank.
- Used to ensure data in a specific column is unique
- Uniquely identifies a record in a relational database table.
- Only one primary key is allowed in a table and they cannot contain null or blank
values

Foreign Key
A field within a table that is a primary key in another table.
- A column or group of columns in a relational database table that provides a link
between the data and two tables.
- Refers to the field in a table that's the primary key of another table.
- More than one foreign key is allowed to exist in a table.

Databases in Data Analytics


Database Normalization
Normalization is a process of organizing data in a relational database. For example, creating
tables and establishing relationships between those tables. It is applied to eliminate data
redundancy, increase data integrity, and reduce complexity in a database.

Exploring Metadata
Metadata
Data about data. It is used in database management to help data analysts interpret the
contents of the data within the database.

3 Common Types of Metadata


1. Descriptive Metadata
Metadata that describes a piece of data and can be used to identify it at a later point
in time.
2. Structural Metadata
Metadata that indicates how a piece of data is organized and whether it is part of
one, or more than one, data collection.
3. Administrative Metadata
Metadata that indicates the technical source of a digital asset.

Metadata is as important as the data itself


Elements of Metadata
1. Title and description
What is the name of the file or website you are examining? What type of content
does it contain?
2. Tags and categories
What is the general overview of the data that you have? Is the data indexed or
described in a specific way?
3. Who created it and when
Where did the data come from, and when was it created? Is it recent, or has it
existed for a long time?
4. Who last modified it and when
Were any changes made to the data? If yes, were the modifications recent?
5. Who can access or update it
Is this dataset public? Are special permissions needed to customize or modify the
dataset?

Using Metadata as Data Analyst


Benefits of using metadata:
1. Metadata creates a single source of truth by keeping things consistent and uniform.
2. Metadata also makes data more reliable by making sure it's accurate, precise,
relevant, and timely.

Metadata Repository
A database specifically created to store metadata. It does it by:
- Describing the state and location of the metadata
- Describing the structure of the tables inside
- Describes how data flows through the repository.
- Keep track of who accesses the metadata and when

Metadata Management
Data Governance
A process to ensure the formal management of a company’s data assets.

Accessing Different Data Sources


Working with More Data Sources
CSV
Comma-separated values

From external sources to a spreadsheet


-
Sorting and Filtering
Sorting and Filtering
Sorting and filtering the data in a spreadsheet helps us customize the way data is presented.
They can also organize data so analysts can zoom in on the pieces that matter.

Sorting
Involves arranging data into a meaningful order to make it easier to understand, analyze,
and visualize.

Filter
Shows only the data that meets specific criteria while hiding the rest.

Working with Large Datasets in SQL


Setting up BigQuery
Just follow the instructions.
Coursera Google Data Analytics
Course 3 Notes
Week 4 - Organizing and Protecting Your Data

4. Organizing and Protecting Your Data


4.1 Effectively Organize Data
4.1.1 Feel confident in your data
Benefits of organizing data:
- Makes it easier to find and use
- Helps you avoid making mistakes during your analysis
- Helps to protect your data

4.1.2 Let’s Get Organized


Best practices when organizing data
- Naming Conventions
Consistent guidelines that describe the content, date, or version of a file in its name.
- Foldering
Organize your files into folders
- Archiving Older Files
Move old projects to separate locations to create an archive and cut down on clutter.
- Align your naming and storage practices with your team
- Develop metadata practices
4.1.3 Organization guidelines
Best practices for file naming conventions
● Work out and agree on file naming conventions early on in a project to avoid
renaming files again and again.
● Align your file naming with your team's or company's existing file-naming
conventions.
● Ensure that your file names are meaningful; consider including information like
project name and anything else that will help you quickly identify (and use) the file for
the right purpose.
● Include the date and version number in file names; common formats are
YYYYMMDD for dates and v## for versions (or revisions).
● Create a text file as a sample file with content that describes (breaks down) the file
naming convention and a file name that applies it.
● Avoid spaces and special characters in file names. Instead, use dashes,
underscores, or capital letters. Spaces and special characters can cause errors in
some applications.

Best practices for keeping files organized

Remember these tips for staying organized as you work with files:

● Create folders and subfolders in a logical hierarchy so related files are stored
together.
● Separate ongoing from completed work so your current project files are easier to find.
Archive older files in a separate folder, or in an external storage location.
● If your files aren't automatically backed up, manually back them up often to avoid
losing important work.

4.1.4 All about File Naming

File Naming Do’s:


- Work out your conventions early
- Align file naming with your team
- Make sure file names are meaningful -> with reference to the project name, etc
- Keep file names short and sweet
- Formate dates yyyymmdd: SalesReport20210722
- Lead revision numbers with 0: SalesReport20210722v02
- Use hyphens, underscores, or capital letters instead of spaces
- Create a text file that lays out the naming convention

4.2 Securing Data


4.2.1 Security Features in Spreadsheet
Data Security
Protecting data from unauthorized access or corruption by adopting safety measures.

The similarity of Spreadsheets and Excel


- Protect spreadsheets from being edited
- Access control features like password protection and user permissions

4.2.2 Balancing security and analytics


Another security feature that can help companies:

Encryption
Uses a unique algorithm to alter data and make it unusable by users and applications that
don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse
the encryption; so if you have the key, you can still use the data in its original form.

Tokenization
replaces the data elements you want to protect with randomly generated data referred to as
a “token.” The original data is stored in a separate location and mapped to the tokens. To
access the complete original data, the user or application needs to have permission to use
the tokenized data and the token mapping. This means that even if the tokenized data is
hacked, the original data is still safe and secure in a separate location.

You might also like