You are on page 1of 10

Module 1.

Introduction to Data Science and Analytics


Data vs. Information
Data
- A “given”, or fact; a number, a statement or an image
- Represents something (quantities, actions and object) in real world
- The raw materials in the production of information
- Data is unorganized
- Data is Not typical useful on its own
Information
- Is a data that have meaning within a context
- Is a data that has been processed in to form that is meaningful to recipient
and is of real or perceived value in the current or in the prospective actions
or decisions of the recipient.
- Information is structured or organize
- Information is useful on its own

-
What is data?
The Latin word data is the plural of datum, "(thing) given", and neuter past
participle of dare, "to give". The first English use of the word "data" is from the
1640s. The word "data" was first used to mean "transmissible and storable
computer information" in 1946. The expression "data processing" was first used in
1954.
Data is a collection of discrete or continuous values that convey information,
describing the quantity, quality, fact, statistics, other basic units of meaning, or
simply sequences of symbols that may be further interpreted formally. A datum is
an individual value in a collection of data. Data is usually organized into structures
such as tables that provide additional context and meaning, and which may
themselves be used as data in larger structures. Data may be used as variables in a
computational process. Data may represent abstract ideas or concrete
measurements. Data is commonly used in scientific research, economics, and in
virtually every other form of human organizational activity. Examples of data sets
include price indices (such as consumer price index), unemployment rates, literacy
rates, and census data. In this context, data represents the raw facts and figures
from which useful information can be extracted
Data is collected using techniques such as measurement, observation, query,
or analysis, and is typically represented as numbers or characters which may be
further processed. Field data is data that is collected in an uncontrolled in-situ
environment. Experimental data is data that is generated in the course of a
controlled scientific experiment. Data is analyzed using techniques such as
calculation, reasoning, discussion, presentation, visualization, or other forms of
post-analysis. Prior to analysis, raw data (or unprocessed data) is typically cleaned:
Outliers are removed and obvious instrument or data entry errors are corrected.
Computer data is information that is stored and processed digitally on a
computer. Data on a computer can take many forms, including text, images, audio,
or video. It may be loaded into memory and processed by the computer's CPU, then
stored as files in folders on a hard drive or solid-state drive.

Data on a computer is stored as binary data, where every file consists of a


series of 1s and 0s called bits. 8 bits make up one byte, which is the basic unit of
data storage. Data is encoded using different methods for different data types. For
example, text data is stored using a character encoding method
like ASCII or Unicode where each letter is represented by a single byte or series of
bytes. Image data, meanwhile, is stored by assigning each pixel in a grid a color
value using as few as 8 or as many as 32 bits per pixel.
Computers can store data on many different kinds of storage devices. Hard
drives and solid-state drives are the most common ways to store data on a
computer. Flash drives can easily transfer data from one computer to another.
Computers with optical drives can burn data to (re)writable CDs and DVDs.
Network-connected computers can transfer data from one computer to another
over a local network or the Internet. Reading or transferring digital data does not
cause any deterioration or quality loss over time.
What is Science?
- Is the systematic study of the structure and behavior of the physical and
natural world through observation, experimentation, and the testing of
theories against the evidence obtained.
- Science is defined as the observation, identification, description,
experimental investigation, and theoretical explanation of natural
phenomena.
- A knowledge or a system of knowledge covering general truths or the
operation of general laws especially as obtained and tested
through scientific method.
Data Science
- Data science is the study of data to extract meaningful insights for business.
It is a multidisciplinary approach that combines principles and practices from
the fields of mathematics, statistics, artificial intelligence, and computer
engineering to analyze large amounts of data.
- Data science combines math and statistics, specialized programming,
advanced analytics, artificial intelligence (AI), and machine learning with
specific subject matter expertise to uncover actionable insights hidden in an
organization’s data. These insights can be used to guide decision making and
strategic planning

The accelerating volume of data sources, and subsequently data, has made data
science is one of the fastest growing field across every industry. As a result, it is no
surprise that the role of the data scientist was dubbed the “sexiest job of the 21st
century” by Harvard Business Review (link resides outside of IBM). Organizations
are increasingly reliant on them to interpret data and provide actionable
recommendations to improve business outcomes.

The data science lifecycle involves various roles, tools, and processes, which
enables analysts to glean actionable insights. Typically, a data science project
undergoes the following stages:
• Data ingestion: The lifecycle begins with the data collection--both raw
structured and unstructured data from all relevant sources using a variety of
methods. These methods can include manual entry, web scraping, and real-
time streaming data from systems and devices. Data sources can include
structured data, such as customer data, along with unstructured data like log
files, video, audio, pictures, the Internet of Things (IoT), social media, and
more.
• Data storage and data processing: Since data can have different formats
and structures, companies need to consider different storage systems based
on the type of data that needs to be captured. Data management teams help
to set standards around data storage and structure, which facilitate
workflows around analytics, machine learning and deep learning models.
This stage includes cleaning data, deduplicating, transforming and combining
the data using ETL (extract, transform, load) jobs or other data integration
technologies. This data preparation is essential for promoting data quality
before loading into a data warehouse, data lake, or other repository.
• Data analysis: Here, data scientists conduct an exploratory data analysis to
examine biases, patterns, ranges, and distributions of values within the data.
This data analytics exploration drives hypothesis generation for a/b testing.
It also allows analysts to determine the data’s relevance for use within
modeling efforts for predictive analytics, machine learning, and/or deep
learning. Depending on a model’s accuracy, organizations can become reliant
on these insights for business decision making, allowing them to drive more
scalability.
• Communicate: Finally, insights are presented as reports and other data
visualizations that make the insights—and their impact on business—easier
for business analysts and other decision-makers to understand. A data
science programming language such as R or Python includes components for
generating visualizations; alternately, data scientists can use dedicated
visualization tools.

Data science vs. data scientist

Data science is considered a discipline, while data scientists are the


practitioners within that field. Data scientists are not necessarily directly
responsible for all the processes involved in the data science lifecycle. For example,
data pipelines are typically handled by data engineers—but the data scientist may
make recommendations about what sort of data is useful or required. While data
scientists can build machine learning models, scaling these efforts at a larger level
requires more software engineering skills to optimize a program to run more
quickly. As a result, it’s common for a data scientist to partner with machine
learning engineers to scale machine learning models.

Data scientist responsibilities can commonly overlap with a data analyst,


particularly with exploratory data analysis and data visualization. However, a data
scientist’s skillset is typically broader than the average data analyst. Comparatively
speaking, data scientist leverage common programming languages, such as R and
Python, to conduct more statistical inference and data visualization. A data analyst
makes sense out of existing data through routine analysis and writing reports. A
data scientist works on new ways to capture, store, manipulate and analyze that
data.

To perform these tasks, data scientists require computer science and pure
science skills beyond those of a typical business analyst or data analyst. The data
scientist must also understand the specifics of the business, such as automobile
manufacturing, eCommerce, or healthcare.

In short, a data scientist must be able to:

• Know enough about the business to ask pertinent questions and identify
business pain points.
• Apply statistics and computer science, along with business acumen, to data
analysis.
• Use a wide range of tools and techniques for preparing and extracting data—
everything from databases and SQL to data mining to data integration
methods.
• Extract insights from big data using predictive analytics and artificial
intelligence (AI), including machine learning models, natural language
processing, and deep learning.
• Write programs that automate data processing and calculations.
• Tell—and illustrate—stories that clearly convey the meaning of results to
decision-makers and stakeholders at every level of technical understanding.
• Explain how the results can be used to solve business problems.
• Collaborate with other data science team members, such as data and
business analysts, IT architects, data engineers, and application developers.

These skills are in high demand, and as a result, many individuals that are
breaking into a data science career, explore a variety of data science programs,
such as certification programs, data science courses, and degree programs offered
by educational institutions.

Data science versus business intelligence


It may be easy to confuse the terms “data science” and “business
intelligence” (BI) because they both relate to an organization’s data and analysis of
that data, but they do differ in focus.
Business intelligence (BI) is typically an umbrella term for the technology that
enables data preparation, data mining, data management, and data visualization.
Business intelligence tools and processes allow end users to identify actionable
information from raw data, facilitating data-driven decision-making within
organizations across various industries. While data science tools overlap in much
of this regard, business intelligence focuses more on data from the past, and the
insights from BI tools are more descriptive in nature. It uses data to understand
what happened before to inform a course of action. BI is geared toward static
(unchanging) data that is usually structured. While data science uses descriptive
data, it typically utilizes it to determine predictive variables, which are then used
to categorize data or to make forecasts
Data science tools
Data scientists rely on popular programming languages to conduct exploratory data
analysis and statistical regression. These open source tools support pre-built
statistical modeling, machine learning, and graphics capabilities. These languages
include the following (read more at "Python vs. R: What's the Difference?"):
R Studio: An open source programming language and environment for
developing statistical computing and graphics.
Python: It is a dynamic and flexible programming language. The Python
includes numerous libraries, such as NumPy, Pandas, Matplotlib, for analyzing data
quickly.
Some data scientists may prefer a user interface, and two common enterprise tools
for statistical analysis include:
SAS: A comprehensive tool suite, including visualizations and interactive
dashboards, for analyzing, reporting, data mining, and predictive modeling.
IBM SPSS: Offers advanced statistical analysis, a large library of machine
learning algorithms, text analysis, open source extensibility, integration with big
data, and seamless deployment into applications.
Excel: a software program created by Microsoft that uses spreadsheets to
organize numbers and data with formulas and functions. Excel analysis is
ubiquitous around the world and used by businesses of all sizes to perform financial
analysis.

Data science and cloud computing

Cloud computing scales data science by providing access to additional processing


power, storage, and other tools required for data science projects.

Since data science frequently leverages large data sets, tools that can scale with
the size of the data is incredibly important, particularly for time-sensitive projects.
Cloud storage solutions, such as data lakes, provide access to storage
infrastructure, which are capable of ingesting and processing large volumes of data
with ease. These storage systems provide flexibility to end users, allowing them to
spin up large clusters as needed. They can also add incremental compute nodes to
expedite data processing jobs, allowing the business to make short-term tradeoffs
for a larger long-term outcome. Cloud platforms typically have different pricing
models, such a per-use or subscriptions, to meet the needs of their end user—
whether they are a large enterprise or a small startup.

Open source technologies are widely used in data science tool sets. When they’re
hosted in the cloud, teams don’t need to install, configure, maintain, or update
them locally. Several cloud providers, including IBM Cloud®, also offer prepackaged
tool kits that enable data scientists to build models without coding, further
democratizing access to technology innovations and data insights.

Data Analytics
Data analytics is the science of analyzing raw data to make conclusions
about that information. Data analytics help a business optimize its
performance, perform more efficiently, maximize profit, or make more
strategically-guided decisions.

Data analysis process


As the data available to companies continues to grow both in amount and
complexity, so too does the need for an effective and efficient process by which to
harness the value of that data. The data analysis process typically moves through
several iterative phases. Let’s take a closer look at each.
- Identify the business question you’d like to answer. What problem is the
company trying to solve? What do you need to measure, and how will you
measure it?
- Collect the raw data sets you’ll need to help you answer the identified
question. Data collection might come from internal sources, like a company’s
client relationship management (CRM) software, or from secondary sources,
like government records or social media application programming interfaces
(APIs).
- Clean the data to prepare it for analysis. This often involves purging
duplicate and anomalous data, reconciling inconsistencies, standardizing
data structure and format, and dealing with white spaces and other syntax
errors.
- Analyze the data. By manipulating the data using various data analysis
techniques and tools, you can find trends, correlations, outliers, and
variations that tell a story. During this stage, you might use data mining to
discover patterns within databases or data visualization software to help
transform data into an easy-to-understand graphical format.
- Interpret the results of your analysis to see how well the data answered
your original question. What recommendations can you make based on the
data? What are the limitations of your conclusions?

REFERENCE/S:
1. https://en.wikipedia.org/wiki/Data
2. https://techterms.com/definition/data
3. https://www.ibm.com/topics/data-science
4. https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.analytixlabs.
co.in%2Fblog%2Fwhat-is-data-science%2F&psig=AOvVaw1GSjvrc-
jHY159BcXbypf_&ust=1695652730401000&source=images&cd=vfe&opi=89
978449&ved=0CBIQjhxqGAoTCJDIoue8w4EDFQAAAAAdAAAAABCDAQ
5. https://www.coursera.org/in/articles/what-is-data-analysis-with-examples
6. https://www.youtube.com/watch?v=yFSEf6TOzDQ
7. https://www.youtube.com/watch?v=qs4Z3PayuVQ&list=PLeggoenlMQrzUq
GamlReq88emA6w-XSjS&index=1

Prepared by:

OLIVER M. DIMALANTA, LPT


Instructor

You might also like