Data Science Assesment 1-2

NAME: RODJEAN A.
SIMBALLA YEAR/SECTION: BSIT- 3D
MODULE 1- INTRODUCTION TO DATA SCIENCE
Assessment 1. Introduction to Data Science/ Evolution of Data Science
1. Identify at least five skill areas of a data scientist.
 The following are the five skills areas of a data scientist;
a) Team work
b) Advanced statistics
c) High-level math
d) Social media mining
e) Natural language process/machine learning
2. Identify the seven main categories of data.
 The following are the seven main categories of data;
1) Structured data
2) Unstructured data
3) Natural language
4) Machine generated data
5) Graph-based or network data
6) Audio, image and video
7) Streaming data
ESSU-ACAD-501|Version 4Page 1 of 6
Effectivity Date: June 10, 2021
3. Identify the year when the significant events in the evolution of data science took
place.
Event Year
Papers by Alan Turing on the topics of computable
numbers and artificial intelligence were published ( 2

1936-1950
different years)
C.F. Jeff Wu’s public lecture (Statistics = Data
Science)
1997
The term “data science” came to prominence in
discussions of the need for statisticians to join with 1990
computer scientists to bring mathematical rigor to
computational analysis of large data sets.
William S. Cleveland published an action plan for
creating a university department. 2001
Leo Breiman published the paper (Statistical
Modeling: the two cultures) His distinction between a 2001
statistical focus on models that explain the data
versus an algorithmic focus on models that can
actually predict the role of a data scientist has
become very broad
Assessment 2. Introduction to Data Science (2)
1. List down major differences between Supervised and Unsupervised Machine
Learning.
Supervised Machine Learning Unsupervised Machine Learning
System creates decisions is the process of restructuring and improving

(outputs) based on output data. inputs in order to provide unlabeled data
structure
This form of learning can be seen Cluster analysis, for example, takes a set of
in spam filters and automatic entities, each with a set of attributes, and divides
credit card approval systems. the entity space into sets or groups depending on
how similar the attributes of all entities are.
Linear discriminant analysis is the This not only reorganizes the data, but it also
same way (LDA). The system is enriches it by identifying it with additional tags (in
given a historical data sample of this case, a cluster number/name).
known inputs and outputs, and it
uses machine learning techniques
to "learn" the link between the two.
To determine which technique is The origin of this nomenclature is unknown, but it
most appropriate for the task at is likely derived from the fact that in unsupervised
hand, judgment is required. learning, there is no obvious objective function
that is maximized or minimized, therefore no
“supervision” to obtain an optimal is required.
2. What are the drawbacks of having too much information?
 Information overload can lead to many disadvantages such as it can cause our
brain to become less productive, easily get tired and distracted. There are
several ways a student or a researcher can do to manage information and make
a better use of internet resources in order to avoid information overload.
Module Assessment:
1. Identify and discuss the facets of data in Data Science?
 Identifying the structure of data
 Cleaning, filtering, reorganizing, augmenting, and aggregating data
 Visualizing data
 Data analysis, statistics, and modeling
 Machine Learning
 Assembling data processing pipelines to link these steps
 Leveraging high-end computational resources for large-scale problems
Often, different tools address different parts of this process.
Therefore, interoperability among tools, based on common data structures and
interfaces, is an important element in enabling the construction of complex, multifaceted
data analysis pipelines. It is in this sense that we can talk about an ecosystem for data
science. For any particular application, you might only be interested in a subset of these
operations.
2. Among the data scientists, who do you think has the greatest contribution in the
existence of data science? Support your answer with a brief explanation.
 Geoffrey Hinton becauseGeoffrey Hilton is called the Godfather of Deep

Learning in the field of data science. Mr. Hinton is best known for his work on
neural networks and artificial intelligence. A Ph.D. in artificial intelligence, he is
accredited for his exemplary work on neural nets.
3. What is data science, and what are the skills needed for you to be data scientist.
 Data science is a multidisciplinary approach to extracting actionable insights from

the large and ever-increasing volumes of data collected and created by today’s
organizations. Data science encompasses preparing data for analysis and
processing, performing advanced data analysis, and presenting the results to
reveal patterns and enable stakeholders to draw informed conclusions.
The skills needed for you to be data scientist are;
 Apply mathematics, statistics, and the scientific methodUse a wide range of

tools and techniques for evaluating and preparing data—everything from SQL
to data mining to data integration methods
 Extract insights from data using predictive analytics and artificial intelligence
(AI), including machine learning and deep learning models
 Write applications that automate data processing and calculations
 Tell—and illustrate—stories that clearly convey the meaning of results to
decision-makers and stakeholders at every level of technical knowledge and
understanding
 Explain how these results can be used to solve business problems
4. Enumerate the 4 V’s in big data, and expound why data science in essential?
 The reason why we need data science is the ability to process and interpret data.
This enables companies to make informed decisions around growth,
optimization, and performance.
 For example, machine learning is now being used to make sense of every kind of
data – big or small.
Volume
The first V of big data is all about the amount of data—the volume. Today, every single
minute we create the same amount of data that was created from the beginning of time
until the year 2000. We now use the terms terabytes and petabytes to discuss the size
of data that needs to be processed. The quantity of data is certainly an important aspect
of making it be classified as big data. As a result of the amount of data we deal with
daily, new technologies and strategies such as multitiered storage media have been
developed to securely collect, analyze and store it properly.
Velocity
Velocity, the second V of big data, is all about the speed new data is generated and
moves around. When you send a text, check out your social media feed and react to
posts on Facebook, Instagram or Twitter or make a credit card purchase, these acts
create data that need to be processed instantaneously. Compound these activities by all
the people in the world doing the same and more and you can start to see how velocity
is a key attribute of big data.
Variety
Today, data is generally one of three types: unstructured, semi-structured and

structured. The algorithms required to process the variety of data generated varies
based on the type of data to be processed. In the past, data was nicely structured—
think Excel spreadsheets or other relational databases. A key characteristic of big data
is that it not only is structured data but also includes text, images, videos, voice files and
other unstructured data that doesn’t fit easily into the framework of a spreadsheet.
Unstructured data isn’t bound by rules like structured data is. Again, this variety has
helped put the “big” in data. We are able to use technology to make sense of
unstructured data today in a way that wasn’t possible in the past. This ability has
opened up a tremendous amount of data that have previously not been accessible or
useful
Veracity
The veracity of big data denotes the trustworthiness of the data. Is the data accurate
and high-quality? When talking about big data that comes from a variety of sources, it’s
important to understand the chain of custody, metadata and the context when the data
was collected to be able to glean accurate insights. The higher the veracity of the data
equates to the data’s importance to analyze and contribute to meaningful results for an
organization.

Data Science Assesment 1-2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Assesment 1-2

Uploaded by

Copyright:

Available Formats

NAME: RODJEAN A.

SIMBALLA YEAR/SECTION: BSIT- 3D

MODULE 1- INTRODUCTION TO DATA SCIENCE

Assessment 1. Introduction to Data Science/ Evolution of Data Science

1. Identify at least five skill areas of a data scientist.

 The following are the five skills areas of a data scientist;

d) Social media mining

e) Natural language process/machine learning

2. Identify the seven main categories of data.

 The following are the seven main categories of data;

4) Machine generated data

5) Graph-based or network data

6) Audio, image and video

Papers by Alan Turing on the topics of computable

numbers and artificial intelligence were published ( 2

C.F. Jeff Wu’s public lecture (Statistics = Data

discussions of the need for statisticians to join with 1990

computer scientists to bring mathematical rigor to

computational analysis of large data sets.

William S. Cleveland published an action plan for

creating a university department. 2001

Leo Breiman published the paper (Statistical

Modeling: the two cultures) His distinction between a 2001

statistical focus on models that explain the data

versus an algorithmic focus on models that can

actually predict the role of a data scientist has

become very broad

1. List down major differences between Supervised and Unsupervised Machine

Supervised Machine Learning Unsupervised Machine Learning

System creates decisions is the process of restructuring and improving

2. What are the drawbacks of having too much information?

several ways a student or a researcher can do to manage information and make

a better use of internet resources in order to avoid information overload.

1. Identify and discuss the facets of data in Data Science?

 Geoffrey Hinton becauseGeoffrey Hilton is called the Godfather of Deep

 Data science is a multidisciplinary approach to extracting actionable insights from

The skills needed for you to be data scientist are;

 Apply mathematics, statistics, and the scientific methodUse a wide range of

Today, data is generally one of three types: unstructured, semi-structured and

You might also like