You are on page 1of 32

1

KDNuggets is an organization concerned with business analytics, data mining and also
data science. It was founded by Gregory I. Piatetsky-Shapiro
Kdnuggets ’Origins of Big Data’ forum post:
https://www.kdnuggets.com/2017/02/origins-big-data.html

USENIX is an organization that focuses on advanced computer systems. It evolved


from the earlier ‘Unix Users Group’
John Mashey USENIX talk slide presentation called ‘Big Data And The Next Wave of
InfraStress’:
https://static.usenix.org/event/usenix99/invited_talks/mashey.pdf

2
Gartner is research and advisory company that provides information, advice and tools
to those in a range of industries such as IT, finance, customer service etc.
It is a well-respected and admired company.

The Gartner Hype cycle is a graphical representation of the life cycle stages
technologies pass through from their first appearance to their widespread use.
Gartner Hype Cycle:
https://www.gartner.com/en/research/methodologies/gartner-hype-cycle

3
In this Gartner hype cycle diagram from 2012 ‘Big Data is shown climbing the ‘Peak of
Inflated Expectations’

Its worth noting that it is accompanied by technologies such as HTML5 ,


Crowdsourcing, Speech-to-Speech translation, Internet of Things etc.

Note that the key indicates that the graph also gives an estimate of when the
technology will reach the ‘Plateau of Productivity’
In the case of Big Data the estimate is 2 to 5 years.

4
Moving on to two years later we have the Gartner hype cycle diagram for 2014.

Here Big Data is moving down into the ’Trough of Disillusionment’ and now has an
estimated figure of 5 to 10 years to reach the plateau.

Note that ‘Data Science’ is moving up the slope to the ‘Peak of Inflated Expectations’
with a 2 to 5 years estimate of reaching the plateau.

5
Moving forward again to 2015 we see that Big Data is missing from the diagram.

A Gartner Blog post by Andrew White of Gartner gives an explanation for this.
The URL for the blog post is:
https://blogs.gartner.com/andrew_white/2015/08/20/the-end-of-big-data-its-all-
over-now/

The main reason given is that it is still present but across a more context specific
technologies collection.
Collectively technologies such as Advanced Analytics, Machine Learning, Internet of
Things, Augmented Reality etc. collectively account for what is happening in Big Data
at that point in time.

6
If we jump forward to 2020 we can identify many technologies considered to be
involved with Big Data that are now of importance.

Technologies present on this diagram are certainly some we should be considering as


an important part of the Big Data landscape.
A few of relevance to this module are Data Fabric, Explainable AI, Responsible AI,

7
The DIKW pyramid features in publications authored by Martin H Fricke and in the publications of ot

You can access a copy of the


Data-Information-Knowledge-Wisdom (DIKW) Pyramid, Framework, Continuum article at the Spring
https://link.springer.com/referenceworkentry/10.1007/978-3-319-32001-4_331-1
(NB You will have to login using your University credentials to access.)
You should consider this paper as directed reading for the BDL module.

To Summarise:
The diagram presents Wisdom, Knowledge , Information and Data as a pyramid.
Although this model is simplistic, many consider this pyramid to represent the real-world challenges e
It is useful in considering how value and meaning can extracted from data and information.

The foundation of the pyramid is Data on which the other layers rest and are built upon.
Data can be considered as the symbolic representations of observable proprieties.

Data can be analyzed, structured and considered in a specific context to gain information which is the
We have created something from the data that informs us.
Information is relevant data or usable data or processed data.
A question is often asked of the data and information is the result.

8
Information is value obtained from the data.

The next layer in DIKW is Knowledge and is often considered as knowhow or skill.
In this case information is promoted to a controlling role through transformation into
instructions.
Instructions that can be followed when dealing with the specific system and the value
obtained from it as information.

The final DIKW layer is Wisdom.


Wisdom cannot be considered without also considering Understanding.
Wisdom allows people to consider ethics and morality in the use of the value they
have obtained.
This is something we will discuss later in this module.

There are however many criticisms of DIKW some of which are covered in the Fricke
article.

8
9
Its important to understand the SI prefixes used when measuring amounts of data.
The slide shows the SI prefixes which use base 10.
In the world of Big Data, base 2 is also used and the prefixes in this case are different.

The US Department of Commerce National Institute of Standards and Technology


(NIST) provides information on the base 10 and base 2 prefixes at the following URLs:
https://physics.nist.gov/cuu/Units/prefixes.html
https://physics.nist.gov/cuu/Units/binary.html

But what lies beyond yottabyte?


There is mention of a possible new prefix:
brontobyte (10^27 bytes)

Here is the URL for an article on this that is worth reading:


https://www.weforum.org/agenda/2015/02/big-data-what-is-a-brontobyte/

10
What can be stored in a petabyte when we consider typically storage requirements?
The examples given show that is some cases e.g. tweets, a petabyte is a large amount
of storage and more than you might need.

However some of todays big data applications would consider a petabyte to be small.
If you want to store data representing DNA or videos uploaded to YouTube a petabyte
may not be considered that much.
1212
13
14
Google Definition of Big Data URL:
https://cloud.google.com/learn/what-is-big-data

15
As a starting point let’s look at the concept of 3V’s

16
17
18
19
20
21
The business value of data is an important topic we will cover later.

Other V’s are also considered such as variability.


Here the meaning of the data is constantly changing, and it must be considered in
terms of its context and current meaning.

22
A wide range of platform, technologies and systems now generate big data.
It cannot be understated that a detailed understanding of the specific data and its
domain of use is essential to gain knowledge and value.

23
Applying Big Data technology to the datasets provided by each of the examples
shown in the slide is not straightforward.
The different areas are often called Domains.
Mastery of each domain including its specific terminology as well as the kind of data
that is generated and specific datasets is essential to get value from that data.

24
In the BDL module we will cover material all of these roles are concerned with.

25
26
Basically Available: Reading and Writing of data is possible but consistency is not
guaranteed.
Soft state: Any state inferred by the data values is soft until if it has converged to its
final value.
Eventual consistency: If we wait a specific period of time we will eventually know for
certain the state of the data held.

27
The Amazon Web Services web site has a useful definition of Data Warehouse.
https://aws.amazon.com/data-warehouse/

The Google Cloud Web site also has a useful definition.


https://cloud.google.com/learn/what-is-a-data-warehouse

Oracle also have a definition of Data Warehouse.


https://www.oracle.com/uk/database/what-is-a-data-warehouse/

The same is true of IBM.


https://www.ibm.com/cloud/learn/data-warehouse

28
The Amazon Web Services web site has a useful definition of Data Lake.
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake

The Google Cloud Web site also has a useful definition.


https://cloud.google.com/learn/what-is-a-data-lake

Oracle also have a definition of Data Lake.


https://www.oracle.com/big-data/what-is-data-lake/

The same is true of IBM.


https://www.ibm.com/uk-en/analytics/data-lake

29
30
We will look at all of these challenges in the Big Data Landscape module.

31

You might also like