You are on page 1of 40

Confident Data Skills 2nd Edition

Eremenko
Visit to download the full and correct content document:
https://textbookfull.com/product/confident-data-skills-2nd-edition-eremenko/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Data Engineering with AWS: Acquire the skills to design


and build AWS-based data transformation pipelines like
a pro 2nd Edition Eagar

https://textbookfull.com/product/data-engineering-with-aws-
acquire-the-skills-to-design-and-build-aws-based-data-
transformation-pipelines-like-a-pro-2nd-edition-eagar/

Programming Skills for Data Science 1st Edition Michael


Freeman

https://textbookfull.com/product/programming-skills-for-data-
science-1st-edition-michael-freeman/

Basic Skills in Interpreting Laboratory Data Mary Lee

https://textbookfull.com/product/basic-skills-in-interpreting-
laboratory-data-mary-lee/

Analytical Skills for AI and Data Science 1st Edition


Daniel Vaughan

https://textbookfull.com/product/analytical-skills-for-ai-and-
data-science-1st-edition-daniel-vaughan/
Confident Speaking : Theory, Practice and Teacher
Inquiry 1st Edition Goh

https://textbookfull.com/product/confident-speaking-theory-
practice-and-teacher-inquiry-1st-edition-goh/

Perfect Phrases for ESL - Conversation Skills 2nd


Edition Diane Engelhardt

https://textbookfull.com/product/perfect-phrases-for-esl-
conversation-skills-2nd-edition-diane-engelhardt/

People Centric Skills Interpersonal and Communication


Skills for Financial Professionals 2nd Edition Danny M.
Goldberg [Danny M. Goldberg]

https://textbookfull.com/product/people-centric-skills-
interpersonal-and-communication-skills-for-financial-
professionals-2nd-edition-danny-m-goldberg-danny-m-goldberg/

Data Protection: Ensuring Data Availability 2nd Edition


Preston De Guise

https://textbookfull.com/product/data-protection-ensuring-data-
availability-2nd-edition-preston-de-guise/

Parenting with presence practices for raising conscious


confident caring kids Stiffelman

https://textbookfull.com/product/parenting-with-presence-
practices-for-raising-conscious-confident-caring-kids-stiffelman/
Confident Data Skills
Confident Data Skills
How to work with data and futureproof
your career

Second Edition

Kirill Eremenko
To my parents, Alexander and Elena
Eremenko, who taught me the most
important thing in life – how to be a
good person
CONTENTS

List of figures
Bonus for readers
Acknowledgments

Introduction
PART ONE ‘What is it?’ Key principles
Striding out
The future is data
Arresting developments
01 Defining data
Data is everywhere
Big (data) is beautiful
Storing and processing data
Data can generate content
Using data
Why data matters now
Worrying achieves nothing
References
Notes
02 How data fulfils our needs
The permeation of data
Data science and physiology
Data science and safety
Data science and belonging
Data science and esteem
Data science and self-actualization
Some final thoughts
References
Notes
03 AI and our future
What is artificial intelligence?
Artificial general intelligence
Narrow AI
Robotic process automation
Computer vision
Natural language processing
Reinforcement learning and deep learning
The dark side of AI
Prepare for Part Two
References
Notes
PART TWO
‘When and where can I get it?’ Data gathering and
analysis
The Data Science Process
Getting started
04 Identify the question
Look, ma, no data!
How do you solve a problem like…
Timekeeping
The art of saying no
Onward!
References
Notes
05 Data preparation
Encouraging data to talk
With great power comes great responsibility
Preparing your data for a journey
Reference
Notes
06 Data analysis: Part 1
Don’t skip this step
Classification and clustering
Classification
Decision trees
Random forest
K-nearest neighbours (K-NN)
Naive Bayes
Classification with Naive Bayes
Logistic regression
Clustering
K-means clustering
Hierarchical clustering
References
Notes
07 Data analysis: Part 2
Reinforcement learning
The multi-armed bandit problem
Upper confidence bound
Thompson sampling
Upper confidence bound vs Thompson
sampling: Which is preferable?
Deep learning
Weight training: How neural networks learn
The future for data analysis
References
Notes
PART THREE
‘How can I present it?’ Communicating data
Looking good
You're not finished yet!
The career shaper
08 Data visualization
What is visual analytics?
What is data visualization?
Speaking a visual language
Steps to creating compelling visuals
Final thoughts
References
Notes
• Looking further: Types of chart
09 Data presentation
The importance of storytelling
Creating data advocacy
How to create a killer presentation
The end of the process
References
Notes
10 Your career in data science
Breaking in
Applying for jobs
Preparing for an interview
Interviewing them
Nurturing your career within a company
References
Notes
Index
LIST OF FIGURES

2.1 Maslow’s hierarchy of needs


II.1 The Data Science Process
5.1 Importing spreadsheets containing missing data
5.2 Missing data in a spreadsheet to show start-up growth
6.1 Scatter plot to show customer data organized by age and
time spent gaming
6.2 Scatter plot to show customer data, divided into leaves
6.3 Flowchart to show the decision tree process
6.4 Measuring Euclidean distance
6.5 The K-NN algorithm
6.6 Success of wine harvests according to hours of sunlight and
rainfall 1
6.7 Success of wine harvests according to hours of sunlight and
rainfall 2
6.8 Calculating marginal likelihood for Naive Bayes
6.9 Calculating likelihood for Naive Bayes
6.10 Scatter diagram to show respondents’ salaries based on
experience
6.11 Scatter diagram to show whether or not respondents
opened our email, based on their age
6.12 Straight linear regression to show the likelihood of
respondents opening our email, based on their age
6.13 Adapted linear regression to show the likelihood of
respondents opening our email, based on their age
6.14 Linear regression transposed to a logistic regression
function
6.15 Graph containing categorical variables
6.16 Logistic regression line
6.17 Logistic regression line (without observations)
6.18 Logistic regression including fitted values
6.19 Logistic regression line (segmented)
6.20 Dataset for e-commerce company customers
6.21 Scatter plot for two independent variables
6.22 K-means assigning random centroids
6.23 Assigning data points to closest centroid
6.24 Re-computing centroids
6.25 Reassigning data points to closest centroid
6.26 Centroids are displaced
6.27 Repeat re-computation of centroids
6.28 K-means, visualized
6.29 Three clusters, identified by K-means
6.30 Finding the distance between data points in one cluster
6.31 Finding the distance between data points in two clusters
6.32 WCSS values against number of clusters
6.33 Steps to agglomerative hierarchical clustering
6.34 The process of agglomerative clustering, as shown in a
dendrogram
6.35 Splitting into four clusters
6.36 Identifying the largest vertical segment
6.37 Splitting into two clusters
7.1 Multi-armed bandit distributions
7.2 Multi-armed bandit distributions with expected returns
7.3 Expected returns vertical axis
7.4 Starting line
7.5 Starting confidence bound
7.6 Real-life scenario
7.7 Machine D3 after trial rounds
7.8 All machines after trial rounds
7.9 Shifting and narrowing of the D4 confidence bound
7.10 Thompson sampling multi-armed bandit distributions
7.11 Thompson sampling expected returns
7.12 Distribution for machine M3 after trial rounds
7.13 Distribution for machine M3 with true expected returns
7.14 Distributions for all three machines after trial runs
7.15 Randomly generated bandit configuration
7.16 Machine M2 updated distribution
7.17 Thompson sampling refined distribution curves
7.18 Parts of a neuron
7.19 A neuron's input and output values
7.20 A threshold activation function
7.21 A rectified linear unit
7.22 A sigmoid activation function
7.23 A perceptron to estimate housing value
7.24 A complex neural network to estimate housing value
7.25 Impact of features on N1
7.26 Impact of features on N2
7.27 Impact of features on N3
7.28 Calculating output with a neural network
7.29 A neural network with three hidden layers
8.1 Napoleon’s retreat from Moscow (the Russian Campaign
1812–13)
BONUS FOR READERS

Thank you for picking up this book. You’ve made a huge step in
your journey into data science.
All readers gain complimentary access to my Data Science A–
Z course.
Just go to www.superdatascience.com/bookbonus and use the
password datarockstar.
You can download a guide to using colour in visualizations at
www.superdatascience.com/cds.
Happy analysing!
ACKNOWLEDGEMENTS

I would like to thank my father, Alexander Eremenko, whose


love and care have shaped me into the person I am today, and
who has shown me through his firm guidance how I can take
hold of life’s opportunities. Thank you to my warm-hearted
mother, Elena Eremenko, for always lending an ear to my crazy
ideas and for encouraging my brothers and me to take part in
the wider world – through music, language, dance and so much
more. Were it not for her wise counsel, I would never have
immigrated to Australia.
Thanks to my brother, Mark Eremenko, for always believing
in me, and for his unshakeable confidence. His fearlessness in
life continues to fuel so many of the important decisions I make.
Thank you to my brother, Ilya Eremenko, wise beyond his
years, for his impressive business ideas and well-considered
ventures. I am certain that fame and fortune will soon be
knocking on his door.
Thank you to my grandmother Valentina, aunt Natasha and
cousin Yura, for their endless love and care. And thank you to
the entire Tanakovic and Svoren families, including my
brothers Adam and David for all the dearest moments we share.
Thank you to my students and to the thousands of people
who listen to the SuperDataScience podcast. My audience
inspires me to keep going!
Several key people helped to make this book a reality. I would
like to thank my writing partner Zara Karschay for capturing
my voice. Thank you to my commissioning editor Rebecca Bush
and production editor Stefan Leszczuk, along with Anna Moss,
whose feedback and guidance were fundamental to the writing
process, and to Kogan Page’s editorial team for their rigour.
Thank you to my friend and business partner Hadelin de
Ponteves for inspiring me and for being a great source of
support in tackling some of the most challenging subject
matters in the field of data science, as well as for his help in
reviewing the technical aspects of this book.
Thank you to my friend and executive assistant Mitja Bosnič
for his tireless efforts in making this second edition possible.
My thanks also go to the talented team at SuperDataScience for
taking on additional responsibilities so that I could write this
book. Thank you to the hardworking team at Udemy, including
my supportive account managers Lana Martinez and Erin
Adams.
Thank you to my friend and mentor Artem Vladimirov,
whose admirable work ethic and knowledge lay the
foundations upon which I have built everything I know about
data science. Many thanks to Vitaly Dolgov, Ivor Lok, Richard
Hopkins, Tracy Crossley and Herb Kanis for being excellent role
models, for believing in me, for always being there when I
needed help, and for guiding me through times both good and
bad. Thank you to Katherina Andryskova – I am grateful that
she was the first person to read and offer invaluable feedback
on Confident Data Skills.
I give my express thanks to the people whose work
contributed to the case studies in this book: Alberto Cairo,
Samuel Hinton, Richard Hopkins, Kristen Kehrer, Raul Popa,
Caroline McColl, Ulf Morys, Daniel and Leigh Pullen, Dominic
Roe, Adrian Rosebrock, Matthew Rosenquist, Dan Shiebler, Ben
Taylor, Artem Vladimirov, and Stephen Welch.
The motivational teachers at Moscow School 54, the Moscow
Institute of Physics and Technology and the University of
Queensland have my thanks for giving me such beneficial
education. To all my past colleagues at Deloitte and Sunsuper,
thank you for giving me the professional development I needed
for creating my data science toolkit.
Most of all, I want to thank you, the reader, for giving me
your valuable time. It has been my foremost hope that this book
will encourage those who wish to understand and implement
data science in their careers.
Introduction
A great deal has changed in the two years since the first edition.
The field of data science has stormed ahead – everything is
bigger, faster and stronger. Increased processing capabilities
have brought reinforcement learning and deep learning to the
market. Neural networks are much easier to build. Artificial
intelligence has changed the game in many industries. But we
must also not forget about the dark side of this moon. Ruthless
hacks have held entire city systems hostage. Many fear the
prospect of machines gaining sentience, and of data-driven
companies abusing our privacy.
I might be a small cog in the data science mechanism, but I
have learned that anyone has the potential to bring about
change. I continue to encourage my students to be a part of the
movement, and I practise what I preach: my 2018 co-founded
corporate training start-up BlueLife AI works from the idea that
everyone can benefit from the power of artificial intelligence.
Nevertheless, I can’t take praise for this move without
acknowledging my colleague and course co-developer, Hadelin
de Ponteves, and my students. My and Hadelin’s online courses
have become a vibrant community of over a million data
scientists from more than 200 countries, all of whom are eager
to share and critique ideas. The international DataScienceGO
conference, now entering its fourth iteration, doubles its
delegate numbers every year. As the conference welcomes all
abilities, we originally devised pathways to accommodate
different skill sets. Yet we learned the most exciting panels were
those that bridged multiple routes. In the sessions where people
of diverse capabilities interacted with each other, ideas spread.
It became clear that data scientists are friendly, engaged people
who want to learn in an open and welcoming environment that
brings everyone together.
I have learned, above all, that human networks are powerful.
That data scientists love a sense of community. That our
students want to continue learning with us, even when they’ve
surpassed the courses we have available. We – and I mean that
generally – get so much value from helping each other. I once
thought that having a reliable network of peers was just one of
many components to success in the field. Now I see that it is the
most critical element of all.
We can make the same argument for data. A single data point
rarely makes a difference, but drop it into a network of a
million others, and it can contribute to real change. Here is my
advice: join a network of data scientists. Follow the movers and
shakers on LinkedIn and Twitter. Take a course and talk to your
fellow students. Get involved in the discussion. You’ll learn
from others, and gain perspective of the latest developments,
and in time will gain the confidence to make valuable
contributions of your own.
PART ONE
‘What is it?’ Key principles
With all the attention given to the apparently limitless potential
of technology and the extensive opportunities for keen
entrepreneurs, some may ask why they should bother with data
science at all – why not simply learn the principles of
technology? After all, technology powers the world, and it
shows no signs of slowing down. Any reader with an eye to
their career might think that learning how to develop new
technologies would surely be the better way forward.
It is easy to regard technology as the force that changes the
world – it has given us the personal computer, the internet,
artificial organs, driverless cars, the Global Positioning System
(GPS) – but few people think of data science as the propeller
behind many of these inventions. That is why you should be
reading this book over a book about technology; you need to
understand the mechanics behind a system in order to make a
change.
We should not consider data only as the boring-but-helpful
parent, and technology as the stylish teenager. The importance
of data science does not begin and end with the explanation
that technology needs data as just one of many other functional
elements. That would be denying the beauty of data, and the
many exciting applications that it offers for work and play. In
short, it is not possible to have one without the other. What this
means is that if you have a grounding in data science, the door
will be open to a wide range of other fields that need a data
scientist, making it an unusual and propitious area of research
and practice.
Part One introduces you to the ubiquity of data, and the
developments and key principles of data science that are useful
for entering the subject. The concepts in the three chapters will
outline a clear picture of how data applies to you, and will get
you thinking not only about how data can directly benefit you
and your company but also how you can leverage data for the
long term in your career and beyond.

Striding out
Chapter 1 will mark the beginning of our journey into data
science. It will make clear the vast proliferation of data and
how in this Computer Age we all contribute to its production,
before moving on to show how people have collected it, worked
with it and, crucially, how data can be used to bolster a great
number of projects and methods within and outside the
discipline.
We have established that part of the problem with data
science is not its relative difficulty but rather that the discipline
is still something of a grey area for so many. Only when we
understand precisely how much data there is and how it is
collected can we start to consider the various ways in which we
can work with it. We have reached a point in our technological
development where information can be efficiently collected
and stored for making improvements across all manner of
industries and disciplines – as evidenced in the quantity of
publicly available databases and government- funded projects
to aggregate data across cultural and political institutions – but
there are comparatively few people who know how to access
and analyse it. Without workers knowing why data is useful,
these beautiful datasets will only gather dust. This chapter
makes the case for why data science matters right now, why it is
not just a trend that will soon go out of style, and why you
should consider implementing its practices as a key component
of your work tasks.
Lastly, this chapter details how the soaring trajectory of
technology gives us no room for pause in the field of data
science. Whatever fears we may have about the world towards
which we are headed, we cannot put a stop to data being
collected, prepared and used. Nevertheless, it is impossible to
ignore the fact that data itself is not concerned with questions
of morality, and this has left it open to exploitation and abuse.
Those of you who are concerned can take charge of these
developments and enter into discussion with global institutions
that are dealing with issues surrounding data ethics, an area
that I find so gripping I gave it its own section in Chapter 3, AI
and our future.

The future is data


Everything, every process, every sensor, will soon be driven by
data. This will dramatically change the way in which business is
carried out. In 10 years from now, I predict that every employee
of every organization in the world will be expected to have a
level of data literacy and be able to work with data and derive
some insights to add value to the business. Not such a wild
thought if we consider how, at the time of this book’s
publication, many people are expected to know how to use the
digital wallet service Apple Pay, which was only brought onto
the market in 2014.
Chapter 2, How data fulfils our needs, makes clear that data is
endemic to every aspect of our lives. It governs us, and it
gathers power in numbers. While technology has only been
important in recent human history, data has always played a
seminal role in our existence. Our DNA provides the most
elementary forms of data about us. We are governed by it: it is
responsible for the way we look, for the shape of our limbs, for
the way our brains are structured and their processing
capabilities, and for the range of emotions we experience. We
are vessels of this data, walking flash drives of biochemical
information, passing it on to our children and ‘coding’ them
with a mix of data from us and our partner. To be uninterested
in data is to be uninterested in the most fundamental principles
of existence.
This chapter explains how data is used across so many fields,
and to illustrate this I use examples that directly respond to
Abraham Maslow’s hierarchy of needs, a theory that will be
familiar to many students and practitioners in the field of
business and management. If this hierarchy is news to you,
don’t worry – I will explain its structure and how it applies to us
in Chapter 2.

Arresting developments
The final chapter in Part One will explore the current state of
AI, its potential applications and its dangers. Many of the
developments made within the field have had knock-on effects
on other subjects. They have raised questions about the future
for data scientists as well as for scholars and practitioners
beyond its disciplinary boundaries. If you want to develop your
career in data science, this chapter could even fire up ideas for
subject niches that sorely need qualified personnel.
To add further weight to the examples offered in Chapter 2
which show compelling arguments for data’s supportive role
across many walks of life, in Chapter 3 I highlight the five most
promising AI developments in the world of business. AI’s broad
application can make it difficult to penetrate. This chapter gives
you a grounding in the subject’s key applications, and where
change is happening.
The positive effects of AI are clear, but it is also important not
to be blinded by it. Thus, Chapter 3 also addresses the security
threats that data, and AI’s use of it, can pose, and how data
practitioners can address current and future issues. Ethics is a
compelling area, as it has the power to alter and direct future
developments in data science. From what we understand of
information collection, to the extent to which it can be used
within machines and online, data ethics is setting the stage for
how humans and technology communicate.
01
Defining data
Think about the last film you saw at the cinema. How did you
first hear about it? You might have clicked on the trailer when
YouTube recommended it to you, or it may have appeared as an
advertisement before YouTube showed you the video you
actually wanted to see. You may have seen a friend sing its
praises on your social network, or had an engaging clip from
the film interrupt your newsfeed. If you’re a keen moviegoer, it
could have been picked out for you on an aggregate movie
website as a film you might enjoy. Even outside the comfort of
the internet, you may have found an advertisement for the film
in your favourite magazine, or you could have taken an idle
interest in the poster on your way to that coffeehouse with the
best Wi-Fi.
None of these touchpoints was coincidental. The stars didn’t
just happen to align for you and the film at the right moment.
Let’s leave the idealistic serendipity to the onscreen encounters.
What got you into the cinema was less a desire to see the film
and more of a potent concoction of data-driven evidence that
had marked you out as a likely audience member before you
even realized you wanted to see the film.
When you interacted with each of these touchpoints, you left
a little bit of data about yourself behind. We call this ‘data
exhaust’. It isn’t confined to your online presence, nor is it only
for the social media generation. Whether or not you use social
media platforms, whether you like it or not, you’re contributing
data.
It has always been this way; we’ve just become better at
recording and collecting it. Any number of your day-to-day
interactions stand to contribute to this exhaust. On your way to
the London Underground, CCTV cameras are recording you.
Hop onto the Tube, and you’re adding to Transport for London’s
statistical data about peak times and usage. When you
bookmark or highlight the pages of a novel on your Kindle, you
are helping distributors to understand what readers
particularly enjoyed about it, what they could put in future
marketing material and how far their readers tend to get into
the novel before they stop.
When you finally decide to forgo the trials and punishments
of public transport and instead drive your car to the
supermarket, the speed you’re going is helping GPS services to
show their users in real time how much traffic there is in an
area, and it also helps your car gauge how much more time you
have left before you’ll need to find a petrol station.
And today, when you emerge from these touchpoints, the data
you leave behind is swept up and added to a blueprint about
you that details your interests, actions and desires.
But this is only the beginning of the data story. This book will
teach you about how absolutely pervasive data really is. You
will learn the essential concepts you need to be on your way to
mastering data science, as well as the key definitions, tools and
techniques that will enable you to apply data skills to your own
work. This book will broaden your horizons by showing you
how data science can be applied to areas in ways that you may
previously have never thought possible. I’ll describe how data
skills can give a boost to your career and transform the way you
do business – whether that’s through impressing top executives
with your ideas or even starting up on your own.

Data is everywhere
Before we move any further, we should clarify what we mean
by data. When people think of data, they think of it being
actively collected, stashed away in databases on inscrutable
corporate servers and funnelled into research. But this is an
outdated view. Today, data is much more ubiquitous.
Quite simply, data is any unit of information. It is the by-
product of any and every action, pervading every part of our
lives, not just within the sphere of the internet, but also in
history, place and culture. A cave painting is data. A chord of
music is data. The speed of a car is data. A ticket to a football
match is data. A response to a survey question is data. A book is
data, as is a chapter within that book, as is a word within that
chapter, as is a letter within that word. It doesn’t have to be
collected for it to be considered data. It doesn’t have to be stored
in a vault of an organization for it to be considered data. Much
of the world’s data probably doesn’t (yet) belong to any
database at all.
Let’s say that in this definition of data being a unit of
information, data is the tangible past. This is quite profound
when you think about it. Data is the past, and the past is data.
The record of things to which data contributes is called a
database. And data scientists can use it to better understand our
present and future operations. They’re applying the very same
principle that historians have been telling us about for ages: we
can learn from history. We can learn from our successes – and
our mistakes – in order to improve the present and future.
The only aspect of data that has dramatically changed in
recent years is our ability to collect, organize, analyse and
visualize it in contexts that are only limited by our imagination.
Wherever we go, whatever we buy, whatever interests we have,
this data is all being collected and remodelled into trends that
help advertisers and marketers push their products to the right
people, that show the government their citizens’ political
leanings according to their borough or age, and that help
scientists create AI technologies that respond to complex
emotions, ethics and ideologies, rather than simple queries.
All things considered, you might start to ask what the limits to
the definition of data are. Does factual evidence about a plant’s
flowering cycle (quantitative data) count as data as much as the
scientist’s recording of the cultural stigma associated with
giving a bunch to a dying relative in the native country
(qualitative data)? The answer is yes. Data doesn’t discriminate.
It doesn’t matter whether the unit of information collected is
quantitative or qualitative. Qualitative data may have been less
usable in the past when the technology wasn’t sophisticated
enough to process it, but thanks to advancements in the
algorithms capable of dealing with such data, this is quickly
becoming history.
To define data by its limits, consider again that data is the
past. You cannot get data from the future, unless you have
managed to build a time machine. But while data can never be
the future, it can make insights and predictions about it. And it
is precisely data’s ability to fill in the gaps in our knowledge
that makes it so fascinating.
Big (data) is beautiful
Now that we have a handle on what data is, we need to shake
up our understanding of where and how it actually gets stored.
We have already shown our wide-reaching potential for
emitting data (that’s our data exhaust) and have explained that,
in it being a unit of information, there is a very broad concept
of what we understand as being data. So once it is out there,
where does it all go?
By now, you’re likely to have heard the term ‘big data’. Put
very simply, big data is the name given to datasets with
columns and rows so considerable in number that they cannot
be captured and processed by conventional hardware and
software within a reasonable length of time. For that reason,
the term is dynamic – what might once have been considered
big data back in 2015 will no longer be thought of as such in
2022, because by that time technology will have been developed
to tackle its magnitude with ease.

The 3 Vs
To give a dataset the label of big data, at least one of three
requirements must be fulfilled:
its volume – which refers to the size of the dataset (eg the
number of its rows) – must be in the billions;
its velocity – which considers how quickly the data is
being gathered (such as online video streaming) – assumes
that the speed of data being generated is too rapid for
adequate processing with conventional methods; and
its variety – this refers to either the diversity of the type of
information contained in a dataset such as text, video,
audio or image files (known as unstructured data) or a
table that contains a significant number of columns which
represent different data attributes.
Big data has been around for much longer than we’d care to
imagine – it’s just that the term for it didn’t appear until the
1990s. We’ve been sitting on big data for years, for all manner
of disciplines, and for far longer than you might expect. So I
have to break it to you: big data is not big news. It’s certainly
not a new concept. Many if not all of the world’s largest
corporations have mammoth stores of data that have been
collected over a long period of time on their customers,
products and services. Governments store human data, from
censuses to surveillance and habitation. Museums store cultural
data, from artefacts and collector profiles to exhibition
archives. Even our own bodies store big data, in the form of the
genome.
In short, if you just can’t work with it, you can call it big data.
When data scientists use the term, they don’t use it loosely. It is
used to draw attention to the fact that standard methods of
analysing the dataset in question are not sufficient.

Why all the fuss about big data?


You might think it strange that we have only just started to
realize how powerful data can be. But while we have been
collecting data for centuries, what stopped us in the past from
lassoing it all into something beneficial to us was the lack of
technology. After all, it’s not how big the data is that matters; it’s
what you do with it. Any data, ‘big’ or otherwise, is only useful
Another random document with
no related content on Scribd:
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite
these efforts, Project Gutenberg™ electronic works, and the
medium on which they may be stored, may contain “Defects,”
such as, but not limited to, incomplete, inaccurate or corrupt
data, transcription errors, a copyright or other intellectual
property infringement, a defective or damaged disk or other
medium, a computer virus, or computer codes that damage or
cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES -


Except for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU
AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE,
STRICT LIABILITY, BREACH OF WARRANTY OR BREACH
OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE
TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER
THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR
ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE
OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF
THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If


you discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person or
entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR
ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you do
or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status by
the Internal Revenue Service. The Foundation’s EIN or federal
tax identification number is 64-6221541. Contributions to the
Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

You might also like