You are on page 1of 41

To do for Today

Discuss “plans” for project 1 that will result in your product 1.


Discuss data-intensive computing and about “data strategy”

B.Ramamurthy 2021 10/27/2021


What will I learn in this course? Slide #3 from last class
You’ll learn about basic data analytics process from defining a problem
to cleaning and processing data for downstream analytics (and application
of algorithms), knowledge extraction. (Figure adapted form Doing Data
science book)

“structured data”. “tables..csv”

You’ll learn about big data infrastructures and algorithms. Managing


large scale data requires special structures and algorithms. (Figure from Lin
and Dyer’s book.)

“structured and unstructured data” “key-value store”

You’ll learn newer data challenges and methods to address these. For
example streaming data, how to process and analyze these in a timely
manner. (Examples for data stream are social media data and multi-modal
enterprise data)
Data stream
“structured, unstructured and streaming data”
“messages”
Poster board for your Project 1: data science roadmap. 5
Let’s analyze each component.

2 3 4
1

9 8

10

B.Ramamurthy 2021 3 10/27/2021


Tasks of the DS Roadmap (Project 1)
4

1. Real world application domain – decide what area area you want to work?
2. Raw data collected – in your case search, study and get the data sets you want to use. – Can be multiple data sets,
multiple sources. See [1] for a report how they collected data.
3. Data processed – in your case you already have a source that has collected and processed the data. But you may want
to select the features (columns) you are interested in.
4. Data cleaned – This is important step in your project. This means addressing missing values, replacing N/A with
appropriate value, adding columns for certain categories (age category) etc.
5. Exploratory data analysis: Understanding the nature of the data you have collected; range, average, summary stats,
any anomalies, any concerns? May have to redo data acquisition?
6. Modeling and analysis: Apply algorithms to analyze data: two types: statistical modeling and ML algorithms. Use
simple ones implemented in the libraries. Evaluate goodness of your modeling and analysis.
7. Visualization of the outcomes of modeling: is an important step to present you data analysis. Use appropriate
libraries and create engaging and “useful” visualization.
8. Build data products: web enabling, enterprise enabling, mobile-enabling. If it worked for “apples” data set it should
work for “oranges” data set.
9. Open it up to the world: Let users test it!
10. Let decision makers use it make data-driven decisions. What are your recommendations based on your analysis?

B.Ramamurthy 2021 10/27/2021


Problem statement (+ data identification) Due: 2/12
5

Here are some rules and policies:

1. Journal everything you do by date: start with today, a google sheets or any
other journaling mechanism. Will help you with your story telling with data.

2. Do not reuse projects from other courses or completed at current your work
or previous work. Start afresh with an open mind.

3. Approach this with personal goals: how can I use this to showcase my skills,
talent and knowledge to advance my career, get fired? What can I learn from this
I have a means project? How can I improve my “intellect”?
to enforce these rules.
You’ll see how in the
Project description
4. Research and get some unique and interesting data to answer some useful questions
with broader impact? Who will this analysis affect? How is the relevant to the world today?

B.Ramamurthy 2021 10/27/2021


Where is the project description?
6

We’ll discuss all the other tasks of the project next lecture.

B.Ramamurthy 2021 10/27/2021


Where can you get your data?
7

Search for “top 10 data sources”


You will get 100’s of data sources/ topics of interest: For example
education, election data etc.
Don’t go confused.. Choose the one domain you are passionate about.
Develop and sketch a questions or questions around that topic.
Form a problem statement.

B.Ramamurthy 2021 10/27/2021


To do (for each of you)
8

Sign up:
 All: Sign up on Piazza: Can you do it now? https://piazza.com/class/kk6xmjqekrl1e9
 TAs: Please set Google sheet with slots (30, 30, 10) Each slot will take 2 students.
 Students: Please for a team of at most 2, sign up the google sheets asap.
Task 1 of project was explained. Work on it.
 ”Research” and identity a application domain
 Form a data-driven questions to answer.
 Form the problem statement around this question, data and the application domain.
 Record it for submission.

B.Ramamurthy 2021 10/27/2021


Data-Intensive Computing
What is your data strategy?
9

B.RAMAMURTHY

B.Ramamurthy 2021 10/27/2021


Data-intensive computing
10

The phrase was initially coined by National Science Foundation (NSF)


What is it?
 Volume, velocity, variety, veracity (uncertainty) (Gartner, IBM)
How is it addressed?
Why now?
What do you expect to extract by processing this large data?
 Intelligence for decision making
What is different now?
 Storage models, processing models
 Big Data, analytics and cloud infrastructures
Summary

B.Ramamurthy 2021 10/27/2021


Motivation
11

Tremendous advances have taken place in statistical methods and tools,


machine learning and data mining approaches, and internet based
dissemination tools for analysis and visualization.
Many tools are open source and freely available for anybody to use.
Is there an easy entry-point into learning these technologies?
Can we make these tools easily accessible to the students, researchers
and decision makers similar to how “office” productivity software is
used?

B.Ramamurthy 2021 10/27/2021


High Level Goals for the course
12

Understand foundations of data analytics so that you can interpret and


communicate results and make informed decisions
Study and learn to apply common statistical methods and machine
learning algorithms to solve business problems
Learn to work with popular tools to analyze and visualize data; more
importantly encourage consistency across departments on analytics/tools used
Working with cloud for data storage and for deployment of applications
Learn methods for mastering and applying emerging concepts and
technologies for continuous data-driven improvements to your
research/work/business processes
Transform complex analytics into routine processes

B.Ramamurthy 2021 10/27/2021


Newer kinds of Data – Diversity of data
13

New kinds of data from different sources: tweets, geo location, emails, blogs, IoTs,
blockchain(Hmm..??)
Three major types: structured, unstructured data and now streaming data
Structured data: data collected and stored according to well defined schema;
Realtime stock quotes
Unstructured data: messages from social media, news, talks, books, letters,
manuscripts, court documents..
“Regardless of their differences, they work in tandem in any effective big data
operation. Companies wishing to make the most of their data should use tools that
utilize the benefits of both.”5
We will discuss methods for analyzing both structured, unstructured data and
streaming data

B.Ramamurthy 2021 10/27/2021


Data Deluge: smallest to largest

 Bioinformatics data: from about 3.3 billion base pairs in a human genome to
huge number of sequences of proteins and the analysis of their behaviors
 The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics …
 Financial applications: that analyze volumes of data for trends and other
deeper knowledge
 Health Care: huge amount of patient data, drug and treatment data
 The universe: The Hubble ultra deep telescope shows 100s of galaxies each
with billions of stars: Sloan Digital Sky Survey: https://www.sdss.org/

B.Ramamurthy 2021 10/27/2021 14


Big-data Problem Solving Approaches
15

 Algorithmic: after all we have working towards this for ever: scalable/tractable
 High Performance computing (HPC: multi-core) CCR has machines that are: 16
CPU , 32 core machine with 128GB RAM: openMP, MPI, etc.
 GPGPU programming: general purpose graphics processor (NVIDIA)
 Statistical packages like R running on parallel threads on powerful machines
 Machine learning algorithms on super computers
 Hadoop MapReduce like parallel processing.
 Spark like approaches providing in-memory computing models
 And now stream processing infrastructure (Kafka).

B.Ramamurthy 2021 10/27/2021


Processing
Si
Granularity
10/27/2021
16
Data size: small n
gl • Single-core, single processor
Pipelined Instruction level e- • Single-core, multi-processor
c
• Multi-core, single processor
Multi-
o
Concurrent Thread level • Multi-core, multi-processor
core
re
• Cluster of processors (single or multi-core) with
Service Object level Cluster shared memory
• Cluster of processors with distributed memory

Indexed File level Grid of clusters B.Ramamurthy 2021

Embarrassingly parallel
Mega Block level processing

MapReduce, distributed file system


Virtual System Level

Data size: large


Cloud computing
A Golden Era in Computing

Heavy
societal
Explosion involvement
Powerful
of domain
multi-core
application
processors
s

Superior
Proliferatio
software
n of devices
methodologies

Virtualization
Wider bandwidth
leveraging the
for
powerful
communication
hardware
B.Ramamurthy 2021 10/27/2021 17
Intelligence and Scale of Data
18

Intelligence is a set of discoveries made by federating/processing information


collected from diverse sources.
Information is a cleansed form of raw data.
For statistically significant information we need reasonable amount of data.
For gathering good intelligence we need large amount of information.
As pointed out by Jim Grey in the Fourth Paradigm book enormous amount of
data is generated by the millions of experiments and applications.
Thus intelligence applications are invariably data-heavy, data-driven and data-
intensive.
Data is gathered from the web (public or private, covert or overt), generated by
large number of domain applications.

B.Ramamurthy 2021 10/27/2021


Intelligence (or origins of Big-data computing?)

 Search for Extra Terrestrial Intelligence (seti@home


project)
The Wow signal http://www.bigear.org/wow.htm

10/27/2021 B.Ramamurthy 2021


19
Characteristics of intelligent applications
20

 Google search: How is different from regular search in existence before it?
It took advantage of the fact the hyperlinks within web pages form an underlying structure
that can be mined to determine the importance of various pages.
 Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to
go to CityGrille”?
 Learning capacity from previous data of habits, profiles, and other information gathered over
time.
 Collaborative and interconnected world inference capable: facebook friend suggestion
 Large scale data requiring indexing
 …Do you know amazon is going to ship things before you order? Here

B.Ramamurthy 2021 10/27/2021


Data-intensive application
characteristics

Models

Algorithms
(thinking)

Data structures
(infrastructure)

AggregatedC Reference
ontent (Raw Structures
data) (knowledge)

B.Ramamurthy 2021 21 10/27/2021


Basic Elements
22

Aggregated content: large amount of data pertinent to the specific application;


each piece of information is typically connected to many other pieces. Ex: DBs
Reference structures: Structures that provide one or more structural and
semantic interpretations of the content. Reference structure about specific
domain of knowledge come in three flavors: dictionaries, knowledge bases, and
ontologies
Algorithms: modules that allows the application to harness the information
which is hidden in the data. Applied on aggregated content and some times
require reference structure Ex: MapReduce
Data Structures: newer data structures to leverage the scale and the WORM
characteristics; ex: MS Azure, Apache Hadoop, Google BigTable

B.Ramamurthy 2021 10/27/2021


Examples of data-intensive applications
23

Search engines
Recommendation systems:
 CineMatch of Netflix Inc. movie recommendations
 Amazon.com: book/product recommendations
Biological systems: high throughput sequences (HTS)
 Analysis: disease-gene match
 Query/search for gene sequences
Space exploration
Financial analysis

B.Ramamurthy 2021 10/27/2021


More intelligent data-intensive
applications
24

Social networking sites


Mashups : applications that draw upon content
retrieved from external sources to create entirely new
innovative services.
Portals
Wikis: content aggregators; linked data; excellent data
and fertile ground for applying concepts discussed in
the text
Media-sharing sites
Online gaming
Biological analysis
Space exploration
B.Ramamurthy 2021 10/27/2021
Algorithms
25

Statistical inference
Machine learning is the capability of the software system to generalize
based on past experience and the use of these generalization to provide
answers to questions related old, new and future data.
Data mining
Soft computing
Deep learning
We also need algorithms that are specially designed for the emerging
storage models and data characteristics.

B.Ramamurthy 2021 10/27/2021


Different Type of Storage
• Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale”
• But observe that this type of data has an uniquely different characteristic than your transactional or the “customer
order” data, or “bank account data” :
• The data type is “write once read many (WORM)” ;
• Privacy protected healthcare and patient information;
• Historical financial data;
• Other historical data
 Relational file system and tables are insufficient.
• Large <key, value> stores (files) and storage management system.
• Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,…
• Clusters of distributed nodes for storage and computing.
• Computing is inherently parallel
• Streaming systems

B.Ramamurthy 2021 10/27/2021 26


Big-data Concepts
Originated from the Google File System (GFS) is the special <key, value> store
Hadoop Distributed file system (HDFS) is the open source version of this. (Currently
an Apache project)
Parallel processing of the data using MapReduce (MR) programming model
Challenges
 Formulation of MR algorithms
 Proper use of the features of infrastructure (Ex: sort)

 Best practices in using MR and HDFS

An extensive ecosystem consisting of other components such as column-based store


(Hbase, BigTable), big data warehousing (Hive), workflow languages, etc.
And now Kafka, confluent, Kotlin, ..

B.Ramamurthy 2021 10/27/2021 27


Data & Analytics

We have witnessed explosion in algorithmic solutions.


“In pioneer days they used oxen for heavy pulling, when one couldn’t
budge a log they didn’t try to grow a larger ox. We shouldn’t be trying
for bigger computers, but for more systems of computers.” Grace
Hopper
What you cannot achieve by an algorithm can be achieved by more
data.
Big data if analyzed right gives you better answers: Center for disease
control prediction of flu vs. prediction of flu through “search” data 2 full
weeks before the onset of flu season! http://www.google.org/flutrends/

B.Ramamurthy 2021 10/27/2021 28


Cloud Computing
Cloud is a facilitator for Big Data computing and is an indispensable in
this context
Cloud provides processor, software, operating systems, storage,
monitoring, load balancing, clusters and other requirements as a service
Cloud offers accessibility to Big Data computing
Cloud computing models:
 platform (PaaS), Microsoft Azure
 software (SaaS), Google App Engine (GAE)

 infrastructure (IaaS), Amazon web services (AWS)

 Services-based application programming interface (API)

B.Ramamurthy 2021 10/27/2021 29


Top Ten Largest Databases

7000

6000

5000
Terabytes

4000
Top ten largest databases (2007)
3000

2000

1000

0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate

Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/

B.Ramamurthy 2021 30 10/27/2021


Top Ten Largest Databases in 2007 vs
Facebook ‘s cluster in 2010
21 PetaByte
In 2010
7000

6000

5000

4000
Terabytes

3000
Top ten largest databases (2010)
2000

1000

0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate Facebook

Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world

B.Ramamurthy 2021 31 10/27/2021


Data Strategy
32

 In this era of big data, what is your data strategy?


 Strategy as in simple “Planning for the data challenge”
 It is not only about big data: all sizes and forms of data
 Data collections from customers used to be an elaborate task: surveys, and
other such instruments
 Nowadays data is available in abundance: thanks to the technological advances
as well as the social networks
 Data is also generated by many of your own business processes and
applications
 Data strategy means many different things: we will discuss this next

B.Ramamurthy 2021 10/27/2021


Components of a data Strategy2
33

Data integration
Meta data
Data modeling
Organizational roles and responsibilities
Performance and metrics
Security and privacy
Structured data management
Unstructured data management
Business intelligence
Data analysis and visualization
Tapping into social data
This course will provide training in emerging technologies, tools, environments and APIs available for
developing and implementing one or more of these components.

B.Ramamurthy 2021 10/27/2021


Data Strategy for newer kinds of data
34

How will you collect data? Aggregate data? What are your sources?
(Eg. Social media)
How will you store them? And Where?
How will you use the data? Analyze them? Analytics? Data mining?
Pattern recognition?
How will you present or report the data to the stakeholders and
decision makers? visualization?
Archive the data for provenance and accountability.

B.Ramamurthy 2021 10/27/2021


Tools for Analytics
35

Elaborate tools with nifty visualizations; expensive licensing fees: Ex:


Tableau, Tom Sawyer
Software that you can buy for data analytics: Brilig, small, affordable
but short-lived
Open sources tools: Gephi, sporadic support
Open source, freeware with excellent community involvement: R
system
Some desirable characteristics of the tools: simple, quick to apply,
intuitive, useful, flat learning curve
A demo to prove this point: data  actions /decisions

B.Ramamurthy 2021 10/27/2021


Demo: Exam1 Grade: Traditional reporting 1

Q1 Q2 Q3 Q4 Q5 Total
16.7 13.9 9.6 18.5 13.7 72.4
20.0 16.0 9.0 19.0 17.0 76.0
20.0 20.0 15.0 25.0 20.0 90.0

Q1 Q2 Q3 Q4 Q5 Total
16.0 14.2 9.6 19.4 14.0 73.2
80.1% 71.1% 64.0% 77.4% 70.2% 73.2%

Q1 Q2 Q3 Q4 Q5 Total
17.3 13.6 9.7 17.6 13.3 71.5
86.7% 67.8% 64.6% 70.3% 66.7% 71.5%

Question 1..5, total, mean, median, mode; mean ver1, mean ver2
B.Ramamurthy 2021 36 10/27/2021
Traditional approach 2: points vs #students
37

Distribution of exam1 points


B.Ramamurthy 2021 10/27/2021
Individual questions analyzed..
38

B.Ramamurthy 2021 10/27/2021


Interpretation and action/decisions
39

B.Ramamurthy 2021 10/27/2021


References
40

[1] Bloomberg global COVID tracker:


https://www.bloomberg.com/graphics/covid-vaccine-tracker-global-distribution/, last viewed Feb3,
2021.
[2] S. Adelman, L. Moss, M. Abai. Data Strategy. Addison-Wesley, 2005.
[3] T. Davenport. A Predictive Analytics Primer. Sept2, 2014, Harvard Business Review.
http://blogs.hbr.org/2014/09/a-predictive-analytics-primer/
[4] M. NemSchoff. A quick guide to structured and unstructured data. In Smart Data Collective,
June 28, 2014.

B.Ramamurthy 2021 10/27/2021


Summary
41

We are entering a watershed moment in the internet era.


This involves in its core and center, big data analytics and tools that
provide intelligence in a timely manner to support decision making.
Newer storage models, processing models, and approaches have
emerged.
We will learn about these and develop software using these newer
approaches (streaming systems) to data.

B.Ramamurthy 2021 10/27/2021

You might also like