Professional Documents
Culture Documents
You’ll learn newer data challenges and methods to address these. For
example streaming data, how to process and analyze these in a timely
manner. (Examples for data stream are social media data and multi-modal
enterprise data)
Data stream
“structured, unstructured and streaming data”
“messages”
Poster board for your Project 1: data science roadmap. 5
Let’s analyze each component.
2 3 4
1
9 8
10
1. Real world application domain – decide what area area you want to work?
2. Raw data collected – in your case search, study and get the data sets you want to use. – Can be multiple data sets,
multiple sources. See [1] for a report how they collected data.
3. Data processed – in your case you already have a source that has collected and processed the data. But you may want
to select the features (columns) you are interested in.
4. Data cleaned – This is important step in your project. This means addressing missing values, replacing N/A with
appropriate value, adding columns for certain categories (age category) etc.
5. Exploratory data analysis: Understanding the nature of the data you have collected; range, average, summary stats,
any anomalies, any concerns? May have to redo data acquisition?
6. Modeling and analysis: Apply algorithms to analyze data: two types: statistical modeling and ML algorithms. Use
simple ones implemented in the libraries. Evaluate goodness of your modeling and analysis.
7. Visualization of the outcomes of modeling: is an important step to present you data analysis. Use appropriate
libraries and create engaging and “useful” visualization.
8. Build data products: web enabling, enterprise enabling, mobile-enabling. If it worked for “apples” data set it should
work for “oranges” data set.
9. Open it up to the world: Let users test it!
10. Let decision makers use it make data-driven decisions. What are your recommendations based on your analysis?
1. Journal everything you do by date: start with today, a google sheets or any
other journaling mechanism. Will help you with your story telling with data.
2. Do not reuse projects from other courses or completed at current your work
or previous work. Start afresh with an open mind.
3. Approach this with personal goals: how can I use this to showcase my skills,
talent and knowledge to advance my career, get fired? What can I learn from this
I have a means project? How can I improve my “intellect”?
to enforce these rules.
You’ll see how in the
Project description
4. Research and get some unique and interesting data to answer some useful questions
with broader impact? Who will this analysis affect? How is the relevant to the world today?
We’ll discuss all the other tasks of the project next lecture.
Sign up:
All: Sign up on Piazza: Can you do it now? https://piazza.com/class/kk6xmjqekrl1e9
TAs: Please set Google sheet with slots (30, 30, 10) Each slot will take 2 students.
Students: Please for a team of at most 2, sign up the google sheets asap.
Task 1 of project was explained. Work on it.
”Research” and identity a application domain
Form a data-driven questions to answer.
Form the problem statement around this question, data and the application domain.
Record it for submission.
B.RAMAMURTHY
New kinds of data from different sources: tweets, geo location, emails, blogs, IoTs,
blockchain(Hmm..??)
Three major types: structured, unstructured data and now streaming data
Structured data: data collected and stored according to well defined schema;
Realtime stock quotes
Unstructured data: messages from social media, news, talks, books, letters,
manuscripts, court documents..
“Regardless of their differences, they work in tandem in any effective big data
operation. Companies wishing to make the most of their data should use tools that
utilize the benefits of both.”5
We will discuss methods for analyzing both structured, unstructured data and
streaming data
Bioinformatics data: from about 3.3 billion base pairs in a human genome to
huge number of sequences of proteins and the analysis of their behaviors
The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics …
Financial applications: that analyze volumes of data for trends and other
deeper knowledge
Health Care: huge amount of patient data, drug and treatment data
The universe: The Hubble ultra deep telescope shows 100s of galaxies each
with billions of stars: Sloan Digital Sky Survey: https://www.sdss.org/
Algorithmic: after all we have working towards this for ever: scalable/tractable
High Performance computing (HPC: multi-core) CCR has machines that are: 16
CPU , 32 core machine with 128GB RAM: openMP, MPI, etc.
GPGPU programming: general purpose graphics processor (NVIDIA)
Statistical packages like R running on parallel threads on powerful machines
Machine learning algorithms on super computers
Hadoop MapReduce like parallel processing.
Spark like approaches providing in-memory computing models
And now stream processing infrastructure (Kafka).
Embarrassingly parallel
Mega Block level processing
Heavy
societal
Explosion involvement
Powerful
of domain
multi-core
application
processors
s
Superior
Proliferatio
software
n of devices
methodologies
Virtualization
Wider bandwidth
leveraging the
for
powerful
communication
hardware
B.Ramamurthy 2021 10/27/2021 17
Intelligence and Scale of Data
18
Google search: How is different from regular search in existence before it?
It took advantage of the fact the hyperlinks within web pages form an underlying structure
that can be mined to determine the importance of various pages.
Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to
go to CityGrille”?
Learning capacity from previous data of habits, profiles, and other information gathered over
time.
Collaborative and interconnected world inference capable: facebook friend suggestion
Large scale data requiring indexing
…Do you know amazon is going to ship things before you order? Here
Models
Algorithms
(thinking)
Data structures
(infrastructure)
AggregatedC Reference
ontent (Raw Structures
data) (knowledge)
Search engines
Recommendation systems:
CineMatch of Netflix Inc. movie recommendations
Amazon.com: book/product recommendations
Biological systems: high throughput sequences (HTS)
Analysis: disease-gene match
Query/search for gene sequences
Space exploration
Financial analysis
Statistical inference
Machine learning is the capability of the software system to generalize
based on past experience and the use of these generalization to provide
answers to questions related old, new and future data.
Data mining
Soft computing
Deep learning
We also need algorithms that are specially designed for the emerging
storage models and data characteristics.
7000
6000
5000
Terabytes
4000
Top ten largest databases (2007)
3000
2000
1000
0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/
6000
5000
4000
Terabytes
3000
Top ten largest databases (2010)
2000
1000
0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate Facebook
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world
Data integration
Meta data
Data modeling
Organizational roles and responsibilities
Performance and metrics
Security and privacy
Structured data management
Unstructured data management
Business intelligence
Data analysis and visualization
Tapping into social data
This course will provide training in emerging technologies, tools, environments and APIs available for
developing and implementing one or more of these components.
How will you collect data? Aggregate data? What are your sources?
(Eg. Social media)
How will you store them? And Where?
How will you use the data? Analyze them? Analytics? Data mining?
Pattern recognition?
How will you present or report the data to the stakeholders and
decision makers? visualization?
Archive the data for provenance and accountability.
Q1 Q2 Q3 Q4 Q5 Total
16.7 13.9 9.6 18.5 13.7 72.4
20.0 16.0 9.0 19.0 17.0 76.0
20.0 20.0 15.0 25.0 20.0 90.0
Q1 Q2 Q3 Q4 Q5 Total
16.0 14.2 9.6 19.4 14.0 73.2
80.1% 71.1% 64.0% 77.4% 70.2% 73.2%
Q1 Q2 Q3 Q4 Q5 Total
17.3 13.6 9.7 17.6 13.3 71.5
86.7% 67.8% 64.6% 70.3% 66.7% 71.5%
Question 1..5, total, mean, median, mode; mean ver1, mean ver2
B.Ramamurthy 2021 36 10/27/2021
Traditional approach 2: points vs #students
37