You are on page 1of 17

© 2006 IBM Corporation

Enabling ad-hoc
Analytic
Text Apps

with Hadoop

rod smith (rod.smith@us.ibm.com)

Friday, October 2, 2009


Hadoop World ’09

Emerging Technology - What do we work on?

Making Hadoop
accessible to
business
professionals

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

New Intelligence - Big Data

Nearly 15 petabytes of data are created


every day — eight times more than the
information in all the libraries in the U.S,

Volume of data in enterprises is doubling


approximately every 3 years (Forrester Research)
• Includes structured and unstructured data, excludes rich
media

Costs to find, collect & analyze data is


decreasing significantly as web innovation
proceeds

Content is untapped value for business


insights & intelligence

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Internet Evolution: A web of data


sources, services for exploring &
manipulating data, and ways that
users can connect them together Extract
(Tom Coates/Yahoo™ )

Gather Explore
Enterprises recognizing potential of
leveraging the broader web for
business intelligence coverage - as
well as for internal data

Next wave of content-centric webApps


emerging
• Long(er) running data collection
& analytic applications

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Internet Evolution: A web of data


sources, services for exploring &
manipulating data, and ways that
users can connect them together
(Tom Coates/Yahoo™ )

Enterprises recognizing potential of


leveraging the broader web for
business intelligence coverage - as
well as for internal data

Next wave of content-centric webApps


emerging
• Long(er) running data collection
& analytic applications

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Hear business users asking for


the ability to directly manipulate,
analyze & remix massive data
sources & services
• LOB “… Google wetted my appetite...I
want more customizable analytics with
me in the drivers seat…”

Leveraging easy-to-use, rich data


manipulation metaphors like
spreadsheets, etc..

Rich visualizations to quickly


identify insights

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Hear business users asking for


the ability to directly manipulate,
analyze & remix massive data
sources & services
• LOB “… Google wetted my appetite...I
want more customizable analytics with Rich
me in the drivers seat…”
Spectrum
DIY Analytic
Leveraging easy-to-use, rich data
manipulation metaphors like
Applications
spreadsheets, etc.. Emerging

Rich visualizations to quickly


identify insights

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

Let!s Talk Customer Scenarios - BBC

Business Questions
• Name names: Who is doing what, who
isn!t doing what
• Overlay voting record with
demographic & voting records over
time
• Buzz - what are people talking about?
BBC Digital • Visualize content relationships
Democracy Project
Achieving Increased Knowledge of Interest:
• Members of Parliament (MPs)
Government Transparency
• Bills, Debates, Voting Districts

Web Content To Gather:


• UK Parliament Web Site
• Timeframe: 10 + years

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

Let!s Talk Customers Scenarios - Thomson Reuters


Business Questions
• NewsBuzz: What are the headlines? What
are not the headlines but still infocus?
• OpinionMonitor: Who is saying what? What
are the debate topics?
• NewsTimeline: Chronology (pulse) of
headline news?
Enrich Trader!s Desktop • TopicCloud: Tag based topic metrix
Enhancement • IssueAnalytics: Link backs to semantically
Timely aggregation & analytics of content related news
originating from public internet sites

Scenario
• Gather unstructured data from anywhere between 200 to
Knowledge of Interest:
2000 data sources - every 15 minutes • People, places, events
• Perform preprocessing (search, transform, index) over
each source
• Publish harvested content for distributed content services
and downstream Mashups Web Content To Gather:
• ~118 3rd Party Finanical News Services and
Blogs, including: BBC, CNN ,Yahoo News,
Financial Times, NY Times, The Big Picture,
Fox News, PR Newswire, Market Watch, World
Press, Forbes, Google News, Wall Street ,
Journal, MSNBC, The Sun, ZDNet,

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

IBM Emerging Technology Project: M2

What is it?
An insight engine for enabling ad-hoc business insights for
business users - at web scale

How does it work?


Discovery Process
1. point M2 to data sources of interests
• unstructured web data, feeds, XML, etc..

2. transform data into a form that can be analyzed


• Unstructured data becomes semi-structured data
• Example: name: Rod Smith, employer: IBM, state: GA
• Apply analytics - enriching the data

3. “what if tooling” - browser-based visual front end - spreadsheet


metaphor to create worksheets for exploring/visualizing the data

What!s different?
• Unlocking insights embedded in unstructured data
• Analyzing data previously unavailable to analyze

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

M2 -> Demo
Business Questions
• How much is a target company worth?
• What are the high-value areas of their
portfolio?
• Explored cited patent topics, litigated
patents

Knowledge of Interest:
Project: • Patents ranked by citation – e.g how often
Improve IP Portfolio Analysis was a patent referenced determines value
for Mergers & Acquisitions
• Corporate genealogies IP ownership roll-up
• Augment analysis with items affecting IP
“...please collect all US Patent value, inventor affiliation, citation rank by
filings… then let’s do…”
time

Web Content To Gather:


• Gathered 1.4m patent docs from USPTO
• 1991-2007 case records from Court of
Appeals United States Federal Circuit
(CAFC)

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

What!s Under the Covers: Hadoop

Emergence of map/reduce programming


model for a new class of webApp

Hadoop: provides a framework for large


scale parallel processing map/reduce
apps (Apache projects lead by Yahoo)
• Offers simplicity of “programming” - Looks like a
simple single threaded app model for developers

• Handles big data - scalable storage across


machine clusters (think read-only file system)

• Deployment: no application knowledge of runtime


or OS or cloud necessary

• Today - setting up, coding Hadoop jobs in Java,


etc. is the domain of skilled Java engineers

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

IBM Emerging Technology Project: M2 Architectural Components

Expanding upon the Hadoop stack


• Visual tooling builds extensively on Pig

M2 Architecture Characteristics:
• Extensible via UDFs
• REST API for customer choice of analytic
service/engine
• REST APl for choice of visualization packages
• Export content as feeds, XML, etc..
• ...more to come

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

Conclusions

In God we trust

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

Conclusions

…all others bring data

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

Conclusions

Enterprises quickly evolving their thinking


from a Database strategy to a Data Strategy
encompassing unstructured & structured
content

Repeatable business patterns in broad range


of industries emerging

Hadoop has potential to be the platform for


broad range of solutions from web-based
analytics -> business event processing ->
collaboration

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009


Hadoop World ’09

Almost The End

Selecting customer proof


of concept projects

INTERESTED?
www-01.ibm.com/software/ebusiness/jstart/about.html

!"#$%"&!'!()*('+,*,-

October 2009 SWG Emerging Internet Technology IBM Software Group

Friday, October 2, 2009

You might also like