An IBM-TechAmerica Event

TECHAMERICA BIG DATA COMMISSION

DEMYSTIFYING BIG DATA Washington, DC
November 14, 2012

Big Data and Analytics at the IRS

Jeff Butler
Director, Research Databases
IRS, Research, Analysis & Statistics
November 14, 2012

1
Presentation Agenda

• IRS business environment
– Business processes, enterprise data, and systems
• Research and analysis at the IRS
– Examples of analytics
– Methods and techniques
– Skills and system requirements
• Big Data environment for IRS Research
– Volume, variety, velocity
– Systems, architecture, tools
– Information quality strategy
• Best practices and lessons learned
• Big Data challenges
• Five myths about Big Data and Analytics

TECHAMERICA BIG DATA COMMISSION 2

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Big Data and Analytics at the IRS
IRS Business Environment

TECHAMERICA BIG DATA COMMISSION 3

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Business Environment

Business Processes Data Environment Systems & Operations

 234 million tax returns filed
Tax Returns
 1.8 billion third-party information returns received

 $2.4 trillion in gross receipts
Accounts Management
 122 million in refunds totaling $415 billion

 319 million vists to IRS website
Customer Service
 83 million toll-free telephone calls

 223 million letters or notices sent to taxpayers
Enforcement
 $116 billion in accounts receivable

TECHAMERICA BIG DATA COMMISSION 4

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Business Environment

Business Processes Data Environment Systems & Operations

Types of Data Sources of Data 

Forms  Taxpayers 
Schedules  Employers 
Worksheets Information  Preparers
Tax Returns 
Attachments Returns
 Banks 
Images  Brokers 
Correspondence  Non-Profits
Customer 
Transactions  Interagency
Accounts 
Phone Calls  Fed/State 
Notices  Treaty Partners
Case 
Transcripts Third Party  Intermediaries
Management 

Structured  Service 
Unstructured  Enforcement

TECHAMERICA BIG DATA COMMISSION 5

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Business Environment

Business Processes Data Environment Systems & Operations

Business Processes

Accounts Customer
Tax Returns Enforcement Other
Management Service

Data Systems and Applications

Tax Processing Case Management Customer Accounts
 Return Submission  Examination  Transactions
 Refunds  Appeals  Notices
 Math Errors  Collection  Correspondence
 Issue Resolution  Underreporter  Telephone
 Settlements  Criminal  Walk-in Centers
 State Exchange Investigation  Web Service

TECHAMERICA BIG DATA COMMISSION 6

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Business Environment

Business Processes Data Environment Systems & Operations 

Over 450 separate systems or applications in the IRS 
Data are stored in different formats (flat files, XML,
databases, VSAM) and on multiple platforms 
Separate authorization policies for system access 
Most systems designed for operational processing, not
research or analytics

Case for Analytic Data Environment
 Cost of compiling data from multiple enterprise systems is too high
 Enterprise tools are not suited for advanced analytics
 Operational data systems are not designed to isolate resource-
intensive computation for research and analyisis
 Different skill sets are needed for analytics

TECHAMERICA BIG DATA COMMISSION 7

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Big Data and Analytics at the IRS
IRS Research Environment

TECHAMERICA BIG DATA COMMISSION 8

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Big Data and Analytics at the IRS
Research & Analytics in the IRS

Taxpayer Behavior Examples of Analytics 
Failure to file or remit payment  Predict patterns of filing and payment 
Abusive tax shelters compliance 

ID Theft  Estimate U.S. tax gap 

Return preparer non-compliance  Measure taxpayer burden 

Misreporting of income and deductions  ID fraud and ID Theft 

Refund fraud  Simulate impact of legislative changes on
taxpayer behavior 
Off-shore transactions
 Optimize case management inventories
 Analyze taxpayer networks and their
structural relationships
 Develop workload allocation models

TECHAMERICA BIG DATA COMMISSION 9

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Big Data and Analytics at the IRS
Research & Analytics in the IRS

Methods and Techniques 

Regression-based methods (GLM, logisitic, quantile,
non-linear, proportional hazards) 
Social network analysis and graph theory 
Machine learning (neural networks, SVMs, genetic
algorithms)
Education and Skills 
Time series analysis
 Economics 
Multivariate statistical methods (discriminant analysis,  Statistics
clustering, density estimation, factor analysis)
 Mathematics 
Simulation (Monte Carlo, MCMC, agent-based modeling)
 Computer science 
Decision trees (CART, CHAID, C5, hybrids)  Operations research 
Bayes rules and other classifiers  Physics 
Sampling and survey estimation  Behavioral sciences

TECHAMERICA BIG DATA COMMISSION 10

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Big Data and Analytics at the IRS
Research & Analytics in the IRS

Data and Systems Requirements Solutions Checklist 

Fast integration of data from a variety Does the data model allow for fast load
of sources in a format conducive to times and high performance for massively
analysis large data? 
Comprehensive set of searchable Are data elements easy to find and
metadata understand? 
Dynamic storage that supports user- Do flexible storage management protocols
defined data structures support user-created data? 
High-performance database designed Is the right database available for fast
for analytics analytics on massively large volumes? 
Tools for queries, advanced analytics, Are there tools in the right place for the
and visualization right job? 
Systems to support massively large Does the systems infrastructure support
computing tasks resource-intensive computation?

TECHAMERICA BIG DATA COMMISSION 11

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Big Data and Analytics at the IRS
Big Data Environment

TECHAMERICA BIG DATA COMMISSION 12

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Research: Big Data Environment
Compliance Data Warehouse (CDW)

Overview and Capabilities 

Data from over 30 sources, including tax returns, customer
accounts, case management systems, and third parties 
Computing environment for resource-intensive processing
and user-defined data structures 
Metadata for over 32,500 columns that includes definitions,
lookup tables, cross-references, and other artifacts 
Tools for a variety of analytics, including SAS, SQL, R, Stata,
Hyperion, and ArcGIS 
Web services for data profiling, reports, SSN masking, and
password management 
Training and support to ensure efficient use of systems 
Nearly 1,000 users from IRS, Treasury, Congress, and
universities

TECHAMERICA BIG DATA COMMISSION 13

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Research: Big Data Environment
Compliance Data Warehouse (CDW)

Key Features

 Number of key data sources …………………………………………........................ 32
 Number of database tables …………………………………………...................... 1,985
 Number of columns…………………………………………………………………. 46,150
 Number of columns with searchable metadata..………………......................... 32,510
 Number of metadata-column attributes…………………………………………. 715,220
 Total database storage …………….………………………………………………. 460TB
 Total disk storage ……………………….…….………………………….. …………1.2PB
 Number of user accounts …………….……………………………….………………. 920
 Average daily concurrent connections ………..………………………………………840
 Average daily database queries ……………………………………………………. 6,500
 Average daily database queries from the website …………...............................1,200

TECHAMERICA BIG DATA COMMISSION 14

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
Data Availability (Volume)
Storage Volume, in Terabytes Number of User Database Tables
2500 3000

2000 2500

2000
1500
1500
1000
1000

500
500

0 0
2007 2008 2009 2010 2011 2012 2013 2007 2008 2009 2010 2011 2012 2013
Available Used

 CDW is the largest database in the IRS
 Over 1000% increase in data and storage in the past 8 years
 Challenge: Is data growing faster than the resources needed to support it?

TECHAMERICA BIG DATA COMMISSION 15

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
Data Timeliness (Velocity)
Frequency of Data Release Extract-to-Load Latency, in Days
200
180
160
140
120
100
80
60
40
20
0
2004 2006 2008 2010 2012 2005 2006 2007 2008 2009 2011

 Continued shift to higher frequency data releases for research and analytics
 In 2005, it took over 4 months to load a full year of tax return data vs. 10 hours today
 Challenge: What are the limits of real-time replication in heterogeneous environments?

TECHAMERICA BIG DATA COMMISSION 16

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
System and Database Usage
Average Daily Connections Average Daily Database Queries
1000 6000
900
800 5000

700 4000
600
500 3000
400
2000
300
200 1000
100
0 0
2004 2006 2008 2010 2012 2004 2006 2008 2010 2012

Accounts Daily Connections Database Queries Web Queries

 Usage driven by new data, increased literacy in SQL, new tools, and web services
 New users in Treasury, Joint Committee on Taxation (Congress), universities
 Challenge: What is the right mix of analytic and operational use?

TECHAMERICA BIG DATA COMMISSION 17

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
Metadata Management
Number of Columns with Metadata
60000
Metadata Matters
50000  One of the most important ingredients for
data warehousing success
40000
 Touches every part of the data supply chain
30000  Informs and guides decision making
 User satisfaction is highly correlated to
20000 robust, accessible metadata
 New frontier: Real-time data profiling at the
10000
metadata layer
0
2005 2006 2007 2008 2009 2011 2013
Columns Metadata

 CDW has the largest structured metadata repository in the IRS
 In 2012, more than 32,500 columns each with over 20 separate metadata attributes
 Challenge: How to minimize the lag time between data and metadata releases?

TECHAMERICA BIG DATA COMMISSION 18

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
Data Supply Chain

Source Systems

Query, Analysis, Reporting
Database
Extract Staging

Flat File Transform
Load
DW
XML
Validate
VSAM Roll-Ups

Source Metadata ETL/T Metadata Data Model Metadata Report Metadata

Central Metadata Repository – Web Accessible

TECHAMERICA BIG DATA COMMISSION 19

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
System and Data Accessibility

Database Servers Application/Web Servers
(Sybase IQ, Oracle, SQL Server) Shared Storage (>1PB) (SAS, R, Hyperion)
(DB, Backup, Staging, User)

IRS Network IRS Network

SAS R SQL ODBC/JDBC Hyperion ArcGIS

TECHAMERICA BIG DATA COMMISSION 20

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
Systems Infrastructure 

Servers
– 2005-2009: 2x CPU 192GB RAM
– 2009-2012: 4x CPU 512GB RAM
– 2012-2014: 8x CPU 1024GB RAM 
Storage
– 2005-2009: 256MB drives, 1-2Gb/s I/O
– 2009-2012: 1GB drives, 2-4Gb/s I/O
– 2012-2014: 3GB drives, 8Gb/s I/O 
Networking
– SAN, bus adapter, and backplane speeds
are critical for I/O-bound tasks
– Most high-volume queries are I/O bound

Strategy for High Performance 
Continue to leverage Moore’s and Kryder’s law 
Find opportunities to improve network throughput for I/O bound tasks

TECHAMERICA BIG DATA COMMISSION 21

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
Web Services: Metadata and Data Profiling

TECHAMERICA BIG DATA COMMISSION 22

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Compliance Data Warehouse
Web Services: Metadata and Data Profiling

TECHAMERICA BIG DATA COMMISSION 23

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Big Data and Analytics at the IRS
Best Practices and Lessons Learned

TECHAMERICA BIG DATA COMMISSION 24

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Research: Big Data Environment
Best Practices and Lessons Learned

Strategies for Big Data and Analytics 

Build multi-disciplinary teams that combines analytic
skills (statistics, machine learning) with IT skills (system
and database administration) 
Maintain a focus on data quality that includes easily
accessible metadata, web-based data profiling, and online
feedback and collaboration capabilities 
Create simple data models that are conducive to the
widest possible variety of analytics 
Implement right-size governance that allows for rapid
change management 
Avoid investments in solutions that are in search of a
problem 
Develop a culture that tolerates non-linear processes and
controlled disruption

TECHAMERICA BIG DATA COMMISSION 25

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
IRS Research: Big Data Environment
Best Practices and Lessons Learned

Challenges 

I/O bottlenecks from off-loading data from the database
to application server
− Software vendors must push more analytic
APIs into the database
− Hardware vendors must provide faster disk speeds
and network throughput rates 
Continued shift in costs to software, labor, and security 
Legal or administratve policies that inhibit Inter-agency
data exchange 
Growth of data outpacing the ability to manage data
quality 
Managing dual goals of safeguarding privacy of data
while expanding access to new information

TECHAMERICA BIG DATA COMMISSION 26

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event
Contact Information
• If you have further questions or comments:

Jeff Butler
Director, Research Databases
Internal Revenue Service
Research, Analysis, and Statistics
jeff.butler@irs.gov

TECHAMERICA BIG DATA COMMISSION 27

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event