wso2-whitepaper-managing-big-data-for-the-intelligence-community

White Paper
Managing Big Data for the

Intelligence Community
http://wso2.com Version 1 (May 7 2013)

White Paper
Table of Contents
Chattering Predators.................................................................................................................................................................................. 03
Moving at the Speed of War: Defeating the Data Volume Problem............................................................................................................04
Analytics for the IC: An Architectural Concept.............................................................................................................................................06
Functional Review: How it All Fits Together.................................................................................................................................................11
The Question of Integration...........................................................................................................................................................................
12
Conclusion.................................................................................................................................................................................................... 14
2
http://wso2.com
White Paper
Military and civilian intelligence organizations pioneered methodologies for storing, retrieving, analyzing and
exploiting large and disparate data sets. These data sets were often huge (for the times), geographically dispersed,
subject to extraordinary security measures and comprised of a widely diverse set of data types. Today, these
conditions are the norm in the world of commercial big data. Driven by promises of a high return on investment
(ROI), widespread adoption, other indirect benefits or some combination of the three, the global information
technology industry has evolved powerful tools for the storage, management and analysis of what has come to be
known as “Big Data.”
Unfortunately, much of the rapid development in Big Data analytics has left the intelligence community (IC) behind.
Far from being an example of an acquisitions failure, the reasons behind the IC’s technological conservatism are
entirely rational and responsible. Government is not in the business of making a profit, and is therefore less
concerned with ROI than with total cost of ownership (TCO) over time. Additionally, the IC already possesses
effective tools, developed over many years and at great expense, to handle data ingestion, transport, storage,
analysis and exploitation. Recent events, however, have forced IC executives and CIOs to re-examine roadmaps for
existing tools. The remainder of this paper explores the big data challenges facing the IC and offers a strawman
solution architecture based on modern, open source data management and analytics tools.
Chattering Predators
The world prior to September 11, 2001 was well defined from an intelligence perspective. The United States was the
dominant global military power, oil prices were stable and potentially hostile actors were largely confined to known
nation states including North Korea, Libya and Iraq (with the odd narcotics trafficker thrown in for good measure).
China was still essentially landlocked from a military perspective and appeared to be more interested in growing her
economic power than challenging the United States and the world. Post-Cold War Russia was in the throes of an
economic slump and, despite occasional bluster and saber rattling, was not in a position to exert strong influence
on the global stage. Given American ascendancy at the time, acquisitions organizations logically assumed that the
existing collection and analysis tools were, and had been, sufficient.
All of that changed on September 11, 2001. Intelligence collection mechanisms premised on targeting (at least
somewhat) rational national entities and their conventional military organizations were too few and far between to
address an exponentially larger number of non-state actors. In addition to the sheer volume of target organizations,
there were stark differences between the way conventional forces were organized and the non-state actors’
operational paradigm. As a result, many existing systems and methodologies were found to be inefficient, taxed
beyond their planned capacities or simply inappropriate for the task at hand.
Solutions, however, were not far away. Beginning in the mid-1980s, the United States military began fielding
unmanned aerial vehicles (UAV). First fielded on a relatively modest scale, UAVs were deployed by the US Navy,
Marine Corps and Army. Initially deployed on battleships to provide gunnery spotting for 16” fire, the UAVs missions
rapidly evolved to include reconnaissance and surveillance. By the end of the first decade of the 21st century, the US
military would field tens of thousands of UAVs, ranging from six inch “micro air vehicles” (MAV) to the three foot long
RQ-11B Raven, the ubiquitous MQ-1 Predator and MQ-9 Reaper aircraft, each larger than a private corporate jet, and
the RQ-4 Global Hawk, which is comparable in size to a medium-range airliner.
While Predator and Reaper aircraft have garnered the lion’s share of media attention due to their high profile use
as weapons platforms, the UAVs primary purpose remains reconnaissance and surveillance. Even the most modest
UAV platforms are capable of relaying still imagery and real-time full motion video of the battlespace using imaging
technologies such as electro-optical (EO) and forward looking infrared (FLIR). Larger platforms, like Predator, may
add synthetic aperture radar (SAR) to their sensor packages. As a result of mating these sensor technologies with
UAV flight platforms, there has never been greater battlespace situational awareness in history. The non-state actors
lack of a targetable infrastructure had previously rendered them immune to traditional warfighting methodologies.
The proliferation of UAVs gave battlefield commanders the ability to conduct long duration surveillance of specific
locations, structures or individuals. There is literally no place left to hide.
3
http://wso2.com
White Paper
As this paper is being written, there are literally thousands of remotely piloted and autonomous aircraft aloft over
Afghanistan. Each aircraft’s sensors collect immense amounts of data. And the amount of data collected is increasing
at a prodigious rate. In 2009, American UAVs generated 24 years’ worth of video surveillance. By 2011, the amount
of video data collected per year had jumped to 720 years. This geometric progression, which eclipses Moore’s Law as
applied to the world’s data, is in no small part due to the rapid evolution of surveillance and reconnaissance sensors.
The US Air Force’s Gorgon Stare, for example, is a persistent wide area aerial surveillance system (WAAS) fielded in
early 2011, which mounts nine high resolution video cameras in a spherical array. The cameras provide continuous
coverage of areas the size of a city and can transmit up to 65 different images to different users on the ground.
Video and imagery surveillance data is not the only data type that has significantly increased in volume. Recognizing
that hostile non-state actors relied heavily on email, cell phones and social media to coordinate and conduct
their operations, the United States began to field new systems designed to collect vast amounts of signals
intelligence (SIGINT). SIGINT comprises two disciplines: the interception of signals between people, which is called
communications intelligence (COMINT) and the interception of electronic signals not directly used in communication
(e.g., radar transmissions), which is called electronic intelligence (ELINT).
These new SIGINT systems operate against hostile non-state actors from all four domains (land, air, sea and space)
with capabilities ranging from traffic analysis (i.e., the volume of intercepted communications) to detailed analysis
of decrypted voice and data communications. In that capacity, they collect and analyze signals across a broad
spectrum. The latest version of the US Army’s AN/MLQ-40(V)4 “Prophet” system, for example, has the capability to
detect, monitor, identify and selectively exploit radio frequency (RF) signals (including cell phone origin signals) for
situational awareness information twenty-four hours a day. It can also send the exploited information to units across
the battlefield using organic communications capabilities. Prophet has been credited with being the decisive factor in
battles across Iraq and Afghanistan.
The Prophet system has an airborne collection variant which greatly increases the amount of data collected.
However, it is far from the only airborne SIGINT asset. Typical of the Army’s airborne collection assets is the RC-12X
Guardrail aircraft. Guardrail is a highly modernized Beechcraft C-12 Huron aircraft developed by Northrop Grumman.
It mounts the Advanced Signals Intelligence Payload (ASIP), which also serves aboard Air Force manned U-2 and
unmanned Global Hawk aircraft. ASIP systems ingest and process huge amounts of signals data. Some of this data
is instantly processed to provide precision threat identification and location information, which is immediately
transmitted to ground combat organizations. The data is also downloaded upon landing for further exploitation by
analysts.
Moving at the Speed of War: Defeating the Data Volume Problem

As can be seen, the IC has largely solved issues associated with intelligence collection. Unfortunately,
greatly improved collection capabilities created a downstream problem for the analyst community tasked
with deriving actionable meaning from the amassed data. Commenting on the issue, former vice chairman
of the Joint Chiefs of Staff, General James Cartwright noted “Today an analyst sits there and stares at Death
TV for hours on end, trying to find the single target or see something move. It’s just a waste of manpower.”
Cartwright’s larger point is extremely well taken. The current toolsets used by analysts are time and labor
intensive and are often limited with respect to the source materials to which they had access. As a result,
the generation of useful intelligence from the collected data is often both arduous and delayed. The
challenge, as Cartwright put it, is to get the intelligence process to move at the “speed of war.”
There’s more to the challenge, however. Battlefield intelligence is about empowering edge organizations
through rapidly shared situational awareness. The most perfect set of collection and analysis tools fails
if the information derived from the collected data is not transformed into actionable knowledge in the
warfighter’s hands. Put another way; it doesn’t matter if the analyst at a Prophet or Guardrail ground
station is aware that a highly sought Very Bad Guy (VBG) is just around the corner from a patrolling infantry
4
http://wso2.com
White Paper
company. It only matters if that knowledge is conveyed to the infantry company commander in a timely
manner and in a form in which it can be readily applied.
Deriving and sharing knowledge at the speed of war demands proliferation of both intelligence information
and systems that process and manage the information to the lowest operational echelons. However, this
goal has proven elusive for a number of reasons:
• Difficulties in sharing intelligence data across system, organizational or service boundaries;

• The acquisition costs associated with specialized information processing tools;
• The need for specialized hardware to accommodate the information processing; and
• The perceived burden to line combat units of associated with managing specialized equipment or
information.
In other words, while there is broad agreement that both intelligence activities and information should
be democratized, with siloed intelligence support activities being more broadly distributed across a flatter
organizational landscape, there are concerns as to how to achieve those goals. Significant efforts, many of
which have borne fruit, have already been made to address these concerns. One example is the Defense
Intelligence Information Enterprise (DI2E). DI2E is a common, cross-agency and cross-services environment
that’s designed to allow users to access intelligence resources from across the IC spectrum. DI2E’s core
requirement is to integrate systems, information, teams and tools that were formerly disconnected in a
manner that is secure and offers round the clock, global availability.
DI2E is intended to fill three basic needs: a common technology base to unite disparate applications
and systems, a compliance testing framework for different intelligence applications and tools and a
virtual storefront that will make it easier for users to discover and download useful intelligence-oriented
applications.
While it addresses the data sharing issue, DI2E does not, unfortunately, directly respond to General
Cartwright’s observation about DeathTV. Just because the analyst has assured access to more data doesn’t
necessarily mean that the efficiency with which the analyst produces actionable knowledge has been
improved. What the analyst requires is an automated means to apply business rules describing events,
persons and concepts of interest to the mass of available data. Fortunately for the IC, industry has been
grappling with many of the same problems. These have been addressed through the employment of event
processing and analytics frameworks that leverage commodity hardware and open source tools.
5
http://wso2.com
White Paper
Analytics for the IC: An Architectural Concept

On the face of it, the problem is relatively easy to describe. There is relatively finite number of intelligence
data sources. Data from these sources must be extracted, transformed and then loaded into an analytics
engine for both storage and analysis. The relevant results of the analysis must then be provided to the
analyst’s tools in the most efficient manner possible.
Figure , Intelligence Analytics Architectural Concept
As can be seen in Figure , above, there are a number of components required to produce the data
movement and processing necessary to feed and execute the automated analytics cycle. These
components fall into three categories:
• Collection and Ingestion;

• Enterprise SOA Platform; and
• Operational Data Storage and Management.
A brief explanation of each category and component is helpful in navigating the architecture.
a. Collection and Ingestion
The United States fields the most impressive array of technical intelligence collection assets that the
world has ever seen. Most, if not all of these assets have dedicated ground stations where the data
is initially analyzed, sorted, collated and either stored or sent to an organizational data center. These
storage sites house a wealth of intelligence information. However, in order for this information to be
of use to the warfighter, it must either be extracted from the asset-specific (or organization specific)
storage and either transformed for submission to an analysis engine or streamed by an agent program
to an event processing engine.
6
http://wso2.com
White Paper
i. Data Service or API
Data services operate on known data structures with the goal of exposing stored data as
readily manipulated XML. By doing so, they ensure loose coupling and provide data
consuming components in an easily ingestible format.
Application Programming Interfaces (API) are used to provide controlled access to

organizational assets such as data or capability. They are essentially specialized web services
developed and promulgated by asset owners and used by potential consumers. Typical use
cases include mobile apps that access APIs to provide content to a smartphone.
ii. Complex Event Processing Engine
Event processing engines are analytical tools that enable the derivation of relevant data
from extremely large volumes of real-time data. These event processors listen to event
streams and detect patterns based on predefined rules in near-real time. They do not store
all the events. (Event storage for later exploitation is the province of business
activity monitors.) There are three basic models:
• Simple event processors, which implement simple filters (“Does this communications
intercept contain the word attack?”)
• Event stream processors (ESP), which aggregate and join multiple event streams;
and
• Complex event processors, which process multiple event streams to identify meaningful
patterns using complex conditions and temporal windows (“There has been a cell phone
call that used the word attack AND there has been a cell phone call that mentions a
specific road junction AND there is an image that shows digging near the road junction
AND there is tagged video that shows hostile forces caching rocket propelled grenades
(RPG) near the road junction”). Complex event processors (CEP) are designed to process
tens of thousands (or more) events per second with a latency of milliseconds or less.
The data streams that feed complex event processors are often generated by means of
dedicated agent programs that operate in conjunction with the asset specific storage
mechanisms.
b. Enterprise SOA Platform
Core enterprise capabilities are provided by the enterprise SOA platform. Typical components include:
• An enterprise service bus (ESB) that provides data transport, transformation and mediation
capabilities;
• An elastic load balancer (ELB) that ensures that the overall system maintains specified reliability
and availability service level agreements (SLA);
• A message broker that provides support for loosely coupled, asynchronous communications
between system components;
7
http://wso2.com
White Paper
• An identity and access management (IdAM) component that manages user accounts,
authentication and provides the ability to exercise fine grained access control at the data element
level; and
• A registry where artifacts including access control policies can be stored, validated and
retrieved.
Additional components may be found, such as tools to manage business processes and business rule
sets.
c. Operational Data Storage and Management
The components in this category perform analytics on the large data sets delivered by the collection
and ingestion components through the enterprise SOA platform, yielding actionable intelligence that is
consumed by both the analyst and warfighting communities. The category’s core is an analytics engine
that harnesses state of the art, open source big data tools that provide storage, enable distributed
processing and allow for normalization, summarization, query and analysis. The resulting processed
data is stored and can either be queried by existing analyst tool sets or delivered to user-definable
dashboards.
i. Replicated, Distributed Intelligence Database
Collected and ingested intelligence data is stored natively, within the analytics engine.
Reflecting the often disconnected and always unpredictable nature of tactical operations, the
storage mechanisms must be fault tolerant and independent. Under ideal circumstances,
each software/hardware deployment would include multiple, computationally independent
nodes to store and process data. Alternately, lightweight edge nodes may require a more
traditional storage system.
The notional architecture supports both command and edge nodes. A command node is
expected to have space and power sufficient to support multi-device computing clusters.
For distributed storage across the command node’s computing cluster(s), the notional
architecture relies on the Hadoop Distributed File System (HDFS). Hadoop, a product
managed by the Apache Software Foundation, is a framework designed to support data
intensive distributed applications.
HDFS is designed to reap maximum utility from clusters of inexpensive commodity hardware.
It does so by implementing a master/slave architecture that stores data on the processing
nodes (following the assumption being that “moving computation is easier than moving
data”). It is designed to detect and quickly and automatically recover from faults in any of
the storage processing nodes, thus maintaining a very high level of availability for both data
and the system as a whole. HDFS provides for very high levels of throughput between nodes
in a single cluster. Additionally, HDFS is designed to reliably store very large files across
machines in a large cluster by storing files as a series of blocks across multiple machines. All
nodes within a cluster need not be local to participate in the replication scheme.
This is due to features inherent in the HDFS design. To minimize global bandwidth
consumption and read latency, HDFS tries to satisfy a read request from a data replica that is
closest to the reader. If a replica exists on the same rack as the reading computer, then that
8
http://wso2.com
White Paper
replica is preferred to satisfy the read request. For a HDFS cluster spanning multiple data
centers, a replica that is resident in the local data center is preferred over any remote replica.
replica is preferred to satisfy the read request. For a HDFS cluster spanning multiple data
centers, a replica that is resident in the local data center is preferred over any remote replica.
This functionality supports the implementation of a centrally planned replication scheme

that supports multiple echelons (e.g., from unified combatant command to division).
Edge nodes are found closer to the forward edge of the battle area (FEBA), in direct support
of maneuver (i.e., ground combat) units. They are usually mounted and feature a much
lighter weight hardware configuration that will not, typically, have access to clustered
processors. As a result, they employ a more traditional database storage mechanism. One
database tool well suited to the needs of an intelligence edge node mentioning is Accumulo.
Accumulo is an open source NoSQL database originally developed at the National Security
Agency (NSA). Unlike many of the NoSQL databases currently available, Accumulo supports
a wide array of enterprise level features. Like many popular NoSQL databases, Accumulo
handles high volumes and large varieties of data at a rapid processing velocity, offers
the ability to process both analytical and transactional work while scaling efficiently to
thousands of nodes and petabytes of data. However, what sets Accumulo apart is that it was
engineered with cell-level security.
Cell level security is the ability to assign access permissions down to individual table cells
within the database, thus allowing administrators to extend the use of a given database
across the enterprise while remaining in compliance with applicable privacy and security
laws, regulations, policies and guidance (LRPG).
The following scenario illustrates the value of cell-level security in the intelligence context:
An intelligence support organization receives imagery data classified at the Top Secret/
Sensitive Compartmented Information (TS/SCI) level. The TS/SCI classification is due to
“sources and methods” data contained in the overall image record; that is the nature of
the platform that obtained the imagery, and details about the platform are found in the
imagery metadata. The imagery is important to the maneuver brigade being supported by
the intelligence organization. However, the brigade’s lower echelon units operate only at the
Secret classification.
Without cell-level security capabilities, getting the imagery to the maneuver units would
require manual sanitization and transfer prior to transmission. This is, at best, a time
consuming task that risks the imagery becoming operationally stale before it is transmitted.
As a result of the conflation of the imagery and sensitive metadata, the imagery might not be
transmittable at all.
With cell-level security, it is feasible (current information assurance (IA) and security rules
notwithstanding) for maneuver users to access the database directly, and all the tables
within, but with certain specified cells encrypted. Soldiers in the maneuver brigade would
be allowed to mine the imagery data stored in the NoSQL database with the exception of
the restricted metadata, which would essentially not exist for them. Accumulo’s cell-level
security would allow more users access to the database, theoretically resulting in faster
analysis and more rapid decision cycle times. In this manner, cell-level security provides part
of the answer to the cross-domain security puzzle.
9
http://wso2.com
White Paper
ii. Intelligence and Operational Analytics
Intelligence data that is collected but not analyzed isn’t of much use. As noted above, the
efficiency of modern collection mechanisms and the sheer volume of data collected often
frustrate timely analysis. It is necessary, therefore, to automate the analysis process to the
maximum extent possible. Fortunately there are a number of enterprise grade open-source
components that help with the automation process.
Complementing its distributed file system, Hadoop offers a capability called MapReduce.
MapReduce is a framework for the parallel processing of large datasets using a cluster of
computers. It works by creating a “master” node and a number of “worker” nodes. During
the “map” step, the master takes the input (such as a data set), and divides it into smaller
subsets and distributes these subsets to worker nodes. The workers process the smaller
problems and pass the results back to the master. During the “reduce” step, the master
assembles all the answers to the sub-problems in a manner that provides the answer to the
original larger problem.
MapReduce’s advantages, like those of HDFS, really come into play at an intelligence
command node, where clusters of commodity computers can be assembled to maximize
the parallel processing benefits. That being said, Hadoop’s MapReduce architecture does
not restrict it to a clustered deployment, and it can readily be used at Edge nodes. It is
often used to normalize data from disparate sources so that the data can later be analyzed
efficiently.
Working hand in hand with MapReduce during the analysis process is Apache Hive. Hive is a
data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries,
and the analysis of large datasets stored in Hadoop compatible file systems and databases. It
can also be used to schedule and manage data manipulation tasks in Hadoop. Hive provides
a mechanism to project structure onto and query the data using a SQL-like language called
HiveQL. Through HiveQL, Hive also supports traditional map/reduce programming.
Both MapReduce and Hive are flexible and adaptable, and support custom data analysis and
querying. As a result, normalization and storage are just the beginning. Organizationally
specialized processing can be implemented. For example, missile defense organizations may
desire to store only information about mobile surface to surface missile launchers out of
the flood of data. MapReduce and Hive can be readily programmed to support this sort of
capability. For example, the data might be tagged
iii. Operational Data Retrieval
Once the data has been retrieved and normalized, it is written to a single, normalized
data set stored in a secure database (i.e., Accumulo). This data set can then be accessed
by analysts through the standard set of analysis tools as well as specialized statistical
engines (e.g., R) and spatial data engines (SDE) or geospatial information systems (GIS).
Architecturally, it is important to ensure that the analysis tools are not tightly architecturally
coupled directly to the database structure. Doing so significantly increases total cost of
ownership (TCO) as any modifications to either the analysis toolset or the database will
require corresponding changes to the other component.
10
http://wso2.com
White Paper
The necessary architectural separation or “loose coupling” is achieved through the use of
a service oriented data access mechanism. This mechanism provides and manages data
services that expose information stored in the database as XML that is readily consumed
by the analysis tools. As a result, changes to either the database or any of the analysis
tools require modification only to specific data services, eliminating the need for extensive
updates and changes to the rest of the system. It’s worth noting that the savings extend
beyond software development labor costs: By limiting the scope of necessary modifications,
a significantly smaller certification and accreditation (C&A) effort is required, which in turn
lowers TCO.
Functional Review: How it All Fits Together

With an understanding of the discrete functionalities involved, a brief review of the sequence of data flows
and manipulations is useful:
1. Data is collected and stored locally by collection systems that include UAVs and other platforms. This
data may include both imagery (full motion and still) and text files.
2. Text-based information is ingested by means of dedicated APIs or data services.
3. Imagery is fed by means of agent programs into a complex event processing engine that has been
programmed to look for certain patterns of interest. The patterns for which it is programmed reflect
the organization’s operational concerns and can be updated as necessary. Only imagery conforming to
the search patterns is retained.
4. The ingested data (text and imagery) is transported, in a manner compliant with organizational access
control policies, via the enterprise SOA platform to a first stage instance of the replicated intelligence
database (HDFS or Accumulo).
5. The data is normalized through a series of MapReduce iterations (managed by Hive).
6. Hive queries and stores the normalized data a second stage instance of the replicated intelligence
database.
7. The data is exposed to intelligence analysis tools as XML by data services.
11
http://wso2.com
White Paper
The Question of Integration

As can be inferred from the above discussion, the issue facing acquisition professionals charged with
developing automated “Big Data” handling systems for the IC and military intelligence organizations is less
about understanding overall requirements than about acquiring the systems in a timely and cost effective
manner. Fortunately, most, if not all of the functionality required can be obtained through commercial
open source software products. By eliminating license acquisition costs, these open source products
substantially reduce TCO. Additionally, as the products are generally pre-existing, thus reducing the effort
needed to develop capabilities, overall acquisition timelines are reduced.
It is useful at this point to dispel misconceptions about the commercial nature of open source software.
Both US federal law (41 USC 403) and enabling regulations (the Federal Acquisition Regulations (FAR)
and Defense Federal Acquisition Regulations (DFAR)) specify a definition of commercial items that clearly
encompasses open source software:
The term “commercial item” means any of the following (as per 41 USC 403 (12)):
(A) Any item, other than real property, that is of a type customarily used by the general public or by
nongovernmental entities for purposes other than governmental purposes, and that -
• has been sold, leased, or licensed to the general public; or
• has been offered for sale, lease, or license to the general public.
(B) Any item that evolved from an item described in subparagraph (A) through advances in technology
or performance and that is not yet available in the commercial marketplace, but will be available
in the commercial marketplace in time to satisfy the delivery requirements under a Federal
Government solicitation.
(B) Any item that, but for –
• modifications of a type customarily available in the commercial marketplace, or

• minor modifications made to meet Federal Government requirements, would satisfy the criteria
in subparagraph (A) or (B).
This definition and its applicability to the defense community (and by extension, the IC) was confirmed by
the “Clarifying Guidance Regarding Open Source Software” issued by the US Department of Defense (DoD)
on October 16, 2009.
A brief analysis of the capabilities required, and whether they are supported by open source products is
shown in the table below:
12
http://wso2.com
White Paper
Required Capability Supported Example Product Remarks

by OSS?
Ingestion data service Yes WSO2 Data Services Server
Ingestion API Yes WSO2 API Manager API Manager does not
generate APIs, it governs
API lifecycle manage-
ment and deployment.
Imagery Ingestion Yes WSO2 Complex Event Pro-

cessor
Transport, Transform, Yes WSO2 Enterprise Service

Mediation Bus
Runtime Governance Yes WSO2 Elastic Load Bal-

ancer
Asynchronous Mes- Yes WSO2 Message Broker

saging and Compo-
nent Loose Coupling
Security and Access Yes WSO2 Identity Server

Control
Access Control Policy Yes WSO2 Governance Registry

Management
Data Storage and Yes WSO2 Business Activity The current version of
Analytics Monitor the Business Activity
Monitor uses Cassan-
• Data Storage • HDFS dra, a NoSQL database.
• Data Normaliza- • MapReduce Upcoming versions will
tion • Cassandra support Accumulo and
• Query Manage- • Accumulo HDFS.
ment • Hive
Data Retrieval Yes WSO2 Data Services Server

Management
Table , Open Source Software Support by Capability
As can be seen, the entire Big Data analytics architecture can be supported by open source products.
13
http://wso2.com
White Paper
Conclusion
Advances in intelligence collection tools and techniques have resulted in an embarrassment of operational
data riches. Analysts must now derive meaning from huge bodies of data within an operationally relevant
timeline. Fortunately, advances made by industry with respect to Big Data can be leveraged by government
acquisition managers, providing ready-made and well vetted solutions that can be customized for IC use.
Most, if not all of the software products required to support such solutions are available as open source
implementations, reducing both TCO and time to initial operational capability. Additionally, US law and
enabling regulations support the use of open source products within a US government context.
Check out more WSO2 Whitepapers and WSO2 Case Studies.
For more information about WSO2 products and services,

please visit http://wso2.com or email bizdev@wso2.com
14

wso2-whitepaper-managing-big-data-for-the-intelligence-community

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

wso2-whitepaper-managing-big-data-for-the-intelligence-community

Uploaded by

Copyright:

Available Formats

White Paper

Managing Big Data for the

http://wso2.com Version 1 (May 7 2013)

Moving at the Speed of War: Defeating the Data Volume Problem

• Difficulties in sharing intelligence data across system, organizational or service boundaries;

Analytics for the IC: An Architectural Concept

Figure , Intelligence Analytics Architectural Concept

• Collection and Ingestion;

a. Collection and Ingestion

Application Programming Interfaces (API) are used to provide controlled access to

ii. Complex Event Processing Engine

b. Enterprise SOA Platform

c. Operational Data Storage and Management

i. Replicated, Distributed Intelligence Database

This functionality supports the implementation of a centrally planned replication scheme

iii. Operational Data Retrieval

Functional Review: How it All Fits Together

2. Text-based information is ingested by means of dedicated APIs or data services.

5. The data is normalized through a series of MapReduce iterations (managed by Hive).

7. The data is exposed to intelligence analysis tools as XML by data services.

The Question of Integration

(B) Any item that, but for –

• modifications of a type customarily available in the commercial marketplace, or

Required Capability Supported Example Product Remarks

Ingestion data service Yes WSO2 Data Services Server

Imagery Ingestion Yes WSO2 Complex Event Pro-

Transport, Transform, Yes WSO2 Enterprise Service

Runtime Governance Yes WSO2 Elastic Load Bal-

Asynchronous Mes- Yes WSO2 Message Broker

Security and Access Yes WSO2 Identity Server

Access Control Policy Yes WSO2 Governance Registry

Data Retrieval Yes WSO2 Data Services Server

Table , Open Source Software Support by Capability

Check out more WSO2 Whitepapers and WSO2 Case Studies.

For more information about WSO2 products and services,

You might also like