You are on page 1of 14

InsideBIGDATA Guide to

Data Analytics in Government


Written by Daniel D. Gutierrez, Managing Editor, insideBIGDATA

BROUGHT TO YOU BY

Copyright © 2017, insideBIGDATA, LLC All rights reserved. Various trademarks are held by their respective owners.
Guide to Data Analytics in Government

Introduction
This guide provides an in-depth overview of the use of data analytics technology in the public
sector. Focus is given to how data analytics is being used in the government setting with
a number of high-profile use case examples, how the Internet-of-Things is taking a firm hold
in helping government agencies collect and find insights in a broadening number of data
sources, how government sponsored healthcare and life sciences are expanding, as well as
how cybersecurity and data analytics are helping to secure government applications.

Data Analytics for Government: An Overview


Data is a critical asset for government agencies. All Government also must look at the type and source
levels of government collect an increasing amount of data being collected, stored, analyzed, and
of data every day. As these agencies strive for a consumed, i.e. structured versus unstructured
meaningful digital transformation, it’s important data. Structured data is the data that has been
not only to collect and store large data sets, but modeled and normalized to fit into a relational
also to use that data when making mission-critical model, such as traditional row/column databases.
decisions. This growing use of so-called “big data” Unstructured data is information that either
builds mounting pressure on government to use doesn’t have a predefined data model or doesn’t
data analytics to turn all data into actionable fit well into relational tables — examples are
information. Efficient use of data is the missing social media, plain text, log files, video, audio,
link between good governance and capacity and network-type data sets. Big data for data
building, where data insights can be gleaned to analytics is a compilation of both structured and
improve service delivery. This technology guide unstructured data.
will help government thought leaders in how best
Recognizing the importance of data analytics,
to use data analytics to manage and derive value
the U.S. Government now has a new official
from an increased dependence on data.
role of Chief Data Scientist. Pledging to put all
A number of prevailing concerns include: government records into the public domain, it is
• How are these data is being converted into clear that U.S. Government officials are far from
beneficial insights? ignorant of the importance of the data revolution.
• What happens when you have too much data?
• How do you make sense of it when the data
volume is on a continued upward trajectory?
• How can you keep up with the volume of data?

Contents
Introduction.......................................................................2 Neuroscience................................................................9
Data Analytics for Government: An Overview...................2 Infectious Diseases . ...................................................10
Emerging Technologies for the Public Sector................3 Government Use of Big Data for Cybersecurity...............11
Preparing for Big Data...................................................4 Cybersecurity Solutions..............................................12
Innovative Use Case Examples......................................4 Case Studies: Dell EMC Focused Customer Use Cases....13
Government Applications of IoT . .....................................5 Dubai’s Smart City Initiative........................................13
Smart Cities...................................................................6 San Diego Supercomputer Center...............................13
Motivations and Challenges for National Center for Supercomputing Applications.....14
Government Use of IoT.................................................7 Tulane University........................................................14
Government Sponsored Healthcare and Life Sciences......8 Translational Genomics Research Institute ................14
Genomics .....................................................................9 Summary.........................................................................14

2
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

• Government agencies involved in healthcare,


KEY TAKEAWAYS such as the Centers for Disease Control
} Data analytics for government is a (CDC), track the spread of illness using social
rapidly evolving field media, while the National Institutes of Health
(NIH) launched an initiative called Big Data
} Government agencies have an appetite
to Knowledge (BD2K) in 2012 to encourage
for embracing big data technologies
innovation based on data-derived insights.
} Government leaders need to focus on One project funded by the government even
delivering value aims to spot the early signs of suicidal thinking
among war veterans, based on their social
Certain uses are already being found for data in media behavior.
government-related fields, such as healthcare,
Privacy concerns are often seen as the biggest
cybersecurity, and education, which can potentially
challenge presented by the rise of data and
have huge positive impact. These include:
analytics, and the U.S. Government has identified
• The CIA helped to fund Palantir Technologies, several potential areas that could be infringed on
which produces analytical software designed through inappropriate use of data, including rights
to stamp out terrorism and crack down on to secrecy of personal information and rights to
cyber fraud by identifying transactions that remain anonymous while exercising free speech.
follow patterns commonly displayed during Indeed, big data use in government certainly
fraudulent activity. presents big challenges — officials and politicians
• American law enforcement agencies (at fed- have a fine line to tread if they do not want to
eral, state and county levels) have access to come across as attempting to implement a real-life
sophisticated ALPR (automated license plate version of Orwell’s Big Brother.
recognition) software that alerts them to any-
one in the vicinity with an outstanding warrant. Emerging Technologies for the
And predictive technologies are used by several Public Sector
police departments to predict flashpoints
where crimes may occur, as well as link A new report by Accenture, Emerging Technologies
particular crimes to particular repeat offenders. in Public Service, examines the adoption of emerging
technologies across government agencies, with
• The U.S. Department of Transportation also
the most interaction with citizens or the greatest
uses license plate recognition, as well as
responsibility for citizen-facing services: health
cameras, to monitor the flow of people as
and social services, policing and justice, revenue,
they travel by plane, train and automobile,
border services, pension and social security, and
generating insights as to where infrastructure
administration. A few of the findings of this nine-
investment is necessary, as well as predictions
country survey of nearly 800 IT leaders include:
about how we are likely to change the way we
travel in the future. • More than two-thirds of government agencies
are evaluating the potential of emerging
• In education settings around the globe, thanks
technologies including big data, but only one-
to the huge increase in the amount of learning
fourth have moved beyond the pilot stage
activity carried out online (both in traditional
• Those that have embraced technologies
school environments and through distance
like IoT and machine learning all report that
learning), massive amounts of data about the
the most common new IT used is advanced
way we study and learn is becoming available.
analytics with predictive modeling
• The U.S. Department of Agriculture supports
• Nearly half of those agencies say their
the agricultural economy through research and
primary objective in harnessing the power
development of new big data technologies.
of advanced analytics is to improve and
One recent breakthrough is increasing the
support employee work
yield of dairy herds by identifying bulls most
• More than three-fourths indicate that
likely to breed high-yielding cows based on
implementing machine learning methods are
genetic records.
either underway or complete
3
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

Preparing for Big Data


The Japanese government plans to
What can government agencies do to prepare develop a system to help disaster
for and embrace big data and data analytics?
First of all, government agencies should try to get
victims by utilizing big data gathered
ahead of their data deluge. Strategy, planning and from sources such as internet
governance are critical to this process. Second, they postings, and global positioning
must develop and review the life cycle for data for system (GPS) data from smartphones
their agencies. The life cycle can be categorized into
the following phases:
and car navigation devices.

Data Life Cycle Phases In setting the stage for successful big data
deployments, government agencies also need
Retail generates a flood of complex a good sense for data lineage — source of data,
structured and unstructured data. There is a change log of data, and trustworthiness of data
vast number of sources of this data, but for a to limit any sort of garbage-in-garbage out
short-list we can consider the following: (GIGO) possibilities.
• Ingest: the collection of data from a
diverse set of sources Innovative Use Case Examples
• Store: the repository for the collected There are a number of high-profile big data use
data — the right kind of data needs to case examples in a government setting:
be stored in the correct repository
• Analyze: the analytics of the data in the } Pakistan’s National Database & Registration
Authority (NADRA), one of the world’s largest
repositories
multi-biometric citizen databases, serves
• Consume: the reporting and business
as an example of harnessing the power of
intelligence for decision-making
government data. NADRA is an independent
and autonomous agency under Ministry of
When the big data life cycle is well understood, Interior and Narcotics Control, Government of
government thought leaders need to plan and Pakistan that regulates government databases
identify the following: and statistically manages the sensitive
• Find technology enablers – seek out and registration database of all the national
evaluate new infrastructure and new software citizens of Pakistan.
applications, and institute pilot programs/
proof-of-concept projects
} The Japanese government plans to develop a
system to help disaster victims by utilizing big
• Adopt an ecosystem approach – big data data gathered from sources such as internet
analytics is an evolving space and there will be postings, and global positioning system (GPS)
new technology options to review and select data from smartphones and car navigation
new solutions devices. The system will enable administrative
• Adopt a use case-based approach – data’s authorities to immediately ascertain the
value depends on the insight of the domain; movements of victims just after a disaster
look for use case-specific projects, e.g. occurs. The government will gather and
network-centric data analytics or cybersecurity analyze information, including that on isolated
insights local communities and overcrowded shelters,
• Invest in data-centric skill sets – insights to make the initial response after a disaster,
in large data sets tend to be as good as the such as search and rescue operations and the
domain knowledge of the data, so skills for delivery of goods, more efficient.
data scientists and data analysts need to be
developed and nurtured
4
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

} From Washington, DC to cities, states,


and countries all around the world, “open
} A new UKAuthority report “Digitising Policing,”
indicates that advances of IT solutions like data
government” is revolutionizing the way analytics are changing the world and policing
citizens interact with government leaders. is changing with it. It is clear that police forces
It connects like-minded citizens with each are rapidly adopting big data technology.
other, with government agencies, and with It promises significant improvements in
many other types of organizations. To support efficiency— the management of evidence
open government initiatives and uphold the can be improved, and the provision of more
values of transparency, participation, and information — can support better decision
collaboration in the US, federal agencies now making at strategic and operational level, and
make their data open. This means making among officers on the beat. There are also
large data stores publicly accessible in a opportunities to use data analytics to better
format that can be shared. Open data from understand factors influencing the demands
the government gives citizens the information on the police.
they need to hold government leaders
accountable. Open data fosters collaboration } Traffic and congestion control in London
has been using automatic plate number
between government leaders and citizens,
recognition along with multiple video cameras
and encourages cooperation internally
across the city for years meaning that during
among government entities. The results can
the planning and delivery of Olympics 2012,
be tremendously better decisions that have
Transport for London (TfL) employed a number
the potential to drastically change lives. One
of big data tools to make sure that during the
fertile source of open data from the Federal
Olympic games public (and private transport)
Government, as well as APIs from a variety
kept people moving quickly and safely to and
of federal agencies and other resources, is
from the games and generally around London
data.gov.
for their normal business.

Government Applications of IoT


The Internet of Things
(IoT) is a term used
to describe the set
of physical objects
embedded with
sensors and connected
to the Internet.
IoT offers numerous
opportunities for the
Federal Government
to cut costs and
improve citizen
services. The data
visualization “Growth
in the Internet of
Things” (below) shows
the tremendous
growth in the number
of connected devices
across all sectors. Source: NCTA – The Internet & Television Association

5
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

Smart Cities Building Blocks of a Smart City


IDC defines a Smart City (also known as digital
city) as being a “finite entity (district, town, city,
county, municipal, and/or metropolitan area) with
its own governing authority that is more local than
national. This entity is built on an information and
communication technology (ICT) foundation layer
that allows for efficient city management, economic
development, sustainability, innovation, and citizen
engagement.”
Data is one of the most critical elements that will
underpin the success of a city’s transformation into
a smart city. To be deemed successful, a city should
be able to harness data from existing government
systems, online and mobile applications, third-party
applications, and, most importantly, from citizens
—the ultimate beneficiaries of smart cities. The Source: IDC Report – Building Agile Data Driven Smart Cities, 2015
data that is gathered can be analyzed and used
to make informed, automated decisions that can A new report by The Wharton School, Smart
improve the life of citizens and better manage local Cities: The Economic and Social Value of Building
resources. Intelligent Urban Spaces, suggests that the
Smart cities are communities that harness worldwide trend toward building smart cities
technology to transform physical systems and is getting the creative juices of urban planners,
services in a way that enhances the life of its entrepreneurs, and citizens flowing. Some projects
residents and business while making government are impressive in their innovation, while others are
more efficient. It is more than a mere automation laudable for their scale and impact. Typically, these
of processes; it also links disparate systems and projects begin in discrete pockets of existing cities,
networks to gather and analyze data that is then where they can generate a “proof of concept,”
used to transform whole systems. While the idea improve upon their design while gaining insights,
behind smart cities has been around for years, it and go on to replicate them in other parts of the
has acquired a new urgency as more people move city or in other communities. A less common
into urban centers. Sustainability is becoming example is the building of brand new cities, such
an important imperative and technologies have as Songdo in South Korea or one planned in China’s
advanced to a point where there can be real- Guangdong province. Innovative work from around
time and meaningful interaction between cities, the world in smart cities touches on a wide range of
residents, and businesses. areas including energy, healthcare, transportation,
parking, crime management, adaptive reuse
According to the U.S. Smart Cities Council, another of existing and unused infrastructure, citizen
perspective of a smart city is a “system of systems participation, and enhanced overall quality of life.
—water, power, transportation, emergency
response, and others— with each one affecting all Other examples of smart cities include Barcelona,
the others.” In recent years, the ability to merge where the drive to revolutionize began more than
multiple data streams and analyze them for critical a dozen years ago, Copenhagen with a focus on
insights has been refined. Those insights are what boosting sustainability, Charlotte, North Carolina,
enhance livability, workability, and sustainability in with a successful energy efficiency program, and
a city. The overarching goal is to make cities more China’s capital city, Beijing. There also is a strong
efficient, and ensure resources are found where push in the Middle East toward smart cities and
they’re needed. adopting IoT and the technologies that benefit from

6
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

machine learning, originating from the perspective public safety. When you look at those areas with
of public safety and security. The U.S., which came traffic issues, patterns emerge and you can allocate
later to the game, is now getting its act together at resources to those identified parts of the city.
the Federal Government level, although several If you can predict that you’re going to have five or
U.S. cities, including New York, Philadelphia, and six accidents in downtown Austin, you can schedule
San Francisco, launched such initiatives years before resources around the downtown area in case that
the term “smart city” or “digital city” became a happens. As a specific example, a major event like
buzzword. Smart cities are driven by the amount the Austin City Limits (ACL) music festival attracts
of data and ability to cross many verticals in the upwards of 70,000 people in the central part of
government space. the city — where do you put emergency services,
where do you put police, where do you put traffic
cops, where do you put EMTs? Smart cities means
Smart cities will represent a market allocating resources at the right time.
value of USD 1.565 Trillion by 2020,
driven by investments in sectors such Motivations and Challenges for
as smart districts, utilities, healthcare, Government Use of IoT
security, and governance. – Frost & Sullivan Where federal agencies have begun deploying
IoT solutions, it is typical to pursue several primary
goals: increase efficiency, reduce costs, and offer
In September 2015, the Obama administration
new services (e.g. anti-crime surveillance). Major
announced a smart cities initiative to invest more
projects include a smart buildings initiative to reduce
than $160 million in at least two dozen research
energy costs, a telematics program to increase
and technology collaborations to help communities
the efficiency of government vehicles, an effort to
across the country tackle challenges ranging from
improve asset management, and an automated
fighting crime and reducing traffic congestion to
process to replace manual data-collection.
fostering economic growth. Digital cities, based
on IoT technologies, use traffic lights to analyze Several federal agencies have used the IoT as an
traffic patterns. opportunity to create services in support of their
missions. Major new projects that use the IoT include
According to a report by Frost & Sullivan, Strategic
improving national defense, monitoring the natural
Opportunity Analysis of the Global Smart City
world, and enhancing safety and public health.
Market, smart cities will represent a market value of
USD 1.565 Trillion by 2020, driven by investments in Adoption of IoT technology for government
sectors such as smart districts, utilities, healthcare, applications, however, doesn’t always follow a
security, and governance. straight line. The Federal Government faces a
number of challenges that have hampered the
One caveat is that it’s harder for the public sector in
adoption of the IoT in the public sector. First, there
developed Western economies to cultivate a smart
is a lack of strategic leadership at the federal level
city, with civil liberties concerns centered on if and
about how to make use of the IoT. Second, federal
how data is leaving the city. Security and privacy
agencies do not always have workers with the
concerns work to restrict data leaving the city, and
necessary technical skills to effectively use data
more broadly, leaving the host country.
generated by the IoT. Third, federal agencies do
Smart cities also involve putting resources not have sufficient funding to modernize their
in the right place. As an example, you have IT infrastructure and begin implementing IoT pilot
limited resources with police, limited staffing for projects. Fourth, even when funding exists, federal
EMTs, limited numbers of fire stations, trucks, procurement policies often make it difficult for
ambulances, etc. The goal is to develop predictive agencies to quickly and easily adopt the technology.
models —for instance Austin, TX has intersections Finally, risks and uncertainty about privacy, security,
that have the most traffic between the hours of interoperability, and return on investment delay
4-6pm, and predicting the most likely intersections federal adoption as potential federal users wait for
to have an accident is a tremendous benefit for the technology to mature and others to adopt first.
7
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

Government Sponsored Healthcare and Life Sciences


Government sponsored data initiatives within is restricted. The belief is that the availability of
healthcare and life sciences are encouraging — patient information and data analytics could have
they not only increase transparency but also have a substantial beneficial impact. Using big data
the potential to help patients. Not surprisingly, technology, there is focus on three main areas:
recent years have seen a flurry of activity in this • Interoperability of patient records – the
sector in many countries. For example, the Italian ability to access and update records at any
Medicines Agency collects and analyzes clinical data point in the healthcare system by integrating
on expensive new drugs as part of a national cost- NHS institutions.
effectiveness program. Based on the results of this • Data analytics – using large quantities of
effort, the government may reevaluate prices and information to better predict and personalize
market access conditions. medicine. Data analytics can identify the
Within the U.S., the Federal Government has been combination of factors that put the patient
encouraging the use of its healthcare data through at high risk of developing a chronic condition,
various policies and initiatives with the hope to allowing the intervention to prevent them
directly improve cost, quality, and the overall health- from getting ill. Personalized medicine can
care ecosystem. With more data being released, improve early diagnosis and improve quality
the Federal Government is trying to ensure that all of care treatments and outcomes can be
appropriate stakeholders, including those in private analyzed in conjunction with patient details
industry, can access the information in standard in order to maximize the benefit of any
formats. The HealthData.gov portal, for example, treatment.
includes federal databases with information on • Mobile technology – apps can be used by
the quality of clinical providers, the latest medical medical practitioners to provide up-to-date
and scientific knowledge, consumer product data, practical advice and by individuals to manage
community health performance, government their health. Tracking devices are becoming
spending data, and many other topics. In addition more popular and could be used to maintain
to publishing information, the aim is to make data personalized healthy lifestyles.
easier for developers to use by ensuring that they
Scotland has used informatics technology to
are machine-readable, downloadable, and acces-
provide an integrated care model for the treatment
sible via application programming interface (API).
of diabetes. GPs, patients and secondary care
In a report by Voterra Partners for Dell EMC, professionals have collaborated to treat diabetes
Sustaining Universal Healthcare in the UK: Making over a period of 20 years. The informatics
Better Use of Information, we learn about the technology is used to track patients’ treatment and
effort to address the unprecedented financial treatment outcomes that are carefully monitored
pressure faced by the UK’s National Health and managed so as to reduce the severity of the
System—affecting patient’s quality of treatment, condition. This has achieved impressive results as
as waiting times increase, and research funding shown in the figure provided below.

Source: Voterra Partners

8
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

Genomics
Apache Spark’s compatibility with
Scientists are realizing fascinating new perspectives
on the human genome, and it’s all thanks to the
the Hadoop platform makes it easy
advancements made in data analytics. For years, to deploy and support within existing
genes have been studied and mapped, with bioinformatics IT infrastructures,
perhaps the crowning achievement being the and its support for languages such as
completion of the Human Genome Project in the
R, Python, and SQL ease the learning
early 2000s, but true understanding of how human
genetics work has required more intensive study curve for practicing bioinformatics
and more resources. Only recently have scientists practitioners.
been able to look more closely at human genes,
and much of this progress comes as they apply by Intel, upon which NGCC is based, can handle
data analytics to the effort. data sets of more than 300 terabytes of genomic
data. In addition, ASU is using the NGCC to
The 100K Project is an ambitious understand certain types of cancer by analyzing
program to sequence 100,000 patients’ genetic sequences and mutations.

whole genomes from NHS patients Apache Spark is an ideal platform for organizing
large genomics analysis pipelines and workflows.
across England.
Its compatibility with the Hadoop platform makes
it easy to deploy and support within existing
There is much interest in genomics and bioinformatics IT infrastructures, and its support
personalized healthcare across the European for languages such as R, Python, and SQL ease
Union. In the UK there is the “100K project,” the learning curve for practicing bioinformatics
where 100,000 genomes are being sequenced. practitioners. Widespread use of Spark for
Formally known as “The 100,000 Genomes genomics, however, will require adapting and
Project,” it is an ambitious program to sequence rewriting many of the common methods, tools,
100,000 whole genomes from NHS patients across and algorithms that are in regular use today.
England. Genomics promises significant benefits
in healthcare through scientific discovery, and this
study will help to deliver on this goal. Dell EMC
Neuroscience
provides the platform for large-scale analytics There are many challenges to analyzing neural data.
in a hybrid cloud model for Genomics England. The measurements are indirect, and useful signals
The project has been using Dell EMC storage must be extracted and transformed in a manner
for its genomic sequence library, and now it tailored to each experimental technique. Analyses
will be leveraging a data lake to securely store must find patterns of biological interest from the
data during the sequencing process. Backup sea of data. An analysis is only as good as the
services are provided by Dell EMC’s Data Domain experiment it motivates; the faster we can explore
and Networker. data, the sooner we can generate a hypothesis and
Arizona State University (ASU) worked with Dell move research forward.
EMC to create a powerful high-performance One of the reasons neural data analysis is so
computing (HPC) cluster that supports data challenging is that there is no standardization. There
analytics. As a result, ASU built a holistic Next are families of workflows, analyses, and algorithms
Generation Cyber Capability (NGCC) using Dell that we use regularly, but it’s just a toolbox, and a
EMC and Intel technologies that is able to process constantly evolving one. To understand data, we
structured and unstructured data, as well as must try many analyses, look at the results, modify
support diverse biomedical genomics tools and at many levels—whether adjusting preprocessing
platforms. HPC technology and the Dell EMC parameters, or developing an entirely new
Cloudera Apache Hadoop solution, accelerated algorithm—and inspect the results again.

9
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

Infectious Diseases
With Spark, especially once data
The National Institutes of Health (NIH) is trying
is cached, we can get answers to to prevent the spread of infectious diseases, e.g.
new queries in seconds or minutes, super-viruses. NIH looks at all the drug data and
instead of hours or days. clinical trial results submitted to the FDA, and
correlates data from drug manufacturers, doctors,
and patients to build a model. As an example, if a
Spark allows the ability to cache a large data set super-virus were to take place in a given population,
in RAM and repeatedly query it with multiple data analytics can help answer questions like: how
analyses. This is critical for the exploratory process, affected would that population be, how fast would
and is a key advantage of Spark compared to it spread, what actions would the NIH have to take
conventional MapReduce systems. With Spark, to quarantine that area, and what steps would be
especially once data is cached, we can get answers needed in order to control the virus before
to new queries in seconds or minutes, instead of it spreads across the country?
hours or days. For exploratory data analysis, this
is a game changer where the ability to visualize A real-life example is how data analytics is helping
intermediate results is critical. the fight against the Zika virus. The World Health
Organization has declared the Zika virus a public
health emergency that could affect four million
BRAIN Initiative people in the next year as it spreads across the
The U.S. based BRAIN Initiative uses big Americas. Big data and analytics have played a role
data technologies to map the human in containing previous viral outbreaks such as Ebola,
brain. By mapping the activity of neurons Dengue fever, and seasonal flu, and lessons learned
in the brain, researchers hope to discover are undoubtedly being put to use in the fight
against Zika. However, while statistical modeling
fundamental insights into how the
of vast, real-time data sets is becoming ingrained
mind develops and functions, as well
across healthcare and emergency response,
as new ways to address brain trauma
the support infrastructure needed to put these
and diseases. Researchers plan to build
initiatives to work at ground level is lagging behind.
instruments that will monitor the activity
of hundreds of thousands and perhaps Data research has dramatically sped up the
1 million neurons, taking 1,000 or more development of new flu vaccines. By analyzing
measurements each second. This goal the results of thousands of tests at institutions
will unleash a torrent of data. A brain around the world, compounds can be developed
observatory that monitors 1 million to target the specific proteins that are found to
neurons 1,000 times per second would enable the virus to grow. Big data is also used by
generate 1 gigabyte of data every second, epidemiologists to track the spread of outbreaks.
4 terabytes each hour, and 100 terabytes Dell EMC and the University of Cambridge maintain
per day. Even after compressing the data the European HPC Solution Centre. In collaboration
by a factor of 10, a single advanced brain with Intel, Dell EMC and Cambridge HPC Solution
laboratory would produce 3 petabytes of Centre aims to provide answers to challenges
data annually. facing the HPC community and feed the results
back into the wider research community. Thanks to
the Centre, researchers are exploring the genetic
analysis of tens of thousands of disease patients.

10
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

Government Use of Big Data for Cybersecurity

Source: Navigating the Cybersecurity Equation by MeriTalk

The cybersecurity waters are teeming with threats they deal with a cybersecurity compromise at least
by criminals, nation states, and hacktivists, and once a month due to their inability to fully analyze
government agencies do not have the personnel, data. In addition, just 45 percent found their efforts
tools, or time to properly handle the massive to be “highly effective.” What is holding agencies
amounts of data involved especially with the attack back from reaching their cybersecurity goals?
surface constantly expanding. However, with the Where do agencies go from here?
ability to discover insights in the very data they are
The top challenges for feds surrounding
sinking in, big data may be the requisite lifeline.
cybersecurity as reported by participants were:
While 90 percent of government data analytics users • The sheer volume of cybersecurity data,
report they have seen a decline in security breaches, which is overwhelming (49 percent)
49 percent of federal agencies say cybersecurity • The absence of appropriate systems to
compromises occur at least once a month as a result gather necessary cybersecurity information
of an inability to fully analyze data. These are some (33 percent)
of the findings in a new report from MeriTalk’s
• The inability to provide timely information
(a public-private partnership focused on improving
to cybersecurity managers (30 percent)
the outcomes of government IT), Navigating the
Cybersecurity Equation, which examines how Because of these challenges, participants
agencies are using big data and advanced analytics stated more than 40 percent of their data goes
to better understand cybersecurity trends and unanalyzed. Other obstacles include the lack of
mitigate threats. skilled personnel (40 percent), potential privacy
concerns (27 percent), and lack of management
The survey asked 150 federal cybersecurity
support/awareness (26 percent).
professionals to examine how agencies are using big
data and analytics to better
understand cybersecurity
trends and mitigate threats.
The study found that 81
percent of federal agencies
say they’re using data
analytics for cybersecurity in
some capacity—53 percent
are using it as a part of their
overall cybersecurity strategy,
and 28 percent are using it in
a limited capacity. However, Source: Navigating the Cybersecurity Equation by MeriTalk
breaches continue to afflict
agencies with 59 percent of
federal agencies reporting

11
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

Cybersecurity Solutions
Dell EMC Cyber Solutions Group has
Cybersecurity difficulties can be addressed through
been developing applications that
the use of hardened technology solutions such as
enable penetration at a database the one from SecureWorks, Inc., a subsidiary of
level with zero impact on the Dell Technologies. SecureWorks is a secure, cost-
database and provides real-time effective, and highly scalable solution for collecting,
assessments across the network in processing, and analyzing massive amounts of data
collected from government agency environments.
seconds. As soon as these threats The solution is deployed on the Dell EMC Cloudera
are identified, assessments can be Apache Hadoop Solution, accelerated by Intel
announced to determine how fit (using the Dell EMC Reference Architecture).
the security solution is.
SecureWorks Features
From a perspective outside the U.S., cybersecurity SecureWorks offers a number of features
has been identified as the number one threat to the attractive to the needs of government
UK government. The UK is seeing a big emphasis agencies:
on using National Cyber Security Center (NCSC) that • Explosive data growth leading to a search
was announced early 2016. The UK faces a growing for new technology
threat of cyberattacks from states, serious crime • Secure solutions built for big data and
gangs, hacking groups as well as terrorists. The analytics
NCSC will help ensure that the people, public and
• Innovative security capabilities through
private sector organizations and the critical national
an evolving technology stack
infrastructure of the UK are safer online.
• Better protection for government clients
Data analytics is playing a significant role in looking • Faster threat response for government
at the threat landscape — to determine where agencies
weaknesses are, and whether the right strategies
• Reduced costs and increased scalability
and tools are in place. Additionally, there is a focus
between the military and the intelligence services,
which are centered on pursuing penetration SecureWorks provides proven threat detection.
testing. It is difficult to do penetration testing Offered as a service, SecureWorks monitors the
on live systems, and the challenge is that you’ll environment and looks for any kind of threats
never really be able to test the vulnerabilities whether from the network, or internally, or
across the network for fear of bringing down externally. The reason SecureWorks is based on the
critical applications. Dell EMC Cyber Solutions Cloudera ApacheTM Hadoop® software platform
Group has been developing applications that is because the amount of attacks happening on a
enable penetration at a database level with zero given environment is so high. It’s necessary to be
impact on the database and provides real-time able to monitor logs from network devices, logs
assessments across the network in seconds. As from computers, notebooks, servers, etc. Typically,
soon as these threats are identified, assessments you don’t have anywhere to put all that log data,
can be announced to determine how fit the security and you don’t have anything fast enough to process
solution is. The Cyber Solutions Group is part of and analyze the data. Hadoop is an open source
Dell EMC, and features expertise in the realms of analytics platform, built from the ground up, to
advanced storage technologies, high availability, address today’s big data challenges. It enables
cyber security, big data, cloud computing, and government agencies to load and consolidate
data science. data from various sources into the highly scalable
Hadoop Distributed File System (HDFS). This data
can then be processed using highly distributed
compute jobs through the MapReduce framework.
12
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

By integrating Hadoop with an IT environment, Another cybersecurity solution is from RSA, also
you’re able to achieve two important aspects. First, a part of Dell EMC Technologies. RSA security
you’re able to solve the scalability problem — not analytics was used by Los Angeles World Airports
in terms of infrastructure, but rather scalability (LAWA) in order to track everything that happens
as far as being able to detect a threat — a threat within its environment. Working frequently with the
might be a log, a Trojan, something that came in FBI and the Secret Service, it has to be accountable
through the firewall. So now you’re able to process for its cybersecurity. Its goal was to have real-time
and analyze that data in any given format and put detection of security events in order to ensure
it in Hadoop. Second, by putting your data in a public safety. RSA security analytics has enabled
single repository, like a data lake, you’re able to LAWA to greatly improve the speed of its response
layer on statistical models and algorithms to detect to immediate threats. The solution enables deep-
anomalies and detect the threats on top of Hadoop. dive into payloads before and after a security event
This enables you to build a single repository to do and delivers more information about each device
all your analysis —one way to ingest data, one way than was previously possible. The RSA cybersecurity
to process data, and one way to analyze data — solution has also helped shorten incident response
to get an operational model of how you’re going time as analysts can see all the information in one
to consume all the data and identify a threat. place, rather than spending time searching for it.

Case Studies: Dell EMC Focused Customer Use Cases


In order to illustrate how government agencies are software-defined storage automation. The platform
rapidly moving forward with the adoption of the big was built to integrate scalability, speed, and agility
data technology stack, in this section we’ll consider of public cloud platforms with the control and
a number of use case examples that have benefited security of private systems.
from these tools. In addition, these project
profiles show how big data is steadily merging } San Diego Supercomputer Center
with traditional HPC architectures. In each case,
significant amounts of data are being collected The San Diego Supercomputer Center (SDSC)
and analyzed in the pursuit of more streamlined was established as one of the nation’s first
government. supercomputer centers under a cooperative
agreement by the National Science Foundation
(NSF) in collaboration with UC San Diego. The
} Dubai’s Smart City Initiative
center opened its doors on November 14, 1985.
Dell EMC has provided an enterprise hybrid cloud Today, SDSC provides big data expertise and
platform to the technology arm of Dubai’s smart services to the national research community.
city initiative. Smart Dubai Government deployed
In 2013, SDSC was awarded a $12 million grant
the cloud platform to establish an Information
by the NSF to deploy the Comet supercomputer
Technology-As-A-Service infrastructure that
with the help of Dell and Intel as a platform for
will work to aid the delivery of new services for
performing big data analytics. Using the Intel®
individuals and organizations. The cloud platform
Xeon® Processor E5-2680 v3, SDSC uses parallel
is designed to provide self-service and automation
software architectures like Hadoop and Spark to
functions, deeper visibility into performance and
perform important research in the field of precision
capacity usage of infrastructure resources, and
health and medicine. It also performs wildfire
a unified backup and recovery system.
analysis and clustering of weather patterns as
The hybrid cloud platform is based on VMware’s well as integration of satellite and sensor data
vCloud Suite and Dell EMC’s VPLEX and Vblock to produce fire models in real-time.
converged infrastructure and ViPR Controller

13
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com
Guide to Data Analytics in Government

} National Center for Using Cypress enables Tulane to conduct new


Supercomputing Applications scientific research in fields such as epigenetics
(the study of the mechanisms that regulate gene
Established in 1986 as one of the original sites of activity), cytometry (the measurements of the
the National Science Foundation’s Supercomputer existence of certain subsets of cells within a kind
Centers Program, the National Center for Super- of tissue in the human body), primate research,
computing Applications (NCSA) is supported by sports-related concussion research, and the
the state of Illinois, the University of Illinois, the mapping of the human brain.
National Science Foundation, and grants from other
federal agencies. The NCSA provides computing,
data, networking, and visualization resources and
} Translational Genomics Research
Institute
services that help scientists, engineers, and scholars
at the University of Illinois at Urbana-Champaign To advance health through genomic sequencing and
and across the country. The organization manages personalized medicine, the Translational Genomics
several supercomputing resources, including Research Institute (TGen) required a robust,
the iForge HPC cluster based on Dell EMC and scalable high-performance computing environment
Intel technologies. One particularly compelling complimented with powerful data analytics tools
scientific research project that’s housed in the NCSA for its Dell EMC Cloudera Apache Hadoop platform,
building is the Dark Energy Survey (DES), a survey accelerated by Intel. For example, a genome map
of the Southern sky aimed at understanding the is required before a personalized treatment for
accelerating expansion rate of the universe. The neuroblastoma, a cancer of the nerve cells, can be
project is based on the iForge cluster and ingests designed for the patient, but conventional genome
about 1.5 terabytes daily. mapping takes up to six months. Dell EMC helped
TGen reduce the time from six months to less than
} Tulane University four hours and enables biopsy to treatment in less
than 21 days. Now TGen is widely known for their
As part of its rebuilding efforts after Hurricane comprehensive genome analysis and collaboration
Katrina, Tulane University partnered with Dell EMC through efficient and innovative use of their
and Intel to build a new HPC cluster to enable technology to support researchers and clinicians to
the analysis of large sets of scientific data. The deliver on the promises of personalized medicine
cluster is essential to power data analytics in for better patient outcomes.
support of scientific research in the life sciences
and other fields. For example, the school has
numerous oncology research projects that involve Summary
statistical analysis of large data sets. Tulane also
has researchers studying nanotechnology, the Data analytics for government is a rapidly evolving
manipulation of matter at the molecular level, field, offering exciting opportunities that, when
involving large amounts of data. explored and applied, can help fragile states
uncover powerful and effective methods for
Tulane worked with Dell EMC to design a new HPC optimizing governance. Furthermore, it is clear
cluster dubbed Cypress, consisting of 124 Xeon- that government agencies have an appetite for
based PowerEdge C8220X server nodes, connected embracing big data technologies to help transform
through the high-density, low-latency Z9500 the public service experience for citizens and
switch, providing a total computational theoretical employees. Government leaders need to focus on
peak performance of more than 350 teraflops of delivering value and on adopting these emerging
computational power. Dell EMC also leveraged technologies while creating the kind of internal
their relationship with Intel, who in turn leveraged conditions that will inspire employees to embrace
their relationship with leading Hadoop distribution change.
Cloudera allowing Tulane to do data analytics using
Hadoop in an HPC environment.

14
Copyright © 2017, insideBIGDATA, LLC www.insidebigdata.com | 508-259-8570 | Kevin@insideHPC.com