You are on page 1of 11

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO.

1, JANUARY 2014 97

Data Mining with Big Data


Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE,
Gong-Qing Wu, and Wei Ding, Senior Member, IEEE

Abstract—Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development
of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering
domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features
of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model
involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy
considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

Index Terms—Big Data, data mining, heterogeneity, autonomous sources, complex and evolving associations

1 INTRODUCTION

D R. Yan Mo won the 2012 Nobel Prize in Literature. This


is probably the most controversial Nobel prize of this
category. Searching on Google with “Yan Mo Nobel Prize,”
vouchers. Such online discussions provide a new means
to sense the public interests and generate feedback in real-
time, and are mostly appealing compared to generic media,
resulted in 1,050,000 web pointers on the Internet (as of such as radio or TV broadcasting. Another example is
3 January 2013). “For all praises as well as criticisms,” said Flickr, a public picture sharing site, which received
Mo recently, “I am grateful.” What types of praises and 1.8 million photos per day, on average, from February to
criticisms has Mo actually received over his 31-year writing March 2012 [35]. Assuming the size of each photo is
career? As comments keep coming on the Internet and in 2 megabytes (MB), this requires 3.6 terabytes (TB) storage
various news media, can we summarize all types of every single day. Indeed, as an old saying states: “a picture
opinions in different media in a real-time fashion, including is worth a thousand words,” the billions of pictures on
Flicker are a treasure tank for us to explore the human
updated, cross-referenced discussions by critics? This type
society, social events, public affairs, disasters, and so on,
of summarization program is an excellent example for Big
only if we have the power to harness the enormous amount
Data processing, as the information comes from multiple,
of data.
heterogeneous, autonomous sources with complex and
The above examples demonstrate the rise of Big Data
evolving relationships, and keeps growing.
applications where data collection has grown tremen-
Along with the above example, the era of Big Data has
dously and is beyond the ability of commonly used
arrived [37], [34], [29]. Every day, 2.5 quintillion bytes of
software tools to capture, manage, and process within a
data are created and 90 percent of the data in the world
today were produced within the past two years [26]. Our “tolerable elapsed time.” The most fundamental challenge
capability for data generation has never been so powerful for Big Data applications is to explore the large volumes of
and enormous ever since the invention of the information data and extract useful information or knowledge for
technology in the early 19th century. As another example, future actions [40]. In many situations, the knowledge
on 4 October 2012, the first presidential debate between extraction process has to be very efficient and close to real
President Barack Obama and Governor Mitt Romney time because storing all observed data is nearly infeasible.
triggered more than 10 million tweets within 2 hours [46]. For example, the square kilometer array (SKA) [17] in radio
Among all these tweets, the specific moments that astronomy consists of 1,000 to 1,500 15-meter dishes in a
generated the most discussions actually revealed the public central 5-km area. It provides 100 times more sensitive
interests, such as the discussions about medicare and vision than any existing radio telescopes, answering
fundamental questions about the Universe. However, with
a 40 gigabytes (GB)/second data volume, the data
. X. Wu is with the School of Computer Science and Information generated from the SKA are exceptionally large. Although
Engineering, Hefei University of Technology, China, and the Department
of Computer Science, University of Vermont. researchers have confirmed that interesting patterns, such
. X. Zhu is with the Department of Computer & Electrical Engineering and as transient radio anomalies [41] can be discovered from
Computer Science, Florida Atlantic University, Boca Raton, FL 33341. the SKA data, existing methods can only work in an offline
. G.-Q. Wu is with the School of Computer Science and Information
fashion and are incapable of handling this Big Data
Engineering, Hefei University of Technology, China.
. W. Ding is with the Computer Science Department, University of scenario in real time. As a result, the unprecedented data
Massachusetts Boston, 100 Morrissey Blvd., Boston, MA 02125. volumes require an effective data analysis and prediction
Manuscript received 25 Jan. 2013; revised 13 Apr. 2013; accepted 5 June 2013; platform to achieve fast response and real-time classifica-
published online 24 June 2013. tion for such Big Data.
Recommended for acceptance by X. He. The remainder of the paper is structured as follows: In
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2013-01-0068. Section 2, we propose a HACE theorem to model Big Data
Digital Object Identifier no. 10.1109/TKDE.2013.109. characteristics. Section 3 summarizes the key challenges for
1041-4347/14/$31.00 ß 2014 IEEE Published by the IEEE Computer Society
98 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014

diverse dimensionalities. This is because different informa-


tion collectors prefer their own schemata or protocols for
data recording, and the nature of different applications also
results in diverse data representations. For example, each
single human being in a biomedical world can be
represented by using simple demographic information such
as gender, age, family disease history, and so on. For X-ray
examination and CT scan of each individual, images or
videos are used to represent the results because they
provide visual information for doctors to carry detailed
examinations. For a DNA or genomic-related test, micro-
array expression images and sequences are used to
represent the genetic code information because this is the
Fig. 1. The blind men and the giant elephant: the localized (limited) view way that our current techniques acquire the data. Under
of each blind man leads to a biased conclusion.
such circumstances, the heterogeneous features refer to the
different types of representations for the same individuals,
Big Data mining. Some key research initiatives and the and the diverse features refer to the variety of the features
authors’ national research projects in this field are outlined involved to represent each single observation. Imagine that
in Section 4. Related work is discussed in Section 5, and we different organizations (or health practitioners) may have
conclude the paper in Section 6. their own schemata to represent each patient, the data
heterogeneity and diverse dimensionality issues become
2 BIG DATA CHARACTERISTICS: HACE THEOREM major challenges if we are trying to enable data aggregation
by combining data from all sources.
HACE Theorem. Big Data starts with large-volume,
heterogeneous, autonomous sources with distributed and 2.2 Autonomous Sources with Distributed and
decentralized control, and seeks to explore complex and Decentralized Control
evolving relationships among data. Autonomous data sources with distributed and decentra-
These characteristics make it an extreme challenge for lized controls are a main characteristic of Big Data
discovering useful knowledge from the Big Data. In a applications. Being autonomous, each data source is able
naı̈ve sense, we can imagine that a number of blind men to generate and collect information without involving (or
are trying to size up a giant elephant (see Fig. 1), which relying on) any centralized control. This is similar to the
will be the Big Data in this context. The goal of each blind World Wide Web (WWW) setting where each web server
man is to draw a picture (or conclusion) of the elephant provides a certain amount of information and each server is
according to the part of information he collects during the able to fully function without necessarily relying on other
process. Because each person’s view is limited to his local servers. On the other hand, the enormous volumes of the
region, it is not surprising that the blind men will each data also make an application vulnerable to attacks or
conclude independently that the elephant “feels” like a malfunctions, if the whole system has to rely on any
rope, a hose, or a wall, depending on the region each of centralized control unit. For major Big Data-related applica-
them is limited to. To make the problem even more tions, such as Google, Flicker, Facebook, and Walmart, a
complicated, let us assume that 1) the elephant is growing large number of server farms are deployed all over the
rapidly and its pose changes constantly, and 2) each blind world to ensure nonstop services and quick responses for
man may have his own (possible unreliable and inaccu- local markets. Such autonomous sources are not only the
rate) information sources that tell him about biased solutions of the technical designs, but also the results of the
knowledge about the elephant (e.g., one blind man may legislation and the regulation rules in different countries/
exchange his feeling about the elephant with another blind regions. For example, Asian markets of Walmart are
man, where the exchanged knowledge is inherently inherently different from its North American markets in
biased). Exploring the Big Data in this scenario is terms of seasonal promotions, top sell items, and customer
equivalent to aggregating heterogeneous information from behaviors. More specifically, the local government regula-
different sources (blind men) to help draw a best possible tions also impact on the wholesale management process
picture to reveal the genuine gesture of the elephant in a and result in restructured data representations and data
real-time fashion. Indeed, this task is not as simple as warehouses for local markets.
asking each blind man to describe his feelings about the
elephant and then getting an expert to draw one single 2.3 Complex and Evolving Relationships
picture with a combined view, concerning that each While the volume of the Big Data increases, so do the
individual may speak a different language (heterogeneous complexity and the relationships underneath the data. In an
and diverse information sources) and they may even have early stage of data centralized information systems, the
privacy concerns about the messages they deliberate in the focus is on finding best feature values to represent each
information exchange process. observation. This is similar to using a number of data fields,
such as age, gender, income, education background, and so
2.1 Huge Data with Heterogeneous and Diverse on, to characterize each individual. This type of sample-
Dimensionality feature representation inherently treats each individual as
One of the fundamental characteristics of the Big Data is the an independent entity without considering their social
huge volume of data represented by heterogeneous and connections, which is one of the most important factors of
WU ET AL.: DATA MINING WITH BIG DATA 99

continuously grow, an effective computing platform will


have to take distributed large-scale data storage into
consideration for computing. For example, typical data
mining algorithms require all data to be loaded into the
main memory, this, however, is becoming a clear technical
barrier for Big Data because moving data across different
locations is expensive (e.g., subject to intensive network
communication and other IO costs), even if we do have a
super large main memory to hold all data for computing.
The challenges at Tier II center around semantics and
domain knowledge for different Big Data applications. Such
information can provide additional benefits to the mining
process, as well as add technical barriers to the Big Data
access (Tier I) and mining algorithms (Tier III). For example,
depending on different domain applications, the data
privacy and information sharing mechanisms between data
producers and data consumers can be significantly differ-
Fig. 2. A Big Data processing framework: The research challenges form
a three tier structure and center around the “Big Data mining platform” ent. Sharing sensor network data for applications like water
(Tier I), which focuses on low-level data accessing and computing. quality monitoring may not be discouraged, whereas
Challenges on information sharing and privacy, and Big Data application releasing and sharing mobile users’ location information is
domains and knowledge form Tier II, which concentrates on high-level
semantics, application domain knowledge, and user privacy issues. The clearly not acceptable for majority, if not all, applications. In
outmost circle shows Tier III challenges on actual mining algorithms. addition to the above privacy issues, the application
domains can also provide additional information to benefit
the human society. Our friend circles may be formed based or guide Big Data mining algorithm designs. For example,
on the common hobbies or people are connected by in market basket transactions data, each transaction is
biological relationships. Such social connections commonly considered independent and the discovered knowledge is
exist not only in our daily activities, but also are very typically represented by finding highly correlated items,
popular in cyberworlds. For example, major social network possibly with respect to different temporal and/or spatial
sites, such as Facebook or Twitter, are mainly characterized restrictions. In a social network, on the other hand, users are
by social functions such as friend-connections and followers linked and share dependency structures. The knowledge is
(in Twitter). The correlations between individuals inherently then represented by user communities, leaders in each
complicate the whole data representation and any reasoning group, and social influence modeling, and so on. Therefore,
process on the data. In the sample-feature representation, understanding semantics and application knowledge is
individuals are regarded similar if they share similar feature important for both low-level data access and for high-level
values, whereas in the sample-feature-relationship repre- mining algorithm designs.
sentation, two individuals can be linked together (through At Tier III, the data mining challenges concentrate on
their social connections) even though they might share algorithm designs in tackling the difficulties raised by the
nothing in common in the feature domains at all. In a Big Data volumes, distributed data distributions, and by
dynamic world, the features used to represent the indivi- complex and dynamic data characteristics. The circle at
duals and the social ties used to represent our connections Tier III contains three stages. First, sparse, heterogeneous,
may also evolve with respect to temporal, spatial, and other uncertain, incomplete, and multisource data are prepro-
factors. Such a complication is becoming part of the reality cessed by data fusion techniques. Second, complex and
for Big Data applications, where the key is to take the dynamic data are mined after preprocessing. Third, the
complex (nonlinear, many-to-many) data relationships, global knowledge obtained by local learning and model
along with the evolving changes, into consideration, to fusion is tested and relevant information is fedback to the
discover useful patterns from Big Data collections. preprocessing stage. Then, the model and parameters are
adjusted according to the feedback. In the whole process,
3 DATA MINING CHALLENGES WITH BIG DATA information sharing is not only a promise of smooth
development of each stage, but also a purpose of Big Data
For an intelligent learning database system [52] to handle
processing.
Big Data, the essential key is to scale up to the exceptionally In the following, we elaborate challenges with respect to
large volume of data and provide treatments for the the three tier framework in Fig. 2.
characteristics featured by the aforementioned HACE
theorem. Fig. 2 shows a conceptual view of the Big Data 3.1 Tier I: Big Data Mining Platform
processing framework, which includes three tiers from In typical data mining systems, the mining procedures
inside out with considerations on data accessing and require computational intensive computing units for data
computing (Tier I), data privacy and domain knowledge analysis and comparisons. A computing platform is,
(Tier II), and Big Data mining algorithms (Tier III). therefore, needed to have efficient access to, at least, two
The challenges at Tier I focus on data accessing and types of resources: data and computing processors. For
arithmetic computing procedures. Because Big Data are small scale data mining tasks, a single desktop computer,
often stored at different locations and data volumes may which contains hard disk and CPU processors, is sufficient
100 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014

to fulfill the data mining goals. Indeed, many data mining locations and their preferences, one can enable a variety
algorithm are designed for this type of problem settings. For of useful location-based services, but public disclosure of
medium scale data mining tasks, data are typically large an individual’s locations/movements over time can have
(and possibly distributed) and cannot be fit into the main serious consequences for privacy. To protect privacy, two
memory. Common solutions are to rely on parallel common approaches are to 1) restrict access to the data,
computing [43], [33] or collective mining [12] to sample such as adding certification or access control to the data
and aggregate data from different sources and then use entries, so sensitive information is accessible by a limited
parallel computing programming (such as the Message group of users only, and 2) anonymize data fields such that
Passing Interface) to carry out the mining process. sensitive information cannot be pinpointed to an indivi-
For Big Data mining, because data scale is far beyond dual record [15]. For the first approach, common chal-
lenges are to design secured certification or access control
the capacity that a single personal computer (PC) can handle,
mechanisms, such that no sensitive information can be
a typical Big Data processing framework will rely on cluster
misconducted by unauthorized individuals. For data
computers with a high-performance computing platform,
anonymization, the main objective is to inject randomness
with a data mining task being deployed by running some into the data to ensure a number of privacy goals. For
parallel programming tools, such as MapReduce or example, the most common k-anonymity privacy measure
Enterprise Control Language (ECL), on a large number of is to ensure that each individual in the database must be
computing nodes (i.e., clusters). The role of the software indistinguishable from k 1 others. Common anonymiza-
component is to make sure that a single data mining task, tion approaches are to use suppression, generalization,
such as finding the best match of a query from a database perturbation, and permutation to generate an altered
with billions of records, is split into many small tasks each of version of the data, which is, in fact, some uncertain data.
which is running on one or multiple computing nodes. For One of the major benefits of the data annomization-based
example, as of this writing, the world most powerful super information sharing approaches is that, once anonymized,
computer Titan, which is deployed at Oak Ridge National data can be freely shared across different parties without
Laboratory in Tennessee, contains 18,688 nodes each with a involving restrictive access controls. This naturally leads to
16-core CPU. another research area namely privacy preserving data
Such a Big Data system, which blends both hardware mining [30], where multiple parties, each holding some
and software components, is hardly available without key sensitive data, are trying to achieve a common data mining
industrial stockholders’ support. In fact, for decades, goal without sharing any sensitive information inside the
companies have been making business decisions based on data. This privacy preserving mining goal, in practice, can
transactional data stored in relational databases. Big Data be solved through two types of approaches including
mining offers opportunities to go beyond traditional 1) using special communication protocols, such as Yao’s
relational databases to rely on less structured data: weblogs, protocol [54], to request the distributions of the whole data
social media, e-mail, sensors, and photographs that can be set, rather than requesting the actual values of each record,
or 2) designing special data mining methods to derive
mined for useful information. Major business intelligence
knowledge from anonymized data (this is inherently similar
companies, such IBM, Oracle, Teradata, and so on, have all
to the uncertain data mining methods).
featured their own products to help customers acquire and
organize these diverse data sources and coordinate with 3.2.2 Domain and Application Knowledge
customers’ existing data to find new insights and capitalize Domain and application knowledge [28] provides essential
on hidden relationships. information for designing Big Data mining algorithms and
3.2 Tier II: Big Data Semantics and Application systems. In a simple case, domain knowledge can help
Knowledge identify right features for modeling the underlying data
(e.g., blood glucose level is clearly a better feature than body
Semantics and application knowledge in Big Data refer to mass in diagnosing Type II diabetes). The domain and
numerous aspects related to the regulations, policies, user application knowledge can also help design achievable
knowledge, and domain information. The two most business objectives by using Big Data analytical techniques.
important issues at this tier include 1) data sharing and For example, stock market data are a typical domain that
privacy; and 2) domain and application knowledge. The constantly generates a large quantity of information, such as
former provides answers to resolve concerns on how data bids, buys, and puts, in every single second. The market
are maintained, accessed, and shared; whereas the latter continuously evolves and is impacted by different factors,
focuses on answering questions like “what are the under- such as domestic and international news, government
lying applications ?” and “what are the knowledge or reports, and natural disasters, and so on. An appealing
patterns users intend to discover from the data ?” Big Data mining task is to design a Big Data mining system
to predict the movement of the market in the next one or
3.2.1 Information Sharing and Data Privacy two minutes. Such systems, even if the prediction accuracy
Information sharing is an ultimate goal for all systems is just slightly better than random guess, will bring
involving multiple parties [24]. While the motivation for significant business values to the developers [9]. Without
sharing is clear, a real-world concern is that Big Data correct domain knowledge, it is a clear challenge to find
applications are related to sensitive information, such as effective matrices/measures to characterize the market
banking transactions and medical records. Simple data movement, and such knowledge is often beyond the mind
exchanges or transmissions do not resolve privacy con- of the data miners, although some recent research has
cerns [19], [25], [42]. For example, knowing people’s shown that using social networks, such as Twitter, it is
WU ET AL.: DATA MINING WITH BIG DATA 101

possible to predict the stock market upward/downward recording location is represented by a mean value plus a
trends [7] with good accuracies. variance to indicate expected errors. For data privacy-
related applications [36], users may intentionally inject
3.3 Tier III: Big Data Mining Algorithms randomness/errors into the data to remain anonymous.
3.3.1 Local Learning and Model Fusion for Multiple This is similar to the situation that an individual may not
Information Sources feel comfortable to let you know his/her exact income, but
As Big Data applications are featured with autonomous will be fine to provide a rough range like [120k, 160k]. For
sources and decentralized controls, aggregating distributed uncertain data, the major challenge is that each data item is
data sources to a centralized site for mining is system- represented as sample distributions but not as a single
atically prohibitive due to the potential transmission cost value, so most existing data mining algorithms cannot
and privacy concerns. On the other hand, although we can be directly applied. Common solutions are to take the
always carry out mining activities at each distributed site, data distributions into consideration to estimate model
the biased view of the data collected at each site often leads parameters. For example, error aware data mining [49]
to biased decisions or models, just like the elephant and utilizes the mean and the variance values with respect to
blind men case. Under such a circumstance, a Big Data each single data item to build a Naı̈ve Bayes model for
mining system has to enable an information exchange and classification. Similar approaches have also been applied for
fusion mechanism to ensure that all distributed sites (or decision trees or database queries. Incomplete data refer to
information sources) can work together to achieve a global the missing of data field values for some samples. The
optimization goal. Model mining and correlations are the missing values can be caused by different realities, such as
key steps to ensure that models or patterns discovered the malfunction of a sensor node, or some systematic
from multiple information sources can be consolidated to policies to intentionally skip some values (e.g., dropping
meet the global mining objective. More specifically, the some sensor node readings to save power for transmission).
global mining can be featured with a two-step (local While most modern data mining algorithms have in-built
mining and global correlation) process, at data, model, and solutions to handle missing values (such as ignoring data
at knowledge levels. At the data level, each local site can fields with missing values), data imputation is an estab-
calculate the data statistics based on the local data sources lished research field that seeks to impute missing values to
and exchange the statistics between sites to achieve a produce improved models (compared to the ones built from
global data distribution view. At the model or pattern the original data). Many imputation methods [20] exist for
level, each site can carry out local mining activities, with this purpose, and the major approaches are to fill most
respect to the localized data, to discover local patterns. By frequently observed values or to build learning models to
exchanging patterns between multiple sources, new global predict possible values for each data field, based on the
patterns can be synthetized by aggregating patterns across observed values of a given instance.
all sites [50]. At the knowledge level, model correlation
analysis investigates the relevance between models gener- 3.3.3 Mining Complex and Dynamic Data
ated from different data sources to determine how relevant
The rise of Big Data is driven by the rapid increasing of
the data sources are correlated with each other, and how to
complex data and their changes in volumes and in nature
form accurate decisions based on models built from
[6]. Documents posted on WWW servers, Internet back-
autonomous sources.
bones, social networks, communication networks, and
3.3.2 Mining from Sparse, Uncertain, and Incomplete transportation networks, and so on are all featured with
Data complex data. While complex dependency structures
underneath the data raise the difficulty for our learning
Spare, uncertain, and incomplete data are defining features
systems, they also offer exciting opportunities that simple
for Big Data applications. Being sparse, the number of data
data representations are incapable of achieving. For
points is too few for drawing reliable conclusions. This is
example, researchers have successfully used Twitter, a
normally a complication of the data dimensionality issues,
well-known social networking site, to detect events such as
where data in a high-dimensional space (such as more than
earthquakes and major social activities, with nearly real-
1,000 dimensions) do not show clear trends or distribu-
tions. For most machine learning and data mining time speed and very high accuracy. In addition, by
algorithms, high-dimensional spare data significantly de- summarizing the queries users submitted to the search
teriorate the reliability of the models derived from the data. engines, which are all over the world, it is now possible to
Common approaches are to employ dimension reduction build an early warning system for detecting fast spreading
or feature selection [48] to reduce the data dimensions or to flu outbreaks [23]. Making use of complex data is a major
carefully include additional samples to alleviate the data challenge for Big Data applications, because any two
scarcity, such as generic unsupervised learning methods in parties in a complex network are potentially interested to
data mining. each other with a social connection. Such a connection is
Uncertain data are a special type of data reality where quadratic with respect to the number of nodes in the
each data field is no longer deterministic but is subject to network, so a million node network may be subject to one
some random/error distributions. This is mainly linked to trillion connections. For a large social network site, like
domain specific applications with inaccurate data readings Facebook, the number of active users has already reached 1
and collections. For example, data produced from GPS billion, and analyzing such an enormous network is a big
equipment are inherently uncertain, mainly because the challenge for Big Data mining. If we take daily user
technology barrier of the device limits the precision of the actions/interactions into consideration, the scale of diffi-
data to certain levels (such as 1 meter). As a result, each culty will be even more astonishing.
102 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014

Inspired by the above challenges, many data mining time limits. In the context of Big Data, real-time processing
methods have been developed to find interesting knowl- for complex data is a very challenging task.
edge from Big Data with complex relationships and
dynamically changing volumes. For example, finding
communities and tracing their dynamically evolving rela- 4 RESEARCH INITIATIVES AND PROJECTS
tionships are essential for understanding and managing To tackle the Big Data challenges and “seize the
complex systems [3], [10]. Discovering outliers in a social opportunities afforded by the new, data driven resolu-
network [8] is the first step to identify spammers and tion,” the US National Science Foundation (NSF), under
provide safe networking environments to our society. President Obama Administration’s Big Data initiative,
If only facing with huge amounts of structured data, announced the BIGDATA solicitation in 2012. Such a
users can solve the problem simply by purchasing more federal initiative has resulted in a number of winning
storage or improving storage efficiency. However, Big Data projects to investigate the foundations for Big Data
complexity is represented in many aspects, including management (led by the University of Washington),
complex heterogeneous data types, complex intrinsic analytical approaches for genomics-based massive data
semantic associations in data, and complex relationship computation (led by Brown University), large scale
networks among data. That is to say, the value of Big Data is machine learning techniques for high-dimensional data
in its complexity. sets that may be as large as 500,000 dimensions (led by
Complex heterogeneous data types. In Big Data, data types Carnegie Mellon University), social analytics for large-
include structured data, unstructured data, and semistruc- scale scientific literatures (led by Rutgers University), and
tured data, and so on. Specifically, there are tabular data several others. These projects seek to develop methods,
(relational databases), text, hyper-text, image, audio and algorithms, frameworks, and research infrastructures that
video data, and so on. The existing data models include allow us to bring the massive amounts of data down to a
key-value stores, bigtable clones, document databases, and human manageable and interpretable scale. Other coun-
graph databases, which are listed in an ascending order of tries such as the National Natural Science Foundation of
the complexity of these data models. Traditional data China (NSFC) are also catching up with national grants on
models are incapable of handling complex data in the Big Data research.
context of Big Data. Currently, there is no acknowledged Meanwhile, since 2009, the authors have taken the lead
effective and efficient data model to handle Big Data. in the following national projects that all involve Big Data
Complex intrinsic semantic associations in data. News on the components:
web, comments on Twitter, pictures on Flicker, and clips of
video on YouTube may discuss about an academic award- . Integrating and mining biodata from multiple
winning event at the same time. There is no doubt that there sources in biological networks, sponsored by the
are strong semantic associations in these data. Mining US National Science Foundation, Medium Grant No.
complex semantic associations from “text-image-video” CCF-0905337, 1 October 2009 - 30 September 2013.
data will significantly help improve application system Issues and significance. We have integrated and
performance such as search engines or recommendation mined biodata from multiple sources to decipher
systems. However, in the context of Big Data, it is a great and utilize the structure of biological networks to
challenge to efficiently describe semantic features and to shed new insights on the functions of biological
build semantic association models to bridge the semantic
systems. We address the theoretical underpinnings
gap of various heterogeneous data sources.
and current and future enabling technologies for
Complex relationship networks in data. In the context of
Big Data, there exist relationships between individuals. On integrating and mining biological networks. We
the Internet, individuals are webpages and the pages have expanded and integrated the techniques and
linking to each other via hyperlinks form a complex methods in information acquisition, transmission,
network. There also exist social relationships between and processing for information networks. We have
individuals forming complex social networks, such as big developed methods for semantic-based data integra-
relationship data from Facebook, Twitter, LinkedIn, and tion, automated hypothesis generation from mined
other social media [5], [13], [56], including call detail data, and automated scalable analytical tools to
records (CDR), devices and sensors information [1], [44], evaluate simulation results and refine models.
GPS and geocoded map data, massive image files . Big Data Fast Response. Real-time classification of Big
transferred by the Manage File Transfer protocol, web text Data Stream, sponsored by the Australian Research
and click-stream data [2], scientific information, e-mail Council (ARC), Grant No. DP130102748, 1 January
[31], and so on. To deal with complex relationship 2013 - 31 Dec. 2015.
networks, emerging research efforts have begun to address Issues and significance. We propose to build a
the issues of structure-and-evolution, crowds-and-interac- stream-based Big Data analytic framework for fast
tion, and information-and-communication. response and real-time decision making. The key
The emergence of Big Data has also spawned new challenges and research issues include:
computer architectures for real-time data-intensive proces-
sing, such as the open source Apache Hadoop project that - designing Big Data sampling mechanisms to
runs on high-performance clusters. The size or complexity of reduce Big Data volumes to a manageable size
the Big Data, including transaction and interaction data sets, for processing;
exceeds a regular technical capability in capturing, mana- - building prediction models from Big Data
ging, and processing these data within reasonable cost and streams. Such models can adaptively adjust to
WU ET AL.: DATA MINING WITH BIG DATA 103

the dynamic changing of the data, as well as 5 RELATED WORK


accurately predict the trend of the data in the
5.1 Big Data Mining Platforms (Tier I)
future; and
Due to the multisource, massive, heterogeneous, and
- a knowledge indexing framework to ensure
dynamic characteristics of application data involved in a
real-time data monitoring and classification for
distributed environment, one of the most important
Big Data applications.
characteristics of Big Data is to carry out computing on
. Pattern matching and mining with wildcards the petabyte (PB), even the exabyte (EB)-level data with a
and length constraints, sponsored by the National complex computing process. Therefore, utilizing a parallel
Natural Science Foundation of China, Grant Nos. computing infrastructure, its corresponding programming
60828005 (Phase 1, 1 January 2009 - 31 December 2010) language support, and software models to efficiently
and 61229301 (Phase 2, 1 January 2013 - 31 December analyze and mine the distributed data are the critical goals
2016). for Big Data processing to change from “quantity” to
Issues and significance. We perform a systematic “quality.”
investigation on pattern matching, pattern mining Currently, Big Data processing mainly depends on
with wildcards, and application problems as follows: parallel programming models like MapReduce, as well as
- exploration of the NP-hard complexity of the providing a cloud computing platform of Big Data services
for the public. MapReduce is a batch-oriented parallel
matching and mining problems,
computing model. There is still a certain gap in perfor-
- multiple pattern matching with wildcards,
mance with relational databases. Improving the perfor-
- approximate pattern matching and mining, and
mance of MapReduce and enhancing the real-time nature of
- application of our research onto ubiquitous large-scale data processing have received a significant
personalized information processing and bioin- amount of attention, with MapReduce parallel program-
formatics. ming being applied to many machine learning and data
. Key technologies for integration and mining of mining algorithms. Data mining algorithms usually need to
multiple, heterogeneous data sources, sponsored by scan through the training data for obtaining the statistics to
the National High Technology Research and Devel- solve or optimize model parameters. It calls for intensive
opment Program (863 Program) of China, Grant No. computing to access the large-scale data frequently. To
2012AA011005, 1 January 2012 - 31 December 2014. improve the efficiency of algorithms, Chu et al. proposed a
Issues and significance. We have performed an general-purpose parallel programming method, which is
investigation on the availability and statistical applicable to a large number of machine learning algo-
regularities of multisource, massive and dynamic rithms based on the simple MapReduce programming
information, including cross-media search based on model on multicore processors. Ten classical data mining
information extraction, sampling, uncertain informa- algorithms are realized in the framework, including locally
tion querying, and cross-domain and cross-platform weighted linear regression, k-Means, logistic regression,
information polymerization. To break through the naive Bayes, linear support vector machines, the indepen-
limitations of traditional data mining methods, we dent variable analysis, Gaussian discriminant analysis,
have studied heterogeneous information discovery expectation maximization, and back-propagation neural
and mining in complex inline data, mining in data networks [14]. With the analysis of these classical machine
streams, multigranularity knowledge discovery learning algorithms, we argue that the computational
from massive multisource data, distribution regula- operations in the algorithm learning process could be
rities of massive knowledge, quality fusion of transformed into a summation operation on a number of
massive knowledge. training data sets. Summation operations could be per-
. Group influence and interactions in social networks, formed on different subsets independently and achieve
sponsored by the National Basic Research 973 penalization executed easily on the MapReduce program-
Program of China, Grant No. 2013CB329604, 1 January ming platform. Therefore, a large-scale data set could be
2013 - 31 December 2017. divided into several subsets and assigned to multiple
Issues and significance. We have studied group Mapper nodes. Then, various summation operations could
influence and interactions in social networks, be performed on the Mapper nodes to collect intermediate
including results. Finally, learning algorithms are executed in parallel
through merging summation on Reduce nodes. Ranger
- employing group influence and information et al. [39] proposed a MapReduce-based application
diffusion models, and deliberating group programming interface Phoenix, which supports parallel
interaction rules in social networks using programming in the environment of multicore and multi-
dynamic game theory, processor systems, and realized three data mining algo-
- studying interactive individual selection and rithms including k-Means, principal component analysis,
effect evaluations under social networks affected and linear regression. Gillick et al. [22] improved the
by group emotion, and analyzing emotional MapReduce’s implementation mechanism in Hadoop,
interactions and influence among individuals evaluated the algorithms’ performance of single-pass
and groups, and learning, iterative learning, and query-based learning in
- establishing an interactive influence model and the MapReduce framework, studied data sharing between
its computing methods for social network computing nodes involved in parallel learning algorithms,
groups, to reveal the interactive influence effects distributed data storage, and then showed that the
and evolution of social networks. MapReduce mechanisms suitable for large-scale data
104 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014

mining by testing series of standard data mining tasks on a third party to analyze their data without breaching the
medium-size clusters. Papadimitriou and Sun [38] pro- security settings or compromising the data privacy.
posed a distributed collaborative aggregation (DisCo) For most Big Data applications, privacy concerns focus
framework using practical distributed data preprocessing on excluding the third party (such as data miners) from
and collaborative aggregation techniques. The implementa- directly accessing the original data. Common solutions are
tion on Hadoop in an open source MapReduce project to rely on some privacy-preserving approaches or encryp-
showed that DisCo has perfect scalability and can process tion mechanisms to protect the data. A recent effort by
and analyze massive data sets (with hundreds of GB). Lorch et al. [32] indicates that users’ “data access patterns”
To improve the weak scalability of traditional analysis can also have severe data privacy issues and lead to
software and poor analysis capabilities of Hadoop systems, disclosures of geographically co-located users or users with
Das et al. [16] conducted a study of the integration of R common interests (e.g., two users searching for the same
(open source statistical analysis software) and Hadoop. The map locations are likely to be geographically colocated). In
in-depth integration pushes data computation to parallel their system, namely Shround, users’ data access patterns
processing, which enables powerful deep analysis capabil- from the servers are hidden by using virtual disks. As a
ities for Hadoop. Wegener et al. [47] achieved the result, it can support a variety of Big Data applications, such
integration of Weka (an open-source machine learning as microblog search and social network queries, without
and data mining software tool) and MapReduce. Standard compromising the user privacy.
Weka tools can only run on a single machine, with a 5.3 Big Data Mining Algorithms (Tier III)
limitation of 1-GB memory. After algorithm parallelization,
To adapt to the multisource, massive, dynamic Big Data,
Weka breaks through the limitations and improves
researchers have expanded existing data mining methods
performance by taking the advantage of parallel computing
in many ways, including the efficiency improvement of
to handle more than 100-GB data on MapReduce clusters.
single-source knowledge discovery methods [11], designing
Ghoting et al. [21] proposed Hadoop-ML, on which
a data mining mechanism from a multisource perspective
developers can easily build task-parallel or data-parallel [50], [51], as well as the study of dynamic data mining
machine learning and data mining algorithms on program methods and the analysis of stream data [18], [12]. The
blocks under the language runtime environment. main motivation for discovering knowledge from massive
5.2 Big Data Semantics and Application Knowledge data is improving the efficiency of single-source mining
(Tier II) methods. On the basis of gradual improvement of computer
hardware functions, researchers continue to explore ways
In privacy protection of massive data, Ye et al. [55]
to improve the efficiency of knowledge discovery algo-
proposed a multilayer rough set model, which can
rithms to make them better for massive data. Because
accurately describe the granularity change produced by massive data are typically collected from different data
different levels of generalization and provide a theoretical sources, the knowledge discovery of the massive data must
foundation for measuring the data effectiveness criteria in be performed using a multisource mining mechanism. As
the anonymization process, and designed a dynamic real-world data often come as a data stream or a
mechanism for balancing privacy and data utility, to solve characteristic flow, a well-established mechanism is needed
the optimal generalization/refinement order for classifica- to discover knowledge and master the evolution of knowl-
tion. A recent paper on confidentiality protection in Big edge in the dynamic data source. Therefore, the massive,
Data [4] summarizes a number of methods for protecting heterogeneous and real-time characteristics of multisource
public release data, including aggregation (such as k- data provide essential differences between single-source
anonymity, I-diversity, etc.), suppression (i.e., deleting knowledge discovery and multisource data mining.
sensitive values), data swapping (i.e., switching values of Wu et al. [50], [51], [45] proposed and established the
sensitive data records to prevent users from matching), theory of local pattern analysis, which has laid a foundation
adding random noise, or simply replacing the whole for global knowledge discovery in multisource data mining.
original data values at a high risk of disclosure with values This theory provides a solution not only for the problem of
synthetically generated from simulated distributions. full search, but also for finding global models that
For applications involving Big Data and tremendous traditional mining methods cannot find. Local pattern
data volumes, it is often the case that data are physically analysis of data processing can avoid putting different data
distributed at different locations, which means that users no sources together to carry out centralized computing.
longer physically possess the storage of their data. To carry Data streams are widely used in financial analysis, online
out Big Data mining, having an efficient and effective data trading, medical testing, and so on. Static knowledge
access mechanism is vital, especially for users who intend to discovery methods cannot adapt to the characteristics of
hire a third party (such as data miners or data auditors) to dynamic data streams, such as continuity, variability,
process their data. Under such a circumstance, users’ rapidity, and infinity, and can easily lead to the loss of
privacy restrictions may include 1) no local data copies or useful information. Therefore, effective theoretical and
downloading, 2) all analysis must be deployed based on the technical frameworks are needed to support data stream
existing data storage systems without violating existing mining [18], [57].
privacy settings, and many others. In Wang et al. [48], a Knowledge evolution is a common phenomenon in real-
privacy-preserving public auditing mechanism for large world systems. For example, the clinician’s treatment
scale data storage (such as cloud computing systems) has programs will constantly adjust with the conditions of the
been proposed. The public key-based mechanism is used to patient, such as family economic status, health insurance,
enable third-party auditing (TPA), so users can safely allow the course of treatment, treatment effects, and distribution
WU ET AL.: DATA MINING WITH BIG DATA 105

of cardiovascular and other chronic epidemiological (2013CB329604), the National Natural Science Foundation
changes with the passage of time. In the knowledge of China (NSFC 61229301, 61273297, and 61273292), the US
discovery process, concept drifting aims to analyze the National Science Foundation (NSF CCF-0905337), and the
phenomenon of implicit target concept changes or even Australian Research Council (ARC) Future Fellowship
fundamental changes triggered by dynamics and context in (FT100100971) and Discovery Project (DP130102748). The
data streams. According to different types of concept drifts, authors would like to thank the anonymous reviewers for
knowledge evolution can take forms of mutation drift, their valuable and constructive comments on improving
progressive drift, and data distribution drift, based on single the paper.
features, multiple features, and streaming features [53].
REFERENCES
6 CONCLUSIONS [1] R. Ahmed and G. Karypis, “Algorithms for Mining the Evolution
of Conserved Relational States in Dynamic Networks,” Knowledge
Driven by real-world applications and key industrial and Information Systems, vol. 33, no. 3, pp. 603-630, Dec. 2012.
stakeholders and initialized by national funding agencies, [2] M.H. Alam, J.W. Ha, and S.K. Lee, “Novel Approaches to
managing and mining Big Data have shown to be a Crawling Important Pages Early,” Knowledge and Information
challenging yet very compelling task. While the term Big Systems, vol. 33, no. 3, pp 707-734, Dec. 2012.
[3] S. Aral and D. Walker, “Identifying Influential and Susceptible
Data literally concerns about data volumes, our HACE
Members of Social Networks,” Science, vol. 337, pp. 337-341, 2012.
theorem suggests that the key characteristics of the Big Data [4] A. Machanavajjhala and J.P. Reiter, “Big Privacy: Protecting
are 1) huge with heterogeneous and diverse data sources, Confidentiality in Big Data,” ACM Crossroads, vol. 19, no. 1,
2) autonomous with distributed and decentralized control, pp. 20-23, 2012.
and 3) complex and evolving in data and knowledge [5] S. Banerjee and N. Agarwal, “Analyzing Collective Behavior from
Blogs Using Swarm Intelligence,” Knowledge and Information
associations. Such combined characteristics suggest that Big Systems, vol. 33, no. 3, pp. 523-547, Dec. 2012.
Data require a “big mind” to consolidate data for maximum [6] E. Birney, “The Making of ENCODE: Lessons for Big-Data
values [27]. Projects,” Nature, vol. 489, pp. 49-51, 2012.
To explore Big Data, we have analyzed several chal- [7] J. Bollen, H. Mao, and X. Zeng, “Twitter Mood Predicts the Stock
Market,” J. Computational Science, vol. 2, no. 1, pp. 1-8, 2011.
lenges at the data, model, and system levels. To support Big [8] S. Borgatti, A. Mehra, D. Brass, and G. Labianca, “Network
Data mining, high-performance computing platforms Analysis in the Social Sciences,” Science, vol. 323, pp. 892-895, 2009.
are required, which impose systematic designs to unleash [9] J. Bughin, M. Chui, and J. Manyika, Clouds, Big Data, and Smart
the full power of the Big Data. At the data level, the Assets: Ten Tech-Enabled Business Trends to Watch. McKinSey
autonomous information sources and the variety of the Quarterly, 2010.
[10] D. Centola, “The Spread of Behavior in an Online Social Network
data collection environments, often result in data with Experiment,” Science, vol. 329, pp. 1194-1197, 2010.
complicated conditions, such as missing/uncertain values. [11] E.Y. Chang, H. Bai, and K. Zhu, “Parallel Algorithms for Mining
In other situations, privacy concerns, noise, and errors can Large-Scale Rich-Media Data,” Proc. 17th ACM Int’l Conf. Multi-
be introduced into the data, to produce altered data copies. media, (MM ’09,) pp. 917-918, 2009.
[12] R. Chen, K. Sivakumar, and H. Kargupta, “Collective Mining of
Developing a safe and sound information sharing protocol Bayesian Networks from Distributed Heterogeneous Data,”
is a major challenge. At the model level, the key challenge is Knowledge and Information Systems, vol. 6, no. 2, pp. 164-187, 2004.
to generate global models by combining locally discovered [13] Y.-C. Chen, W.-C. Peng, and S.-Y. Lee, “Efficient Algorithms for
patterns to form a unifying view. This requires carefully Influence Maximization in Social Networks,” Knowledge and
Information Systems, vol. 33, no. 3, pp. 577-601, Dec. 2012.
designed algorithms to analyze model correlations between
[14] C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu, G.R. Bradski, A.Y. Ng, and K.
distributed sites, and fuse decisions from multiple sources Olukotun, “Map-Reduce for Machine Learning on Multicore,”
to gain a best model out of the Big Data. At the system level, Proc. 20th Ann. Conf. Neural Information Processing Systems (NIPS
the essential challenge is that a Big Data mining framework ’06), pp. 281-288, 2006.
needs to consider complex relationships between samples, [15] G. Cormode and D. Srivastava, “Anonymized Data: Generation,
Models, Usage,” Proc. ACM SIGMOD Int’l Conf. Management Data,
models, and data sources, along with their evolving changes pp. 1015-1018, 2009.
with time and other possible factors. A system needs to be [16] S. Das, Y. Sismanis, K.S. Beyer, R. Gemulla, P.J. Haas, and J.
carefully designed so that unstructured data can be linked McPherson, “Ricardo: Integrating R and Hadoop,” Proc. ACM
through their complex relationships to form useful patterns, SIGMOD Int’l Conf. Management Data (SIGMOD ’10), pp. 987-998.
2010.
and the growth of data volumes and item relationships [17] P. Dewdney, P. Hall, R. Schilizzi, and J. Lazio, “The Square
should help form legitimate patterns to predict the trend Kilometre Array,” Proc. IEEE, vol. 97, no. 8, pp. 1482-1496, Aug.
and future. 2009.
We regard Big Data as an emerging trend and the need [18] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,”
for Big Data mining is arising in all science and engineering Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data
Mining (KDD ’00), pp. 71-80, 2000.
domains. With Big Data technologies, we will hopefully be [19] G. Duncan, “Privacy by Design,” Science, vol. 317, pp. 1178-1179,
able to provide most relevant and most accurate social 2007.
sensing feedback to better understand our society at real- [20] B. Efron, “Missing Data, Imputation, and the Bootstrap,” J. Am.
time. We can further stimulate the participation of the Statistical Assoc., vol. 89, no. 426, pp. 463-475, 1994.
[21] A. Ghoting and E. Pednault, “Hadoop-ML: An Infrastructure for
public audiences in the data production circle for societal the Rapid Implementation of Parallel Reusable Analytics,” Proc.
and economical events. The era of Big Data has arrived. Large-Scale Machine Learning: Parallelism and Massive Data Sets
Workshop (NIPS ’09), 2009.
[22] D. Gillick, A. Faria, and J. DeNero, MapReduce: Distributed
ACKNOWLEDGMENTS Computing for Machine Learning, Berkley, Dec. 2006.
[23] M. Helft, “Google Uses Searches to Track Flu’s Spread,” The New
This work was supported by the National 863 Program of York Times, http://www.nytimes.com/2008/11/12/technology/
China (2012AA011005), the National 973 Program of China internet/12flu.html. 2008.
106 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014

[24] D. Howe et al., “Big Data: The Future of Biocuration,” Nature, [51] X. Wu, C. Zhang, and S. Zhang, “Database Classification for
vol. 455, pp. 47-50, Sept. 2008. Multi-Database Mining,” Information Systems, vol. 30, no. 1, pp. 71-
[25] B. Huberman, “Sociology of Science: Big Data Deserve a Bigger 88, 2005.
Audience,” Nature, vol. 482, p. 308, 2012. [52] X. Wu, “Building Intelligent Learning Database Systems,” AI
[26] “IBM What Is Big Data: Bring Big Data to the Enterprise,” http:// Magazine, vol. 21, no. 3, pp. 61-67, 2000.
www-01.ibm.com/software/data/bigdata/, IBM, 2012. [53] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online Feature
[27] A. Jacobs, “The Pathologies of Big Data,” Comm. ACM, vol. 52, Selection with Streaming Features,” IEEE Trans. Pattern Analysis
no. 8, pp. 36-44, 2009. and Machine Intelligence, vol. 35, no. 5, pp. 1178-1192, May 2013.
[28] I. Kopanas, N. Avouris, and S. Daskalaki, “The Role of Domain [54] A. Yao, “How to Generate and Exchange Secretes,” Proc. 27th Ann.
Knowledge in a Large Scale Data Mining Project,” Proc. Second Symp. Foundations Computer Science (FOCS) Conf., pp. 162-167,
Hellenic Conf. AI: Methods and Applications of Artificial Intelligence, 1986.
I.P. Vlahavas, C.D. Spyropoulos, eds., pp. 288-299, 2002. [55] M. Ye, X. Wu, X. Hu, and D. Hu, “Anonymizing Classification
[29] A. Labrinidis and H. Jagadish, “Challenges and Opportunities Data Using Rough Set Theory,” Knowledge-Based Systems, vol. 43,
with Big Data,” Proc. VLDB Endowment, vol. 5, no. 12, 2032-2033, pp. 82-94, 2013.
2012. [56] J. Zhao, J. Wu, X. Feng, H. Xiong, and K. Xu, “Information
[30] Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining,” Propagation in Online Social Networks: A Tie-Strength Perspec-
J. Cryptology, vol. 15, no. 3, pp. 177-206, 2002. tive,” Knowledge and Information Systems, vol. 32, no. 3, pp. 589-608,
[31] W. Liu and T. Wang, “Online Active Multi-Field Learning for Sept. 2012.
Efficient Email Spam Filtering,” Knowledge and Information Systems, [57] X. Zhu, P. Zhang, X. Lin, and Y. Shi, “Active Learning From
vol. 33, no. 1, pp. 117-136, Oct. 2012. Stream Data Using Optimal Weight Classifier Ensemble,” IEEE
[32] J. Lorch, B. Parno, J. Mickens, M. Raykova, and J. Schiffman, Trans. Systems, Man, and Cybernetics, Part B, vol. 40, no. 6, pp. 1607-
“Shoroud: Ensuring Private Access to Large-Scale Data in the Data 1621, Dec. 2010.
Center,” Proc. 11th USENIX Conf. File and Storage Technologies
(FAST ’13), 2013. Xindong Wu received the bachelor’s and
[33] D. Luo, C. Ding, and H. Huang, “Parallelization with Multi- master’s degrees in computer science from the
plicative Algorithms for Big Data Mining,” Proc. IEEE 12th Int’l Hefei University of Technology, China, and
Conf. Data Mining, pp. 489-498, 2012. the PhD degree in artificial intelligence from the
[34] J. Mervis, “U.S. Science Policy: Agencies Rally to Tackle Big Data,” University of Edinburgh, Britain. He is a Yangtze
Science, vol. 336, no. 6077, p. 22, 2012. River scholar in the School of Computer Science
[35] F. Michel, “How Many Photos Are Uploaded to Flickr Every Day and Information Engineering at the Hefei Uni-
and Month?” http://www.flickr.com/photos/franckmichel/ versity of Technology, China, and a professor of
6855169886/, 2012. computer science at the University of Vermont.
[36] T. Mitchell, “Mining our Reality,” Science, vol. 326, pp. 1644-1645, His research interests include data mining,
2009. knowledge-based systems, and web information exploration. He is the
[37] Nature Editorial, “Community Cleverness Required,” Nature, steering committee chair of the IEEE International Conference on Data
vol. 455, no. 7209, p. 1, Sept. 2008. Mining (ICDM), the editor-in-chief of Knowledge and Information
[38] S. Papadimitriou and J. Sun, “Disco: Distributed Co-Clustering Systems (KAIS, by Springer), and a series editor of the Springer Book
with Map-Reduce: A Case Study Towards Petabyte-Scale End-to- Series on Advanced Information and Knowledge Processing (AI&KP). He
End Mining,” Proc. IEEE Eighth Int’l Conf. Data Mining (ICDM ’08), was the editor-in-chief of the IEEE Transactions on Knowledge and Data
pp. 512-521, 2008. Engineering (TKDE, by the IEEE Computer Society) between 2005 and
[39] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. 2008. He served as a program committee chair/cochair for ICDM ’03 (the
Kozyrakis, “Evaluating MapReduce for Multi-Core and Multi- 2003 IEEE International Conference on Data Mining), KDD-07 (the 13th
processor Systems,” Proc. IEEE 13th Int’l Symp. High Perfor- ACM SIGKDD International Conference on Knowledge Discovery and
mance Computer Architecture (HPCA ’07), pp. 13-24, 2007. Data Mining), and CIKM 2010 (the 19th ACM Conference on Information
and Knowledge Management). He is a fellow of the IEEE and AAAS.
[40] A. Rajaraman and J. Ullman, Mining of Massive Data Sets.
Cambridge Univ. Press, 2011.
[41] C. Reed, D. Thompson, W. Majid, and K. Wagstaff, “Real Time Xingquan Zhu received the PhD degree in
Machine Learning to Find Fast Transient Radio Anomalies: A computer science from Fudan University,
Semi-Supervised Approach Combining Detection and RFI Exci- Shanghai, China. He is an associate professor
sion,” Proc. Int’l Astronomical Union Symp. Time Domain Astronomy, in the Department of Computer & Electrical
Sept. 2011. Engineering and Computer Science, Florida
[42] E. Schadt, “The Changing Privacy Landscape in the Era of Atlantic University. Prior to that, he was with
Big Data,” Molecular Systems, vol. 8, article 612, 2012. the Centre for Quantum Computation & Intelli-
gent Systems, University of Technology,
[43] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A Scalable Parallel
Sydney, Australia. His research mainly focuses
Classifier for Data Mining,” Proc. 22nd VLDB Conf., 1996.
on data mining, machine learning, and multi-
[44] A. da Silva, R. Chiky, and G. Hébrail, “A Clustering Approach for
media systems. Since 2000, he has published more than 160 refereed
Sampling Data Streams in Sensor Networks,” Knowledge and
journal and conference papers in these areas, including two Best Paper
Information Systems, vol. 32, no. 1, pp. 1-23, July 2012.
Awards (ICTAI-2005 and PAKDD-2013) and one Best Student Paper
[45] K. Su, H. Huang, X. Wu, and S. Zhang, “A Logical Framework for
Award (ICPR-2012). Dr. Zhu was an associate editor of the IEEE
Identifying Quality Knowledge from Different Data Sources,”
Transactions on Knowledge and Data Engineering (2008-2012), and a
Decision Support Systems, vol. 42, no. 3, pp. 1673-1683, 2006.
program committee cochair for the 23rd IEEE International Conference
[46] “Twitter Blog, Dispatch from the Denver Debate,” http:// on Tools with Artificial Intelligence (ICTAI 2011) and the Ninth
blog.twitter.com/2012/10/dispatch-from-denver-debate.html, International Conference on Machine Learning and Applications (ICMLA
Oct. 2012. 2010). He also served as a conference cochair for ICMLA 2012. He is a
[47] D. Wegener, M. Mock, D. Adranale, and S. Wrobel, “Toolkit-Based senior member of the IEEE.
High-Performance Data Mining of Large Data on MapReduce
Clusters,” Proc. Int’l Conf. Data Mining Workshops (ICDMW ’09),
pp. 296-301, 2009.
[48] C. Wang, S.S.M. Chow, Q. Wang, K. Ren, and W. Lou, “Privacy-
Preserving Public Auditing for Secure Cloud Storage” IEEE Trans.
Computers, vol. 62, no. 2, pp. 362-375, Feb. 2013.
[49] X. Wu and X. Zhu, “Mining with Noise Knowledge: Error-Aware
Data Mining,” IEEE Trans. Systems, Man and Cybernetics, Part A,
vol. 38, no. 4, pp. 917-932, July 2008.
[50] X. Wu and S. Zhang, “Synthesizing High-Frequency Rules from
Different Data Sources,” IEEE Trans. Knowledge and Data Eng.,
vol. 15, no. 2, pp. 353-367, Mar./Apr. 2003.
WU ET AL.: DATA MINING WITH BIG DATA 107

Gong-Qing Wu received the bachelor’s degree Wei Ding received the PhD degree in computer
from Anhui Normal University, China, the mas- science from the University of Houston in 2008.
ter’s degree from the University of Science and She has been an assistant professor of compu-
Technology of China (USTC), and the PhD ter science at the University of Massachusetts-
degree from the Hefei University of Technology, Boston since 2008. Her primary research inter-
China, all in computer science. He is an ests include data mining, machine learning,
associate professor of computer science at artificial intelligence, computational semantics,
Hefei University of Technology, China. His and with applications to astronomy, geos-
research interests include data mining and web ciences, and environmental sciences. She has
intelligence. He has published more than 20 re- published more than 70 refereed research
fereed research papers. He received the Best Paper Award at the 2011 papers, one book, and has one patent. She is an associate editor of
IEEE International Conference on Tools with Artificial Intelligence Knowledge and Information Systems (KAIS) and an editorial board
(ICTAI) and the Best Paper Award at the 2012 IEEE/WIC/ACM member of the Journal of System Education (JISE). She received a Best
International Conference on Web Intelligence (WI). His research is Paper Award at the IEEE International Conference on Tools with
currently sponsored by the National 863 Program of China and the Artificial Intelligence (ICTAI) 2011, a Best Paper Award at the IEEE
National Natural Science Foundation of China (NSFC). International Conference on Cognitive Informatics (ICCI) 2010, a Best
Poster Presentation Award at the ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems (SIG-
SPAITAL GIS) 2008, and a Best PhD Work Award between 2007 and
2010 from the University of Houston. Her research projects are currently
sponsored by NASA and US Department of Energy (DOE). She is a
senior member of the IEEE.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like