You are on page 1of 5

Mining Big Data: Current Status, and Forecast to the

Future

Wei Fan Albert Bifet
Huawei Noah’s Ark Lab Yahoo! Research Barcelona
Hong Kong Science Park Av. Diagonal 177
Shatin, Hong Kong Barcelona, Catalonia, Spain
david.fanwei@huawei.com abifet@yahoo-inc.com

ABSTRACT construction, police activities to our calendar
schedules, and perform deep optimization under the
Big Data is a new term used to identify the datasets that
tight time constraints. In all these applications, we are
due to their large size and complexity, we can not manage
facing signi cant challenges in leveraging the vast
them with our current methodologies or data mining soft- amount of data, including challenges in (1) system
ware tools. Big Data mining is the capability of extracting capabilities (2) algorithmic design (3) business models.
useful information from these large datasets or streams of
As an example of the interest that Big Data is having in the
data, that due to its volume, variability, and velocity, it was
data mining community, the grand theme of this year's
not possible before to do it. The Big Data challenge is
KDD conference was 'Mining the Big Data'. Also there was a
becoming one of the most exciting opportunities for the
speci c workshop BigMine'12 in that topic: 1st Interna-tional
next years. We present in this issue, a broad overview of Workshop on Big Data, Streams and Heterogeneous Source
the topic, its current status, controversy, and forecast to Mining: Algorithms, Systems, Programming Mod-els and
the future. We introduce four articles, written by in uen-tial 1
scientists in the eld, covering the most interesting and Applications . Both events successfully brought to-gether
state-of-the-art topics on Big Data mining. people from both academia and industry to present their
most recent work related to these Big Data issues, and
exchange ideas and thoughts. These events are important
1. INTRODUCTION in order to advance this Big Data challenge, which is being
considered as one of the most exciting opportunities in
Recent years have witnessed a dramatic increase in our abil- the years to come.
ity to collect data from various sensors, devices, in di erent
We introduce Big Data mining and its applications in
formats, from independent or connected applications. This
Sec-tion 2. We summarize the papers presented in this
data ood has outpaced our capability to process, analyze,
issue in Section 3, and discuss about Big Data
store and understand these datasets. Consider the Internet
controversy in Sec-tion 4. We point the importance of
data. The web pages indexed by Google were around one open-source software tools in Section 5 and give some
million in 1998, but quickly reached 1 billion in 2000 and have challenges and forecast to the future in Section 6.
already exceeded 1 trillion in 2008. This rapid expan-sion is Finally, we give some conclusions in Section 7.
accelerated by the dramatic increase in acceptance of social
networking applications, such as Facebook, Twitter, Weibo,
etc., that allow users to create contents freely and amplify the 2. BIG DATA MINING
already huge Web volume. Furthermore, with mobile phones The term 'Big Data' appeared for rst time in 1998 in a Silicon
becoming the sensory gateway to get real-time data on people Graphics (SGI) slide deck by John Mashey with the title of
from di erent aspects, the vast amount of data that mobile "Big Data and the Next Wave of InfraStress" [9]. Big Data
carrier can potentially process to im-prove our daily life has mining was very relevant from the beginning, as the rst book
signi cantly outpaced our past CDR (call data record)-based mentioning 'Big Data' is a data mining book that appeared
processing for billing purposes only. It can be foreseen that also in 1998 by Weiss and Indrukya [34] . However, the rst
Internet of things (IoT) applications will raise the scale of data academic paper with the words 'Big Data' in the title appeared
to an unprecedented level. People and devices (from home co a bit later in 2000 in a paper by Diebold [8]. The origin of the
ee machines to cars, to buses, railway stations and airports) term 'Big Data' is due to the fact that we are creating a huge
are all loosely connected. Tril-lions of such connected amount of data every day. Usama Fayyad [11] in his invited
components will generate a huge data ocean, and valuable talk at the KDD BigMine'12 Work-shop presented amazing data
information must be discovered from the data to help improve numbers about internet usage, among them the following:
quality of life and make our world a better place. For example, each day Google has more than 1 billion queries per day,
after we get up every morning, in order to optimize our Twitter has more than 250 milion tweets per day, Facebook
commute time to work and complete the optimization before has more than 800 million updates per day, and YouTube has
we arrive at o ce, the system needs to process information more than 4 billion views per day. The data produced
from tra c, weather, nowadays is estimated in the order of zettabytes, and it is
growing around 40% every year. A
1
http://big-data-mining.org/

SIGKDD Explorations Volume 14, Issue 2 Page 1

due to the ability of forecast to the future. is doing for Development: Challenges & Opportunities"[22]: research in nding patterns in mobile data about what users do. including the typical re-lational database data. Nowadays. global net- devices. and 3) establishing an integrated. launched in 2009. There are many applications of Big Data. Early warning: develop fast response in time of crisis. monitoring it in real time.new large source of data is going to be generated from mobile hypotheses. We need new algorithms. multi- typed data. sensor data. and the experience of doing analytics at example the following [17. as per-sonal data will permit to prevent and structured heterogeneous information network models detect illness much earlier than before [17]. focusing in the use of the Pegasus tool. Value: business value that gives organization a and that provides a broad overview of the eld and its com-pelling advantage.Inc. for infrastructures. video. infor-mation processing for enhanced insight and decision This paper presents insights about Big Data min-ing making.Scaling Big Data Mining Infrastructure: The Twit-ter assets that demand cost-e ective. Apple. as text. It is estimated than there of data. that were previ-ously considered beyond reach ICDM. Global Pulse is a U Kang and Christos Faloutsos(Carnegie Mellon Universi- United Nations initiative.Mining Large Streams of User Data for Personal-ized technology toolkit for analyzing real-time data and sharing Recommendations by Xavier Amatriain (Net ix). Alex 'Sandy' ers to developing countries in their White paper "Big Data Pentland in his 'Human Dynamics Laboratory' at MIT. Gartner[15] summarizes this in their de nition of Big Data in 2012 as high volume. and also be healthier. Facebook. it is not straightforward to perform analytics. innovative forms of Experience by Jimmy Lin and Dmitriy Ryaboy (Twit-ter. These applications will allow people to have better as heterogeneous information net-works. there are two more V's: 3. The paper gives inspirational future 1) researching innovative methods and tech-niques for research directions for big graph mining.1 Global Pulse: "Big Data for development" from interconnected data. Business: costumer personalization. and not in what people says they do [28]. and that 80% are infor-mation from it in real time located in develop-ing countries. audio. and we are interested in obtaining useful are over ve billion mobile phones. 2]: Twitter. This paper presents an overview of mining big graphs. ECML-PKDD. and turning Technology: reducing process time from hours to sec-onds preliminary models into robust solutions. Issue 2 Page 2 . but not the percent of Real-time feedback: check what policies and data that our tools can process programs fails. To show the usefulness of Big Data mining. Ya. CONTRIBUTED ARTICLES Variability: there are changes in the structure of the data and how users want to interpret that data We selected four contributions that together shows very sig-ni cant state-of-the-art research in Big Data Mining. its size continues increasing. Doug Laney[19] was the rst one in talking about 3 V's in Big Data management: Real-time awareness: design programs and policies with a more ne-grained representation of reality Volume: there is more data than ever before. It shows that due to the current state of the data mining tools. networks is a new and promising research frontier in Big with wise man-agement of natural resources Data mining research. Health: mining DNA of each person. and big companies as Google. It considers interconnected. velocity and variety information . as mobiles are spreading in Velocity: data is arriving continuously as streams developing countries as well. we would like to . Smart cities: cities focused on sustainable This paper shows that mining het-erogeneous information economic development and high quality of life. leverage the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge 2.Big Graph Mining: Algorithms and discoveries by mention the work that Global Pulse is doing [33] using Big Data to improve life in developing countries. work of Pulse Labs. 2) assembling free and open source . Other signi cant work in Big Data making deci-sions based in answering questions Mining can be found in the main conferences as KDD. and that is based in mining Big Data for showing some nd-ings in the Web Graph and Twitter developing countries. better costumer experiences. Twitter are starting to look carefully to this data to nd Global Pulse describe the main opportunities Big Data o useful patterns to improve user experience. analyzing real-time digital data to detect early emerging vulnerabilities. an innovative lab. that functions as tiy). They pursue a strategy that consists of social network. churn detection Most of the time is consumed in preparatory work to the ap-plication of data mining methods. SIGKDD Explorations Volume 14. and more The Big Data mining revolution is not restricted to the in-dustrialized world. to discover.). to pilot the approach at country level. or journals as "Data Mining and Knowledge Discov-ery" or "Machine Learning". . hoo.Mining Heterogeneous Information Networks: A Structural mon-itor and improve health aspects of every one Analysis Approach by Yizhou Sun (North-eastern University) and Jiawei Han (University of Illinois at Urbana-Champaign). and using this feedback make the needed changes Variety: there are many di erent types of data. graph. These semi- services. and new tools to deal with all detecting anomalies in the usage of digital media of this data.

and ing systems. set mining and frequent graph mining. Robert Gentleman at the University of Auckland. These reduce tasks use the output of the methodology.This paper presents some lessons learned with the Net ix amounts of data in parallel on large clusters of Prize. based in the MapReduce pro. in his new book [32]. It started as a Ethical concerns about accessibility. as data will continue growing. regression. It discusses recent important input dataset into independent subsets that are prob-lems and future research directions. CONTROVERSY ABOUT BIG DATA Hive. This step of contains an interesting discussion about if we need more mapping is then fol-lowed by a step of reducing data or better models to improve our learning tasks. con-troversy about it. Android There may be a digital divide between people or and Storm. VW is able to learn from 5. what it is important is not the size of the Hadoop. and other funny correlations. It has imple-mentations are assumed to be a representative of the global of classi cation. The most popular are the following: medium-size companies. it is its recency. Issue 2 Page 3 . there have been a lot of Cas-sandra. It allows to nd patterns and anomalies in gramming model and a distributed le system called massive real-world graphs. See the paper by U. Cascading. Twitter. New Zealand. Pegasus [18]: big graph mining system built on top of tributed applications. We try to Apache S4 [26]: platform for processing continuous data summarize it as follows: streams. when the number of variables grow. It has implementations of a wide range of data. Yahoo!. dis-tributed applications. SAMOA [1] is a new upcoming software orga-nizations being able to analyze Big Data or not. Kang Hadoop Distributed Filesystem (HDFS). there are many open source always the best programming platform. Storm [31]: software for streaming data-intensive Big Data may be a hype to sell Hadoop based comput. via parallel learn-ing. Section 4 processed by map tasks in parallel. Leinweber [21] showed that the S&P 500 ware environment designed for statistical computing stock index was correlated with butter production in and visualization. The main project of the Machine Learning group of University of issue is if it is ethical that people can be analyzed Waikato. It can exceed the throughput The Big Data phenomenon is intrinsically related to the of any single machine network interface when open source software revolution. maps to obtain the nal result of the job. Apache 4. project for distributed stream mining that will combine Also organizations with access to Big Data will be S4 and Storm with MOA. Hadoop is not always the best tool [23]. the number of fake correlations also grow. We may create a division Vowpal Wabbit [20]: open source project started at between Big Data rich and poor organizations. for example see [7]. scalable. clustering and frequent item population. famous for the WEKA software. For example. Scribe and many others. S4 is designed speci cally for managing data There is no need to distinguish Big Data analytics streams. and it will never be small again. able to extract knowledge that without this Big Data is not possible to get. It developed by Nathan Marz at Twitter. Apache As Big Data is a new hot topic. The streams framework [6] provides an environment for de ning and running stream pro-cesses using simple Limited access to Big Data creates new digital divides. useful learning algorithm. Large companies as doing linear learning. and other related software as: following open source tools: Apache Hadoop [3]: software for data-intensive dis. It depends if the analysis of very large data sets. Face-book. In data mining open source software based mainly in that case. Apache Mahout [4]: Scalable machine learning and In real time analytics. New Zealand beginning in 1993 and is used for statistical Bigger data are not always better data. As Taleb explains frequent pattern mining. machine learning and data mining algorithms: clustering. collaborative ltering and Claims to accuracy are misleading. For R [29]: open source programming language and soft- example. Yahoo! Research and continuing at Microsoft Research to design a fast. seems that data management system sellers try to sell systems based in Hadoop. data is noisy or not. and Christos Faloutsos in this issue. and discuss the recommender and personalization compute nodes. XML based de nitions and is able to use MOA. for exam-ple for initiatives. Apache ZooKeeper. TOOLS: OPEN SOURCE REVOLUTION terafeature datasets. data may be changing. Apache HBase. S4 apps are designed combining streams and from data analytics. lows writing applications that rapidly process large SIGKDD Explorations Volume 14. processing elements in real time. Big Data infrastructure More speci c to Big Graph mining we found the deals with Hadoop. Hadoop al. clas-si cation. when this is not always the case. A MapReduce job divides the tech-niques used in Net ix. LinkedIn bene t and contribute work-ing on open source projects. MapReduce. similar to S4. R was designed by Ross Ihaka and Bangladesh. and MapReduce may be not In Big Data Mining. some times Twitter users perform data mining in real time. and if it is representative of what we MOA [5]: Stream data mining open source software to are looking for. Apache Hadoop related projects [35]: Apache Pig. without knowing it.

[12] use coresets to reduce the com-plexity of Big Data [8] F. The 2012 IDC 6. R. 9. Springer. Big Data is ing the problem into three layers: the batch layer. 2013. It is not clear yet how an op- 7. To have distributed versions of some methods. or sampling where we choose what is the data [6] C. change rst. The streams Framework.nz/. Kirkby. the quantity of cms. and what we consider are the main functions on arbitrary data in realtime by decompos. extensi. TU Dortmund University. [9] F. that arise from the nature useful for Big Data if tagged and analyzed. Hidden Big Data. and not be fooled by ran. and evolving [27. and debuggable. diverse. Coresets are small sets that provably Macroeconomic Measurement and Forecasting. U Kang. executed in parallel on each vertex and can interact with neigh-boring vertices. http://samoa-project. main approaches: compression where we don't loose anything. minimal maintenance. and the main challenges for the future. [3] Apache Hadoop. boyd and K. REFERENCES Distributed mining. Christos Faloutsos and Xavier domness. warmly invited to participate in this intrepid journey. Bockermann and H. larger. solving hard machine learning problems in parallel. in orders of magnitude. 2000. Bifet. Managing and Mining Sensor theoretical analysis to provide new methods. Crawford. concerns. the becoming the new Final Frontier for scienti c data research serving layer. CONCLUSIONS timal architecture of an analytics systems should be Big Data is going to continue growing during the next years. Information. Using sampling. Visualization. ble. Pier working paper archive. currently only 3% of the potentially some of the challenges that researchers and useful data is tagged. The Lambda Ar. Aggarwal. Department visualize the results. that is more representative. For example Feldman et al. Data. http://hadoop. Time evolving data. 23% There are many future important challenges in Big Data (643 ex-abytes) of the digital universe would be management and analytics. "Big Data" Dynamic Factor Models for problems. Dmitriy Ryaboy. GraphLab computes over tech-niques. ACKNOWLEDGEMENTS Statistical signi cance. SIGKDD Explorations Volume 14. as for example the large distributed data-graph. 8. Yizhou Sun. practitioners will have to deal during the next years: Analytics Architecture. the data stream mining eld has very powerful techniques for this task [13]. Diebold. http://mahout. scalable. but the gains in space may be 15(5):662{679. so it is important that the Big Data mining techniques should be able to adapt and in some cases to detect [4] Apache Mahout. Pfahringer. The properties of the system are: knowledge that no one has discovered before. and frameworks to tell and show dependent records which are stored as vertices in a stories will be needed. Large quantities of useful data are getting lost since new data is largely untagged le-based and unstructured data. it is very of Eco-nomics. We are at the beginning of a same system Hadoop for the batch layer. FORECAST TO THE FUTURE study on Big Data [14] explains that in 2012. we Data. Scale Inference [10]. Communication and Society.org. GraphLab [24]: high-level graph-parallel system built di cult to nd user-friendly visualizations. An interesting proposal is the Lambda amount of data every year. It combines in the and for business applications. Holmes. Everybody is robust and fault tolerant. Issue 2 Page 4 . These are However. 2013. 12 2012. we may take more time and less space. This data is going to be more architecture of Nathan Marz [25]. and even less is analyzed. and the speed layer. a lot of research is needed with practi-cal and [2] C. Journal of Machine Learning space needed to store it is very relevant. allows ad hoc queries. it is easy to go wrong with huge data sets and thousands of questions to answer at once. 16]. C. editor. There are two Research (JMLR). [5] A. Using compression. Advances in Database Systems. 2012. Compression: Dealing with Big Data. We discussed in this paper some chitecture solves the problem of computing arbitrary insights about the topic. Technical Report 5. Algorithms in GraphLab photographs.apache. trivial to paralyze. It is important to achieve sig-ni We would like to thank Jimmy Lin. of data: large.waikato. G. and faster. Using Discus-sion Read to the Eighth World Congress of the merge-reduce the small sets can then be used for Econo-metric Society. Data may be evolving over time.apache. Han.net. general. University of Pennsylvania. Many data mining techniques are not [1] SAMOA.org. Blom. Critical Questions for Big a transformation from time to space. diverse. so we can con-sider it as [7] d. On the Origin(s) and Development of the Term "Big Data". As the data is so big. New without using MapReduce. Jiawei cant statistical results. and Storm new era where Big Data mining will help us to discover for the speed layer. are loosing information. to deal with historic data and with real-time data at and each data scientist will have to manage much more the same time. 2012. and B. For example. MOA: Massive Online Analysis http://moa. As Efron explains in his book about Large Amatriain for contributing to this special section. infographics and essays in the are expressed as vertex-programs which are beautiful book "The Human Face of Big Data" [30]. Diebold. A main task of Big Data analysis is how to Penn Institute for Economic Research. 2010.ac. approximate the original data for a given problem.

Edge. Op-portunities in On-line Predictive Modeling. Institute of learning. In SODA. McGraw-Hill Companies. Marz and J. 2001. Robbins. Kang. [18] U. San Francisco. M. Taleb. [27] C. abs/1209. Manning Publications. NY. 2012.org/keynotes/#fayyad. H.com/content/www/us/en/big. [29] R Core Team. pages 1{6. and J. META Group Research Note. [31] Storm. 2012. Hellerstein. Big Thinkers on Big Data.edge. Letouz . http:// big-data-mining. Deutsch. Knowledge Discovery from Data Streams. 2013. PEGASUS: Mining Billion-Scale Graphs in the Cloud. Parker. MapReduce is Good Enough? If All You Have is a Hammer. [22] E. B. In ICDM Workshops. Programming Models and Applications. Big data. [19] D. Langford. Testing. M. USA. [14] J. deRoos. Big Data Analytics: Applications and NY. IDC: The Digital Universe Sterling Publishing Company Incorporated. pages 170{177. and C. Cambridge on Big Data.2191. 2012. Austria.org. IBM Understanding Big Data: Analyt-ics http://www. M. Taylor & Francis Group. Lin. Gonzalez.html. 2011. and A. R: A Language and Environment for Sta- [13] J. Lapis. Velocity and Variety. for Enterprise Class Hadoop and Streaming Data. Algorithms. 2007. 2010. Laney. New York. Issue 2 Page 5 . 2012. [34] S. Eaton. Programming Models and Applications. Graphlab: A new parallel framework for machine learning. In Proceed- ings of the 1st International Workshop on Big Data. [17] Intel. [32] N. http://www. 2010. California. 1998. http://storm-project. in 2020: Big Data. Gantz and D. In Confer- ence on Uncertainty in Arti cial Intelligence (UAI). Unexpected challenges in large scale machine ods for Estimation. 3-D Data Management: Controlling Data Volume. [28] A. Gama.net. ACM. New York. and J. Zikopoulos. Large-Scale Inference: Empirical Bayes Meth. [23] J. Nair.intel.. USA. Sohler. [15] Gartner. July 2010.com/it-glossary/big. D. Efron. T. BigMine '12.net/~vw/. Predictive data mining: a Streams and Heterogeneous Source Mining: Algorithms. Vienna. Kyrola.org/conversation/reinventing-society- in-the-wake-of-big-data. [30] R. [16] V. Streams and Heterogeneous Source Mining: Uni-versity Press. [24] Y. C. [26] L. May 2011. Petland. Big. [21] D. Turning big data into tiny data: Constant-size coresets for k. The Human Face of Big Data.[10] B. means. [11] U. http://www. 2013. Schmidt. December 2012. Stupid Data Miner Tricks: Over tting the S&P 500. Dis-covery. Chau. 2010. http://hunch. Big Data: Principles and best practices of scalable realtime data systems. R Foundation for Statistical Com- Chapman & Hall/Crc Data Mining and Knowledge puting. Mine '12. 2012. 2012. A. Guestrin. Reinventing society in the wake of big data. 16:15{22. Antifragile: How to Live in a World We Don't data.org. ISBN 3-900051-07-0. Understand. [35] P. Neumeyer. big business: bridging the gap. Leinweber. Indurkhya. pca and projective clustering. S4: Distributed Stream Computing Platform. Fayyad. Feldman. Bigger Digital Shadows. [33] UN Global Pulse. 2012. ACM. Steier. Gopalkrishnan. Faloutsos. H. Limited. Systems. Morgan Kaufmann Publishers Inc. [12] D. and Biggest Growth in the Far East. Weiss and N. J. C. pages 7{11. Systems. Erwitt. Low. Kesari. Catalina Island.unglobalpulse. and C. D. February 6. D. In Proceedings of the 1st International Work-shop Mathematical Statistics Monographs. 2012. Lewis. 2011. Throw Away Everything That's Not a Nail! CoRR.Incorporated. A. data/big-thinkers-on-big-data. [20] J. Guszcza. Penguin Books. USA. J. [25] N. http://www. Smolan and J. tistical Computing. The Journal of Investing. Reinsel. SIGKDD Explorations Volume 14. and G. CA. 2012. Big Data for Development: Opportunities & Challenges. practical guide.gartner. D. Vowpal Wabbit. 2012. and Prediction. Warren. Bickson.