Professional Documents
Culture Documents
net/publication/326053560
Data quality in big data processing: Issues, solutions and open problems
CITATIONS READS
14 3,274
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jerry Gao on 27 July 2020.
Abstract—With the rapid development of social networks it is necessary to use a variety of tools to manage big data
and Internet of things, Big Data age has arrived. The algorithms. The key issue is to ensure that the data quality
increasing number of data have brought great value to the is high and the data source is reliable.
public and enterprises, how to manage and use Big Data Big data’s research is in its infancy, experts and
better has become the focus of all walks of life. However, the scholars of various countries on each aspects of Big Data
Big Data 4V characteristics has brought a lot of problems to work is actively exploring. The recommendation system is
the processing of Big Data. The key to Big Data processing is the successful application of Big Data processing
to solve the problem of data quality, to ensure data quality technology. The Big Data recommendation system has
as a prerequisite for Big Data to play the value. The
gradually become the research hotspot in the information
recommendation system and the prediction system are the
successful application of the Big Data technology. In this
field. The Big Data collection user feedback, purchase
paper, we study the recommendation system and prediction record and even social data are analyzed and excavated
system in Big Data environment, and try to find out the data between the customer and the commodity Correlation.
quality of data collection, data preprocessing, data storage With the deepening of the information revolution, the
and data analysis in Big Data processing. Through the prediction of the big data age is easier, and human’s life is
elaboration and analysis of the problem, the corresponding being greatly changed by big data. The most common
solution is put forward. At the end of the paper we have application’s case is "forecasting the stock market",
raised some open questions. "forecasting the flu", "predicting consumer behavior"
Predictive analysis is the core function of Big Data. Big
Keywords- Big data, Big data processing, Data Quality, Data scale is expanding all the time. It’s a great challenge
Recommendation system, Prediction system. for the data storage and analysis. This paper takes the
recommendation system and the prediction system as an
example to elaborate and analyze the data quality of data
I. INTRODUCTION acquisition, data preprocessing, data storage and data
With the development of mobile applications, the analysis in Big Data research, and put forward the
amount of information data is exploding, the concept of corresponding solution.
Big Data is noticed by the industry and the academic The rest of this article is arranged as follows. Section II
community of universal. there is a different definition review the relevant work. The third section introduces the
about the Big data’s concept, such as Wikipedia [2] on the related concepts involved in this paper, including the
"Big Data" interpretation is: big data involved in the characteristics of Big Data, Big Data processing process,
amount of data to a large scale cannot be artificial, in a and Big Data application system. The fourth section
reasonable time to intercept, management, processing, and analyzes and describes the problems encountered in the
sorting into human beings can read the information. Baidu process of Big Data processing from the perspective of
Encyclopedia of the "big data" is defined as: big data data quality. Section 5 presents a solution to the problems
refers to the amount of data involved in the huge scale encountered. Section 6 discusses the relevance of the
cannot through the current mainstream software tools, in a problem to Big Data characteristics and data quality, and
reasonable time to achieve the capture, management, analyzes some of the current open issues. Finally
processing, and organize to help business Decision- summarize the full text.
making for more aggressive purposes.
No matter what kind of definition are permeated with II. RELATED SURVEY
the characteristics of big data, namely: Volume, Velocity, As the big data presents new features, its data quality is
Variety, Veracity. As the amount of data in each area also facing many challenges. How to deal with a large
increases rapidly, the efficient processing of data has number of real-time data has become a key issue in
become a major challenge. How to correctly understand research and application. Data quality of Big Data is a hot
the big data, how to use big data for people's production topic in academic research. As shown in Table I, data
and life to provide quality services is the new challenges quality standards mainly involve four dimensions:
that the researchers faced. Big data processing has several Availability, Usability, Reliability, and Relevance.
stages what includes collection, pre-processing, storage Availability is defined as the degree of user access to data
and analysis. Data quality is the most important aspect of and related information. It is divided into accessibility, and
big data processing, and big data can be supported by a timeliness. The concept of usability means whether the
large number of embedded data analysis and statistical data is useful and satisfying the user's needs or not.
analysis tools [1]. In order to provide reliable data analysis, Reliability is whether we can trust data, including accuracy,
consistency, and integrity. Relevance is used to describe describes a new generation of technologies and
the degree of correlation between data content and user architectures designed to economically extract value from
expectations or requirements. [10]. a wide variety of data by achieving high-speed capture,
discovery and analysis. Gartner updates it as "Big Data is
TABLE I. BIG DATA QUALITY QUALITY STANDARD large, high speed and high quality information assets that
require new forms of processing to achieve enhanced
Dimensions Elements
decision making, insight discovery and process
Accessibility optimization." [3][5]. IBM defines Big Data with four
Availability
Timeliness features by adding Veracity to the three features.
Usability Credibility
Accuracy
Consistency
Reliability
Integrity
Completeness
Relevance Fitness
Less data collected and low Network connectivity and dynamics Availability Volume
recall rates
Collection
The interaction between the user and the
Data sparseness Relevance Variety
item is less
Abnormal data and cheating data at the time
Noise data Usability Variety
of acquisition
Preprocessing
Data distribution is not balanced, the Veracity
Incomplete data Reliability
network transmission is unstable Velocity
Limitations of Storage The total amount of data is large and the type
Usability Volume
Technology is complex
Storage
Data timeliness Response time is long Availability Variety
Although some results have been made for the study of Most data mining algorithms in big data analysis will
data quality, such as the validity, reliability and availability be designed for parallel computing. However, once the
of data, the research on Big Datais still in its infancy. Big data mining algorithm is designed or modified for parallel
Dataneeds to solve many problems, such as: the data of the computing, the exchange of information between different
multi-source, the quality of differentiation, how to obtain data mining processes can cause bottlenecks. One of them
high-quality large data, how to integrate the existing multi- is a synchronization problem, because even if the same
source data, how to detect and repair data and other issues. algorithm is used to handle the same amount of data, the
Some of the problems we have mentioned can be easily programs of different programs will make their work
overcome, and these practical problems are common in foolish at different times. The bottleneck of data mining
traditional data processing. However, some issues are algorithms will be an open issue in big data analysis,
closely related to the characteristics of large data, and so which suggests that we need to take this into account when
far, are still unresolved issues. Big issues caused by developing and designing new data mining algorithms for
several public issues will serve as the main content of this big data analysis.
section to explain the dilemma that may be faced by large
data. Here are some open problems [12]: Open problem3. Security issues
Since environmental data and human behavior will be
Open problem1. Inconsistent data collected by a large number of data analysis, how to
protect them will also be an open problem because there is
Big data clouds will soon be applied to various areas, no safe way to handle the collected data, and big data
the resulting data is absolutely inconsistent. There are four analysis cannot be a reliable system. Although we have to
inconsistencies in time, text, space, and functional tighten the security of big data analysis before we can
dependencies that must address data inconsistencies at the collect more data from all over the world, the reality is that
data level, information level, and knowledge level in order so far, much research has focused on the security issues of
to better analyze the data. Big Dataanalysis is a new era of big data analysis. According to our observation, the
data analysis, there are still data inconsistencies. But security of big data analysis can be divided into four
traditional inconsistency solutions do not necessarily apply aspects: input, data analysis, output and communication
to large data. Data is more prone to inconsistencies with other systems. For input, it can be considered as a
because data is captured or generated by different sensors collection of data related to sensors, handheld devices and
and systems. How the effect of data inconsistency is an even networking devices. One of the important security
open question of Big Dataanalysis. issues in the big data analysis input section is to ensure
Open problem2. The bottleneck of data mining that the sensor is not compromised by the attack. For
algorithm analysis and input, it can be considered as a security issue
for such a system. In order to communicate with other [9] Yang T, Qian K, Dan C T L, et al. Improve the Prediction
systems, the security problem lies in the bigdata analysis Accuracy of Naïve Bayes Classifier with Association Rule
Mining[C]// IEEE, International Conference on Big Data Security
of communication with other external systems. Because of on Cloud. IEEE, 2016:129-133.
these potential problems, security has become one of the [10] Cai L, Zhu Y. The Challenges of Data Quality and Data Quality
open issues of big data analysis. Assessment in the Big Data Era [J]. Data Science Journal, 2015, 14
(1): 21-3.
CONCLUSION AND FUTURE WORK [11] Maślankowski J. Data Quality Issues Concerning Statistical Data
We have entered the era of large data, through data Gathering Supported by Big Data Technology[J]. Communications
in Computer & Information Science, 2014, 424(1):92-101.
collection, processing, storage, so that data is better
analyzed. As a result of the substantial increase in the [12] Tsai C W, Lai C F, Chao H C, et al. Big data analytics: a survey[J].
Journal of Big Data, 2015, 2(1):21.
volume of data in various fields, it has become a major
[13] Chen T, Honda K. Solving data preprocessing problems in existing
challenge in efficiently processing data. We will need to location-aware systems[J]. Journal of Ambient Intelligence &
address these issues, thereby improving the quality of the Humanized Computing, 2015:1-7.
data. This paper takes the recommendation system and the [14] BIRTOLO C, RONCA D. Advances in clustering collaborative
forecasting system as an example to analyze and describe filtering by means of fuzzy C-means and trust[J].Expert Systems
the possible problems in the process of Big Dataprocessing. with Applications,20 1 3, 40(1 7):6 9 9 7-700 9.
We reviewed the concept of large data, Big [15] Duggal P S, Paul S. Big Data Analysis: Challenges and
Dataprocessing, forecasting systems and recommendation Solutions[C]// International Conference on Cloud, Big Data and
Trust 2013, Nov 13-15, RGPV. 2013.
systems. Data quality, data preprocessing, data storage,
[16] Yang F, Xiao-Yan A I, Zhang Y H, et al. New mining architecture
data analysis and other data quality problems in Big and prediction model for big data[J]. Electronic Design
Dataresearch are expounded and analyzed. For each stage, Engineering, 2016.
we ask possible questions and propose solutions, and [17] Martin Hilbert, Priscila López, The world’s technological capacity
finally analyze some open questions. to store, communicate, and compute information, Science 332
Our Big Dataanalysis is unavoidable. Existing Big (6025) (2011) 60–65.
Datatechnologies and tools still have some limitations that [18] Witten IH, Frank E, Hall MA (2011) Data mining: practical
cannot completely solve Big Dataproblems. In the follow- machine learning tools and techniques. Morgan Kaufmann,
up study, we need to study the big data problem, try to Burlington.
solve the problem of data quality through better solution. [19] Zeng W, Zhao Y, Ou K, et al. Research on cloud storage
architecture and key technologies[C]// International Conference on
Interaction Sciences: Information Technology, Culture and Human
REFERENCES 2009, Seoul, Korea, 24-26 November.DBLP,2009:1044-1048.
[1] Louridas, P., Ebert, C.: Embedded analytics and statistics for big [20] Xie L, Zhou W, Li Y. Application of Improved Recommendation
data. IEEE Software 30, 33–39 (2013) System Based on Spark Platform in Big Data Analysis[J].
[2] Wikipedia contributors. "Big data." Wikipedia, The Free Cybernetics & Information Technologies, 2017, 16.
Encyclopedia. Wikipedia, The Free Encyclopedia, 31 Mar. 2017. [21] GHAZANFAR M A, PRüGEL-BENNETT A. Leveraging
Web. 31 Mar. 2017. clustering approaches to solve the graysheep users problem in
[3] G. J. and . E. Reinsel, "“Extracting Value from Chaos”, IDC’s recommender systems [J]. Expert Systems with Applications,20 1
Digital Universe Study, sponsored by EMC,"2011. 4,4 1 (7 ): 3 2 6 1-3 2 7 5.
[4] A. A. TOLE, "Big Data Challenges," Database Systems Journal, [22] J.Hathaway, C.Bezdek, "Clustering incomplete relational data
vol. vol. IV, no. no. 3, pp. 31-40, 2013 using the non-Euclidean relational fuzzy c-means algorithm,
[5] S. Kaisler, F. Armour and J. A. Espinosa, "Big Data: Issues and "Pattern Recognition Letters, vol.23, no. 1, pp.151–160, 2002.
Challenges Moving Forward," Hawaii International Conference on [23] D.Li, H.Gu, L.Zhang, "a hybrid genetic algorithm-fuzzy c-means
System Sciences, no. 46th, 2013 approach for incomplete data clustering based on nearest-neighbor
[6] WANG Z,YU X,FENG N,et al.An improved collaborative movie intervals," Soft Computing, vol.17, no. 10,pp.1787-1796, 2013.
recommendation system using computational intelligence [J ]. [24] Li C, Lan M, Zou B, et al. Big Data and Recommendation
Journal of Visual Languages & Computing,20 1 4,2 5(6):6 6 7-6 7 System[J]. Big Data Research, 2015, 14(1):39-43..
5. [25] Madden S (2012) From databases to big data. IEEE Internet
[7] Li J, Xu Z, Jiang Y, et al. The overview of big data storage and Comput 16(3):4–6
management[C]// IEEE, International Conference on Cognitive [26] Hamm S (2013) How big data can boost weather forecasting.
Informatics & Cognitive Computing. IEEE, 2014:510-513. http://readwrite.com/2013/02/28/how-big-data-can-boost-weather-
[8] Leng Y L. Incomplete Big Data Distributed Clustering[J]. Applied forecasting#awesm=ou64ZEaKe2HtUu. Accessed 20 Nov 2014
Mechanics & Materials, 2014, 687-691:1496-1499. [27] Han J, Kamber M, Pie J (2012) Data mining: concepts and
techniques. Elsevier, Inc., San Francisco