Data Qualityin Big Data Processing Issues Solutionsand Open Problems

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/326053560
Data quality in big data processing: Issues, solutions and open problems
Conference Paper · August 2017

DOI: 10.1109/UIC-ATC.2017.8397554
CITATIONS READS
14 3,274
4 authors, including:
Pengcheng Zhang Jerry Gao

Hohai University San Jose State University
105 PUBLICATIONS 980 CITATIONS 302 PUBLICATIONS 4,987 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Smart Cities View project
BiG Data Quality Assurance Modeling Research View project
All content following this page was uploaded by Jerry Gao on 27 July 2020.
The user has requested enhancement of the downloaded file.

Data Quality in Big Data Processing: Issues, Solutions and Open Problems
Pengcheng Zhang1, Fang Xiong1, Jerry Gao2,3
1
College of Computer and Information, Hohai University, Nanjing, China
2
San Jose State University, San Jose, CA, 3Taityan University of Technology, China
Email Address: {pchzhang@hhu.edu.cn; jerry.gao@sjsu.edu}
Abstract—With the rapid development of social networks it is necessary to use a variety of tools to manage big data
and Internet of things, Big Data age has arrived. The algorithms. The key issue is to ensure that the data quality
increasing number of data have brought great value to the is high and the data source is reliable.
public and enterprises, how to manage and use Big Data Big data’s research is in its infancy, experts and
better has become the focus of all walks of life. However, the scholars of various countries on each aspects of Big Data
Big Data 4V characteristics has brought a lot of problems to work is actively exploring. The recommendation system is
the processing of Big Data. The key to Big Data processing is the successful application of Big Data processing
to solve the problem of data quality, to ensure data quality technology. The Big Data recommendation system has
as a prerequisite for Big Data to play the value. The
gradually become the research hotspot in the information
recommendation system and the prediction system are the
successful application of the Big Data technology. In this
field. The Big Data collection user feedback, purchase
paper, we study the recommendation system and prediction record and even social data are analyzed and excavated
system in Big Data environment, and try to find out the data between the customer and the commodity Correlation.
quality of data collection, data preprocessing, data storage With the deepening of the information revolution, the
and data analysis in Big Data processing. Through the prediction of the big data age is easier, and human’s life is
elaboration and analysis of the problem, the corresponding being greatly changed by big data. The most common
solution is put forward. At the end of the paper we have application’s case is "forecasting the stock market",
raised some open questions. "forecasting the flu", "predicting consumer behavior"
Predictive analysis is the core function of Big Data. Big
Keywords- Big data, Big data processing, Data Quality， Data scale is expanding all the time. It’s a great challenge
Recommendation system, Prediction system. for the data storage and analysis. This paper takes the
recommendation system and the prediction system as an
example to elaborate and analyze the data quality of data
I. INTRODUCTION acquisition, data preprocessing, data storage and data
With the development of mobile applications, the analysis in Big Data research, and put forward the
amount of information data is exploding, the concept of corresponding solution.
Big Data is noticed by the industry and the academic The rest of this article is arranged as follows. Section II
community of universal. there is a different definition review the relevant work. The third section introduces the
about the Big data’s concept, such as Wikipedia [2] on the related concepts involved in this paper, including the
"Big Data" interpretation is: big data involved in the characteristics of Big Data, Big Data processing process,
amount of data to a large scale cannot be artificial, in a and Big Data application system. The fourth section
reasonable time to intercept, management, processing, and analyzes and describes the problems encountered in the
sorting into human beings can read the information. Baidu process of Big Data processing from the perspective of
Encyclopedia of the "big data" is defined as: big data data quality. Section 5 presents a solution to the problems
refers to the amount of data involved in the huge scale encountered. Section 6 discusses the relevance of the
cannot through the current mainstream software tools, in a problem to Big Data characteristics and data quality, and
reasonable time to achieve the capture, management, analyzes some of the current open issues. Finally
processing, and organize to help business Decision- summarize the full text.
making for more aggressive purposes.
No matter what kind of definition are permeated with II. RELATED SURVEY
the characteristics of big data, namely: Volume, Velocity, As the big data presents new features, its data quality is
Variety, Veracity. As the amount of data in each area also facing many challenges. How to deal with a large
increases rapidly, the efficient processing of data has number of real-time data has become a key issue in
become a major challenge. How to correctly understand research and application. Data quality of Big Data is a hot
the big data, how to use big data for people's production topic in academic research. As shown in Table I, data
and life to provide quality services is the new challenges quality standards mainly involve four dimensions:
that the researchers faced. Big data processing has several Availability, Usability, Reliability, and Relevance.
stages what includes collection, pre-processing, storage Availability is defined as the degree of user access to data
and analysis. Data quality is the most important aspect of and related information. It is divided into accessibility, and
big data processing, and big data can be supported by a timeliness. The concept of usability means whether the
large number of embedded data analysis and statistical data is useful and satisfying the user's needs or not.
analysis tools [1]. In order to provide reliable data analysis, Reliability is whether we can trust data, including accuracy,
consistency, and integrity. Relevance is used to describe describes a new generation of technologies and
the degree of correlation between data content and user architectures designed to economically extract value from
expectations or requirements. [10]. a wide variety of data by achieving high-speed capture,
discovery and analysis. Gartner updates it as "Big Data is
TABLE I. BIG DATA QUALITY QUALITY STANDARD large, high speed and high quality information assets that
require new forms of processing to achieve enhanced
Dimensions Elements
decision making, insight discovery and process
Accessibility optimization." [3][5]. IBM defines Big Data with four
Availability
Timeliness features by adding Veracity to the three features.
Usability Credibility
Accuracy
Consistency
Reliability
Integrity
Completeness
Relevance Fitness
Many research scholars have commented on the

problems and challenges of big data. Jacek Ma'slankowski
has published a paper [11]. he focused on the data quality
problems in the data collection process. the article
analyzes the data acquisition process problem through
comparing the traditional statistical survey technology and
Big Data technology to collect data. The case of research
show that there are many hurdles in data quality when
using Big Data technology. These obstacles are identified
and described in this article. Toly Chen et al. [13] Figure 1. 4Vs of Big Data
reviewed data differences due to improper data
preprocessing (including huge data, incomplete data Volume: With the rapid development of information
normalization, subjective data linearization or non- technology, data began to explosive growth, storage units
linearization, bias weighting, and loss of information from the past GB to TB, until the PB, EB. Taobao nearly
Discretization, etc.) and corrected these differences. Jie Li 400 million members of the commodity trading data
illustrates the status of Big Data storage and management, generated about 20TB every day; Facebook about 1 billion
analyzes the problem, and finally presents a solution [7]. users log data generated more than 300TB every day.
Analyzing Big Data is a challenging task because it There is an urgent need for intelligent algorithms,
involves large distributed file systems that should be fault- powerful data processing platforms and new data
tolerant, flexible, and extensible [15]. processing technologies to analyze, analyze, predict and
After investigating the literature, some scholars have process such large-scale data in real time [4].
studied the challenges and problems encountered in Big Variety: Big data is from a variety of data sources.
Data, and put forward some opinions on the problems Data types and formats are increasingly rich, including
existing in data collection, data preprocessing, data storage structured, semi-structured and unstructured and other
and data analysis, and some Can solve the problem of the forms of data. Incompatible data formats, misaligned data
solution. However, very few people have a clear analysis structure analysis of semantic representation and
and description of each stage of the problem. Based on the inconsistent data may result in the spread of major
previous scholars' research, this paper takes the Big Data challenges [5].
application system as an example, including the Velocity: Unlike traditional archives, broadcasts,
recommendation system and the prediction system to newspapers and other traditional data carriers, Big Data
illustrate the four stages of data acquisition, data exchange and dissemination through the Internet, cloud
preprocessing, data storage and data analysis in Big Data computing, etc. than the traditional media information
processing data quality issues. Followed by a exchange and dissemination of fast. Big Data and massive
corresponding solution for each question. data of the important difference, in addition to Big Data
larger than the data, the Big Data on the processing of data
III. RELATED CONCEPT
response speed more stringent requirements.
Veracity: With the interest of new data sources such
A. Characteristics of Big Data
as social data, corporate content, transaction and
In 2001, Gartner's Doug Laney put forward a view that application data, the limitations of traditional data sources
Big Data has three dimensions: Volume, Variety and are broken, and the accuracy of Big Data tends to change,
Velocity. As a result, IDC defines it: Big Data technology and what data is found to be truly valid in Big Data Is very
important. Enterprises increasingly need effective restraint collection and processing, system need analyze the data. It
means to ensure its authenticity and security. is well known that is a fact what Big Data is not just
B. Big Data processing having big data, and the most important reality is the
analysis of Big Data, only through analysis to get a lot of
Through big data analysis, a lot of valuable intelligent, in-depth, valuable information. As more and
information can be extracted from the massive data, while more applications involve Big Data, and these Big Data
excellent technical support is essential. With the
attributes are presented with the growing complexity of
development of the computer industry, different big data
Big Data, so Big Data analysis methods in the field of Big
analysis technology has also been the continuous
development. Data integration, conversion and other Data is particularly important.
technologies have a lot of relying on the tools. As shown C. Big Data Application System
in the figure2, the processing of Big data is divided into
Recommendation system: Recommendation system
the following four processes:
was proposed by Robert Armstrong in AAAI in 1995,
while Marko Balabanovic et al. introduced the
personalized recommendation system LIRA1.
Recommendation system based on the user's interest
characteristics and purchase behavior, to the user to
recommend the user interested in information and goods.
So far, the recommendation system has been widely used
in many fields. With the boom development of e-
commerce, recommendation system plays an irreplaceable
role in the Internet. It is reported that Amazon's 35% of
the turnover of the site from their own recommendation
system; Taobao, Jingdong and other e-commerce platform
Figure 2. The processing of the Big Data recommendation system is relatively successful. With the
rise of social networking, Twitter, Sina microblogging’s
Data collection: Big Data collection is the friends recommend the system, Google, Netease's news
foundation of the whole process. With the development of recommendation system is also favored by the user. On
Internet technology and application and the popularity of the other hand, the location of the information and other
various terminal devices, the range of data producers is information can be very accurately obtained with the
increasing, the output of data is more and more, between process of using the mobile Internet users. The user can
the data correlation and data become more and more search according to their current location Restaurants,
complex, which is Big Data in the "big" embodiment, so hotels, cinemas, tourist attractions and other information
we need to improve the data acquisition speed and services in Meetup and where the location-based service
accuracy requirements. Big Data acquisition mainly site. The recommendation system in the big data
includes system log collection method, network data environment is an extension of the traditional
acquisition method and other data acquisition methods. recommendation system. Because the big data
Data preprocessing: Big Data preprocessing is environment’s is more complex than the traditional
mainly used to complete the already received data to environment to provide the environment and data
identify, extract and clean the operation. Because the data characteristics, only by extracting and predicting the
is variety gathered by the data acquisition steps is not users’ preference contained in big data’s environment can
conducive to the subsequent data analysis. And some data produce validly more accurate recommendation [24].
is invalid data which need to be removed, otherwise it prediction system: Big Data prediction are based on
will affect the accuracy and reliability of data analysis. existing data analysis, based on data’s development and
Therefore, you need to unify the data format and remove dissemination to assess the upcoming trends and
invalid data. conditions. Big Data is the application of core data
Data storage: Big Data storage and processing is prediction, and traditional predictive tools cannot handle
not only large-scale, more requires its transmission and the size, speed, and complexity inherent in Big Data [25].
processing of the response speed. Due to the different data The weather forecast is one of the major beneficiaries of
source, with the characteristics of data diversity, Big Data, and big data will be of great benefit to weather
traditional database storage technology cannot adapt to forecasts [26,27]. As the Internet bring the convenience of
Big Data storage. Big Data storage or processing system predicting the popularity of Big Data. During the World
must be a variety must have better compatibility to Cup, Baidu predicted the 64 games entirely, the accuracy
loading to kinds of data and hardware and software rate of 67%, into the knockout after the accuracy rate of
platform. 94%. Google successfully predicts the number of flu
Data analysis: After the completion of the data patients based on the user's search log summary. Some
big data companies and big data service companies (such B. Issues of Data Preprocessing
as EDITD) use mobile social media data to predict future In [18] Witten IH et al. mentioned that the data
fashion. preprocessing describes the processing of raw data to
prepare for any other type of processing. Due to the
IV. ISSUES OF BIG DATA PROCESSING diversity of data sources, the collected data sets have
unpredictable quality levels in terms of redundancy, noise,
A. Issues of Data Collection consistency, etc. [13]. In the data preprocessing to collect
At present, the recommendation system, prediction the relevant data preprocessing calculation, the data
system analysis data is mainly from the Internet. Facebook, processing results act as a recommendation system
Baidu and other companies need to deal with more than mathematical form of input. The study of big data
10PB of data per month, Google deal with hundreds of PB preprocessing is very complex and it contains a wealth of
of data, these data are exponential growth to create [17]. strategies and techniques. Big data sources’ authenticity,
Many social networks, such as Twitter, provide an accuracy, completeness, timeliness and other research is a
accessible API to encourage developers to take advantage very critical first step in the big data processing stage. The
of the resources on their social networks for data mining problems encountered during the preprocessing process
and to extract valuable information. In the process of data are as follows:
collection, the increasingly complex social network and its
massive content data brings great challenges to the Issue B-1. Noise data
acquisition of social network data. The following are some Whether it is a recommendation system or a prediction
of the issues that must be considered in the data collection system, there are often a variety of noise, such as data loss
process: or data anomalies in collecting data. In the original data
source will be mixed with a variety of noise data, on the
Issue A-1. Less data collected and low recall rates one hand, there are some abnormal data in the process of
In the process of data collection, the whole network data collection and reporting. on the other hand， it also
data collection is difficult to achieve because the network includes the system on-line operation generated cheating
connectivity is difficult to guarantee and other reasons. and data. Hardware errors, programming errors, or garbledness
the dynamic nature of the social network determines that in the program may also cause noise data. The noise data
we usually only get every moment or more time network will lead to abnormalities in the recommended results, and
snapshot. A large number of social network event processing noise data is a key issue in the data
monitoring system in the realization of the collection of preprocessing phase.
thematic messages seriously dependent on the social
network built-in search engine, the quality and quantity of Issue B-2. Incomplete data
data can’t be guaranteed. In the process of collecting data, If you want to analyze the data of a year's earthquake
there is a problem that data loss is easy, and there is a and make some predictions ， we can find that period
problem that the data collected is low and the recall rate is Twitter, microblogging and other social media data will
low, and finally the data analysis is difficult. surge, these sites accumulate a lot of data in a short period
Issue A-2. Data sparseness of time, but these data are difficult to reflect all the
problems. Because there are a large number of people in
The sparseness of data is one of the most common large city, smart phones are more popular and covered, and
problems in the recommendation system. The sparseness most of the earthquakes are concentrated in large cities,
of data mainly refers to the sparseness of the user-object and those are poorly collected in relatively remote areas
scoring matrix that is the interaction between the user and and the most severely affected areas almost have no
the item is too little. As of the end of 2014, Taobao has statistics related to the data. This can cause some data to
registered nearly 500 million members, active users over be missed. Data integrity is generated due to node failure
120 million, the number of online products reached 1 and network transmission instability.
billion. But the average user to browse the number of C. Issues of Data Storage
goods is relatively small, the average number of users per
visit to the baby does not exceed 800. In fact, the average According to IDC, global data generation in 2011
number of items per user may not exceed 20. In this scale, was only 1.8ZB (or 1.8 trillion GB), and global data is
any two users browse the intersection of goods are expected to grow 50 times to 2020. The data in the era of
relatively small. The ratio of the items selected by the user big data are diversified, not only structured data, but also
to the entire project space is very low, reducing the semi-structured and unstructured data, and with the
predictive performance of the various recommended development of social networks and mobile networks,
strategies greatly. If the data set is very sparse, only unstructured data is increasing. Big data storage’ focus is
contains a very small amount of user behavior data, the to collect the different structure of the data through the
recommended algorithm will be reduced the accuracy pretreatment, improve the quality of data, storing it, and
greatly, it is easy to lead to the proposed algorithm over- the establishment of the corresponding database for
fitting, and affecting the performance of the algorithm. management. Big data storage problems are mainly
focused on the following aspects:
Issue C-1. Limitations of Storage Technology Recommendation systems need to recommend
For Big Data applications (such as recommendation thousands of products or even more than one million. The
and prediction systems), the biggest problem is the need massive user history log poses a huge challenge to the
for large amounts of data to produce results, but this has traditional recommendation algorithm. With the rapid
caused problems with data storage. Large-scale storage growth of the number of users and the rapid growth of the
reaches the PB level usually, and sometimes even the EB number of goods, the massive user log needs to wait for
level, by 2020, our total data will grow 44 times over 2009. the recommendation system to calculate, analyze and
The Big Data management, query and analysis of the excavate. Although the proposed algorithm may be
storage technology put forward higher requirements. Due simpler, the computational complexity is high and usually
to the large amount of data, complex structure, storage requires very high-dimensional user data The Through the
standards will also be a revolutionary change. Big statistical or likelihood calculation of high-dimensional
Datatypes are complex, often structured, semi-structured matching to form a virtual community, which also led to
and non-structural coexistence, Big Data storage needs even the most basic recommendation algorithm is often
achieve a variety of types of data unified storage, the very time-consuming, and in practice, scalability is poor,
current storage technology there are still some limitations with the increase in the number of users and goods, The
[11]. complexity of the algorithm is polynomial growth, the
algorithm performance is getting worse.
Issue C-2. Data timeliness
Timeliness refers to content that is strongly related to V. SOLUTIONS OF BIG DATA PROCESSING
time, such as news, current affairs, and so on. The In the previous chapter, we classify the types of data
recommendation system requires real-time analysis of the quality problems that exist in the process and analyze the
user's browsing history and accurate recommendation of problems. The main content of this chapter is to propose a
the corresponding content. This requires that the storage solution.
system must be able to maintain a high response speed,
because the response delay is the result of the system will A. Solutions of Data Collection
push the "expired" content to the user, resulting in invalid In big data, the quality of the data source symbolizes
recommendation. Weather forecast size from day to day, the quality of the data. In order to deal with the barriers to
there are harsh time requirements. If we know the the data collection phase of big data, the following
predicting the next day, then the forecast is worthless. conclusions need to be noticed.
Other areas of Big Data to predict the application of the
Solution A-1. Increase acquisition coverage
characteristics of "timeliness" also have higher
requirements, such as the stock market, real-time pricing. We can combine built-in search and meta-search-based
approach to increase the coverage of the acquisition and
D. Issues of Data Analysis get more information. Meanwhile this method is not only
The goal of data analysis is to use effective methods to for a social network, they are applicable for most social
accurately identify and predict the relationship between networks. Using a based-on PageRank friend
data values and data. Over the past few decades, recommendation algorithm is feasible. It integrates the
researchers have responded to the ever-increasing amount user’s interactive information and social information in the
of data by accelerating the analysis algorithm. What are recommended process. The FP-Growth association mining
the problems with Big Data analysis for large-scale data algorithm under Mahout improves the accuracy, recall and
applications such as social networks? forecast coverage of the recommendation system by
correlating the actual business data of the social network.
Issue D-1. Accuracy
Accuracy is the ultimate goal of Big Data projections. Solution A-2. Using dimension reduction techniques
Dynamic variables at different time points, and any and processing algorithms
variable triggers an entire system change or even a Using dimensionality reduction technology can
butterfly effect(未改). If a variable has a decisive effect on effectively alleviate the sparseness of data. Among the
the results and is difficult to capture, it is difficult to many dimensionality reduction methods, the most relevant
predict, such as human factors. Big Data forecast methods are principal component analysis and singular
applications are mostly very unstable areas but there are value decomposition. It can also be used by some diffusion
fixed laws, such as weather, stock market, disease. This algorithms, from the original first-order association (how
requires the system to accurately capture the data for each many similar scores or common purchases of two users) to
variable and adjust the forecast in real time. And some of second-order or even higher-order associations (assuming
the data of the law is elusive, resulting in the accuracy of that the relevance or similarity itself is spreadable), You
the forecast there is a big problem. Such as earthquake can also add some default scoring, thus improving the
prediction. resolution of similarity. In general, it will be sparser if the
Big Datais bigger. It is now thought that algorithms that
Issue D-2. Scalability can deal with sparse data (such as diffusion, iterative
optimization, transfer similarity, etc.) are more valuable.
B. Solutions of Data Preprocessing Solution C-2. Spark platform
Data preprocessing has a critical impact on data Spark [20] introduces the RDD data model and the
analysis, including the recommended results and memory-based computing model so that it can well adapt
prediction results. In order to improve the quality of data, to the data mining of large data, and is superior to Hadoop
the emphasis on data preprocessing is necessary. This in iterative computing. In order to solve the
article provides two solutions. recommendation and predict the timeliness, a real-time
context algorithm based on Spark is proposed. The
Solution B-1. Remove noise operation algorithm combines the filtering of the situation on the
Data collection and reporting of abnormal data, they basis of cooperative filtering, and uses Kafka as the real-
need to combine the database table structure and the actual time message transceiver to deal with the real-time flow
scene to do filtering, such as null check, numerical data with Spark Streaming, which enhances the accuracy
exception, type anomalies, data to heavy. In addition, for and timeliness of the algorithm.
the "artificial" noise data, such as brush click, brush list D. Solutions of Data Analysis
and other acts, these key data will affect the follow-up
algorithm effect seriously. It needs to have some anti- After the data analysis can reflect the important value
cheating strategy to clear or down, such as session analysis, of all large data. Traditional data analysis can’t handle
combined with the cookie, ip, behavior and the number of massive amounts of data. There is a need for further
times and other rules to filter. In the data preprocessing improvements in the accuracy and scalability of the data.
phase to remove the noise operation, the effect is often
Solution D-1. Useful algorithms
better than the complex optimization algorithm.
In order to improve the accuracy of Big Data mining
Solution B-2. Clustering Algorithm and prediction capabilities, to solve the traditional data
Hathaway and Bezdek developed a method based on mining technology can’t adapt to Big Data processing
incomplete dissimilarity clustering incomplete relation environment. Many companies have begun to use Na07ve
data [22]. Li et al. A hybrid method based on the interval Bayesian classifiers, association rule mining, decision
attribute of the nearest neighbor information is proposed, trees and other well-known algorithms to analyze their
and the hybrid method of incomplete data clustering is data and predict their potential customers and business
carried out by genetic algorithm and fuzzy c mean [23]. [8] decisions. Accurate classification results may help the
The data set is divided into a complete data set and a non- company to obtain reliable predictions intelligently
complete data set, and then the affinity data clustering through the viable business. [9] proposed an association
algorithm is used to cluster the complete data set. rule mining to improve the Na07ve Bayesian classifier,
According to the similarity measure, the incomplete data is mainly by combining the relevant attributes to adapt to the
divided into corresponding the cluster. In order to improve naive Bayesian classifier in the independent hypothesis to
the efficiency of the algorithm, a distributed clustering reduce the attributes. In [16], the large-scale data mining
algorithm based on cloud computing technology is architecture and prediction model based on BP neural
designed. network are established by using cloud service and Big
Data processing technology. Finally, the prediction results
C. Solutions of Data Storage can be obtained.
In order to solve the problem of data storage, we can Solution D-2. Clustering Algorithm
use the following two ways. In order to solve the scalability problem, according
Solution C-1. Using cloud storage technology and to the literature, the use of clustering algorithm can
tiered storage solve the scalability of the recommendation system
effectively and improve the recommending accuracy. In
Cloud storage [19] by means of the cluster [21], a K-means clustering algorithm with improving
application, grid technology or distributed file system and center-of-gravity selection method and distance
other functions, the network a large number of storage measurement is proposed. [14] studied the application
devices work together through the application software of model-based collaborative filtering technology, in
together. Cloud storage is focused on providing Internet- particular, it proposed a cluster CF framework and two
based online storage services, users do not need to clustering algorithms CF: Project-based fuzzy
consider storage capacity, storage device type, data storage clustering collaborative filtering (IFCCF) and trust-
location, and data availability. Hierarchical storage for the aware clustering collaborative filtering (TRACCF). In
application of capacity, use and performance requirements, the literature [6], the PCA-GAKM algorithm is proposed.
selecting different storage media to reduce the total cost. First, the data is preprocessed by principal component
Hierarchical storage in cloud storage classifies and analysis. Then, the K-means clustering algorithm is
migrates to different storage tiers. Storage hierarchical improved by the fusion genetic algorithm. Finally, the
management strategies simplify storage resource TOP-N recommendation mechanism is used to
management greatly. generate the recommended list.
VI. DISCUSSION in four aspects. In the big data processing work, there are
Big data provides more opportunities for the Internet, Big Dataof large-scale data, data generation fast, complex
but also brings more problems or even traps. This paper data types and so on. These features bring a lot of data
mainly from the data quality point of view, to recommend quality problems to Big Dataprocessing. Table II shows
the system and the forecasting system as an example, put the relationship between data processing and data quality
forward the big data processing process existing problems and 4V characteristics.
TABLE II. ISSUES, DATA QUALITY,4V CHARACTERISTIC
The existing issues Primary Cause Data Quality 4V Characteristic
Less data collected and low Network connectivity and dynamics Availability Volume
recall rates
Collection
The interaction between the user and the
Data sparseness Relevance Variety
item is less
Abnormal data and cheating data at the time
Noise data Usability Variety
of acquisition
Preprocessing
Data distribution is not balanced, the Veracity
Incomplete data Reliability
network transmission is unstable Velocity
Limitations of Storage The total amount of data is large and the type
Usability Volume
Technology is complex
Storage
Data timeliness Response time is long Availability Variety
Accuracy The rule of the data is elusive Reliability Velocity

Analysis
High data dimension, high computational
Scalability Usability Volume
complexity
Although some results have been made for the study of Most data mining algorithms in big data analysis will
data quality, such as the validity, reliability and availability be designed for parallel computing. However, once the
of data, the research on Big Datais still in its infancy. Big data mining algorithm is designed or modified for parallel
Dataneeds to solve many problems, such as: the data of the computing, the exchange of information between different
multi-source, the quality of differentiation, how to obtain data mining processes can cause bottlenecks. One of them
high-quality large data, how to integrate the existing multi- is a synchronization problem, because even if the same
source data, how to detect and repair data and other issues. algorithm is used to handle the same amount of data, the
Some of the problems we have mentioned can be easily programs of different programs will make their work
overcome, and these practical problems are common in foolish at different times. The bottleneck of data mining
traditional data processing. However, some issues are algorithms will be an open issue in big data analysis,
closely related to the characteristics of large data, and so which suggests that we need to take this into account when
far, are still unresolved issues. Big issues caused by developing and designing new data mining algorithms for
several public issues will serve as the main content of this big data analysis.
section to explain the dilemma that may be faced by large
data. Here are some open problems [12]: Open problem3. Security issues
Since environmental data and human behavior will be
Open problem1. Inconsistent data collected by a large number of data analysis, how to
protect them will also be an open problem because there is
Big data clouds will soon be applied to various areas, no safe way to handle the collected data, and big data
the resulting data is absolutely inconsistent. There are four analysis cannot be a reliable system. Although we have to
inconsistencies in time, text, space, and functional tighten the security of big data analysis before we can
dependencies that must address data inconsistencies at the collect more data from all over the world, the reality is that
data level, information level, and knowledge level in order so far, much research has focused on the security issues of
to better analyze the data. Big Dataanalysis is a new era of big data analysis. According to our observation, the
data analysis, there are still data inconsistencies. But security of big data analysis can be divided into four
traditional inconsistency solutions do not necessarily apply aspects: input, data analysis, output and communication
to large data. Data is more prone to inconsistencies with other systems. For input, it can be considered as a
because data is captured or generated by different sensors collection of data related to sensors, handheld devices and
and systems. How the effect of data inconsistency is an even networking devices. One of the important security
open question of Big Dataanalysis. issues in the big data analysis input section is to ensure
Open problem2. The bottleneck of data mining that the sensor is not compromised by the attack. For
algorithm analysis and input, it can be considered as a security issue
for such a system. In order to communicate with other [9] Yang T, Qian K, Dan C T L, et al. Improve the Prediction
systems, the security problem lies in the bigdata analysis Accuracy of Naïve Bayes Classifier with Association Rule
Mining[C]// IEEE, International Conference on Big Data Security
of communication with other external systems. Because of on Cloud. IEEE, 2016:129-133.
these potential problems, security has become one of the [10] Cai L, Zhu Y. The Challenges of Data Quality and Data Quality
open issues of big data analysis. Assessment in the Big Data Era [J]. Data Science Journal, 2015, 14
(1): 21-3.
CONCLUSION AND FUTURE WORK [11] Maślankowski J. Data Quality Issues Concerning Statistical Data
We have entered the era of large data, through data Gathering Supported by Big Data Technology[J]. Communications
in Computer & Information Science, 2014, 424(1):92-101.
collection, processing, storage, so that data is better
analyzed. As a result of the substantial increase in the [12] Tsai C W, Lai C F, Chao H C, et al. Big data analytics: a survey[J].
Journal of Big Data, 2015, 2(1):21.
volume of data in various fields, it has become a major
[13] Chen T, Honda K. Solving data preprocessing problems in existing
challenge in efficiently processing data. We will need to location-aware systems[J]. Journal of Ambient Intelligence &
address these issues, thereby improving the quality of the Humanized Computing, 2015:1-7.
data. This paper takes the recommendation system and the [14] BIRTOLO C, RONCA D. Advances in clustering collaborative
forecasting system as an example to analyze and describe filtering by means of fuzzy C-means and trust[J].Expert Systems
the possible problems in the process of Big Dataprocessing. with Applications,20 1 3, 40(1 7):6 9 9 7-700 9.
We reviewed the concept of large data, Big [15] Duggal P S, Paul S. Big Data Analysis: Challenges and
Dataprocessing, forecasting systems and recommendation Solutions[C]// International Conference on Cloud, Big Data and
Trust 2013, Nov 13-15, RGPV. 2013.
systems. Data quality, data preprocessing, data storage,
[16] Yang F, Xiao-Yan A I, Zhang Y H, et al. New mining architecture
data analysis and other data quality problems in Big and prediction model for big data[J]. Electronic Design
Dataresearch are expounded and analyzed. For each stage, Engineering, 2016.
we ask possible questions and propose solutions, and [17] Martin Hilbert, Priscila López, The world’s technological capacity
finally analyze some open questions. to store, communicate, and compute information, Science 332
Our Big Dataanalysis is unavoidable. Existing Big (6025) (2011) 60–65.
Datatechnologies and tools still have some limitations that [18] Witten IH, Frank E, Hall MA (2011) Data mining: practical
cannot completely solve Big Dataproblems. In the follow- machine learning tools and techniques. Morgan Kaufmann,
up study, we need to study the big data problem, try to Burlington.
solve the problem of data quality through better solution. [19] Zeng W, Zhao Y, Ou K, et al. Research on cloud storage
architecture and key technologies[C]// International Conference on
Interaction Sciences: Information Technology, Culture and Human
REFERENCES 2009, Seoul, Korea, 24-26 November.DBLP,2009:1044-1048.
[1] Louridas, P., Ebert, C.: Embedded analytics and statistics for big [20] Xie L, Zhou W, Li Y. Application of Improved Recommendation
data. IEEE Software 30, 33–39 (2013) System Based on Spark Platform in Big Data Analysis[J].
[2] Wikipedia contributors. "Big data." Wikipedia, The Free Cybernetics & Information Technologies, 2017, 16.
Encyclopedia. Wikipedia, The Free Encyclopedia, 31 Mar. 2017. [21] GHAZANFAR M A, PRüGEL-BENNETT A. Leveraging
Web. 31 Mar. 2017. clustering approaches to solve the graysheep users problem in
[3] G. J. and . E. Reinsel, "“Extracting Value from Chaos”, IDC’s recommender systems [J]. Expert Systems with Applications,20 1
Digital Universe Study, sponsored by EMC,"2011. 4,4 1 (7 ): 3 2 6 1-3 2 7 5.
[4] A. A. TOLE, "Big Data Challenges," Database Systems Journal, [22] J.Hathaway, C.Bezdek, "Clustering incomplete relational data
vol. vol. IV, no. no. 3, pp. 31-40, 2013 using the non-Euclidean relational fuzzy c-means algorithm,
[5] S. Kaisler, F. Armour and J. A. Espinosa, "Big Data: Issues and "Pattern Recognition Letters, vol.23, no. 1, pp.151–160, 2002.
Challenges Moving Forward," Hawaii International Conference on [23] D.Li, H.Gu, L.Zhang, "a hybrid genetic algorithm-fuzzy c-means
System Sciences, no. 46th, 2013 approach for incomplete data clustering based on nearest-neighbor
[6] WANG Z,YU X,FENG N,et al.An improved collaborative movie intervals," Soft Computing, vol.17, no. 10,pp.1787-1796, 2013.
recommendation system using computational intelligence [J ]. [24] Li C, Lan M, Zou B, et al. Big Data and Recommendation
Journal of Visual Languages & Computing,20 1 4,2 5(6):6 6 7-6 7 System[J]. Big Data Research, 2015, 14(1):39-43..
5. [25] Madden S (2012) From databases to big data. IEEE Internet
[7] Li J, Xu Z, Jiang Y, et al. The overview of big data storage and Comput 16(3):4–6
management[C]// IEEE, International Conference on Cognitive [26] Hamm S (2013) How big data can boost weather forecasting.
Informatics & Cognitive Computing. IEEE, 2014:510-513. http://readwrite.com/2013/02/28/how-big-data-can-boost-weather-
[8] Leng Y L. Incomplete Big Data Distributed Clustering[J]. Applied forecasting#awesm=ou64ZEaKe2HtUu. Accessed 20 Nov 2014
Mechanics & Materials, 2014, 687-691:1496-1499. [27] Han J, Kamber M, Pie J (2012) Data mining: concepts and
techniques. Elsevier, Inc., San Francisco
View publication stats

Data Qualityin Big Data Processing Issues Solutionsand Open Problems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Qualityin Big Data Processing Issues Solutionsand Open Problems

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Conference Paper · August 2017

Pengcheng Zhang Jerry Gao

SEE PROFILE SEE PROFILE

Smart Cities View project

BiG Data Quality Assurance Modeling Research View project

The user has requested enhancement of the downloaded file.

Many research scholars have commented on the

TABLE II. ISSUES, DATA QUALITY,4V CHARACTERISTIC

The existing issues Primary Cause Data Quality 4V Characteristic

Accuracy The rule of the data is elusive Reliability Velocity

View publication stats

You might also like