Professional Documents
Culture Documents
Abstract:
Introduction
The categorization of data and information requires
The Big Data technology involves
re-evaluation in the age of Big Data in order to
ensure that the appropriate protections are given to collecting data from different resources
different types of data. The aggregation of large merge it that is becomes available to deliver
amounts of data requires an assessment of the harms a data product useful for the organization
and benefits that pertain to large datasets linked
business. The process of converting large
together, rather than simply assessing each datum or
amount of data i.e unstructured raw data
dataset in isolation. Big Data produce new data via
inferences, and this must be recognized in ethical received from different sources to produce a
assessments. We propose a Classification of Data data product useful for the organizations
Quality Assessment and Improvement Methods The and the users.
use of schemata such as this will assist decision-
Most big data problems can be categorized
making by providing research ethics committees and
in the following ways −
information governance bodies with guidance about
the relative sensitivities of data. This will ensure that
Supervised classification
appropriate and proportionate safeguards are
provided for data research subjects and reduce Supervised regression
inconsistency in decision making.
Unsupervised learning
Learning to rank
Keywords
Management is often thirsty for new means that in order to train a model that
insights. Segmentation models can provide will, for example, recognize digits from an
this insight in order for the marketing image, we need to label a significant
amount of examples by hand. There are The best way to find sponsors for a project
web services that can speed up this process is to understand the problem and what
and are commonly used for this task such would be the resulting data product once it
as Amazon mechanical tuck. It is proven has been implemented. This understanding
that learning algorithms improve their will give an edge in convincing the
performance when provided with more management of the importance of the big
data, so labeling a decent amount of data project.
examples is practically mandatory in
supervised learning.
Who would benefit from your utilized for handling it? Will any Data
project? Who would be your client Mining method have the capacity to deal
The data being generated might be too undertakings) ensure this touchy data. In
3. Administration Capabilities
reviewed the actual tool (for those that were numerical columns, median and average
freely available) and any documentation of value scores, etc. In addition, column
the tool including information on the analysis also computes the inferred type
this review with a general review of DQ For example, a column could be declared
literature that describes DQ methods, and as a ‘string’ column in the physical data
have cited the relevant works in our model, but the values found would lead to
resulting list of DQ methods in the the inferred data type ‘date’. The frequency
each tool and extracted the DQ methods, the another key metric which can influence the
Column set analysis refers to how values despite the fact that some problems can be
automatically detected and that the correction
from multiple columns can be compared
methods can also be automated, the whole process
against one another. For this research, we
cannot be carried out automatically without human
include this functionality within the term intervention in most cases in finding a common
“column analysis”. spelling error in many different instances of a word,
Cross-domain analysis (also known as a human is often used to develop the correct regular
expression to automatically find and replace all the
functional dependency analysis in some
incorrect instances. So between the application of the
tools) can be applied to data integration
automated assessment and improvement methods,
scenarios with dozens of source systems. It there often exists a manual analysis and
enables the identification of redundant data configuration step. The aim of this research was to
across tables from different, and in some provide a review of methods for DQ assessment and
improvement and identify gaps where there are no
cases even the same, sources. Cross-domain
existing methods to address particular DQ problems.
analysis is done across columns from
different tables to identify the percentage of References
values that are the same and hence indicates [1] Briand, L. C., Daly, J., and Wüst, J., "A unified
framework for coupling measurement in objectoriented
whether the columns are redundant. systems", IEEE Transactions on Software Engineering, 25,
1, January 1999, pp. 91-121.
Data verification algorithms verify if a [2] Maletic, J. I., Collard, M. L., and Marcus, A., "Source
Code Files as Structured Documents", in Proceedings 10th
value or a set of values is found in a IEEE International Workshop on Program Comprehension
(IWPC'02), Paris, France, June 27-29 2002, pp. 289-292.
reference data set ; these are sometimes
[3] Marcus, A., Semantic Driven Program Analysis, Kent
referred to as data validation algorithms in State University, Kent, OH, USA, Doctoral Thesis, 2003.
[4] Marcus, A. and Maletic, J. I., "Recove [1] M.
some DQ tools. A typical example for K.Kakhani, S. Kakhani and S. R.Biradar, Research issues
in big data analytics, International Journal of Application
automated data verification is checking or Innovation in Engineering & Management, 2(8) (2015),
pp.228-232. [2] A. Gandomi and M. Haider, Beyond the
whether an address is a real address by hype: Big data concepts, methods, and analytics,
International Journal of Information Management, 35(2)
using a postal dictionary. It is not possible (2015), pp.137-144. [3] C. Lynch, Big data: How do your
data grow?, Nature, 455 (2008), pp.28-29. [4] X. Jin, B.
to check if it is the correct address, but these W.Wah, X. Cheng and Y. Wang, Significance and
challenges of big data research, Big Data Research, 2(2)
algorithms verify that the address refers to a (2015), pp.59-64. [5] R. Kitchin, Big Data, new
epistemologies and paradigm shifts, Big Data Society, 1(1)
real, occupied address. The results depend (2014), pp.1-12. [6] C. L. Philip, Q. Chen and C. Y. Zhang,
Data-intensive applications, challenges, techniques and
on high quality input data. For example,
technologies: A survey on big data, Information Sciences, learning methods in medicine, International Neurourology
275 (2014), pp.314-347. [7] K. Kambatla, G. Kollias, V. Journal, 18 (2014), pp.50-57.
Kumar and A. Gram, Trends in big data analytics, Journal
of Parallel and Distributed Computing, 74(7) (2014), [22] P. Singh and B. Suri, Quality assessment of data
pp.2561-2573. ring Documentation-to-Source-Code using statistical and machine learning methods. L. C.Jain,
Traceability Links using Latent Semantic Indexing", in H. S.Behera, J. K.Mandal and D. P.Mohapatra (eds.),
Proceedings 25th IEEE/ACM International Conference on Computational Intelligence in Data Mining, 2 (2014), pp.
Software Engineering (ICSE'03), Portland, OR, May 3-10 89-97.
2003, pp. 125-137. [23] A. Jacobs, The pathologies of big data,
[5] Salton, G., Automatic Text Processing: The Communications of the ACM, 52(8) (2009), pp.36-44.
Transformation, Analysis and Retrieval of Information by
Computer, Addison-Wesley, 1989.