Professional Documents
Culture Documents
April 5, 2015
1
1 Introduction
The purpose of this essay is to give an overview consisting of objectives of each field
and technologies being of the two scientific areas, namely, the Big Data Management
and Semantic Web Technologies. A short example will follow after overview of each area
that would provide a context associated with that particular area.
After an introduction, the essay will try to convince the reader that both areas are actu-
ally related by presenting some of the recent applications of the techniques of each field
and that this relationship is exploited in the same way every time the combination of the
two is used in practice.
An additional section at the end will provide more detailed overview over one particular
application of the both fields.
Data Science Series (2012) gives an extended list of possible benefits for both busi-
nesses and customers of turning to the Big Data resources. As it can be seen from the
list that Big Data can be advantageous to any company independent of the sector or
niche it occupies as new opportunities in data-utilisation can be discovered and exploited.
2.1 Objectives
As it had been said, Big Data Management is supposed to utilise appropriate tools and
techniques to make it possible to capture, process, and analyse data that is fast, large,
uncertain, and heterogeneous.
Chen and Zhang (2014) present an exhaustive list of challenges posed by the Big Data
for computing. The list includes storage problems, I/O speed, network throughput, data
2
curation, and processing power as an umbrella over more detailed challenges. All the
listed challenges are, indeed, the objectives for the Big Data Management to reach over-
all aim to be able to store, process, and analyse large amounts of different uncertain
data.
2.2 Technologies
Given objectives and current challenges for the Big Data Management, Chen and Zhang
(2014) discuss possible improvement approaches to allow for better handling of the Big
Data. For instance, to improve upon inconsistent, incomplete, and/or noisy data, clean-
ing, integration, and transformation can be considered. The challenge becomes to per-
form all these tasks life - as data becomes available. One of the solutions for the fast
processing is of course the parallel handling of the data.
The current solution to Big Data Management that possibly comes from distributed sources
is NoSQL databases. NoSQL databases are more of a philosophy rather than a tech-
nique or a tool. It describes a set of approaches the Big Data Management can be ac-
complished. For instance, some NoSQL databases may or may not use relation, some
do not use SQL management language, and some may employ schema-free, schema-
less, or flexible schema policies. In addition, different approaches to store data are being
used. For example, some systems use key-value storage system, some variation is key-
document system, some turn to column-families type or even graph systems. What all
the NoSQL databases have in common is their ability and devotion to dynamic schema
as an underlying feature that serves as an advantage when dealing with different data.
Another common factor is the separation between the storage and management of the
data. While storage happens in one of the previously mentioned fashions, the manage-
ment is implemented in the application layer, which means that when some dirty data
is being extracted from the system, it is then dumped onto the application layer that is
supposed to deal with what should be processed further and what is not needed for this
particular extraction.
Some of the state of the art approaches to Big Data Management that Chen and Zhang
(2014) discusses include statistical analysis of the data at hand, data mining approaches
and the use of neural networks together with machine learning algorithms to discover
patterns in different data and cluster discovered items together to create classes.
3
2.3 Big Data Example
As it can become apparent from description above, Big Data can provide additional rev-
enues to any company that deals with data. Apart from monetary interest, Big Data can
provide new knowledge to science as there is potential value hidden inside of any data.
To present a simple, but powerful example, it is worth to mention the notion of ‘smart
cities’. Data Science Series (2012) provides this as an example of Big Data as well, but
smart cities can also be viewed as an encapsulation of services, such as health service,
public service, transport service, and more. In the case of health services, patients can
have their personalised doctor on their wrist that sends data to an actual doctor or even
an AI that records data every moment of patients’ life and provides clues directly to the
person on how to improve upon his/her life. In case of public services, for example, can
monitor traffic developments, people gatherings, forums, etc. and act upon this data for
the good of the citizens. As for the transportation services, public transport can cooper-
ate and provide services only to the places where it is needed.
3.1 Objectives
Shadbolt et al. (2006) writes that e-science - the source of the need for the technology,
is a major driver for the semantic web for reasons of data integration between heteroge-
neous data sets that come from different scientific communities. Such integration can be
achieved through the use of ontologies - standard for formal namings/definitions/proper-
ties/relations of entities within one particular domain.
Rationale behind integration of data from wide ranges of fields is inspired by the move-
ment towards interdisciplinary aspects of the science - fusion of different disciplines for
the pursuit of acquiring new shared knowledge.
Therefore, certain standards should be enforced to allow for distributed and heteroge-
4
neous data to merge into meaningful unambiguous knowledge in any domain.
3.2 Technologies
The key technologies (rather techniques) in semantic web are URIs that identify various
resources. Given a URI to a resource, anyone can tap onto it. URIs is a building block
of RDFs that describes every part of a subject-predicate-object triple that, in turn, re-
lates subject to an object. When building an application, RDF vocabulary can be used
to specify domain of predicates used within that application. RDF vocabulary serves as
an abstraction over distinct RDFs and provides one-point entry for the vocabularies to be
linked. RDF Schema (RDFS) is even further abstraction of RDF that provides descrip-
tion of groups of related resources. While RDF Vocabulary is optional, RDF Schema is
mandatory. Triple stores, further, extend individual RDFS to provide facilities for richer
RDF content. To provide a standardised access to triple stores, SPARQL language had
been developed to query the underlying RDFs. OWL languages provide means for adding
extra information into RDFS to make the knowledge more representative. In addition,
OWL languages support ontology consistency checking (Shadbolt et al., 2006).
5
4.1 Semantic Link Network for Big Data in Multimedia
Paper by Liu et al. (2014) uses a particular approach to organise multimedia resources
with the use of texts and surrounding texts. The aim of the project is to give meaning to
different multimedia resources and allow users to search related resources and to be
able to gain a more comprehensive meaning of a particular resource given its relation-
ships.
Authors’ main assumption is that the manual annotations can be considered as a reli-
able source of semantics. Also, it is mentioned that ontologies can describe multimedia
semantics. The aim of the paper becomes to bridge a gap between ontologies and man-
ually given annotations. Motivation for the paper is to provide reasoning to be able to
derive the implicit knowledge from information. Common applications for the derivation
of implicit knowledge can be found useful in many areas, such as surveillance, sports, or
Internet of Things.
During the presentation of the results, certain heuristics were applied to filter the under-
lying assumptions of the model even further. As a result, with the use of ontologies and
tags along with textual descriptions, semantic relatedness had been achieved between
multimedia items accurately and robustly.
6
Smart data notion is introduced into the context of health care as a fusion between the
Big Data and Semantic Web. The Big Data part of the smart data deals with accessing
and processing large volumes of homogeneous and heterogeneous data about every
single patient. Since the data is not structured most of the time, Semantic Web technolo-
gies come into play and are used to annotate various concepts.
For such a system it would be important to use a controlled vocabulary that would en-
sure that all the parties belonging to the system use that vocabulary when describing
certain aspects of the research.
7
5 Conclusion
Although both the Big Data and Semantic Web Technologies can be seen as two differ-
ent areas of research, both are applied to real-life certain problems as it had been de-
scribed in Section 4.
Applications converge to a similar aim, namely, process and give meaning to the Big
Data generated by the means of embedded technology.
In addition, it can be seen that the main focus of applications of the both technologies is
knowledge, may it be for profit or for the discovery of more knowledge.
Therefore, it is worth to say that both areas should progress further by giving meaning to
the unstructured, fast, and uncertain data around us.
8
6 Semantic Web technologies for the big data in life
sciences
Wu and Yamaguchi (2014) present a survey of big data in life sciences with semantic
extension.
The problem emerges when data sources contain different or new unseen data types
and different formats of underlying data. To be able to use such data sources, they must
be integrated, eliminating thus inconsistencies. To accomplish the task of data integra-
tion, considerable knowledge about that data is necessary to find what can be integrated
and what cannot or should not. The author points out that the main problems in this con-
text, as was also pointed out in the Section 2, are the volume and the rate of the gener-
ated data. The paper, thus, discusses the issue of how Semantic Web Technologies can
solve the general problems of the Big Data Management that were outlined in Section 2.
The paper later describes the technologies of the Semantic Web that were also listed
previously in Section 3.2 along with examples for better visualisation of each technology.
In addition to the previously described technologies, the paper presents some additional
ones, for instance, linked data, triple stores, and triple stores in the cloud.
Linked data tries to incorporate all data from World Wide Web into a single database
and to make all the data semantically related in some way. Linked data uses the same
basic technologies that were described previously for the Semantic Web. The basic idea
is to allow connectionist approach to world. A simple example of that would be to give
relevant related recommendations to users that are viewing some certain part of the web
or searching for some particular information.
Triple store is simply a database for all the triples. The triple store must allow for fast
query execution, be scalable, and have a low load cost to be highly-operational.
Triple store in the cloud is yet another paradigm that would allow users to connect to a
cloud and, from there, use data or applications that are available on that cloud. Cloud
computing can provide such services as: Data as a Service that would give access to
the current data, Software as a Service that would allow users to use software instance
from the cloud, Platforms as a Service that would allow users to exploit dedicated to
9
them area on the cloud to execute and test their software, and Infrastructure as a Ser-
vice that would allow users to utilise execution power of the servers that host the cloud.
Author then presents some of the examples of the technologies that offer triple stores in
the cloud for scientific purposes.
One of the major challenges for the field of the Semantic Web is that it was not designed
to serve the Big Data requirements. To turn things around, additional concepts were in-
troduced into the Semantic Web, such as: RDF and RDF Schema and/or OWL on top of
RDF.
The current issues that still persist in the field of Semantic Web is that the techniques
cannot deal with the fast data and large data. Therefore, external solutions are some-
times employed to accommodate for the unpredictability of the Big Data.
The author concludes saying that effective data processing platforms are needed to be
able to process and share the data, especially in the research, as it had been pointed
in Section 4.3. In addition, shared data must remain secure at all times. To increase the
processing performance of the Big Data processing, parallel computing seem to provide
some solutions in this area with such tools as, for example, Hadoop.
In conclusion, the paper does extend the previously discussed chapter of relationships
between two fields and is therefore another proof of that the both fields do cooperate
when dealing with real-life problems concerning Big Data processing and analysis.
10
References
Colin L Bird and Jeremy G Frey. Chemical information matters: an e-research perspec-
tive on information and data sharing in the chemical sciences. Chemical Society re-
views, 42(16):6754–76, Aug 2013. ISSN 0306-0012. doi: 10.1039/C3CS60050E.
Y Liu, L Chen, X Luo, L Mei, C Hu, and Z Xu. Semantic link network based model for
organizing multimedia big data. IEEE Transactions on . . . , 2014. URL http://www.
computer.org/csdl/trans/ec/preprint/06786371.pdf.
N Shadbolt, W Hall, and T Berners-Lee. The semantic web revisited. Intelligent Sys-
tems, 2006. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=
1637364.
Hongyan Wu and Atsuko Yamaguchi. Semantic web technologies for the big data in life
sciences. BioScience Trends, 8(4):192–201, 2014. doi: 10.5582/bst.2014.01048.
11