You are on page 1of 11

Big Data Management

Assessed Coursework Two


Big Data vs Semantic Web
F21BD

Boris Mocialov (H00180016)


MSc Software Engineering

Heriot-Watt University, Edinburgh

April 5, 2015

1
1 Introduction
The purpose of this essay is to give an overview consisting of objectives of each field
and technologies being of the two scientific areas, namely, the Big Data Management
and Semantic Web Technologies. A short example will follow after overview of each area
that would provide a context associated with that particular area.

After an introduction, the essay will try to convince the reader that both areas are actu-
ally related by presenting some of the recent applications of the techniques of each field
and that this relationship is exploited in the same way every time the combination of the
two is used in practice.

An additional section at the end will provide more detailed overview over one particular
application of the both fields.

2 Introduction to Big Data Management


Big Data is a term that describes possibly inconsistent uncertain data that resides in
large volumes, different forms and is being produced at high speed. Given such de-
scription, tools that operate upon and manage Big Data should capture, process, and
analyse the data accordingly to overcome mentioned difficulties. Big Data Management
incorporates such tools and techniques to overcome these difficulties.

Data Science Series (2012) gives an extended list of possible benefits for both busi-
nesses and customers of turning to the Big Data resources. As it can be seen from the
list that Big Data can be advantageous to any company independent of the sector or
niche it occupies as new opportunities in data-utilisation can be discovered and exploited.

2.1 Objectives
As it had been said, Big Data Management is supposed to utilise appropriate tools and
techniques to make it possible to capture, process, and analyse data that is fast, large,
uncertain, and heterogeneous.

Chen and Zhang (2014) present an exhaustive list of challenges posed by the Big Data
for computing. The list includes storage problems, I/O speed, network throughput, data

2
curation, and processing power as an umbrella over more detailed challenges. All the
listed challenges are, indeed, the objectives for the Big Data Management to reach over-
all aim to be able to store, process, and analyse large amounts of different uncertain
data.

2.2 Technologies
Given objectives and current challenges for the Big Data Management, Chen and Zhang
(2014) discuss possible improvement approaches to allow for better handling of the Big
Data. For instance, to improve upon inconsistent, incomplete, and/or noisy data, clean-
ing, integration, and transformation can be considered. The challenge becomes to per-
form all these tasks life - as data becomes available. One of the solutions for the fast
processing is of course the parallel handling of the data.

The current solution to Big Data Management that possibly comes from distributed sources
is NoSQL databases. NoSQL databases are more of a philosophy rather than a tech-
nique or a tool. It describes a set of approaches the Big Data Management can be ac-
complished. For instance, some NoSQL databases may or may not use relation, some
do not use SQL management language, and some may employ schema-free, schema-
less, or flexible schema policies. In addition, different approaches to store data are being
used. For example, some systems use key-value storage system, some variation is key-
document system, some turn to column-families type or even graph systems. What all
the NoSQL databases have in common is their ability and devotion to dynamic schema
as an underlying feature that serves as an advantage when dealing with different data.
Another common factor is the separation between the storage and management of the
data. While storage happens in one of the previously mentioned fashions, the manage-
ment is implemented in the application layer, which means that when some dirty data
is being extracted from the system, it is then dumped onto the application layer that is
supposed to deal with what should be processed further and what is not needed for this
particular extraction.

Some of the state of the art approaches to Big Data Management that Chen and Zhang
(2014) discusses include statistical analysis of the data at hand, data mining approaches
and the use of neural networks together with machine learning algorithms to discover
patterns in different data and cluster discovered items together to create classes.

3
2.3 Big Data Example
As it can become apparent from description above, Big Data can provide additional rev-
enues to any company that deals with data. Apart from monetary interest, Big Data can
provide new knowledge to science as there is potential value hidden inside of any data.

To present a simple, but powerful example, it is worth to mention the notion of ‘smart
cities’. Data Science Series (2012) provides this as an example of Big Data as well, but
smart cities can also be viewed as an encapsulation of services, such as health service,
public service, transport service, and more. In the case of health services, patients can
have their personalised doctor on their wrist that sends data to an actual doctor or even
an AI that records data every moment of patients’ life and provides clues directly to the
person on how to improve upon his/her life. In case of public services, for example, can
monitor traffic developments, people gatherings, forums, etc. and act upon this data for
the good of the citizens. As for the transportation services, public transport can cooper-
ate and provide services only to the places where it is needed.

3 Introduction to Semantic Web Technologies


Semantic web is an idea of adding meaning to the things that are found on the World
Wide Web. The purpose of the added meaning is to allow machines to reason about
these things.

3.1 Objectives
Shadbolt et al. (2006) writes that e-science - the source of the need for the technology,
is a major driver for the semantic web for reasons of data integration between heteroge-
neous data sets that come from different scientific communities. Such integration can be
achieved through the use of ontologies - standard for formal namings/definitions/proper-
ties/relations of entities within one particular domain.

Rationale behind integration of data from wide ranges of fields is inspired by the move-
ment towards interdisciplinary aspects of the science - fusion of different disciplines for
the pursuit of acquiring new shared knowledge.

Therefore, certain standards should be enforced to allow for distributed and heteroge-

4
neous data to merge into meaningful unambiguous knowledge in any domain.

3.2 Technologies
The key technologies (rather techniques) in semantic web are URIs that identify various
resources. Given a URI to a resource, anyone can tap onto it. URIs is a building block
of RDFs that describes every part of a subject-predicate-object triple that, in turn, re-
lates subject to an object. When building an application, RDF vocabulary can be used
to specify domain of predicates used within that application. RDF vocabulary serves as
an abstraction over distinct RDFs and provides one-point entry for the vocabularies to be
linked. RDF Schema (RDFS) is even further abstraction of RDF that provides descrip-
tion of groups of related resources. While RDF Vocabulary is optional, RDF Schema is
mandatory. Triple stores, further, extend individual RDFS to provide facilities for richer
RDF content. To provide a standardised access to triple stores, SPARQL language had
been developed to query the underlying RDFs. OWL languages provide means for adding
extra information into RDFS to make the knowledge more representative. In addition,
OWL languages support ontology consistency checking (Shadbolt et al., 2006).

Switching to tools, it is worth mention Protege, an ontology editor and validator.

3.3 Semantic Web Application Example


A commonly cited example of semantic web applications is, perhaps, e-science. As on-
tologies can be distributed and combined by such technologies as, for example, OWL
languages, e-sciences can work in distributed fashion by synchronising their findings
and build common knowledge while maintaining a common ontology that would define
the domain and range of the research both parties are engaged in. As long as common
ontology is defined and obeyed during synchronisation, both parties can make changes
to their underlying models, terms and definition as they wish (as local requirements/laws
may enforce such differences).

4 Relationship between Big Data and Semantic Web


Areas, identified by Data Science Series (2012) had been considered to identify relation-
ship between the two fields.

5
4.1 Semantic Link Network for Big Data in Multimedia
Paper by Liu et al. (2014) uses a particular approach to organise multimedia resources
with the use of texts and surrounding texts. The aim of the project is to give meaning to
different multimedia resources and allow users to search related resources and to be
able to gain a more comprehensive meaning of a particular resource given its relation-
ships.

Authors’ main assumption is that the manual annotations can be considered as a reli-
able source of semantics. Also, it is mentioned that ontologies can describe multimedia
semantics. The aim of the paper becomes to bridge a gap between ontologies and man-
ually given annotations. Motivation for the paper is to provide reasoning to be able to
derive the implicit knowledge from information. Common applications for the derivation
of implicit knowledge can be found useful in many areas, such as surveillance, sports, or
Internet of Things.

Semantic Link Network method is employed to associate relationships between resources.


Since every aspect in the Semantic Web is a triple, as it had been pointed out earlier,
mapping can be accomplished without any considerable modifications.

During the presentation of the results, certain heuristics were applied to filter the under-
lying assumptions of the model even further. As a result, with the use of ontologies and
tags along with textual descriptions, semantic relatedness had been achieved between
multimedia items accurately and robustly.

4.2 Personalised Medicine with Big Data and Semantic Web


Technologies
Panahiazar et al. (2014) considers a patient, who requires personalising health care
plan. To accomplish this requirement, a health care system has to implement a new in-
frastructure that would allow live delivery of patient data directly into the hands of a pro-
fessional. The other side of the equation would allow health care systems to make better
decisions about their patients based on the data from all the patients. The paper dis-
cusses an approach towards personalised health care using big data and Semantic Web
technologies.

6
Smart data notion is introduced into the context of health care as a fusion between the
Big Data and Semantic Web. The Big Data part of the smart data deals with accessing
and processing large volumes of homogeneous and heterogeneous data about every
single patient. Since the data is not structured most of the time, Semantic Web technolo-
gies come into play and are used to annotate various concepts.

4.3 Information and Data Sharing in Chemical Sciences


Bird and Frey (2013) provide in-detail rationale behind the importance of data and knowl-
edge sharing in the chemical sciences. e-Reasearch is a direct consequence of the ex-
pansion of available to researchers data. As more work power is required to process the
available data, the more need emerges in use of distributed collaborations, so that col-
laborative bodies can tackle problem of Big Data in sciences. In addition to workforce,
scientists depend upon each other’s work more than ever. Single-entry database so-
lutions are not feasible to accommodate for all the research centres and universities.
Therefore, a distributed approach must be taken. Although the distributed approach is
feasible serving as a boilerplate for all the research happening in one field, additional in-
frastructure should be in place to allow discovery, browsing, documentation, etc. This
would in turn allow for the provenance of the data, so that the initial baseline can be
frozen and not changed any more after it had been shared.

For such a system it would be important to use a controlled vocabulary that would en-
sure that all the parties belonging to the system use that vocabulary when describing
certain aspects of the research.

4.4 Linking Smart Cities Data


Yet another example comes from Celino et al. (2012), who report on the implementation
of an application that engages users to provide information about a city to fix inconsis-
tencies in automatic inferences made by reasoning software regarding a specific ontol-
ogy. It had been noted from similar applications that users are willing to provide infor-
mation if the application supports GWAP paradigm. In other words it can be said that
the crowd can foster the connection between the Big Data and the Semantic Web Tech-
nologies given appropriate infrastructure. Author also notes that similar works had been
done that covered the whole Semantic Web life-cycle rather than the fine-tuning part.

7
5 Conclusion
Although both the Big Data and Semantic Web Technologies can be seen as two differ-
ent areas of research, both are applied to real-life certain problems as it had been de-
scribed in Section 4.

Applications converge to a similar aim, namely, process and give meaning to the Big
Data generated by the means of embedded technology.

In addition, it can be seen that the main focus of applications of the both technologies is
knowledge, may it be for profit or for the discovery of more knowledge.

Therefore, it is worth to say that both areas should progress further by giving meaning to
the unstructured, fast, and uncertain data around us.

8
6 Semantic Web technologies for the big data in life
sciences
Wu and Yamaguchi (2014) present a survey of big data in life sciences with semantic
extension.

The paper’s aim is to enable investigation of effects chemicals on biological systems.


Additional data sets are required to be able to accomplish that.

The problem emerges when data sources contain different or new unseen data types
and different formats of underlying data. To be able to use such data sources, they must
be integrated, eliminating thus inconsistencies. To accomplish the task of data integra-
tion, considerable knowledge about that data is necessary to find what can be integrated
and what cannot or should not. The author points out that the main problems in this con-
text, as was also pointed out in the Section 2, are the volume and the rate of the gener-
ated data. The paper, thus, discusses the issue of how Semantic Web Technologies can
solve the general problems of the Big Data Management that were outlined in Section 2.

The paper later describes the technologies of the Semantic Web that were also listed
previously in Section 3.2 along with examples for better visualisation of each technology.
In addition to the previously described technologies, the paper presents some additional
ones, for instance, linked data, triple stores, and triple stores in the cloud.
Linked data tries to incorporate all data from World Wide Web into a single database
and to make all the data semantically related in some way. Linked data uses the same
basic technologies that were described previously for the Semantic Web. The basic idea
is to allow connectionist approach to world. A simple example of that would be to give
relevant related recommendations to users that are viewing some certain part of the web
or searching for some particular information.
Triple store is simply a database for all the triples. The triple store must allow for fast
query execution, be scalable, and have a low load cost to be highly-operational.
Triple store in the cloud is yet another paradigm that would allow users to connect to a
cloud and, from there, use data or applications that are available on that cloud. Cloud
computing can provide such services as: Data as a Service that would give access to
the current data, Software as a Service that would allow users to use software instance
from the cloud, Platforms as a Service that would allow users to exploit dedicated to

9
them area on the cloud to execute and test their software, and Infrastructure as a Ser-
vice that would allow users to utilise execution power of the servers that host the cloud.
Author then presents some of the examples of the technologies that offer triple stores in
the cloud for scientific purposes.

One of the major challenges for the field of the Semantic Web is that it was not designed
to serve the Big Data requirements. To turn things around, additional concepts were in-
troduced into the Semantic Web, such as: RDF and RDF Schema and/or OWL on top of
RDF.

The current issues that still persist in the field of Semantic Web is that the techniques
cannot deal with the fast data and large data. Therefore, external solutions are some-
times employed to accommodate for the unpredictability of the Big Data.

The author concludes saying that effective data processing platforms are needed to be
able to process and share the data, especially in the research, as it had been pointed
in Section 4.3. In addition, shared data must remain secure at all times. To increase the
processing performance of the Big Data processing, parallel computing seem to provide
some solutions in this area with such tools as, for example, Hadoop.

In conclusion, the paper does extend the previously discussed chapter of relationships
between two fields and is therefore another proof of that the both fields do cooperate
when dealing with real-life problems concerning Big Data processing and analysis.

10
References
Colin L Bird and Jeremy G Frey. Chemical information matters: an e-research perspec-
tive on information and data sharing in the chemical sciences. Chemical Society re-
views, 42(16):6754–76, Aug 2013. ISSN 0306-0012. doi: 10.1039/C3CS60050E.

I Celino, S Contessa, M Corubolo, and D Dell’Aglio. Urbanmatch-linking and improv-


ing smart cities data. LDOW, 2012. URL http://planet-data.eu/sites/default/
files/publications/ldow2012-paper-10.pdf.

CLP Chen and CY Zhang. Data-intensive applications, challenges, techniques and


technologies: A survey on big data. Information Sciences, 2014. doi: 10.1016/j.
ins.2014.01.015. URL http://www.sciencedirect.com/science/article/pii/
S0020025514000346.

Data Science Series. Ten practical big data benefits.


http://datascienceseries.com/stories/ten-practical-big-data-benefits, 2012. URL
http://datascienceseries.com/stories/ten-practical-big-data-benefits.

Y Liu, L Chen, X Luo, L Mei, C Hu, and Z Xu. Semantic link network based model for
organizing multimedia big data. IEEE Transactions on . . . , 2014. URL http://www.
computer.org/csdl/trans/ec/preprint/06786371.pdf.

Maryam Panahiazar, Vahid Taslimitehrani, Ashutosh Jadhav, and Jyotishman Pathak.


Empowering personalized medicine with big data and semantic web technology:
Promises, challenges, and use cases. Proceedings: ... IEEE International Conference
on Big Data. IEEE International Conference on Big Data, 2014:790–795, Oct 2014.
doi: 10.1109/BigData.2014.7004307.

N Shadbolt, W Hall, and T Berners-Lee. The semantic web revisited. Intelligent Sys-
tems, 2006. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=
1637364.

Hongyan Wu and Atsuko Yamaguchi. Semantic web technologies for the big data in life
sciences. BioScience Trends, 8(4):192–201, 2014. doi: 10.5582/bst.2014.01048.

11

You might also like