Professional Documents
Culture Documents
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 178
Data Warehouses for Uncertain Data
Hoda M. O. Mokhtar
Abstract Data warehousing is one of the most powerful BI tools nowadays. A data warehouse stores historical data that is inte-
grated from many sources, and processes it in a multidimensional approach to make it easy to use for efficient decision making. How-
ever, so far most of the data warehouses designs are based on the assumption that data in the data warehouse is either true or true until
a new snapshot occurs. Today, many real world applications require handling uncertain data. Sensor networks, and a wide range of
location based services (LBS), and many others deals with data that is not 100% guaranteed accurate. Inspired by the importance of
those newly emerging application, in this paper we propose a novel framework for data warehouses that efficiently handles both exact
and uncertain data. We present the application of our model in the context of sensor networks and show analyzing uncertain data can
also be achieved.
Index TermsData Warehouses, analyzing fuzzy data, uncertain data warehouses, sensor data.
1 INTRODUCTION
NCERTAINITY is an inherent property in many real
world data. Even with the current advances in sensor
data acquisition, and positioning systems (GPSs), acquiring
100% accurate (exact) data is not feasible. In most realworld
applicationsdataacquiredisanapproximate,uncertain,fuzzy,
ornearaccuratedata.Acquiringanexactreadingofasensorat
every time instant or obtaining the exact location of a moving
objectateverytimeinstantisnotpossible.
Handlingdatauncertaintythusrequiresspecialtreatmentthan
regular traditional data that is assumed to be always true.
Handling uncertain and fuzzy data was discussed in the
database community is several research work including [16].
Probabilistic databases and fuzzy databases are among the
techniques proposed to deal with data uncertainty in the
database environment. In addition, several approaches were
presented for querying, managing, storing, and mining
uncertain data [79]. However, elevating this to consider
uncertainandfuzzyhistoricaldataindatawarehouseswasnot
thoroughly investigated [10, 11]. In general, data warehouses
were introduced to aid managers and decision makers in
makingthemostefficientdecision.Adatawarehouseissimply
defined as a subjectoriented, consistent, time variant,
and nonvolatile store of data that basically gains its power
through storing and handling measurable historical data [12].
Today, data warehousing is widely accepted of by many
organizationsasaneffectiveandefficientbusinessintelligence
and decision making tool. A key characteristic of data
warehouses is the usage of multidimensional model (i.e. star
schemaandsnowflakeschema).Thesemodelsenablethedata
warehouse to provide OLAP (OnLine Analytical Processing)
capabilitiesthatenrichthequeryingprocess.
However, current data warehouse models are built on the
assumption that data is true until a new instance (snapshot)
occurs. This assumption although valid in some applications
where obtaining exact values is possible, seems unrealistic in
manynewlyemergingapplicationswheredataerrorcanoccur.
Sensor failure, calibration error, measurement inaccuracy,
sampling discrepancy, and even outdated data are all normal
sources of data inaccuracy. These factors affect the nature of
datastoredinthedatabaseandconsequentlytransferredtothe
data warehouse for further processing. Suppose for example
we have a data warehouse to aid meteorologists in making
decisions based on weather monitoring readings. If we
determine that one of the basic sensors is probable to give
incorrect readings, how dowe know which reading is wrong?
Do we input all the readings in the warehouse and treat them
asiftheywere100%accurate?Whatifwehavemorethanone
faulty sensor? How do we combine those erroneous readings
and aggregate them? Inspired by the role of data warehouses
in many applications and the effect of data uncertainty on
query results and manipulation, in this paper, we investigate
thedesignofdatawarehouseschemathatiscapabletohandle
fuzzy, uncertain data in an efficient way. The main
contributionsofthepaperare:
1. Proposing a model for representing uncertainty in data
warehouses.
2. Extending traditional star schema model to capture and
handleuncertaindata.
U
- HodaM.O.MokhtariswiththeFacultyofComputersandInfor
mation,InformationSystemsDpet.CairoUniversity,PostalCode:
12613Egypt
2011JournalofComputingPress,NY,USA,ISSN
21519617
http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 179
3.Presentingtheapplicationofuncertaindatawarehousesina
realapplication(i.e.weatherdataacquiredfromsensors).
Therestofthepaperisorganizedasfollows:Section2presents
abriefoverviewofpreviousrelatedwork.Section3,discusses
ourdesignapproachforuncertaindatawarehouses.Section4,
discusses the application of our design to handle sensor data
(specifically weather data). Finally, section 5 concludes and
proposesdirectionsforfuturework.
2 RELATED WORK
Data uncertainty is an inherent property in various realworld
applications. Managing uncertainty in database systems had
been the focus of many research work specially recently with
theadventofvariousapplicationsthatarebasedonmeasured
data [1316]. Handling uncertain data has been the focus of
severalresearchworksinthedatabasecommunity.Ingeneral,
data uncertainty is a nature consequence of either
measurementerrors,orincompletedata.
Forexample,acquiringtheexactlocationofamovingobjectat
every time instant is not feasible. Thus, approximation and
predictiontechniquesareusedtoobtainmissinglocations.This
inturn,affectsthedegreeofaccuracyofthedatastoredinthe
database. Handling uncertain data can be divided into two
main directions: proposing efficient techniques to model the
uncertain data, and incorporating those models to serve
differentapplications[17,1,18].
Modelinguncertaindatainthedatabasecontextwasdiscussed
in several work [1,2,4]. Generally, uncertainty in the database
environment is classified as either value uncertainty (attribute
uncertainty), or tuple uncertainty (existential uncertainty). In
the first case,one or more fields might contain uncertain data,
whileinthesecondcase,uncertaintyisonthewholerecordas
whetheritexistsornot.Dealingwithbothcaseswasthefocus
of several works. In [19], the authors consider the use of
probability density functions (PDFs) to represent the
probabilistic nature of data. Although the solution is neat, a
strong probabilistic background is needed to maintain such
solution. In [20], the authors considered incomplete data, thus
fuzzy sets was an alternative approach. Using fuzzy sets [21],
theauthorsareabletoconsiderarangeofvaluesratherthana
singleexactvalueasusedbefore.Fuzzysettheoryisthetheory
behind the usage of fuzzy values. The fuzzy set theory
stipulates that not only are there values for given object
classification,buttheseobjectshavedegreesofmembershipto
theircategories(i.efuzzysets)aswell.Thus,fuzzysetsaresets
that include values (just like normal sets) as well as a
membership value (also known as degree of truth) that
indicates how strongly a certain value belongs to this set.
Hence,anelementisassociatedwithamembershipvaluethat
reflectsthedegreeofconfidenceofbelongingtoacertainsetof
values.Anothersolutionwasintroducedin[22].Inthispaper
the authors used NULL values for the missing entries. Other
work focused on querying uncertain data. In moving object
context,probabilistic range querieswereintroducedin[15].In
this paper the authors present a model and query answering
approach that employs stochastic processes to answer queries
over uncertain data. In [23] the authors consider sensor net
works and present a solution for indexing uncertain data in
sensor network environment. In [24] the authors consider the
problem of outlier detection in uncertain sensor data. Lately,
research was directed to consider mining and aggregating
uncertain data. Proposing different clustering techniques for
uncertain data was the focus of some work including
[18,10,25].
In this paper we continue to study uncertain data, However,
we follow a different perspective, we consider uncertainty in
the data warehouse environment. More specifically, how to
model and represent uncertainty in historical data stored in a
datawarehouse.
3 UNCERTAIN DATA WAREHOUSES
Inthissectionwepresentourapproachtodesignadataware
house(DW)thatiscapabletostore,manage,andanalyzeboth
exact and uncertain data. Following the traditional data
warehouse definition, we continue to have a consistent, time
variant, subjectoriented and historical store of data with an
additionalfeaturethatishandlinguncertaindata.
Althoughdatainthedatawarehouseishistoricalinnature,the
degree of confidence in each value stored in the data
warehouse need not be the same. Consider for example a
weather sensor that monitors temperature readings, once the
sensor acquires a reading, a snapshot is automatically
generated in the DW that basically records the sensor
identifier,thereading,andthereadingtime.Ifthesensorhada
measurement error the history of that sensor is thus affected
andconsequentlyfutureanalysisandreportscanbeaffectedas
well.Motivatedbythiskindofcommonlyoccurringrealworld
uncertainty situation, in this section we present our proposed
DW design to handle those situations both efficiently and
effectively.
Oursolutionoffersawaytohandleuncertainvaluesinadata
warehouse,betheyprobabilisticorfuzzy.Themainkeyinour
solutions is based on an important conclusion presented in
[26]. This conclusion states that both probability density
function(PDF)andamembershipfunctionproducevaluesthat
imply the same thing. The idea behind this conclusion is the
fact that a membership function produces values in the range
[0,1] which indicates how close an element is to a certain set,
and consequently measures its chance of belonging to this set.
Whereas,aPDFfunctionalsoreturnsavalueintherange[0,1]
thatindicatestheprobabilityofoccurrenceofacertainrandom
variable in an observation space. This mapping between the
two measures can enforce them to have the same
interpretation. This in turn allows us to treat the fuzzy set to
which an element belongs in the same way we treat its
probabilityofoccurrence.Hence,thecloserthevaluesareto1,
the more likely the element belongs to a fuzzy set and the
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 180
higher its probability of occurrence. Using this conclusion we
are able to consider uncertain data regardless of its nature
(probabilisticorfuzzy).Ineithercaseweconsiderthepossible
range of values for a certain value (for example a sensor
reading)andmaintainaconfidencevalueintherange[0,1]that
expresses our belief level in this acquired value. Working in a
data warehouse environment, we use a multidimensional
model to model data in the data warehouse. In this paper we
only consider the denormalized star schema as it is widely
used in many systems. Following the traditional star schema,
ourproposedschemahas2maincomponentsnamely,thefact
table where numeric facts (measures) are kept for future
analysisanddecisionmaking,andthedimensiontableswhere
verboseattributesarestored.
Fig.1.AnUncertainStarSchema(ustarschema)
Definition2Anuncertainwarehouse(UDW)isasubjectoriented,
integrated, nonvolatile, and time variant store for both exact and
uncertain data. An UDW is modeled using a ustar schema with
additionalfacts(measures)thatreflectthedegreeofdatauncertainty.
Inbothcases,theclosertheConfidence(M)isto1,themoreour
belief of truth and confidence in the measure recorded in the
facttable.Theclosertheconfidenceisto0,thelessourbeliefis.
Definition4GivenadatawarehouseD,withafacttablecontaining
n records. Let each record contains at least 1 fuzzy (probabilistic)
attributeAwithvalueti[A]1in.Suchthat:
n i
u
A t
i
i
i
s s = 1 ] [
o
Where, ui is the value for attribute A in tuple i, and oi is the
confidencemeasureforthatvalue.
Then,theaveragevalueofafuzzy/probabilisticmeasuresofattribute
A presented in the fact table over the nrecords denoted by
uaverage(A)isdefinedas:
) 1 (
) * (
*
) (
1
n
u
records of number
confidence measure
A uaverage
n
i
i i
=
= =
o
Theabovedefinition,Definition5definestheerrorrangefora
special case of the Gaussian distribution (i.e. mean=0). This
definition is generalized in the probability theory yielding the
followinggeneralformulafortheconfidencerange.
Definition6GivenaGaussianrandomvariablexthatrepresentsa
measure (fact) represented in the fact table and an uncertain data
warehouse. Such that ) , ( ~ o u N x .Then the error function erf(x)
isdefinedas[27]:
dt e x erf
x
t
H
=
0
2
2
) (
Then,theprobabilityofameasurementfallinginarangeno around
themeanuforanynisgivenby:
)
2
( ) (
n
erf n x n P = + < < o u o u
2 2 2
2
2
2
2
1
2
1
2 2 1 1
2
2 2 1 1
...
...
, ) , ( ~ ...
n n
n n
n n
a a a
b a a a
where N b X a X a X a Y
o o o o
u u u u
o u
+ + + =
+ + + + =
+ + + + =
Definition 8 If
) , ( ~
2
i i i
N X o u
are independent variables, then the
Average of those Independent Random Variables X is distributed as:
) , ( ~
2
n
N X
i
i
i
o
u
.
Thustheaverageaggregateforprobabilisticmeasurescanalso
be computed yielding a normal distribution with u,o as
definedinDefinition7.
Finally, once the mean and standard deviation are calculated,
the confidence interval can be computed as shown above and
theustarschemacanbemodifiedasshowninFig.2andused
forfurtheranalysis.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 182
4. UNCERTAIN DATA WAREHOUSES FOR WEATHER DATA
In this section we will explore the usage of our proposed
uncertain data warehouse in designing an UDW for synthetic
weather data acquired from sensor readings. We assume we
have sensors that measure temperature, pressure, rainfall, and
humidity and we treat those measurements as independent
measures. We also assume that the sensors follow normal
distributionintheirreadings.
Employing our proposed design schema we design the ustar
schema shown in Fig. 3. In this schema we have a central fact
table and 5 dimension tables (date dimension and 4 weather
sensordimensions).Weuseatransactionfacttablesoeachrow
inthefacttablerepresentsthereadingsofthedifferentsensors
on a specific date at a specific hour. We assume that readings
areobservedeveryhoursothelowestgranularitywehavefor
the time is hour. This assumption can be easily relaxed in real
applications and in this case it is preferable we split the
datatime dimension into 2 separate dimensions (data and
time)toavoidtheexplosioninthedimensiontablesize.Inthe
facttablewemeasuretheaveragevaluesforeachofthesensor
readings using the uaverage measure defined earlier. In addi
tion, we record the confidence range of each measure (CI)
giventhemeasuresattributes(u,o).Thisconfidenceintervalis
expressedinthefacttablewith2valuesasmeasureCIMinand
measure CI Max to express the minimum and maximum
bound of the interval. However, the fact table can keep more
than1interval(i.e.morerangesaroundthemean),thiswillin
turnaffectthepossiblenumberofqueriesthatweissueaseach
rangetranslatestoadifferentconfidenceintheoccurrenceofa
certainreading.
Fig.3.AustarSchemaforWeatherData
5. CONCLUSIONS AND FUTURE WORK
Inthispaperwepresentapioneersteptowardsdesigningun
certain data warehouses. Our approach is capable of handling
bothfuzzyandprobabilisticmeasures.Forthefuzzyvalueswe
use the membership function and for the probabilistic values
weusetheconfidenceintervalforGaussianrandomprocesses.
Weproposeanuncertainstarschemaframeworkthatisableto
capture the uncertainty feature of the measures stored in the
facttable.Forfutureworkwetargettoconsiderawiderrange
of distributions and to investigate the effect for our proposed
modelontheminingprocess.
REFRENCES
[1] L. Antova, T. Jansen, C. Koch, and D. Olteanu, Fast and simple rela
tionalprocessingofuncertaindata,inIEEE24thInternationalConference
on Data Engineering (ICDE08). Washington, DC, USA: IEEE Computer
Society,2008.
[2] N. Dalvi and D. Suciu, Efficient query evaluation on probabilistic da
tabases,TheVLDBJournal,vol.16,no.4,pp.523544,2007.
HodaM.O.Mokhtar
Is currently an assistant professor in the Information Sys
tems Dept., Faculty of Computers and Information, Cairo
University. Dr. Hoda Mokhtar received her PhD in Com
puter Science in 2005 from University of California Santa
Barbara. She received her MSc and BSc in 2000 and 1997
resp. from the Computer Engineering Dept., Faculty of En
gineering CairoUniversity. Her research interestsare da
tabasesystems,movingobjectdatabases,datawarehousing,
anddatamining.