You are on page 1of 5

2019 International Conference on Electrical Engineering and Informatics (ICEEI)

July 2019, 9 - 10, Bandung, Indonesia

The Development of Data Publishing Tool for


Indonesian Open Government Data
Wikan Danar Sunindyo I Dewa Putu Deny Krisna Amrita
School of Electrical Engineering and Informatics School of Electrical Engineering and Informatics
Institut Teknologi Bandung Institut Teknologi Bandung
Bandung, Indonesia Bandung, Indonesia
wikan@informatika.org 13514096@std.stei.itb.ac.id

Abstract—Open Government Data (OGD) is data produced There are several challenges in terms of OGD
or commissioned by the government, which can be publicly development in Indonesia. According to a research conducted
published. These data can be accessed freely by anyone, in order by Aryan, et al. in 2014, most Indonesian open government
to increase public participation and enable government agencies data published have low data quality [3]. Most of the data
to report their performance transparently. Indonesia is one of provide 1-star data quality, and others provide up to 3-star data
many countries that has been applying open government data quality. Moreover, according to a survey conducted by World
concept, by establishing Open Government Indonesia (OGI). Wide Web Foundation in 20172, Indonesian open government
With the establishment of the OGI, many Indonesian data achieved only 37 out of 100 in terms of overall data
government agencies have developed open government data. quality.
However, many of them have low data quality. One standard
that can assess open government data quality is Five Star Open In this paper, the authors propose a solution to overcome
Data. This standard uses 5-step concept, with each data requires these challenges. The solution is to develop a data publishing
particular quality to achieve those steps. This paper proposed a tool, which has two core processes. The first process is to
solution to enhance the data quality of Indonesian government enhance open government data quality, from 2-star into 5-star
data and develop a data publishing tool. This data publishing data quality according to Five Star Open Data standard. The
tool accepts data with 2-star and 3-star quality and enhances the second process is to publish the data accordingly and
quality of input data to 5-star quality respectively. This tool also generates data visualization in the process. The solution has
publishes the data and generates several types of data
been evaluated with datasets from Hasan Sadikin General
visualization according to the data. This tool uses data from
Hasan Sadikin General Public Hospital (RSHS) as test data.
Public Hospital. Based on the evaluation, the tool has
Based on the evaluation conducted, this tool can enhance the successfully enhanced the data quality of each dataset from 2-
data quality of five datasets, from 2-star to 5-star quality. In star to 5-star quality. The tool also published each dataset
addition, the tool publishes the datasets and generates data successfully.
visualization based on the datasets’ contents.
II. RELATED WORKS
Keywords—open government data; government data; Five
Star Open Data; data visualization; hospital data A. Five Star Open Data
Five Star Open Data is one of the standards that can be
I. INTRODUCTION used to assess quality of open data. This standard was
Open Government Data (OGD) is data produced or proposed by Tim Berners Lee, which consists 5 levels to
commissioned by the government or government-controlled achieve. In order to achieve 5-star level, an open data must
entities that can be used, reused, and redistributed freely by provide a particular quality, i.e. must be available on the web
anyone [2]. Government data that can be used as OGD is under an open license (level 1), must have a form of structured
limited to data which contains no personal content and will not data (level 2), must be on non-proprietary format (level 3),
put national security and stability to danger. In OGD have URIs to identify itself (level 4), and is connected with
development, there are many standards to assess the data another sources of information e.g. another datasets, articles,
quality of the data published. One of the standards is Five Star documents, etc. (level 5)3.
Open Data. Five Star Open Data is a standard which assess
OGD in five levels. In order to achieve each level, an OGD B. Existing Data Publishing Frameworks in Indonesia
must possess qualities stated by the standard. Aryan et al. proposed a framework to transform existing
OGD usage and publication in Indonesia is based on Open government data into data with LOD-ready format i.e. RDF.
Government Indonesia (OGI) movement. This movement’s The processes included in the framework can be found in
main goal is to make Indonesian government into a Figure 1.
transparent, participative, and innovative government1. One of
the concrete moves by OGI is an OGD portal named
data.go.id. With movements made by OGI, hopefully
everyone can access government data without facing
inconvenient administrative procedures.

1 https://infokomputer.grid.id/2015/01/fitur/perkembangan- 2 https://opendatabarometer.org/4thedition/
open-data-di-indonesia/ 3 http://5stardata.info/en/

978-1-7281-2418-6/19/$31.00 ©2019 IEEE 30


Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Electrical Engineering and Informatics (ICEEI)
July 2019, 9 - 10, Bandung, Indonesia

III. SOLUTION DESIGN

A. Core Processes
The tool was developed based on the frameworks
proposed by Aryan et al., also Nusapati and Sunindyo. The
tool has three core processes which can be seen in Figure 3.

Figure 1 Framework consisting data transformation to LOD- Figure 3 Tool Core Processes
ready format (Aryan et al) The first core process is to enhance input data quality
The process of the transformation begins with inserting from 2-star to 5-star quality. Data then will be published into
existing Indonesian open government data, ranging from 1- the public by giving access to web APIs provided in the
star to 3-star quality into the system. The data is then system, also by a client webpage containing published data.
transformed into RDF from initial data format, to achieve 4- Finally, data visualization can be generated using data
star data quality. The RDF format then will be cleaned in data published beforehand.
cleaning process, to ensure that the data will contain no typos B. Data Quality Enhancement
or unnecessary contents. The data will be also linked with
other resources in this step, achieving 5-star data quality. Data quality enhancement is the first process conducted
These two processes will be recorded in provenance step to by system. The main goal of this process is to enhance the
ensure that the processes conducted can be tracked later. This data quality of input data to 5-star data quality. Data quality
provenance step also stores all the data processed into the RDF enhancement has 8 processes involved. Processes involved in
store. data quality enhancement can be seen in Figure 4.
Nusapati and Sunindyo also proposed another framework,
developed from previous framework proposed by Aryan et al
[1]. In their research, the framework proposed used open
government data published from Badan Pusat Statistik (BPS)
Indonesia as a study case. Then the framework was extended
into more detailed processes. The core processes proposed in Figure 4 Data Quality Enhancement Processes
the framework can be found in Figure 2.
The first process is uploading the input data into the
system. The process is conducted automatically by system.
Since the input data is a 2-star data i.e. data with Excel format,
the system uses a library called “pandas” to convert data with
Excel format into data with JSON format. The original data
then will be saved automatically into the file system.
The second process is input data conversion to JSON
object format. This object oriented JSON format consists of
input data content and metadata. This JSON object data
serves as temporary data storage, which information in the
the data will be used in data cleaning process.
The third process is data cleaning. In this process, the
JSON object data will be converted into pandas’ DataFrame
matrix in order to make the cleaning process easier. Then the
DataFrame will be cleaned with processes as follows: typos
removal, missing data removal, and content editing. The
process will be done manually by the user i.e. the
administrator of the government agency. After data cleaning
process is conducted, the data will be saved in JSON object
format once more.
The fourth process is defining column data types and
Figure 2 Framework consisting data transformation to LOD- contexts. The main purpose of this process is to define data
ready format (Nusapati and Sunindyo) types of each columns in the structured data, also defining
The process of the transformation begins with inserting contexts of each columns i.e. RDF classes and properties. Not
data with 2-star or 3-star quality i.e. CSV and Excel. Then the only definition of data types and contexts, custom column
data quality enhancement process begins. Data quality title and description is also given to each column. The process
enhancement process consists of six processes i.e. conversion will be done manually.
to JSON array, data cleaning, adding metadata, conversion to
The fifth process is conversion to JSON-LD. This process
JSON object, conversion to RDF, and finally data linking.
These processes will enhance the data quality up to 5-star will convert data with defined contexts to a JSON format with
quality. The final process of the framework is data store, RDF contexts given into the data. The main purpose of this
which saves all of the data processed in the previous step. process is to make RDF-ization process can be done easily,
Those processed data will be embedded with web APIs to give due to contexts defined. These contexts defined serves as
data access to public. URIs which will be used as a “link” for another data to link

31
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Electrical Engineering and Informatics (ICEEI)
July 2019, 9 - 10, Bandung, Indonesia

to mentioned data. The process is done automatically by the everyone. The processes of data visualization evocation can
system. be seen in Figure 5.
The sixth process is RDF-ization. RDF-ization is a
process to translate JSON-LD generated beforehand, into a
data in RDF format, which contains several URIs as links
mentioned before. The process is done automatically.
Figure 5 Processes of Data Visualization Evocation
The seventh process is data linking. The main goal of this The first process is chart type selection. The objective of
process is to linking data with another resources in order to this process is to choose the best chart to be presented to the
create Linked Open Data. Resources that is accepted as a audience. In this paper, the charts are limited to four types,
external resources in this paper are external data, external which are bar chart, line chart, scatter plot, and pie chart.
attachments, and external links. This process is done These charts are chosen in order to fulfill four purposes of data
manually, by defining the title, description, and content of the visualization, namely composition, comparation, distribution,
resources. and correlation4.
The final process is data store. Data that has been
processed beforehand will be saved as file with Excel, CSV, The second process is axes selection. The main goal of this
JSON, JSON-LD, and RDF format respectively. These files process is to choose preferred axes for the chart selected
beforehand. In this process, a user must be extremely careful
will be stored in file system, and later will be embedded with to choose the axes due to different data types needed to
web APIs to enable access to each file. The process of storing generate a chart. For example, a scatter plot must have number
each file is done automatically by the system. data types on both axes. In this paper, the chart can accept
C. Data Publishing aggregation for one axis, in order to visualize charts with
aggregation purpose, such as sum, average, maxima, or
After enhancing data quality with processes mentioned minima of an axis.
before, the data publishing process starts. Data publishing
process consists of two subprocesses, namely web APIs The third process is data processing. In this process, data
evocation and publication status shift. Web APIs evocation is from selected axes will be processed according to the numeric
a process which web APIs are generated from the system. transformation. For example, if a chart with sum function of x
Web APIs consist of data in various formats which have been axis will process y axis as the sum of the mentioned x axis,
stored in the data store. This web APIs is represented in a form and vice versa.
of RESTful API, the renowned API form for web. The web
APIs then can be accessed read-only by public as The last process is drawing data visualization. In this
downloadable files. The list of the web APIs can be seen in process, data processed in previous process will be drawn
Table I below. accordingly to its axes, creating a complete chart. If the chart
is already good to be published, then it will be stored in the
file storage as PNG image file.
TABLE I. WEB API

Web API E. System Architecture


Name Req The system architecture of data publishing tool developed
Route Respons
ue
e in this paper is a web-based application which functionalities
Data /api/data/:admin_id/ GET [Data] are derived from three core processes mentioned before. This
List /api/data/details/:data_id/ web application is using client-server architecture, which
Data GET Data
architecture can be seen in Figure 6.
/api/data/:data_id/:data_versio
Excel GET File.xls
n
x
/api/data/:data_id/:data_versio
CSV GET File.cs
n
v
/api/data/:data_id/:data_versio
JSON GET File.jso
n
n
JS /api/data/:data_id/:data_versio
GET File.jso
ON n
n
/api/data/:data_id/:data_versio
RDF GET File.ttl
n

Publication status shift is a process in changing the


publication status of the data. Thus, data with publication
status stated as “published” can be accessed by everyone via
client webpage and also web APIs.
Figure 6 System Architecture
D. Data Visualization Evocation The main users of this web application are divided in two,
namely administrator of the system i.e. staff of the general
The last core process is data visualization evocation. Data hospital and public i.e. human or bot. The administrator of the
visualization evocation is a process in generating data system has a role to insert the data into the system, then
visualizations. These data visualizations would help a lot in manage the contents inside the system. Thus, the administrator
making a data more understandable to comprehend by has full access privilege in the system i.e. reading and writing.
In order to gain this privilege, the administrator would have to
provide login information to the system. Whereas public are

4 https://www.labnol.org/software/find-right-chart-type-for-
your-data/6523/

32
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Electrical Engineering and Informatics (ICEEI)
July 2019, 9 - 10, Bandung, Indonesia

the client of the system. They have a role to access the data can be in both binary or non-binary format. The schema of the
available in the system, with limited access privilege. Public collection can be seen in Figure 10.
can only use reading privilege in the system. Thus, public can
only access the system via web APIs and client webpage in The fifth collection is Link. This collection stores
order to access data provided. Human can access both web information about external links saved in the system. These
APIs and client webpage, whereas bot can only access the data external links will be embedded in the RDF file also. The
provided via web APIs. schema of the collection can be seen in Figure 11.

All of the users’ requests will be directed straight to a The last collection is Temporary Data. This collection
server. This server processes all of the operations needed by stores information about temporary data used in the system. In
the users e.g. inserting data into the system, reading files in this paper, temporary data is used for data cleaning process
data store, etc. The operations available in the server are data and data visualization evocation process. The schema of the
extraction, transformation, data store, data publication, and collection can be seen in Figure 12.
data visualization evocation. In order to have full access to all
operations on the server, a user need to have login information Admin: {
_id: Schema.Types.ObjectId,
mentioned before, thus the user would need an administrator department_id: Number,
privilege to do so. If the user is a public user, the system would department_name: String,
only limit the user’s privilege into read-only, which accepts password: String
only GET requests. }

The last layer of the system architecture is data store. The Figure 7 Schema for Adminstrator Collection
data store is divided into two stores i.e. database and file
system. The database keeps the data record generated by
Data: {
server operations, whereas the file system keeps both binary _id: Schema.Types.ObjectId,
and non-binary files which are generated both from data admin_id: Schema.Types.ObjectId,
insertion and evocation from the system. created_at: Datetime,
updated_at: Datetime,
F. Data Structure first_published_at: Datetime,
last_published_at: Datetime,
All of the data processed in the data publishing tool in this last_hidden_at: Datetime,
paper is stored in a database. The database used in this paper published_data_updated_at: Datetime,
is MongoDB. MongoDB is a database which using the draft_data_updated_at: Datetime,
NoSQL technology. NoSQL technology enables system to is_published: Boolean,
metadata: {
create records without predefining its scheme. This can be title: String,
done due to its flexible nature, thus making each record in one description: String
table can has a different scheme one to another. The database }
is represented in collections and documents. Collections are }
equivalent to tables in relational database, whereas documents
are equivalent to records in the same way. The reason of using Figure 8 Schema for Data Collection
MongoDB as main database is that MongoDB is using JSON
representation as one document. JSON is very light and easy
to read by human. Furthermore, JSON is very versatile i.e. it
can be easily parsed by various programming languages DataVer: {
_id: Schema.Types.ObjectId,
available today. data_id: Schema.Types.ObjectId,
In this paper, though MongoDB documents can have version_number: Number,
created_at: Datetime,
various scheme depending of the needs of the system, the is_main_version: Boolean,
scheme used in one collection would be uniform to make the metadata: {
data processing easier. There are six collections defined in the columns: Schema.Types.Mixed,
system. The collections are Administrator, Data, Data context: Schema.Types.Mixed,
Version, File, Link, and Temporary Data. rating: Number
},
The first collection is Administrator. This collection stores content: Schema.Types.Mixed,
administrator ids and credentials. These are needed to gain tasks: [{
administrator access in the system. The schema of the name: String,
description: String,
collection can be seen in Figure 7. status: Boolean,
The second collection is Data. This collection stores data’s mandatory: Boolean,
done_at: Datetime,
general information of the data itself. This collection also url: String,
serves as a container for the Data Version collection, which }]
will be explained later. The schema of the collection can be }
seen in Figure 8.
Figure 9 Schema for Data Version Collection
The third collection is Data Version. This collection stores
data versions created on one Data collection. Due to this, one
Data collection can have several versions of the data, which
supports data updates depending on the situation. The schema
of the collection can be seen in Figure 9.
The fourth collection is File. This collection stores
information regarding files stored in the file system. The file

33
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Electrical Engineering and Informatics (ICEEI)
July 2019, 9 - 10, Bandung, Indonesia

TABLE II. DATASET LIST


Dataset
File: { ID
_id: Schema.Types.ObjectId, Name Origin
data_id: Schema.Types.ObjectId, Patients’ Visit to Information
SIRS-
version_number: Number, Emergency Department System
created_at: Datetime, 01
Patients’ Visit to Internal
filename: String, Information
SIRS- Medicine Department
System
format: String, 02 Data in 2017
Department
category: String, Patients’ Visit to Information
SIRS-
alias: String, Pediatric Department System
03
metadata: { Customer Satisfaction Planning and
PE-01
source_name: String, Index Measurement Data Evaluation
title: String, Customer Satisfaction Planning and
PE-01
description: String Index Measurement Data Evaluation
}
}
Figure 10 Schema for File Collection B. Evaluation Result
Link: { The data publishing tool has been tested both on
_id: Schema.Types.ObjectId, functionality and data quality aspects. On functionality aspect,
data_id: Schema.Types.ObjectId, the tool has executed all of the functionalities completely
version_number: Number, without any exceptions available. On data quality aspect, the
created_at: Datetime, tool has processed all of the datasets successfully. All of the
url: String,
title: String,
datasets have been enhanced into 5-star data quality from 2-
description: String star data quality. However, there is one exception for the
} result. The exception is that the 5-star data i.e. linked data
specific for connection from one dataset to another is limited
Figure 11 Schema for Link Collection to table-to-table connection. This exception can be solved by
TempData: { linking the dataset to another by row-to-row level in the future
_id: Schema.Types.ObjectId, development. In general, the tool can pass all of the tests and
data_id: Schema.Types.ObjectId,
version_number: Number,
assessments completely without any severe problem.
content: Schema.Types.Mixed
} V. CONCLUSION
Figure 12 Schema for Temporary Data Collection In this paper, the authors have developed a data publishing
G. File Storage tool that can enhance the data quality of five datasets of Hasan
Sadikin General Public Hospital. The enhancement starts from
In this paper, there are various files stored in the file 2-star to 5-star data quality. The data quality enhancement
system. In order to manage the files available in the system, process consists of data cleaning process, data conversion
the authors use Django’s default file system. Django is the process, and data linking to another resources, in this case
server-side programming framework, written in Python datasets, links, and attachments. Not only enhancing the data
language. The file system consists of five directories, namely quality of the datasets, this tool also published all of the
results, source, charts, attachments, and temporary files. datasets, and generated several data visualizations for the
Results directory stores result files from processed data datasets.
beforehand. Source directory stores source files consisting
input data. Charts directory stores charts that have been stored ACKNOWLEDGMENT
after data visualization evocation process. Attachments
This research is fully supported by our school, School of
directory stores attachments embedded to the data. Lastly, Electrical Engineering and Informatics, Institut Teknologi
temporary files directory stores temporary data used for data Bandung. The authors would like to thank our colleagues,
processing e.g. charts. both from our department and from other disciplines for the
insights and inspirations so we can conduct this research
IV. EVALUATION
accordingly to the plan. The authors also would like to thank
staffs from Hasan Sadikin General Public Hospital for the
A. Evaluation Scheme
supports and assistance to make this research ran smoothly.
The data publishing tool would be tested as an evaluation Finally, we would like to thank everyone who contributed in
for the development conducted to build the tool itself. Not this research.
only for development, the evaluation also targets data quality
assessment for the data stored in the system. Data quality
assessment used Five Star Open Data standard as the main REFERENCES
benchmark. The tests for the system include functionality [1] C. A. Nusapati and W. D. Sunindyo, Semi-automated data publishing
testing for each functionality using black box testing method. tool for advancing the Indonesian open government data maturity level
case study: Badan pusat statistik Indonesia, 2017 International
The tests is conducted using five datasets from Hasan Sadikin Conference on Data and Software Engineering (ICoDSE), Palembang,
General Public Hospital as test datasets. The list of datasets 2017, pp. 1-6.
used for the tests can be seen in Table II below. [2] F. Bauer and M. Kaltenböck. 2012. Linked Open Data: The Essentials.
Vienna:Semantic Web Company, pp. 9-13.
[3] P. Aryan, F. Ekaputra, W. Sunindyo, S. Akbar, Fostering government
transparency and public participation through linked open government
data (Case study: Indonesian public information service), 1st
International Conference on Data and Software Engineering (ICODSE
2014), pp. 1-6, 2014

34
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.

You might also like