Professional Documents
Culture Documents
The Development of Data Publishing Tool For Indonesian Open Government Data
The Development of Data Publishing Tool For Indonesian Open Government Data
Abstract—Open Government Data (OGD) is data produced There are several challenges in terms of OGD
or commissioned by the government, which can be publicly development in Indonesia. According to a research conducted
published. These data can be accessed freely by anyone, in order by Aryan, et al. in 2014, most Indonesian open government
to increase public participation and enable government agencies data published have low data quality [3]. Most of the data
to report their performance transparently. Indonesia is one of provide 1-star data quality, and others provide up to 3-star data
many countries that has been applying open government data quality. Moreover, according to a survey conducted by World
concept, by establishing Open Government Indonesia (OGI). Wide Web Foundation in 20172, Indonesian open government
With the establishment of the OGI, many Indonesian data achieved only 37 out of 100 in terms of overall data
government agencies have developed open government data. quality.
However, many of them have low data quality. One standard
that can assess open government data quality is Five Star Open In this paper, the authors propose a solution to overcome
Data. This standard uses 5-step concept, with each data requires these challenges. The solution is to develop a data publishing
particular quality to achieve those steps. This paper proposed a tool, which has two core processes. The first process is to
solution to enhance the data quality of Indonesian government enhance open government data quality, from 2-star into 5-star
data and develop a data publishing tool. This data publishing data quality according to Five Star Open Data standard. The
tool accepts data with 2-star and 3-star quality and enhances the second process is to publish the data accordingly and
quality of input data to 5-star quality respectively. This tool also generates data visualization in the process. The solution has
publishes the data and generates several types of data
been evaluated with datasets from Hasan Sadikin General
visualization according to the data. This tool uses data from
Hasan Sadikin General Public Hospital (RSHS) as test data.
Public Hospital. Based on the evaluation, the tool has
Based on the evaluation conducted, this tool can enhance the successfully enhanced the data quality of each dataset from 2-
data quality of five datasets, from 2-star to 5-star quality. In star to 5-star quality. The tool also published each dataset
addition, the tool publishes the datasets and generates data successfully.
visualization based on the datasets’ contents.
II. RELATED WORKS
Keywords—open government data; government data; Five
Star Open Data; data visualization; hospital data A. Five Star Open Data
Five Star Open Data is one of the standards that can be
I. INTRODUCTION used to assess quality of open data. This standard was
Open Government Data (OGD) is data produced or proposed by Tim Berners Lee, which consists 5 levels to
commissioned by the government or government-controlled achieve. In order to achieve 5-star level, an open data must
entities that can be used, reused, and redistributed freely by provide a particular quality, i.e. must be available on the web
anyone [2]. Government data that can be used as OGD is under an open license (level 1), must have a form of structured
limited to data which contains no personal content and will not data (level 2), must be on non-proprietary format (level 3),
put national security and stability to danger. In OGD have URIs to identify itself (level 4), and is connected with
development, there are many standards to assess the data another sources of information e.g. another datasets, articles,
quality of the data published. One of the standards is Five Star documents, etc. (level 5)3.
Open Data. Five Star Open Data is a standard which assess
OGD in five levels. In order to achieve each level, an OGD B. Existing Data Publishing Frameworks in Indonesia
must possess qualities stated by the standard. Aryan et al. proposed a framework to transform existing
OGD usage and publication in Indonesia is based on Open government data into data with LOD-ready format i.e. RDF.
Government Indonesia (OGI) movement. This movement’s The processes included in the framework can be found in
main goal is to make Indonesian government into a Figure 1.
transparent, participative, and innovative government1. One of
the concrete moves by OGI is an OGD portal named
data.go.id. With movements made by OGI, hopefully
everyone can access government data without facing
inconvenient administrative procedures.
1 https://infokomputer.grid.id/2015/01/fitur/perkembangan- 2 https://opendatabarometer.org/4thedition/
open-data-di-indonesia/ 3 http://5stardata.info/en/
A. Core Processes
The tool was developed based on the frameworks
proposed by Aryan et al., also Nusapati and Sunindyo. The
tool has three core processes which can be seen in Figure 3.
Figure 1 Framework consisting data transformation to LOD- Figure 3 Tool Core Processes
ready format (Aryan et al) The first core process is to enhance input data quality
The process of the transformation begins with inserting from 2-star to 5-star quality. Data then will be published into
existing Indonesian open government data, ranging from 1- the public by giving access to web APIs provided in the
star to 3-star quality into the system. The data is then system, also by a client webpage containing published data.
transformed into RDF from initial data format, to achieve 4- Finally, data visualization can be generated using data
star data quality. The RDF format then will be cleaned in data published beforehand.
cleaning process, to ensure that the data will contain no typos B. Data Quality Enhancement
or unnecessary contents. The data will be also linked with
other resources in this step, achieving 5-star data quality. Data quality enhancement is the first process conducted
These two processes will be recorded in provenance step to by system. The main goal of this process is to enhance the
ensure that the processes conducted can be tracked later. This data quality of input data to 5-star data quality. Data quality
provenance step also stores all the data processed into the RDF enhancement has 8 processes involved. Processes involved in
store. data quality enhancement can be seen in Figure 4.
Nusapati and Sunindyo also proposed another framework,
developed from previous framework proposed by Aryan et al
[1]. In their research, the framework proposed used open
government data published from Badan Pusat Statistik (BPS)
Indonesia as a study case. Then the framework was extended
into more detailed processes. The core processes proposed in Figure 4 Data Quality Enhancement Processes
the framework can be found in Figure 2.
The first process is uploading the input data into the
system. The process is conducted automatically by system.
Since the input data is a 2-star data i.e. data with Excel format,
the system uses a library called “pandas” to convert data with
Excel format into data with JSON format. The original data
then will be saved automatically into the file system.
The second process is input data conversion to JSON
object format. This object oriented JSON format consists of
input data content and metadata. This JSON object data
serves as temporary data storage, which information in the
the data will be used in data cleaning process.
The third process is data cleaning. In this process, the
JSON object data will be converted into pandas’ DataFrame
matrix in order to make the cleaning process easier. Then the
DataFrame will be cleaned with processes as follows: typos
removal, missing data removal, and content editing. The
process will be done manually by the user i.e. the
administrator of the government agency. After data cleaning
process is conducted, the data will be saved in JSON object
format once more.
The fourth process is defining column data types and
Figure 2 Framework consisting data transformation to LOD- contexts. The main purpose of this process is to define data
ready format (Nusapati and Sunindyo) types of each columns in the structured data, also defining
The process of the transformation begins with inserting contexts of each columns i.e. RDF classes and properties. Not
data with 2-star or 3-star quality i.e. CSV and Excel. Then the only definition of data types and contexts, custom column
data quality enhancement process begins. Data quality title and description is also given to each column. The process
enhancement process consists of six processes i.e. conversion will be done manually.
to JSON array, data cleaning, adding metadata, conversion to
The fifth process is conversion to JSON-LD. This process
JSON object, conversion to RDF, and finally data linking.
These processes will enhance the data quality up to 5-star will convert data with defined contexts to a JSON format with
quality. The final process of the framework is data store, RDF contexts given into the data. The main purpose of this
which saves all of the data processed in the previous step. process is to make RDF-ization process can be done easily,
Those processed data will be embedded with web APIs to give due to contexts defined. These contexts defined serves as
data access to public. URIs which will be used as a “link” for another data to link
31
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Electrical Engineering and Informatics (ICEEI)
July 2019, 9 - 10, Bandung, Indonesia
to mentioned data. The process is done automatically by the everyone. The processes of data visualization evocation can
system. be seen in Figure 5.
The sixth process is RDF-ization. RDF-ization is a
process to translate JSON-LD generated beforehand, into a
data in RDF format, which contains several URIs as links
mentioned before. The process is done automatically.
Figure 5 Processes of Data Visualization Evocation
The seventh process is data linking. The main goal of this The first process is chart type selection. The objective of
process is to linking data with another resources in order to this process is to choose the best chart to be presented to the
create Linked Open Data. Resources that is accepted as a audience. In this paper, the charts are limited to four types,
external resources in this paper are external data, external which are bar chart, line chart, scatter plot, and pie chart.
attachments, and external links. This process is done These charts are chosen in order to fulfill four purposes of data
manually, by defining the title, description, and content of the visualization, namely composition, comparation, distribution,
resources. and correlation4.
The final process is data store. Data that has been
processed beforehand will be saved as file with Excel, CSV, The second process is axes selection. The main goal of this
JSON, JSON-LD, and RDF format respectively. These files process is to choose preferred axes for the chart selected
beforehand. In this process, a user must be extremely careful
will be stored in file system, and later will be embedded with to choose the axes due to different data types needed to
web APIs to enable access to each file. The process of storing generate a chart. For example, a scatter plot must have number
each file is done automatically by the system. data types on both axes. In this paper, the chart can accept
C. Data Publishing aggregation for one axis, in order to visualize charts with
aggregation purpose, such as sum, average, maxima, or
After enhancing data quality with processes mentioned minima of an axis.
before, the data publishing process starts. Data publishing
process consists of two subprocesses, namely web APIs The third process is data processing. In this process, data
evocation and publication status shift. Web APIs evocation is from selected axes will be processed according to the numeric
a process which web APIs are generated from the system. transformation. For example, if a chart with sum function of x
Web APIs consist of data in various formats which have been axis will process y axis as the sum of the mentioned x axis,
stored in the data store. This web APIs is represented in a form and vice versa.
of RESTful API, the renowned API form for web. The web
APIs then can be accessed read-only by public as The last process is drawing data visualization. In this
downloadable files. The list of the web APIs can be seen in process, data processed in previous process will be drawn
Table I below. accordingly to its axes, creating a complete chart. If the chart
is already good to be published, then it will be stored in the
file storage as PNG image file.
TABLE I. WEB API
4 https://www.labnol.org/software/find-right-chart-type-for-
your-data/6523/
32
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Electrical Engineering and Informatics (ICEEI)
July 2019, 9 - 10, Bandung, Indonesia
the client of the system. They have a role to access the data can be in both binary or non-binary format. The schema of the
available in the system, with limited access privilege. Public collection can be seen in Figure 10.
can only use reading privilege in the system. Thus, public can
only access the system via web APIs and client webpage in The fifth collection is Link. This collection stores
order to access data provided. Human can access both web information about external links saved in the system. These
APIs and client webpage, whereas bot can only access the data external links will be embedded in the RDF file also. The
provided via web APIs. schema of the collection can be seen in Figure 11.
All of the users’ requests will be directed straight to a The last collection is Temporary Data. This collection
server. This server processes all of the operations needed by stores information about temporary data used in the system. In
the users e.g. inserting data into the system, reading files in this paper, temporary data is used for data cleaning process
data store, etc. The operations available in the server are data and data visualization evocation process. The schema of the
extraction, transformation, data store, data publication, and collection can be seen in Figure 12.
data visualization evocation. In order to have full access to all
operations on the server, a user need to have login information Admin: {
_id: Schema.Types.ObjectId,
mentioned before, thus the user would need an administrator department_id: Number,
privilege to do so. If the user is a public user, the system would department_name: String,
only limit the user’s privilege into read-only, which accepts password: String
only GET requests. }
The last layer of the system architecture is data store. The Figure 7 Schema for Adminstrator Collection
data store is divided into two stores i.e. database and file
system. The database keeps the data record generated by
Data: {
server operations, whereas the file system keeps both binary _id: Schema.Types.ObjectId,
and non-binary files which are generated both from data admin_id: Schema.Types.ObjectId,
insertion and evocation from the system. created_at: Datetime,
updated_at: Datetime,
F. Data Structure first_published_at: Datetime,
last_published_at: Datetime,
All of the data processed in the data publishing tool in this last_hidden_at: Datetime,
paper is stored in a database. The database used in this paper published_data_updated_at: Datetime,
is MongoDB. MongoDB is a database which using the draft_data_updated_at: Datetime,
NoSQL technology. NoSQL technology enables system to is_published: Boolean,
metadata: {
create records without predefining its scheme. This can be title: String,
done due to its flexible nature, thus making each record in one description: String
table can has a different scheme one to another. The database }
is represented in collections and documents. Collections are }
equivalent to tables in relational database, whereas documents
are equivalent to records in the same way. The reason of using Figure 8 Schema for Data Collection
MongoDB as main database is that MongoDB is using JSON
representation as one document. JSON is very light and easy
to read by human. Furthermore, JSON is very versatile i.e. it
can be easily parsed by various programming languages DataVer: {
_id: Schema.Types.ObjectId,
available today. data_id: Schema.Types.ObjectId,
In this paper, though MongoDB documents can have version_number: Number,
created_at: Datetime,
various scheme depending of the needs of the system, the is_main_version: Boolean,
scheme used in one collection would be uniform to make the metadata: {
data processing easier. There are six collections defined in the columns: Schema.Types.Mixed,
system. The collections are Administrator, Data, Data context: Schema.Types.Mixed,
Version, File, Link, and Temporary Data. rating: Number
},
The first collection is Administrator. This collection stores content: Schema.Types.Mixed,
administrator ids and credentials. These are needed to gain tasks: [{
administrator access in the system. The schema of the name: String,
description: String,
collection can be seen in Figure 7. status: Boolean,
The second collection is Data. This collection stores data’s mandatory: Boolean,
done_at: Datetime,
general information of the data itself. This collection also url: String,
serves as a container for the Data Version collection, which }]
will be explained later. The schema of the collection can be }
seen in Figure 8.
Figure 9 Schema for Data Version Collection
The third collection is Data Version. This collection stores
data versions created on one Data collection. Due to this, one
Data collection can have several versions of the data, which
supports data updates depending on the situation. The schema
of the collection can be seen in Figure 9.
The fourth collection is File. This collection stores
information regarding files stored in the file system. The file
33
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Electrical Engineering and Informatics (ICEEI)
July 2019, 9 - 10, Bandung, Indonesia
34
Authorized licensed use limited to: University of Wollongong. Downloaded on May 30,2020 at 08:29:14 UTC from IEEE Xplore. Restrictions apply.