You are on page 1of 18

Research Document

Privacy-preserving data mining for open government data


from heterogeneous sources
Contents
Introduction:..............................................................................................................3
Theory Presentation:.................................................................................................3
Project Risks:.........................................................................................................4
Early warning signs:...............................................................................................5
Controlling the system development:...................................................................6
Case Summary:..........................................................................................................7
Case Analysis:............................................................................................................8
Methodologies:.......................................................................................................13
Conclusion:..............................................................................................................15
Introduction:
This research document is about the Privacy-preserving data mining for open
government data from heterogeneous sources. A global movement called "open
data" has the potential to have a positive social and economic impact.
Government agencies may not have new or creative services because of policies
on open government data (OGD). The significance of data mining is sufficiently
covered by the International Open Data Charter.
The following areas should be a priority for the governments that have signed this
charter:
 data mining
 linking
 thorough examination
The use of OGD data mining for in-depth analysis is encountered with a number
of practical challenges. First off, the majority of OGD lack identifiers that would
prevent privacy disclosure. Second, due to the nature of siloed data, different
administrative or institutional hurdles must be removed in order to facilitate data
sharing and collecting from heterogeneous OGD. An innovative technical solution
that uses micro-aggregation and distance-based data linkage to overcome the
aforementioned problems has thus become necessary. Therefore, a method that
can combine two or more de-identified OGDs into a single dataset to enable OGD
data mining is suggested in this study.
the reason of writing this document is to explain three perspectives of our
project:
 project risks that are associated with our project
 early sign warning, i.e., warnings, issues and problems that we face in
starting of our project or at early stages.
 Controlling the project development.

Theory Presentation:
The analysis of our project applies three theoretical perspectives and are explain
below:
Project Risks:
Data arrangements that can cover consumer information are necessary for data
mining, which may jeopardize confidentiality and privacy. Data aggregation, which
is the process of gathering data from various sources and putting it all together so
it can be examined, is one way for this to happen.
In our report we try to explain the risks that we face while completing this
project, oftenly the projects are affected by two reasons:
 Strategy Risk
 Inter organization issues

There are four major risks that we faced while completing this project, two of
them affected our project badly because these risks are linked to our basic
requirements.
The risks and challenges we face in this project are:
 Government law and policies
 Over budget
 Data management
 Data privacy and accuracy
The main risk that affected or project badly is Government law and policies, we
have to sign the Government policies paper with government employees it wastes
our time, because of this our project took extra time to complete.
In start of our project, we planned a budget for this project, but it took 38% extra
to complete this project, as it is government level project our Company contact
with government about this and government delays to pass the budget, at one
moment we are thinking about closing the project.
Early warning signs:
It is not difficult to recognize the early warning symptoms. Early on, you can tell if
a project is going to fail, but you can only perceive success when the job is
finished.
Your ability to plan and manage projects will determine everything. Project
managers can assess variables that could prevent you and your team from
achieving your objectives by having a firm understanding of the essential
conditions for a project's success.
You're already halfway there if your project is well-planned out, has a distinct
baseline, and a suitable framework in place. The Pulse claims that high-
performing companies that use tried-and-true project management techniques
achieve their initial objectives 2.5 times more frequently.

One of the concepts used to identify a project's probable failure causes is the
concept of early warning indications. Early warning is defined by signals, which
can be interpreted in a variety of ways as an expression, an indicator, a proof, or
an indication that certain future unfavorable concerns will arise.
Actually, the project failure means that:
we are unable to deliver the project.
We are unable to achieve our goals.
We are unable to satisfy our clients.
As I mentioned earlier, we face budget risk in completing this project, and at
some point, we are thinking to close our project, because we thought that it is
early warning sign for our project failure.
The other early sign warnings that we faced are:
 Unclear project requirements
 Poor communication with clients
 Constant changing in scope.
Controlling the system development:
The process of creating and maintaining computer-based information systems is
known as information systems analysis and design (ISAD). Information system
analysis and design are motivated by organizational considerations. An
organization could be made up of the entire company, certain departments, or
individual work groups. Therefore, information systems analysis and design are a
technique for improving an organization. Systems are created and improved
(rebuilt) for organizational purposes. Benefits come about as a result of adding
value when developing, producing, and maintaining the organization's services
and goods.
Controlling of system includes control on different phases of development that
are:
1. Requirement gathering.
2. Design
3. Implementation.
4. Testing.
To achieve the control in development of or project we try to control all these
four phases, as these are known as basics of System development.
Case Summary:
In this study, we propose a micro-type privacy-preserving data mining (PPDM)
algorithm to protect the privacy of open government data (OGD) without any
identifiers during mining. In 2013, several nations signed the G8 Charter.
However, the possibility of refining its principles to support wider utilization of
OGD has since been discussed (The International Open Data Charter, 2020). The
preamble to the International Open Data Charter clearly highlights the primary
purpose of OGD to be the pursuit of technology-fueled innovation to create more
accountable, efficient, responsive, and effective governments and businesses,
while simultaneously promoting economic growth (G8, 2013). The public use of
open data is necessary to achieve the aforementioned goals (T. M. Yang & Wu,
2016). However, this grants stakeholders across the world unprecedented access
to OGD, enabling outsiders to use their external experience and information
gathered from data mining as leverage for their own interests (Altayar, 2018; G8,
2013; Janssen, Charalabidis, & Zuiderwijk, 2012; Z. Yang & Kankanhalli, 2013;
Zuiderwijk, Helbig, Gil-García, & Janssen, 2014). Data mining is a process that uses
sophisticated data analysis tools to investigate relationships present in datasets,
revealing hidden patterns that are otherwise indiscernible (Tan, Steinbach, &
Kumar, 2016). It includes statistical modeling techniques, mathematical
algorithms, and various machine learning-based methods (Seifert, 2004).
Expanding the scope of OGD implementations requires the implicit use of
heterogeneous data sources (Conradie & Choenni, 2014). However, the utilization
of data from various sources is difficult owing to the de-identification of personal
information present in OGD data. Such privacy barriers present a major hindrance
to OGD mining by restricting access to sensitive data (Axelsson & Schroeder,
2009; Ruijer, Grimme likhuijsen, & Meijer, 2017). Additionally, implementations
can be
significantly delayed owing to the time-consuming and expensive nature of legal
proceedings associated with the participating services (Conradie & Choenni,
2014), which are often geographically separated. Some practitioners have
suggested that a technical solution might be a more appropriate solution to the
privacy issue, to enable special measures that can be implemented in the
workplace immediately (Seifert, 2004). In summary, OGD mining is difficult. This is
because of the complexity of establishing good linkages for data analysis owing to
the lack of identifiers caused by de-identification. Silos further complicate data-
sharing between government agencies. Therefore, in this study, we formulated a
research question on the possible modalities of application of the PPDM method
to OGD without using identifiers. To answer this question, we propose a new
technical solution to link heterogeneous data via a privacy-preserving technique.
PPDM of big data based on artificial intelligence (AI) is a state-of-the art technical
solution that preserves sensitive attributes. Consequently, it has become an
increasingly popular field of research (Agrawal & Srikant, 2000). PPDM-related
studies primarily aim to develop an efficient algorithm capable of preventing
disclosure or inference of sensitive data during the extraction of relevant
information from databases (Wu, Chu, Wang, Liu, & Yue, 2007). PPDM finds
applications in several practical scenarios, as demonstrated by its role in medical
research based on personal medical records of patients (Lindell & Pinkas, 2002).
For example, the Center for Disease Control researches patterns of disease
occurrence by analyzing the data owned by various insurance companies.
However, if an insurance company were to disclose the entirety of personal
medical data to the Center for Disease Control, it would create serious
commercial and legal problems (Estivill-castro & Clifton, 2002). In such a case, a
distributed data mining algorithm is required to safeguard the private information
of the associated users (Xiong, Chitti, & Liu, 2007).
The proposed method is a hybrid model, combining distance-based record linkage
with a micro-aggregation method. Its data mines deidentified OGD via record
linkage. Thus, the method is capable of addressing the problem of mining of
anonymized and already distributed OGD. Further, it supports heterogeneous
data mining for in-depth analysis, which has been emphasized in the Open Data
Charter.

Case Analysis:
After case summary we have try analyze our case according to the three
perspectives we have about our project, then we will discuss the methodologies
on which we work to find the solution of our project problem.
Project Risk:
We have identified the risks of our project through interviewing our clients and
different employees that are working on this project.
The risks we faced while completing this project “Privacy-preserving data mining
for open government data from heterogeneous sources” are given below in table:

We categories the different risks in various categories and create a framework to


solve these risks to complete our project successfully.
Firstly, we plan risks management and then we identify the risks that we are
facing, after that we perform qualitative and quantitative risk analysis with our
clients to implement the solution or responses to our project risks.
Early Sign Warnings:
One of the concepts used to identify a project's probable failure causes is the
concept of early warning indications. Early warning is defined by signals, which
can be interpreted in a variety of ways as an expression, an indicator, a proof, or
an indication that certain future unfavorable concerns will arise.
We work together with our team and clients to identify the early sign warnings
and reason to make our project successful.
We identify following early sign warnings:
People Related EWS:
lack of backing from upper management:
Given that workers frequently concentrate on tasks that their boss deems
important, it is not unexpected that this EWS is the best-rated one. Projects that
are initiated "from the bottom up" and departmental initiatives that lack the
necessary enterprise-wide support are examples of potential problem projects.
Ineffective project manager:
Project managers that are unable to lead or communicate well present a
significant danger. Successful analysts or programmers are promoted to project
managers in our project, but the roles are fundamentally distinct, much like sales
and sales management.
Instead of performing the effort, the project manager must plan and organize
numerous endeavors, and our managers consistently perform poorly in their
roles.
No participation or involvement of stakeholders:
Every significant project has a lot of stakeholders. If the project is to succeed,
these stakeholders must contribute resources, which frequently means taking
funds away from lesser priority tasks. Resources are constantly in greater demand
than they are in supply. It is almost certain that the project will not receive the
resources and attention necessary to deliver the stated project scope on time and
within budget if all important stakeholders are not involved and committed to the
project's success. Our project stakeholders are government men and they don’t
show any intention in this project and we have to all work alone.
Process Related EWS:
Absence of requirements or success criterion documentation.;
If functional, performance, and reliability requirements are not defined, then each
project team member and stakeholder will inevitably have a different expectation
and assumption about the project because each participant is operating from a
distinct mental model. Asking for sign-offs on requirements documents brings
disparities in expectations and assumptions to the fore, where they can be
resolved.
Response to our project EWS:

Controlling the system development:


Controlling system development is very important in project management and
one of the main factors of project success and failure. To control our project
development, we divide our project into four different modules:
1. Requirement:
In this module we interview our clients to gather requirements, we used
two different techniques to gather the requirements:
 Interviewing
 Story board user stories.
2. Design:
Once we gather all the requirements, we discuss that with our design team
and they convert instruction form requirements into technical
requirements.
3. Implementation:
After design we send designed technical requirements to the
implementation team, they convert technical requirements into the
system.
4. Testing:
Once the system is developed the testing team performed testing to test
the system that is it working accurately.

Methodologies:
k-Anonymity is a technique that simply examines whether a specific individual can
be re-identified from the available data. This method is J.-S. Lee and S.-P. Jun
Government Information Quarterly 38 (2021) 101544 3 one of the most basic
assessment techniques and is easy to understand and execute. However, it has
the drawback of being vulnerable to several types of attacks, including
homogeneity attacks and background knowledge attacks.
l-Diversity assigns well-represented sensitive values greater than or equal to l to
each equivalence class. This addresses some of the disadvantages of k-anonymity.
It extends k-anonymity by additionally requiring the existence of well-represented
values within each anonymity group. However, l-diversity fails to prevent
probabilistic inference attacks and attribute disclosure.
t-Closeness is a further extension of l-diversity. Instead of merely guaranteeing
the well-representation of sensitive values, this approach requires the distribution
of every sensitive attribute inside the anonymity groups to be identical to the
distribution of the attribute over the entire dataset, modulo a threshold t. It
protects against attribute disclosure, but not against identity disclosure
The databases used in the experiments presented in this study were primarily
obtained from two sources—the Korean Innovation Survey (KIS) and the Korea
Enterprise Data (KED). Each database is a multi-year micro-dataset obtained from
the relational database management system (RDBMS) for separate purposes. The
database acquired from KIS comprises cross-sectional survey data obtained via an
annual survey on the research and development (R&D) activities of Korean
companies. This survey captures contemporary international trends by following
the Oslo Manual (OECD, 2005b) of the Organization for Economic Cooperation
and Development (OECD), which facilitates international comparison. It acquires
and provides basic data required for the establishment of national innovation
policies and innovation research by identifying the current status and
characteristics of all innovationrelated activities of manufacturing and service
industries. Conversely, the KIS database comprises various general details
regarding companies, including company history, employee information, sales
figures, expenditure, patents, anchor product ratios, and technological levels. In
addition, information on R&D investments, equipment purchases, and
government-supported programs sorted by types of innovation activities
(product, process, organization, and marketing) is also included. The KED
database comprises longitudinal data containing the financial information of
Korean companies, including the data disclosed in financial position statements
and cash flow statements obtained from the Korea Corporate Disclosure System
(Jun, Lee, & Lee, 2020). The linking of the two aforementioned databases would
enable the investigation of a host of topics, like variations in the time series of a
company’s financial performance with respect to its R&D activities. 4000
companies from the manufacturing sector that responded to the survey between
January 1, 2013 and December 31, 2015 were used as the analysis targets in this
study from the KIS database (certain items, however, corresponded to the period
between January 1, 2015 and December 31, 2015). The analysis targets from the
KED dataset comprised the financial information of 119,890 companies surveyed
in 2015. Throughout this study, the data of 4000 companies surveyed in 2015
obtained from the KIS database is referred to as the basic database, while the
data of 119,890 companies collected in 2015 obtained from the KED database is
referred to as the extended database. We acquired the business registration
numbers of 1055 companies that appear in both databases. The linkage proposed
in the method developed in this study is executed without the knowledge of any
identifiers, including the business registration numbers that we obtained. Instead,
the business registration numbers for 1055 companies are used only during the
performance assessment of the proposed linkage technique. The variables used in
the experiments are industry category, establishment year, and sales figures,
which are common to both databases. In analogy with personal data, industry
category can be compared to the residential zip code of an individual, while the
establishment year and sales figures can be regarded to correspond to his age and
income, respectively. Industry category and establishment year are considered to
be quasi-identifiers (QI). These two variables do not uniquely identify particular
companies by themselves, but they can be used in conjunction with other
information to identify them (OECD, 2005a). Additionally, the data used in the
experiments in this study are secondary data that were distributed in accordance
with the strict guidelines for privacy protection established by the OECD.

Conclusion:
The findings of this study demonstrate a strong relationship between the
identified risk variables, the Early Warning Signs, and the absence of adequate
control mechanisms. We try to find and interlink the risks, early sign warnings and
control mechanism while completing this project.
The advice based on this report is to employ all three when developing a
preliminary project plan since the three theoretical viewpoints used to this
example are cohesively related. What danger factors are there for us? How can
we tell if the path we are on is hazardous? How can we influence what we do and
how we develop? Combining these viewpoints results in a comprehensive
understanding of the project and the dangers it faces.
This study proposed a method to preserve and utilize the privacy of already
published OGD. In this project, it emphasized the importance of research on
already published OGD that has, thus far, been relatively neglected. Further, it has
broadened the research area for PPDM by proposing a method to create an
integrated dataset for the mining of heterogeneous data without any identifiers.
We intend to investigate methods to train the proposed model in future works to
improve its practical utility by making it faster and easier to reuse. In addition, a
guide will be provided to policy makers for reference by conducting a case study
using the proposed method.

Recommendations:
After completing our project “Privacy-preserving data mining for open
government data from heterogeneous sources” we realized we have many
mistakes while completing this project. We can make it better with some changes
that we will implement in next projects of same domain.

There are few recommendations to improve project work and quality for next
time:
 if Government is involved in our project, we will sign TOS contract before
starting of project.
 Stay focused on the main goal and aim of project.
 Improve project planning and working by hiring suitable project managers.
 Increase the communication with clients.
 Discuss lessons learned in every module or phase of project.
 Project technical documentation should be ready before the
implementation phase.
 Finalize the project schedule and budget before starting the project,
otherwise it will effect our project during implementation.

References:
 M. Holsheimer and A. Siebes. Data mining: The search for knowledge in
databases. In CWI Report CSR9406, Amsterdam, The Netherlands, 1994.
 G. H. John. Enhancements to the data mining process. In PhD. Thesis,
Computer Science Dept., Stanford Univeristy, 1997
 R. Agrawal, T. Imielinski, A. Swami. Database mining: A performance
perspective. IEEE Trans. Knowledge and Data Engineering, 5(6) 914–925,
December 1993
 P. Bradley, U.M. Fayyad O.L. Mangasarian. Mathematical Programming for
Data Mining: Formulations and Challenges. Technical Report 98-01,
Computer Sciences Department, University of Wisconsin, Madison, WI,
January 1998
 Developing a theoretical framework for open data use. Government
Information Quarterly, 34(1), 45–52.
https://doi.org/10.1016/j.giq.2017.01.001.
 Schweitzer, R. (2009). Examining the barriers to e-government adoption.
Journal of EGovernment, 7(1), 113–122. Seifert, J. W. (2004).
 Data mining and the search for security: Challenges for connecting the dots
and databases. Government Information Quarterly, 21(4), 461–480.
https:// doi.org/10.1016/j.giq.2004.08.006.
 Shadbolt, N., & O’Hara, K. (2013). Linked data in government. IEEE Internet
Computing, 17(4), 72–77. https://doi.org/10.1109/MIC.2013.72
 Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Official
Statistics, 461–468. https://doi.org/10.1007/978-0-387-39940-9_3686.
 Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the
American Statistical Association.
https://doi.org/10.1080/01621459.1996.10476908.
 Ruijer, E., Grimmelikhuijsen, S., & Meijer, A. (2017). Open data for
democracy.

You might also like