Open Source Tools for Creating Mashups with Government Datasets

Mohammed Firdaus, Muhd Sharuzzamal Bakri

June 29, 2010

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Introduction

About the Speakers

About the Speakers

Mohammed Firdaus bin Mohammed Ab Halim (@firdaus halim) and Muhd Sharuzzamal Bakri (@amai) Founders of Persada Terbilang Sdn Bhd - We have no relationship whatsoever to any fertilizer supplier

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Introduction

What are Mashups?

Mashups

A mashup is a web page or application that uses and combines data, presentation or functionality from two or more sources to create new services. (Source: Wikipedia) Data mashups combine similar types of media and information from multiple sources into a single representation. (Source: Wikipedia)

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Challenges

Data Sets are Not Available in Machine Readable Form

Data Sets are Not Available in Machine Readable Form

Nothing useful here:
filetype:csv site:.gov.my filetype:xml site:.gov.my filetype:rdf site:.gov.my

We have to resort to web scraping.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Challenges

No Data Dictionaries

No Data Dictionaries

Since the data sets that are available were meant for humans to consume rather machines they are usually published without any type of data dictionary. This means that an application developer will have to make assumptions about the structure of each field e.g. whether it’s unique, whether it’s a multi-value field, which fields are mandatory/option. These assumptions may or may not turn out be correct as you see more and more data in the data set.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Challenges

New Data Sets Constantly Become Available

New Data Sets Constantly Become Available

This is a not a bad thing. However, our code, database and schema must be flexible enough to deal with future data sets that we might want to use in our applications.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Challenges

Lack of Standards Across Agencies

Lack of Standards Across Agencies

Different identifiers for referring to the same entity. The lack of common identifiers makes it tedious to combine data sets together which maybe describing the same entity. MyCoID and MyID are steps in the right direction.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Challenges

Summary

In Summary

Because of these challenges, we need an agile method for modeling, storing and processing these government datasets in our application. The purpose of this presentation is to show how representing your data as a graph both help you deal with these challenges and at the same time help make compelling data mashups.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Introduction to Graphs

What is a Graph?

A data structure that consists of a collection of vertices and the connections between those vertices, called edges. Vertices are sometimes called nodes or dots. Edges are sometimes called relationships or edges. The terminology differs between software packages.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Types of Graphs
A directed graph (or digraph) is one where the edges have a direction (i.e. there’s an outgoing and incoming vertex). A multigraph is one where multiple edges can exist between two vertices. An edge-labeled graph is a graph where edges have labels. Similarly, a vertex-labeled graph is one in which the vertices have labels. An attributed graph is one in which the vertices and edges can have attributes (key-value pairs). A graph can have more than one of these properties e.g. a multi digraph is one which multiple directed edges can exist between two vertices.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Types of Graphs - Simple/Undirected Graphs

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Types of Graphs - Directed Graph

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Types of Graphs - Edge and Node Labeled Graph

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Types of Graphs - Multigraph

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Types of Graphs - Attributed Multigraph

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Examples - Social Graphs

Source: http://www.flickr.com/photos/greenem/11696663/

Undirected Graph - Vertices represent people and edges represents friendship.
Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Types of Graphs

Examples - Web Graph

http://en.wikipedia.org/wiki/File:WorldWideWebAroundWikipedia.png

Multi-digraph - Vertices represent web pages and directed edges represent links between pages.
Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Property Graphs

Property Graphs

’Property graph’ is another term for attributed labeled multi-digraph. Property graphs are flexible enough to support most types of graph data. Other types of graphs (with the exception of hypergraphs) can be built on top of property graphs by removing features or using features of the property graph in certain ways. The tools that we are covering in this presentation deal primarily with property graphs.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graphs

Property Graphs

Property Graphs

Source: http://wiki.github.com/tinkerpop/gremlin/defining-a-property-graph

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Treasury Procurement Data

Treasury - Tenders Awarded

Source: http://myprocurement.treasury.gov.my/index.php/en/list-keputusan-tender

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Treasury Procurement Data

Fields

Tajuk Tender (Title of Tender) Nombor Tender (Tendor Number) Kategori Perolehan (Procurement Category) Kementerian (Ministry) Petender Berjaya (Winner of Tender) No Pendaftaran Dengan ROB/ROS/ROC (Registration Number with ROB/ROS/ROC) No Pendaftaran Dengan MOF/PKK (Registration Number with MOF/PKK) Harga Setuju Terima (Agreed Upon Value)

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Treasury Procurement Data

Code and Data in Machine Readable Form

For this presentation we are using data that we scraped form this site on 2010-04-26 The source code for our scraper and the CSV dump from 2010-04-26 is available at http://mfirdaus.com/mosc-paper/ The dump contains 2615 records.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Treasury Procurement Data

The Dump

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Issues with this Data Sets

Missing Fields

Out of the 2615 records in the dump 510 records were missing a tender number 472 records were missing a category 1836 records were missing a ROB/ROS/ROC number 510 records were missing a MOF no

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Issues with this Data Sets

Tender Numbers are Not Unique
32 records have the same tender number and title as another record 23 records have the same tender number as another record In some cases these appear to be duplicate records since the fields all match up. In other cases, one or two fields are slightly different indicating that there was a probably a typo (erroneous record was not deleted). In some cases, the other fields are completely different which leads us to think that it’s possible for there to be multiple winners of a tender (need some government officials to verify this for us).

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Issues with this Data Sets

Format of Tender Numbers

Examples of tender numbers: 8/2009 PL.(T).08.2009(JKP) X0141110101090021 128/2009 KBS.S.4-14/69 (T.26/2009) Probably not a good idea to write code that attempts to parse the tender number.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Issues with this Data Sets

Format of the ”Petender Berjaya” Field

SYARIKAT PROSPECTRUM SDN BHD TELEKOM SMART SCHOOL SDN BHD NO.45-8, LEVEL 3, BLOCK C, PLAZA DAMANSARA, JALAN MEDAN SETIA 1, BUKIT DAMANSARA 50490 KUALA LUMPUR 1. GLOBAL AEROSPACE SDN BHD (A002) 2. SYSTEM ALLIANCE TECHNOLOGY SDN. BHD.(A003) 3. KARISMA WIRA SDN. BHD. (A004) 4. KESUMA TECHNOLOGY SDN. BHD (A005) A QUALITY REPUTATION SDN BHD B PRIMABUMI SDN BHD

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Modeling

Modeling this Data Set as a Property Graph
One way to model this data as a graph is to: Vertices to represent tenders, ministries and companies/businesses. An ”awarded by” labeled edge to associate a tender with a ministry. An ”awarded to” labeled edge to associate a tender with the winner of the tender (the company/business). Attributes on tender vertices for the tender title, number, value, category Attributes on company/business vertices for the company/business name, ROB/ROC/ROS registration number and MOF registration number. Attributes on ministry vertices from the name of the ministry.
Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Data Sets

Modeling

Example

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Databases and Neo4j

Neo4j - Introduction

Neo4j

Neo4j is a graph database. Persists data in graph form. Property graph data model with the exception of vertex labels. In Neo4j terms, vertices are nodes, edges are relationships and attributes are properties. Property values can be a String or any Java primitive (arrays of these types are supported as well). Licensed under the AGPLv3. Which basically means that you don’t need a license if your application is released under a compatible free software license. For other uses, you need a commercial license from them.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Databases and Neo4j

Neo4j - Introduction

Neo4j

Written in Java. Bindings available for Python, Ruby, Clojure, Erlang, Groovy, Scalan and PHP. We will be using the Python bindings in this talk. An embedded database, meaning that it runs in the same process space as the application. There’s a standalone REST server for those who prefer it.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Databases and Neo4j

Inserting into Neo4j

Initializing the Database

import neo4j db = neo4j.GraphDatabase("db")

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Databases and Neo4j

Inserting into Neo4j

Creating the Nodes

ministry node = db.node(name=ministry, type="ministry") entity node = db.node(name=entity name, no=entity no, mof no=entity mof no, type="business entity") tender node = db.node(no=tender no, title=tender title, category=tender category, value=tender value, type="tender")

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Databases and Neo4j

Inserting into Neo4j

Creating the Relationships

tender node.awarded by(ministry node) tender node.awarded to(entity node)

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Databases and Neo4j

Inserting into Neo4j

Indexing Nodes

ministries = db.index("ministries", create=True) business entities = db.index("business entities", create=True) tenders by no = db.index("tenders by no", create=True) tenders by title = db.index("tenders by title", create=True)

tenders by no[tender no] = tender node tenders by title[tender title] = tender node

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Databases and Neo4j

Inserting into Neo4j

The Result

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversing the Graph

Traversing is the process of walking around the graph.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Graph Traversal Options

Graph Traversal Framework Gremlin SPARQL Manual traversal

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Problem

Lets use graph traversal to find all the companies who have been awarded contracts by Kementerian Kesihatan.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Graph Around Kementerian Kesihatan

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversal Framework

Defining the Traversal
# Companies who have gotten contracts from a particular ministry # The start node is a ministry class Contractors(neo4j.Traversal): types = [neo4j.Incoming.awarded by, neo4j.Outgoing.awarded to] order = neo4j.DEPTH FIRST stop = neo4j.STOP AT END OF GRAPH def isReturnable(self, position): if position["type"] == "business entity": return True else: return False

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversal Framework

Using the Traversal

with db.transaction: moh = ministries["KEMENTERIAN KESIHATAN"] contractors = Contractors(moh) for c in contractors: print c["name"]

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversal Framework

Output

RAF SYNERGY SDN BHD PRIMABUMI SDN BHD AVERROES PHARMACEUTICALS SDN BHD QUALITY REPUTATION SDN BHD UNISENDO SDN BHD PRESTIGE PHARMA SDN BHD PHARMANIAGA LOGISTICS SDN BHD IDAMAN PHARMA SDN BHD PHARMASERV ALLIANCES SDN BHD

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversing Graphs with Gremlin

Gremlin

Gremlin is a graph based programming language. Can express complex graph traversals concisely. Available at http://wiki.github.com/tinkerpop/gremlin/

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversing Graphs with Gremlin

Traversing the Graph with Gremlin
$ ./gremlin.sh \,,,/ (o o) --–-oOOo-( )-oOOo--–gremlin> $ := g:key(”ministries”, ”KEMENTERIAN KESIHATAN”) ==>v[66] gremlin> ./inE[@label=”awarded by”]/outV/ outE[@label=”awarded to”]/inV/@name ==>PHARMASERV ALLIANCES SDN BHD ==>IDAMAN PHARMA SDN BHD ==>PHARMANIAGA LOGISTICS SDN BHD ==>PRIMABUMI SDN BHD ==>PRESTIGE PHARMA SDN BHD ==>UNISENDO SDN BHD ==>PRIMABUMI SDN BHD ==>QUALITY REPUTATION SDN BHD ==>AVERROES PHARMACEUTICALS SDN BHD ==>PRIMABUMI SDN BHD .....
Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversing Graphs with Gremlin

Explanation

./inE[@label=”awarded by”]/outV/outE[@label=”awarded to”]/inV/@name

inE - incoming edges outV - outgoing vertices outE - outgoing edges inV - incoming vertices

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversing Graphs with Gremlin

Explanation

./inE[@label=”awarded by”]/outV/outE[@label=”awarded to”]/inV/@name

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Traversals

Traversing Graphs with Gremlin

Explanation
./inE[@label=”awarded by”]/outV/outE[@label=”awarded to”]/inV/@name

Get current object (.) (the ’KEMENTERIAN KESIHATAN’ node). Get the incoming edges labeled ”awarded by” (inE[@label=”awarded by”]). Get the outgoing vertices of those edges (outV) (the contract nodes). Get the outgoing ”awarded to” edges of the contract nodes (outE[@label=”awarded to”]). Get the incoming vertices of those edges (inV) (the business entity vertices). Get the name attributes of those vertices (@name).
Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Visualizations

Gephi

Gephi

Photoshop for graphs. Supports for various graph layout algorithms. Graph metrics supported - clustering coefficient. pagerank, diameter, betweeness centrality, closeness centrality File formats supported - csv, graphml, gexf etc.. http://www.gephi.org

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Visualizations

Gephi

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Graph Visualizations

Gephi

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Mashing Up

Adding External Data Sources

Mashing Up

Lets add shareholding data from Suruhanjaya Syarikat Malaysia (SSM) to the graph so that we can show the tenders that have been awarded to Telekom Malaysia BERHAD and any of its subsidiaries/associate companies.

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Mashing Up

Adding External Data Sources

Connecting Telekom Malaysia Berhad and Telekom Smart School Sdn Bhd
telekom = business entities["TELEKOM MALAYSIA BERHAD"] telekom smart school = business entities["TELEKOM SMART SCHOOL SDN BHD"] telekom multi media = db.node( name="TELEKOM MULTI-MEDIA SDN BHD", no="345420-H", text="TELEKOM MULTI-MEDIA SDN BHD", type="business entity") telekom.shareholder in(telekom multi media, units=1650000) telekom multi media.shareholder in(telekom smart school, units=7650000)

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Mashing Up

Adding External Data Sources

Graph Centered at Telekom Malaysia Berhad

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Mashing Up

Adding External Data Sources

Graph Centered at Telekom Smart School Sdn Bhd

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Mashing Up

Traversing to Find Direct/Indirect Awards

The Traverser

class AllTendersDirectIndirect(neo4j.Traversal): types = [neo4j.Incoming.awarded to, neo4j.Outgoing.shareholder in] order = neo4j.DEPTH FIRST stop = neo4j.STOP AT END OF GRAPH def isReturnable(self, position): if position["type"] == "tender": return True else: return False

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Mashing Up

Traversing to Find Direct/Indirect Awards

Executing the Traverser and the Output
Executing the Traversal Definition telekom = business entities["TELEKOM MALAYSIA BERHAD"] tenders = AllTendersDirectIndirect(telekom) for tender in tenders: print tender["no"]

Output 30/2009 35/2009 8/2009 162/2009 JASA/OP/1/2009

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Wrapup

Making this Easier

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Wrapup

Making this Easier

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Wrapup

Making this Easier

Mohammed Firdaus, Muhd Sharuzzamal Bakri

Open Source Tools for Creating Mashups with Government Datas

Sign up to vote on this title
UsefulNot useful