DC Big Data Exploration

Final Report
March 15-17, 2013

World Bank Group Finances
finances.worldbank.org
@WBOpenFinances

TABLE OF CONTENTS
ACKNOWLEDGEMENTS
EXECUTIVE SUMMARY
DETAILED PROJECT REPORTS
PREDICTING SMALL-SCALE POVERTY MEASURES FROM NIGHT ILLUMINATION
BACKGROUND AND PROBLEM STATEMENT
DATASETS AVAILABLE
KEY FINDINGS
METHODS AND ANALYSIS
RECOMMENDATIONS AND NEXT STEPS
ADDITIONAL RESOURCES
SCRAPING WEBSITES TO COLLECT CONSUMPTION AND PRICE DATA
BACKGROUND AND PROBLEM STATEMENT
DATASETS AVAILABLE
KEY FINDINGS
METHODS AND ANALYSIS
RECOMMENDATIONS AND NEXT STEPS
ADDITIONAL RESOURCES
LATIN AMERICA POVERTY ANALYSIS FROM MOBILE SURVEYS
PROBLEM STATEMENT
DATASETS AVAILABLE
KEY FINDINGS
METHODS AND ANALYSIS
RECOMMENDATIONS AND NEXT STEPS
ADDITIONAL RESOURCES
MEASURING SOCIOECONOMIC INDICATORS IN ARABIC TWEETS
BACKGROUND AND PROBLEM STATEMENT
DATASETS AVAILABLE
KEY FINDINGS
METHODS AND ANALYSIS
RECOMMENDATIONS AND NEXT STEPS
ADDITIONAL RESOURCES
ANALYZING WORLD BANK DATA FOR SIGNS OF FRAUD AND CORRUPTION
DETAILED PROBLEM STATEMENT
DATASETS AVAILABLE
KEY FINDINGS
METHODS AND ANALYSIS
RECOMMENDATIONS AND NEXT STEPS
ADDITIONAL RESOURCES
UNDP RESOURCE ALLOCATION
BACKGROUND AND PROBLEM STATEMENT
DATASETS AVAILABLE
KEY FINDINGS
METHODS AND ANALYSIS
RECOMMENDATIONS AND NEXT STEPS
ADDITIONAL RESOURCES
OTHER PROJECTS: A HEURISTIC TOOL FOR AUDITING AND SOCIAL NETWORK ANALYSIS FOR RISK MEASUREMENT
NEXT STEPS

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

2

3
4
7
7
7
7
7
7
15
16
17
17
17
17
18
22
22
23
23
23
23
23
28
28
29
29
29
29
30
36
36
37
37
37
37
38
47
49
50
50
50
51
51
54
54
55
55

ACKNOWLEDGEMENTS
The World Bank Group Finances team is grateful to all the partners that supported the DC Data
Dive on March 15-17:



UN Global Pulse
Qatar Computing Research Institute
UNDB
UNDP

The following groups from inside the Bank also contributed –










AFR
CTR
DEC
EXT
IEG
INT
LAC
OPCS
PREM
TWICT
WBI

We are also grateful to DataKind, the data ambassadors that it assembled, and the volunteers
who participated. The DC Big Data Exploration would not have occurred without them.
The DC Big Data Exploration was preceded by a similar event that the UNDP organized in Vienna
in February.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

3

EXECUTIVE SUMMARY
Dr. Jim Kim in a recent speech asked ‘what will it take for the World Bank Group to be at its best
on every project, for every client, every day?’ His own prescription was that ‘We must…support
our clients in applying evidence-based, non-ideological solutions to development challenges…
This is the next frontier for the World Bank Group…we need to continue investing in data and
analytic tools, building on the success of the Open Data initiative. Data are crucial to setting
priorities, making sound policy, and tracking results.’
The age of big data contains the tantalizing promise of reshaping international development.
There is overwhelming evidence already from the private sector that big data can be
transformative. UPS used sensor data to save 30 million miles off its drivers' routes. Reports claim
that predictive analytics has been worth about 23 billion to Target over 8 years. Services like
Farecast use vast amounts of seemingly unconnected data to recreate new information services
that would not have been possible a few years ago.
The question then is how/whether big data has a role in international development. The UN led
Global Pulse initiative seeks to harness 'today's new world of digital data and real-time analytics
to gain a better understanding of changes in human well-being’. The World Bank too sees big
data as a promising area – but one that needs further exploration.
On March 15-17 DataKind, in partnership with the World Bank and its partners from UNDP,
UNDB, UN Global Pulse, and the Qatar Computing Research Initiative, held the DC Big Data
Exploration to explore new ways of using big data to fight poverty and corruption. The event
drew more than 120 pro bono data scientists from Washington DC and across the nation to the
World Bank’s Preston Auditorium. Working alongside Bank experts on the Poverty and Fraud &
Corruption teams, the data scientists uncovered new ways of collecting, exploring, and
visualizing data to maximize their impact. The collaboration between the two communities
yielded new insights from World Bank data, devised new ways of using existing big data sources
for monitoring poverty and corruption, and created entirely new streams of data that the Bank
and its partners can use in future research.
Prior to the event, DataKind and the World Bank’s Poverty and Fraud & Corruption teams
identified six key projects to tackle over the weekend. The projects were designed to address
the Bank’s needs and generate tangible insights within a 24-48 hour period:
Predicting Small-Scale Poverty Measures from Night Illumination: The team explored whether
nighttime light imagery could be used to estimate sub-national poverty levels. Over the
weekend, the team created software to overlay lighting information with other geospatial
indicators (e.g. population, change in poverty) and performed a statistical analysis showing that
lighting levels in satellite images were predictive of poverty levels in 2001 in Bangladesh. The
Bank can use these findings to carry out more sophisticated experiments relating nighttime
lighting to poverty and to build software to monitor poverty in real-time from remote sensing.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

4

Scraping Websites to Collect Consumption and Price Data: To combat the lack of price data,
this team wrote software to scrape food prices from supermarket and cost of living websites,
and other sources of food data. The results yielded real-time food price monitoring data for
early alerting to food crises, better information for battling inflation, and a richer perspective on
food data.
Measuring Socioeconomic Indicators in Arabic Tweets: This project analyzed more than 25GB
of Arabic tweets to see if they could estimate socioeconomic conditions from what people were
saying on social media. The team wrote code to track key socioeconomic terms over time (e.g.,
“bankrupt” or “food”) and estimated the time zones, locations, and gender of the authors from
their messages alone. These findings could be used in future work to design proper experiments
to test for socioeconomic differences in regions or demographics based on passively collected
social media data.
Latin America Poverty Analysis from Mobile Surveys: This project analyzed Listening to Latin
America (L2L) mobile survey data to understand the socioeconomic conditions in Peru. The team
analyzed basic survey results as well as discovered patterns in the survey response rates, e.g.
mobile response rates yield very different answers to socially sensitive questions and economic
incentives don’t seem to affect response rates. The Bank can use these findings to plan future
surveys so that they collect more accurate information.
Analyzing World Bank Data for Signs of Fraud and Corruption: This project combined the World
Bank’s internal supplier, contractor, and bidder data with external data to gain a richer
perspective on how firms that had been bidding on Bank-financed contracts behaved. The team
created new unified databases that make analysis easier for the Bank, and identified interesting
patterns in debarred-to-non-debarred organization relationships, co-bidder patterns, and Bank
lending patterns over time.
UNDP Resource Allocation: The UNDP provided capacity and project data in order to
understand what skills and budgets created the best program results. The team explored
expenditures by project and identified the types of projects and regions that hit their budget
goals compared to those that did not.
Heuristic Auditing Tool and Supplier Social Network Analysis: Two other projects outside of
DataKind were developed over the weekend as well: The first was a tool for automated auditing
of bids; the second was a social network analysis tool for understanding the relationships
between suppliers.
The weekend in mid-March was just the start. The World Bank Group Open Finances team is
currently working on the following steps for the program:

Holding an online competition to address an operational question designed to improve
the delivery of Bank projects;

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

5

Considering a partnership with DataKind to:
o

Work with DataKind’s vetted DataCorps consultants on larger Bank efforts with
tangible deliverables;

o

Engage the data science community to add innovation capacity and data science
expertise to the Bank’s ongoing efforts. The DC community was invigorated by
the DC Big Data Exploration; the Bank and DataKind will continue to engage the
groups that participated to collaborate on these issues.

Collaborating with partners inside and outside the Bank to create an analytics program
based on big data techniques and tools like those that were used so successfully during
the DataDive weekend.

Finally it is worth noting that the promise of big data comes with numerous challenges –
especially those related to privacy, data quality, attribution, and legal frameworks. The findings
from the DataDive are provisional – a number of methodological issues still need to be
addressed: e.g. sample size, selection bias and validity of sources. The promise may however
outweigh the perils and the Bank and its partners need to quickly build on the momentum
achieved over the DataDive weekend.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

6

DETAILED PROJECT RESULTS
PREDICTING SMALL-SCALE POVERTY MEASURES FROM NIGHT ILLUMINATION
Background and Problem Statement
Poverty data collection is expensive and slow. To complement this effort, it would be helpful to
find potentially less accurate but cheaper, and more frequent ways to measure poverty. The
team looked for patterns that could help build leading indicators of poverty by comparing
existing national poverty maps to other geospatial indicators to see if they were correlated or,
even better, predictive. Nighttime illumination recorded from satellite imagery was used as the
indicator for this project. The team sought to identify whether light levels correlated with
poverty levels and, more importantly, if changes in light intensities could predict changes in
poverty level. If so, then light maps could potentially be used as proxies for poverty data.
Datasets Available

The 2001 and 2005 poverty levels of Bangladesh at every Upazila (county administration
level), of which there are about 500.
Average nighttime illumination levels for every year from 2001 to 2005 from NOAA.

Both datasets were available as raw GIS data. The team also built accessible CSV files containing
histograms of light per region that could be used for statistical analysis. See the Additional
Resources section for other datasets the team collected.
Key Findings
• Using regression models, the team found that lighting and census data from 2001 was
predictive of poverty levels that year. This finding holds promise for being able to
predict poverty levels using satellite imagery of nighttime lighting.
• The team did not find similar results for 2005 data nor for predicting the change in
poverty between 2005 and 2001. No statistically significant relationships were found
over the weekend.
• Further exploration should be done of the 2005 data before ruling out any predictive
ability from lighting maps.
• Geospatial poverty data can be combined with other geospatial data and image-based
data (such as remote imagery) to explore the relationships between different variables.
The team wrote code to aid in extracting shapefile (.shp) regions from TIFF files (.tiff) as
well as interactive maps for researchers to use.
• Census data combined with light intensity data does a slightly better job at predicting
poverty levels than just using light data alone
Methods and Analysis
The team took two approaches to exploring the relationship of light intensity to poverty levels.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

7

The first involved creating interactive maps that would allow researchers to overlay geographic
data with existing poverty maps. With these maps, researchers could visually understand which
indicators correlate with poverty. The second approach entailed building statistical models of
the relationships between light intensity and poverty levels.
Interactive Poverty Maps
The team collected poverty maps from Bangladesh from 2001 to 2005 as well as maps of the
change in poverty, light intensity, literacy, and total population. After converting all of the data
into a common format (not a simple task as some of the data was graphical and other data was
in geo format), the team created three major interactive visualizations, available at this site:
1. Descriptive information: This interactive displays descriptive information about each
region of Bangladesh for each of the datasets described above. For example, one can
view the change in poverty from 2001 to 2005 by region on the map. Figure 1 shows an
example of these descriptive maps.
2. Overlaid information: This interactive allows the user to overlay various indicators on
top of illumination data from satellite imagery. For example, one could compare light
intensity to a map of access to electricity or to the change in poverty levels between
2001 and 2005. Figure 2 shows an example of an overlaid map.
3. Timeline maps: The final interactive shows illumination changes for every year from
2001 to 2005. By navigating across this timeline, one can see where regions have
increased or decreased in illumination levels over time. Figure 3 shows a screen shot of
the Timeline maps tool.
With these tools, researchers can visually inspect the relationships between poverty and other
geospatial datasets. The code used here could be adapted to take in other geospatial or raster
image data to add to the tool.

Figure 1: A descriptive map highlights change in poverty in Bangladesh with the Upazilla“Dharmapasha.”

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

8

Statistical Analysis of Light Intensity Levels and Poverty Data

Figure 2: An overlaid map of poverty in 2001 and light intensity.

Using the standardized data from both poverty maps of Bangladesh and satellite imagery of
nighttime illumination, the team set out to answer the following questions:

Is poverty correlated with light intensity?
Are changes in poverty correlated with changes in light intensity?

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

9

Figure 3: An interactive timeline map of light intensity from 2001 to 2005.

To explore the relationship between poverty levels and light intensity, the team first needed to
get a measure of light intensity at the same scale as the poverty data. Because the poverty data
is on a regional level, the team created an “average light intensity” measure for each region.
This measure was computed for each region by first getting an “average” intensity score using
the equation:
255

score = ∑ (%of pixelsat intensityi)⋅ i

i=1
Once this score was computed for each region, a value was assigned to the region based on the
percentage of other regions that had a lower score during the same year. For example, if a
region's average lighting was more than 80% of the other regions in 2001, it would be assigned
an "average light intensity" of 80 for 2001.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

10

Figure 4: Poverty vs. light intensity in 2001 and poverty vs. average light intensity in 2005.

Looking just at the plot of light intensity vs. poverty in 2001 for each upazilla (county admin) in
Figure 4, we can see a strong correlation between light intensity and poverty levels. The trend is
less pronounced in the average light intensity plot for 2005, but still apparent.
The team next built a number of models in an attempt to predict actual poverty levels from light
intensity and census data.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

11

Poverty vs. Light Intensity in 2001
The first model the team built was a linear regression predicting poverty level from light
intensity alone using 2001 data. Figure 5 shows a plot of the predicted poverty levels in 2001
and the actual poverty levels using light data alone. All hyper-parameters of the model are
selected based on cross validation on 80% of the data. The model is then fit to the same 80%
and used to predict the remaining 20%. The team set the alpha parameter to 10.0 for these
computations, which is recommended when using 80% of the data as cross-validation.

Figures 5 (top) and 5 (bottom): Predicted poverty vs. true poverty in 2001 for the model fit using light
intensity and census data from 2001. (Figure 5: RMSE = 0.067650; Figure 6: RMSE =0.076982).

The Root Mean Squared Error (RMSE) for the model using only an intercept term and the 2001
average light intensity is 0.076982. The RMSE is a measure of the model’s accuracy where lower

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

12

numbers indicate higher-performing models. This low RMSE is indicative of a good fit, meaning
that this model is a promising sign that 2001 light intensities predicted 2001 poverty levels.
Figure 5 shows actual poverty levels vs. the poverty levels predicted by the model.
The team also included features from the 2001 census to see if the fit improved. In this case, the
RMSE on the test data was 0.067650. This slightly lower score means that the census data
combined with light intensity data does a slightly better job at predicting poverty levels than just
using light data alone. Figure 6 shows the predicted poverty levels vs. true poverty levels using
this model.

Figure 6: Predicted poverty vs. true poverty in 2005 for the model fit using light intensity
alone. (RMSE = 0.129964)

The team also looked at how well the census features (i.e. lights in 2001, lights in 2005, and
poverty in 2001) could predict the change in poverty between 2001 and 2005. This RMSE was
0.129603 for the model without 2001 census data (Figure 8) and 0.143962 for the model with
2001 census data (Figure 9).
Based on the results from these models, the team concluded that the combination of lighting
and census data from 2001 was predictive of poverty that same year.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

13

Figure 7: Predicted poverty vs. true poverty in 2005 for the model fit using light intensity data
from 2005 and census data from 2001. (RMSE = 0.144068)

Figure 8: Change in poverty from 2001 to 2005 predicted against the actual change in poverty
using lights in 2001, lights in 2005, poverty in 2001, and census data in 2001. (RMSE =
0.143962)

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

14

Figure 9: Change in poverty from 2001 to 2005 predicted against the actual change in poverty
using lights in 2001, lights in 2005, and poverty in 2001. (RMSE = 0.129603)

Recommendations and Next Steps
The World Bank team should re-run these analyses using the most recent and detailed poverty
maps available. If there are additional geospatial indicators, they should be included in the
analysis.
Based on the preliminary results and findings of the weekend, there are two definitive areas for
further exploration:

The World Bank can reproduce the analysis prototyped at the event using the most
recent and detailed poverty data available in different countries. Testing the approach
on richer poverty maps, as well as using supplemental geographic and census data could
help uncover a deeper connection between light and poverty.

The second area of research could be to identify additional geospatial data sources to
incorporate into future versions. Initial sources identified, like available roads and night
illumination, could contribute to a leading proxy for poverty.

Additional next projects can include adding more granular data and modeling techniques to this
prototype. Additional geospatial data, along with traditional indicators can supplement the
project for stronger correlations.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

15

By identifying new sources, the World Bank has an opportunity to build on this and other
geospatial-data research efforts to provide timely and granular poverty measurements.
ADDITIONAL RESOURCES
Available on the Team’s Hackpad
o A 2006 NOAA paper on using night-lights and satellites to measure national poverty levels.
They cite and address the measurement challenges that the World Bank faces in particular.
http://www.ngdc.noaa.gov/dmsp/pubs/Poverty_index_20061227_a.pdf
o Additional geospatial data and external resources (a detailed list).
o UN studies on parking lot density and cellphone coverage predicting poverty levels:
o Parking Lot Study
o Cellphone Coverage Study
o Code to convert GeoTIFFs to shapefiles
o Two academic papers created by the team:
o Luminosity
o Lighting Up Poverty
o Interactive Visualizations

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

16

SCRAPING WEBSITES TO COLLECT CONSUMPTION AND PRICE DATA
Background and Problem Statement
Collecting detailed food price data is not just important for poverty monitoring, but also critical
for the economic management of a country. In 2009, Kenya had an official inflation rate of25%
but a lending rate of below 20 percent. If these numbers were true, banks would have gone
bankrupt. Banks thus had to “guess” inflation levels to set their interest rates.
What’s more, existing food price data often takes a huge amount of time to collect and exists
only at the national level. Differences at the regional level are difficult to come by. In most
developing countries, food makes up the largest share of inflation (often up to half).
This project team, dubbed “Team Ndizi”, decided to supplement the World Bank’s current food
price data by scraping new, more real-time food price data from other sources. Their goal was to
identify if:

Other data sources could be scraped to create more real-time estimates of food prices
and therefore estimate inflation rates and poverty.
There existed data sources that captured price data at regional and sub-national levels
so that prices could be compared across the country.

Datasets Available
The team did not begin with any data but instead scraped a multitude of websites to create
detailed datasets of price information.
Key Findings
Team Ndizi proved how useful scraping can be for the World Bank and they were able to collect
food prices across a number of helpful sources.
As with all other projects from the DC Big Data Exploration, the findings should be considered
provisional, as there are a number of methodological issues that still need to be addressed, e.g.
sample size, selection bias and validity of sources.



Global food prices could be scraped from humunch.com
Daily crop prices in Kenya could be scraped from mobile price providers like mFarm
going back 1,000 days. This could be extended to other countries.
Prices for multiple common goods could be scraped from grocery store sites to create a
healthy “food basket.” The team found prices at the national and sub-national level and
verified them against cost-of-living sites.
The team used the Wayback Machine to scrape historical data, and demonstrated that
pulling historical rice prices showed evidence of the Indonesian rice crisis before global
food prices did.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

17

Methods and Analysis
Team Ndizi identified dozens of websites to be scraped and dozens of sources for official price
data. They proved that it was feasible to scrape global and regional data from existing sources
that could then be used by the World Bank to track price changes, and monitor inflation.
Scraping Global Banana Prices
Team Ndizi first turned their sights on http://www.humuch.com, a global price repository. While
the site allows users to look up specific commodity prices and map them, the data is not in a raw
machine readable format. Team Ndizi trained four of its members to scrape banana prices from
the site and converted the data to a machine-readable format. Figure 11 shows a comparison of
banana prices by continent as a proof of concept.

Figure 11: A bar graph of banana prices by continent created using data scraped from
humunch.com.

Figure 10: A line plot of 2012 monthly average dry maize prices in five different regions in
Kenya. This plot was created from data the team scraped from mFarm.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

18

Scraping Farm Data in Kenya
The team next looked to the site mFarm, which forwards crop price information to farmers in
Kenya via cellphone so they can make more informed decisions. By harvesting this data
“exhaust” from the site, the team gleaned monthly price data for a range of crops across regions
of Kenya. This data could be used to follow sudden changes in food prices or as a comparison to
prices in other parts of the world. Figure 12 shows a graph of dry maize prices for five regions in
Kenya across all of 2012. The team also began working on interactive line graphs and maps of
this data so that the World Bank could easily access the data in a readable format.
Scraping Grocery Prices from South Africa
The team also scraped prices of a wide range of goods from Pick ‘n Pay’s grocery store websites
in South Africa. These sites provide food prices for all of their products but, again, they are
locked in the website and are not available for download and analysis. Team Ndizi’s freed this
data using web scraping, and focused on 11 essential food types that to a balanced daily diet.

Figure 12: The daily cost per person for a balanced 2,000 calorie diet across five countries (four
African, one USA). The prices of 11 staple foods is included and visualized proportionally.

With this data, the team could price a typical food basket for someone surviving on a 2,000
calorie diet. They compared these prices across different countries as well as looked at
breakdowns of how much each good contributed to the total cost of the basket. Figure 13
shows the daily cost per person for a balanced 2,000 calorie diet across four African countries
and the US. Figure 14 shows the cost of a balanced 2,000 calorie diet in South Africa and what
proportion of that cost is attributable to each good. The team also replicated this data for a few
countries in Africa as well as sub-regions within each country, as shown in Figure 15. These
results are significant as they show that, with regular scraping, the World Bank can create realtime measures of food basket prices around the world, even at the sub-national level.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

19

Figure 13: Proportional break down of costs for a healthy balanced diet in South Africa.

Perhaps most interestingly, Team Ndizi also validated their price scraping by comparing their
price estimates against known cost of living calculators across the Web. Figure 16 shows Team
Ndizi’s scraped prices for 11 basic food items alongside the prices reported by three well-known
cost of living websites: Numbeo, Xpatulator, and Expatistan. From this plot we can see two
compelling findings: first, the price that Team Ndizi found from scraping Pick ‘n Pay grocery
store websites in South Africa is never more than a few cents away from the average estimate
from the other cost of living sites. Secondly, only Numbeo actually measures all of these items,
while the other sites were lacking at least one product that Ndizi was able to measure.
This data could be used in a number of different ways. Prices could be tracked over time or the
cost of living could be computed using different food products. A full view of food prices across
the country (depending on the coverage of the grocery websites) could be provided to the
World Bank.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

20

Figure 15: Food prices for 11 items that Team Ndizi scraped (orange) compared to prices from
known cost of living websites.

Indonesian Rice Crisis Tracking
Team Ndizi’s final project looked at rice prices in Indonesia over time. The data was scraped
from Carrefour Indonesia, a popular supermarket chain. The team also used the Wayback
Machine to go back to historical versions of the website to collect data. Some experimentation
was done with pulling prices from Twitter data as well, but there was not enough time to create
a full-fledge “universal” scraper from all sources.

Figure 16: World food prices, as reported by major monitoring agencies (green and yellow) vs.
prices of two brands of rice in Indonesia (blue and red).

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

21

What was most interesting about the Indonesian price-scraping project was that it proved the
importance of having more real-time food data in anticipating and managing crises. The team
scraped the price of rice per kg for two different brands of rice (Si Pulen Crystal and Topi Koki
Setra Ramos) in Indonesia from January 2010 to March 2013. Figure 16 shows these prices
plotted over time against World Food Prices and the Food and Agricultural Organization’s (FAO)
data. What is striking from this plot is that the rice prices in Indonesia increased by a good
$1/USD per kg above the world’s food prices. This shows that early signs of the rice crisis may
have been observed in this data that wouldn’t have been observed by looking at global data.
Moreover, the team scraped data that extends beyond estimates that the FAO was able to
provide. FAO data stopped in October 2012.
Recommendations and Next Steps
Team Ndizi showed how easily food price data can be scraped and collected and has provided
code and data to replicate their work. Some areas for future work and more careful analysis
include:
• Using web-scraping techniques, as prototyped in the event, to create granular, and near
real-time measures of food prices at sub-national levels;
• Further examining the difference between scraped and official data as a tool for filling in
gaps where the current measures of price data are aggregated or infrequent. The
Indonesian rice price exploration from the DataDive may serve as an example of the
useful perspective that could be gained from granular, local, and nearing real-time data;
• More robust comparisons and correlations from the scraped price data to known
economic metrics and historical data. During the event, the team conducted some basic
validation of the price scraped data by comparing their price estimates against known
cost of living calculators across the Web. The deeper dive will help the Bank determine
how useful this granular view of price data can be;
• If the methods outlined prove to be useful, then a more basic ‘universal’ scraper could
be built to automate the sampling of price data from sources around the Web.
Additional Resources
The team’s HackPad contains an in-depth list of examples of related projects, the datasets they
collected, and the code needed to recreate their work.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

22

LATIN AMERICA POVERTY ANALYSIS FROM MOBILE SURVEYS
Problem Statement
The World Bank’s poverty team is interested in finding new ways to measure welfare in Latin
American countries. Two pilot surveys of household well being, called Listening to Latin America
(L2L) used cell phones. In this project, teams used the L2L data from the Peru survey to address
the following question:

Is it possible to draw inferences about changes in welfare at the national level using data
collected with cellphones, or a combination of this data and the national household
survey (ENAHO) data?

Datasets Available
• L2L Survey Data (SPSS Format, also CSV, currently)
• Peru: Initial F2F Questionnaire
• Honduras F2F Questionnaire*
• Peru: Mobile Questionnaire
• Honduras Mobile Questionnaire*
• L2L Final Report
• ENAHO data
*The data was not analyzed at the event.
Key Findings
• More “Yes” answers were given to personal and negative questions when follow-up
surveys were delivered by SMS or phone call than in-person. This finding may indicate
that people’s impersonal answers are more honest than when they are face-to-face.
Mobile data collection may therefore be more accurate, in some contexts, than face-toface surveys.
• Monetary incentives did not appear to influence response rates, regardless of the
technologies involved.
• The data from these surveys provides very rich detail about the Peruvian people and
could be used at a broad level to learn more about socioeconomic conditions in this
country.
As with all other projects from the DC Big Data Exploration, the findings should be considered
provisional, as there are a number of methodological issues that still need to be addressed, e.g.
sample size, selection bias and validity of sources.
Methods and Analysis
Initial Exploration of the Survey Data
The team first dove into understanding the questions asked in each survey, before looking at the
basic statistics of Peruvian respondents in aggregate. The data on the Peruvian people was so

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

23

rich that it could (and perhaps should) be the subject of a research project in its own right. With
just a quick look at the basic statistics, the team found the following basic statistics:









Family sizes of approximately three to six members;
In about half of the cases, neither of the respondent's parents was educated;
The number of hours of work each respondent reported was normally distributed
around a mean of 40 hours with a standard deviation of about 15 hours;
Most respondents worked in farming and related fields, with small business and
housework being next most popular. The remaining occupations comprised the long tail
of the rest of the data;
40% of respondents have soil floors;
8% of respondents had someone in the household lose their job in the last month;
8% of respondents had someone in their household find a new job in the last month;
7% of respondents had someone in their household miss school for lack of money in the
last month;
13% of respondents had someone miss school due to sickness in the last month;
11% of respondents had someone robbed in their household in the last month;
9% of respondents had moved in the last three years.

These findings begin to paint the picture of socioeconomic conditions in Peru. They may already
be known to the World Bank Poverty team, but the DataKind team felt it was worth bringing
these figures up in case any were new or surprising. In either instance, it is quite easy to repeat
this analysis for future survey data.
Response Rate Analysis
The team’s next analysis was designed to confirm the results in the published study, which
focused primarily on response rates in the follow-up studies, and analyze which factors of the
survey correlated with higher and lower response rates.
The team first looked at response rates regardless of follow-up technology. Of the 1,600 people
surveyed, about 1,000 didn't respond to any follow-ups, while the other 600 responded to about
half of the follow ups. Given the fact that households had agreed to participate in the six-month
survey beforehand, the response rate seemed particularly low to the team. The team wondered
if phone network connectivity was playing a role here and recommends the World Bank explore
rural versus urban response rates.
Effects of Survey Technology
The team next looked at the effects of follow-up technology on response rates. The overall goal
of this study was to evaluate the effectiveness of mobile-phone based surveying. Answering this
question could have far-reaching implications for the Bank’s ability to collect accurate
information with the global reach and ease of using mobile technology.
The data the team was working with contained responses to surveys for an initial survey and
then six follow-up surveys. The first survey was conducted in the traditional way: face-to-face

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

24

and in person. The subsequent surveys were performed using one of three methods: human
telephone conversation (CATI), pre-recorded voice mobile phone interview (IVR), and text-based
(SMS).
The breakdown of response rates by technology is shown in the table below:
Method
CATI
SMS
IVR

Response Rate
50%
30%
25%

Here again we see fairly low response rates. The aforementioned question of cellphone
reliability would be important to address. Interestingly, person-to-person interviews yielded the
highest response rates, which may be due to the respondent’s feeling of responsibility to
another person that they do not feel when ignoring a pre-recorded phone call or an SMS.
Incentives for Responses

Figure 17: Response rates over six months broken out by technology type and monetary
incentive offered (none, $1, $5).

The team also explored whether monetary incentives affected response rates (see Figure 17).
The incentives, in this case, seemed to have little or no impact on response rate. The difference
between no incentive and a small incentive showed no major increase and the increase of the
incentive from $1-$5 appeared to have no impact on the response rate.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

25

The lack of incentive motivation may be particular to this survey, so it is recommended that the
same analysis be performed on the Honduras data to see if similar findings to those in Peru are
observed. Moreover, a more rigorous analysis should be performed to see if there are truly no
statistically significant increases in responses based on incentives. A larger dataset may be
needed to answer this question confidently. While the financial incentives did not appear to
increase response rate on average, the team also thought it would be interesting to explore the
related question responses about financial hardships to explore if they affect the impact of the
monetary incentive.
Effects CATI vs. Technology-Intermediated Follow-ups
The team next looked to see if the presence of humans in the follow-up process (CATI) affected
the responses in ways that purely technological follow-ups (SMS) didn’t. To address this
question, the team looked at the ratio of “No” to “Yes” answers to very personal questions
about negative outcomes, i.e. “did something bad happen to you in the last X months?”
In CATI follow-up surveys, the No: Yes ratio varies between 1:7 and 1:16, indicating that a fairly
low proportion of people are reporting bad incidents happening to them (5.9% to 12.5% ‘no’
responses). When looking at responses to the identical questions on text-mobile surveys,
however, the ratio of No: Yes answers drops the range from 1:3 to 1:6 (14%-25% ‘no’
responses). In other words, about twice as many people report something bad happening to
them via impersonal SMS follow-up survey than report it in a CATI follow-up.
There are many theories as to why this could be happening, not least of which is that the group
surveyed via mobile may in fact have had more bad things happen to them. The experiment
should be repeated with other respondents to see if the trend is observed again. Other theories
include:

Selection bias: people who have had a bad event happen to them are more likely to
respond to the surveys in the first place.
Shame/Privacy concern: People under-report bad events happening to them in a faceto-face interview. There is precedence for this behavior: there is a body of literature on
physician illness that shows serious under-reporting of depression in in-person
interviews vs. anonymous interviews (~1% vs. 15% prevalence)

Researchers should begin by exploring not just the ratio of No: Yes answers, but also the
response rates for different types of questions to determine whether selection bias is at work.
The team was interested in knowing whether respondents who had something bad happen to
them were more likely to respond because they had something to report.
It would also be interesting to repeat this study to see if the same effect is observed for
questions that are considered “neutral” or “good”, e.g. did you find a job recently? A related
question would be whether certain topics of questions (e.g. education, finance, household) were
affected by the type of technology used in the surveys.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

26

Figure 18: Percentage of “Yes” answers at each stage of the process for seven different
questions.

Figure 19: Percentage of “Yes” responses given at each stage of the interview process,
broken out by technology.

The team explored these ideas graphically as well. Figure 18 shows the percentage of “Yes”
answers to each of seven questions in the Peru Mobile Questionnaire, where each line
represents the responses for the face-to-face, first follow-up, and second follow-up phases of
the interviews. From this figure, we see a general upward trend on almost all questions over
time. Without a doubt the percentage of “Yes” answers increases during follow-up 1 after the
face-to-face. The team wondered what could account for such a pronounced increase in “Yes”
answers during the follow-up sessions.
The team dug in to see if talking to a human correlates with an under-reporting of answers. The
team compared the percentage of “Yes” answers given to a very personal and negative question
– whether the respondent had been robbed in the last month – across respondents that
received follow-ups via SMS, IVR, and CATI. The results in Figure 19, which graphs the
percentage of “Yes” responses given during the face-to-face, first follow-up, and second-follow-

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

27

up broken down by technology, could indicate that the more impersonal the mechanism, the
higher the reported incidence of robbery. We can see that the IVR recipients reported the
highest rate of robbery during their first follow-up, while they seemed about equal to the SMS
group during the face-to-face and second follow-up. What is striking is that the CATI (humanvoice interview) responses are consistently lower than the other two technologies.
Again, these results could be due to legitimate differences in the groups (e.g. the SMS group, by
chance, really did get robbed more often than the CATI group).
Recommendations and Next Steps
As the data is so rich, the team felt it would be useful to analyze differences in responses to
questions, broken out by demographic groups, e.g. wealth, geographic location, educational
attainment, age, etc. Presuming that the target population is the marginalized and rural poor,
analyses could shed light on ways to modify future mobile surveys to target this population.
Some ideas for deeper analysis of the L2L survey:
• On monetary incentives: the Bank could do a deeper dive into the data to determine
whether or not there is a relationship between the financial questions of the survey and
the impact of the monetary incentive as well as retention in the survey.
• On the use of mobile for follow-up surveys: the team questioned whether wireless
connectivity could play a role in the response rate of the follow-up surveys, especially
where respondents live in rural areas. The Bank should also consider exploring more
factors that may impact survey response rates.
• The analyses in this report should be repeated on the Honduras data and compared to
Peru to see if similar trends emerge.
• Lastly, the World Bank should conduct a similar survey in Peru to see if the results can
be duplicated. There could be good evidence that mobile survey responses are more
reliable and real-time than in-person surveys. The implications of this finding could
mean that the World Bank would not only have access to more people around the world
through mobile surveys than they could reasonably support in-person, but also that the
results would be more accurate and complete than are provided by in-person surveys.
Additional Resources
• Charles F. Turner's work on mode effect on the collection of data regarding sensitive or
risky behavior. In particular, T-ACASI reduces bias in STD Measurements: The National
STD and Behavior Measurement Experiment
• Mick Cooper’s work on mode comparison
• Dr. Edith de Leeuw's research regarding mode comparison
• Eleanor Singer has a comprehensive article on the impact of incentives on response
rates in household surveys.
• CPS is a longitudinal survey that uses mixed methods that might be useful or survey
methodology.
• R code to generate response rates by incentive and technology

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

28

MEASURING SOCIOECONOMIC INDICATORS IN ARABIC TWEETS
Background and Problem Statement
Conventional poverty measures are time-consuming and expensive to collect. The World Bank is
interested in exploring alternative data sources for measuring poverty that are easier to collect
and less expensive to update. It was hypothesized that monitoring conversations on Twitter may
shed light onto socioeconomic conditions based on what people talk about and attributes of
their conversations. Their goal was to research what information could be drawn from these
tweets to inspire future experts. They explored questions such as:



Do the frequencies of key socioeconomic keywords (e.g. “broke” or “need money”)
change over time and, if so, do those changes reveal anything interesting?
Can we learn about the social network structure of people tweeting to each other and
does that teach us anything about their socioeconomic conditions?
What can we learn about someone tweeting from just their text or other aspects of
their tweets?
Can we correlate any of the activity on Twitter with standard poverty indicators?

Datasets Available
• 25GB of Arabic tweets spanning a six-week period from November 2011 to January
2012. Qatar Computing Research Initiative (QCRI) delivered the tweets, and claimed it to
be a nationally representative sample. The dataset was so large that the team stored
the data in a database on Amazon Web Services. The team used samples of the data to
study their questions.

An English to Arabic translation of key socioeconomic terms.

Key Findings
The World Bank could monitor tweets and other social media channels to potentially learn more
about a range of socioeconomic indicators:



The team found clear periodic cycles in features of the Twitter data. These could be
correlated against existing poverty indicators;
The team was able to identify a user’s location using only their message text and the
times of day they tweeted;
Gender can likely be detected from language patterns in text and could thus be used as
input to socioeconomic modeling;
It is possible to infer a measure of social connectedness from the network of tweets.
This measure could be correlated with socioeconomic conditions.

As with all other projects from the DC Big Data Exploration, the findings should be considered
provisional, as there are a number of methodological issues that still need to be addressed, e.g.
sample size, selection bias and validity of sources.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

29

Methods and Analysis
Keyword Usage
The team began their exploration by looking at the frequency of keywords over time. Using the
English to Arabic translation, the team came up with three categories of terms they would track:
Everyday item terms, economic terms, and positive sentiment terms.
“Everyday” item mentions:
Gasoline
Bread
Rice
Meat
Milk
Butter
Beans
Cigarettes
Car/Auto
“Economic” mentions:
Price
Money
Fees/Bills
Purchasing/Buying
Credit/Loan
Salary/Pension
Work/Job
Rain
Day(s), Week(s), Month(s)

World Bank Group Finances
finances.worldbank.org

‫نيزنبلورتب‬
‫جاصةنوبطةزبخزبخ شيع‬
‫زرأ زر‬
‫موحل ةمحل محل‬
‫نبلبيلح‬
‫ةدبز دبز‬
‫ايبولةيبوللوفايلوصاف‬
‫ورراج رئاجس‬
‫ةيبرع ةرايس‬

‫ةنمثأنمث راعسا رعس‬
‫دوقنلامسولف‬
‫ةرتكافريتاوف تاقفنفيراصم‬
‫ءارتشاءارشعفد‬
‫ةراعإفلسفيلستدصقيديركضرق قوست‬
‫بتارتابترمبترمشاعم‬
‫لغشلمع‬
‫اتشةيرطمتاطقاستراطمأرطم‬
‫روهشنيرهشرهشعيباسانيعوبساعوبسأمايأنيمويموي‬

@WBOpenFinances

30

“Positive sentiment” words
‫ليمج ةفحت ةولح ةعئار ♡ ♥ زاتمم بيط شوخ سيوك ديعس ةطوسبم طوسبم‬

Figure 20: Mentions of everyday terms (red), economic terms (green), and positive terms
(blue) over time.

The team wrote code to count the occurrences of phrases in each category to see if they
changed significantly over time. Figure 20 shows these frequencies. While there may have been
a slight increase in the total number of mentions over time, no significant trend was seen, nor
was it possible for the team to identify sudden increases or decreases of phrases in this graph.
With another poverty indicator, it could be possible to find correlations between the two
datasets.
Next, the team turned to ways of controlling for different variables in the data once an indicator
was determined for future projects. If they could write code to extract features of the tweets,
such as periodic trends in tweet frequency or the gender of a user, those facts could help World
Bank researches correlate the Twitter data with other poverty measures or they could block for
these variables in an experiment.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

31

Timing of Tweets

Figure 21: Number of tweets by the day of the week. 28% of tweets occur on the weekends.
The most tweets occur on Monday. All times are in Greenwich Mean Time.

Figure 22: Percentage of tweets by time of day. All times are in Greenwich Mean Time.

Figure 21 shows the frequency of tweets by the day of the week while Figure 22 shows the
percentage of tweets by time of day. The team observed clear cyclic trends in when people were
tweeting, therefore could be able to account for time of day in tweets when performing a real
experiment.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

32

While the plot for Figure 23 is for the entire dataset, we could isolate an individual user to their
respective time zone. Knowing the user’s time zone would allow future researchers to account
for time when running experiments. This data may also be available directly from Twitter.

Figure 23: Number of mentions of each country, computed by counting country hashtags.

Country References
The team next looked to see if they could determine the origin location of each tweet based on
mentions of countries.
Figure 24 shows the number of times that each major Arab country was mentioned with a
hashtag in a tweet. Bahrain was the most frequently mentioned country, followed by Syria.
Many of the Bahrain tweets referred to the upcoming anniversary of the February 14 protests.

Figure 24: Number of tweets by hour of day. All times are in Greenwich Mean Time.

Identifying the location of a tweet is important because it could help researchers infer the
socioeconomic conditions in that region. In Figure 24, the team also established methods for
determining the origin country of a tweet using only the message text. While Twitter may
provide this information to us automatically, other forms of social media may not. This code can
be adapted to estimate locations of messages so that researchers can account for regional
effects when running experiments.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

33

Gender Identification
Knowing the gender of the author could be very beneficial to understanding one’s
socioeconomic conditions. It could also be another variable to account for when running
experiments. The team identified resources that could be used to infer the gender of the author
of each tweet from the message text alone. Using the suffixes of words in Arabic may be a
straightforward way of determining gender.
Social Connectedness
Research suggests that people who are more socially connected are more affluent than those
who are not. To pursue this idea, the team explored frequency of tweeting and social
connectedness of tweeters in the dataset.

Figure 25: Groups of tweeters joined by the number of times they tweeted in this three-month
period.

The team first counted the number of times each person tweeted during the three-month
window (Figure 25). Most people tweeted once or twice, with very few people tweeting more.
Two accounts tweeted five and six times, one of which appears to be a news source. With these
results, Bank experts could try to determine the socioeconomic status of the Twitter accounts
involved to see if there is a correlation between tweet frequency and affluence.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

34

Figure 26: A connectedness graph showing Twitter users who mentioned one another. Each
node represents a Twitter user and each line indicates that one of the two users it connects
mentioned the other in a tweet. Larger nodes have more connections.

The team searched for every tweet in which a user “@” mentioned another user, and used
these connections to build the social graph of all of the users. Each node represents a Twitter
user and each line indicates that one of the two users it connects mentioned the other in a
tweet. Larger nodes have more connections. Colors indicate social groups, e.g. nodes connected
with green lines have more friends in common with other nodes connected with green lines
than with nodes connected with red lines.
From this graph, we can see that there are a few large nodes, namely the large red one toward
the upper left corner, which are highly socially connected. It would be interesting to look at
these individual accounts and see if their socioeconomic status can be determined and find a
correlation between connectedness of nodes and socioeconomic conditions.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

35

Recommendations and Next Steps
None of the assumptions about frequency of keywords or locations mentioned can be verified
without a good indicator to measure against. The World Bank experts should identify key
indicators that can be matched with the trends in the data or use their expertise to verify that
patterns in the data track with some known qualitative measure of poverty:

Subject matter experts in poverty should team with data scientists to help advise on the
problem formulation beforehand as well as during the project;

More detailed information about the tweets should be secured. We suspect the data we
received was trimmed of GPS data, free-text locations, and more.

Additional Resources
• The team’s HackPad page
• The team’s project page on GitHub
• Final presentation
• UNGP projects on mining tweets for unemployment and crisis-related issues:
o Study on monitoring crisis and stress (video)
o Study on unemployment statistics
• Additional information about Twitter dataset from Vienna Open Data Day
• QCRI Permission to use data
• Male/Female language differences, from Debra Tannen
• Stopwords in social signaling
• Kate Niederhoffer
• Jamie Pennebaker's group at UT-Austin

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

36

ANALYZING WORLD BANK DATA FOR SIGNS OF FRAUD AND CORRUPTION
Detailed Problem Statement
The World Bank’s Fraud and Corruption team is faced with the weighty task of detecting
individuals and companies that misuse or misappropriate funds on Bank financed projects.
Corruption can occur at almost any stage of the project pipeline, from design to bidding and
final execution. It can be difficult for the Fraud and Corruption team to gain a full view of
potential grievances because data about bidders, contractors, and contracts often live in
different datasets around the Bank and are not consolidated. Moreover, a large amount of data
about bidders and projects exists outside of the Bank in areas such as project implementation
units and more that could be used to gain more insights about the bidders and contactors
involved. The Fraud and Corruption team is often faced with the insurmountable task of tracking
suspicious companies by hand.
The team explored how they could help strengthen and scale the World Bank’s methods using
new data and analytical tools, focusing on the following main tasks:

Creating contractor profiles containing external corporate data such as location, chief
personnel, date incorporated, etc. Consolidating this information would help identify
undisclosed relationships between firms, and hopefully lead to a method to discover
shell corporations.
Consolidate existing World Bank datasets and produce datasets from unstructured
sources within the Bank. Using this data the team explored and built tools to highlight
contractor behavior and activity, such as bidder relationships.

Datasets Available
• World Bank Project API
• OpenCorporates
• Major Contracts Awarded
• World Bank Project Pages
• Debarment Documents
For detailed lists of datasets used, see the Datasets sections of the team’s two HackPads:
HackPad 1 and HackPad 2.
Key Findings
• Debarment data can be scraped to create a full list of all debarred companies, which can
then be analyzed. The team created ranked lists of countries by number of debarments,
based on debarment type, and changes in debarments over time;
• Using external corporation data, the team was able to measure relationships between
“similar” suppliers. The team built network graphs that showed relationships between
debarred and non-debarred firms that shared similar addresses, phone numbers,
officers, or names;
• The team was able to scrape co-bidder information from the Web and used that data to
build social networks of co-bidders. This code could be used by the Bank to identify
suspicious activities between co-bidders;

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

37

The team proved it was feasible to combine disparate Bank datasets into more unified
supplier profiles. They wrapped this unified data into an API so that the Bank could have
consolidated supplier information;
The team analyzed project approval trends over time and found increases in the number
of projects approved toward the end of each month and in the spring and early summer,
specifically. The team did not draw any conclusions from this but this could prove quite
interesting to explore further.

As with all other projects from the Big DC Data Exploration, the findings should be considered
provisional, as there are a number of methodological issues that still need to be addressed, e.g.
sample size, selection bias and validity of sources.
Methods and Analysis
The major goal of the weekend project was to provide new datasets and algorithms that could
automatically identify organizations, either bidders or contractors, as potential risks to the Bank.
To this end, the team first created a number of datasets that they then analyzed for suspicious
patterns.
Historical Debarment Data
The first task the team tackled was compiling a list of historical debarred firms. With this list,
one could compare incoming bidders and contractors against debarred companies to see if they
share suspicious similarities, e.g. same address or phone number.
The dataset was compiled using the help of the Wayback Machine, which allowed the team to
see the Bank’s list of debarred firms over time. This approach proved that scraping the Web for
data could be used to create a constantly updated list of debarred firms. The Bank, however,
likely has this information internally. Digitizing it could sidestep the need for this approach.
Figure 27 shows the average time of debarment by country for firms that are not banned
permanently, color-coded by whether countries are borrowing or non-borrowing. Greece tops
the list for longest debarments and is a non-borrowing country.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

38

Figure 27: Ranked list of countries by number of firms, along with proportions of firms
permanently debarred.

Figure 28 shows countries as they are ranked by percentage of permanently debarred firms.
Here Ireland and the United Arab Emirates top the list, with 100% of debarments in these
countries permanent. However, they each have only a few firms debarred, so this is not
completely surprising. The UK, in contrast, has a higher number of debarred firms than either,
however only two-thirds of the debarments against UK firms are permanent. These patterns
may be interesting to investigate further.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

39

Figure 28: Ranked list of countries by proportion of permanently debarred firms
vs. temporarily debarred.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

40

The team also looked at the change in the average number of non-permanent debarments in
countries before and after 2006. Figure 29 shows the changes in debarment rates as sloping
lines, colored by whether they’re increasing or decreasing. From this graph, we can see
Bangladesh, China, and the US increased the most between the two time periods, while the UK,
Indonesia, and Sweden decreased the most. Figures like these might be interesting to the Bank
team in understanding where concentrations of debarments are changing over time. These may
be indicative of areas that are becoming more corrupt or that are improving over time.

Figure 29: Changes in number of debarments by country before and after 2006.
Each line connects a country’s debarment number between the two time periods.
Red lines indicate decreases, green lines indicate increases.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

41

Lastly, the team performed a visual analysis of the grounds for debarment for each country. A
Tableau report on grounds for debarment can be viewed here (Windows required), while a full
Tableau report on the debarred data can be downloaded here. Two highlights from the reports
are shown below:

Figure 30: World map of debarments by type in each country.

In Figure 30, the size of each pie chart is proportional to the number of debarred firms. In Figure
31, there appears to be few consistent trends across countries, each being unique in its
composition of reasons for debarment.

Figure 31: Ranked list of countries by most debarred firms, broken out by reason for
debarment.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

42

Debarred and Non-Debarred Firm Relationships
The team next looked at relationships between debarred and non-debarred firms. To analyze
the similarities between these firms (e.g. similar addresses, phone numbers, names), the team
first had to supplement the contractor and bidder profiles with identifying information.
OpenCorporates, a freely available database of company registrations, was merged with the
Bank’s list of non-debarred firms to add addresses, phone numbers, and officers.
The team now had two lists of firms with identifying information such as address and officers
included one for debarred firms and one for non-debarred firms. They built a network
visualization to understand the relationships between debarred and non-debarred firms using a
simple matching measure. Firms are represented as nodes: debarred nodes are red and nondebarred nodes are green. Nodes share an edge if they are considered “similar,” in the sense
that they share an address, a phone number, officers, or a similar spelling of their names. Figure
32 shows one example of a network of connections between a major debarred company (the
left, large red node) and all other firms. Note the high number of non-debarred countries
connected to it.

Figure 32: Network diagram of connections between debarred (red) and non-debarred (green)
firms. Advantages exist between firms if they share common attributes like addresses, phone
numbers, officers, or similar names.

This figure indicates that suspicious relationships may exist between the companies. Further
study should be done on the debarred/non-debarred groups that share edges to understand
why they are linked and what that means.
For future work, the team suggested developing an automated which flags contractors when: a
firm’s geodesic distance to a debarred firm falls below a certain threshold; j of its k nearest
neighbors have debarred histories; it is classified as a debarred firm using a supervised
clustering algorithm trained on a carefully vetted sample of the data.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

43

Supplier Profile Database
Having looked at the debarred companies specifically, the team next turned to building a full
Supplier Profile Database. Before the DC Big Data Exploration, much of the data from suppliers
existed across different datasets. The team created a unified database enabling users to drill
down on supplier information. This database was formed by combining results from search APIs
to obtain supplier and Bank data. Code to link suppliers with the Bank’s projects can be found
here and the full database code can be found on Cam Cook’s GitHub page.
The Bank can use this project as a framework to develop a tool that can be used by both fraud
and corruption examiners as well as implementing agencies to analyze contractors or potential
contractors. Users can manually examine relationships between firms. If supplemented by data
on debarred firms examiners could identify relationships to known debarred firms or individuals.
Mapping Bidder Relationships
The team next looked at interesting patterns in the relationships between bidders. To begin, the
team gathered the URLs of all award notices from the Bank. They then scraped these sites to
produce data about the award and about all the bidders involved. They then generated a
network of relationships between bidders. Figure 33 shows networks of co-bidders, where each
node is a firm and groups of nodes all bid on contracts together. The visualization only includes
firms that had bid on three or more contracts. Each edge indicates that the two connected firms
bid on a contract together, with darker, thicker edges indicating more co-bids. The node size is
based on number of bids and the bluer a node is, the more centrally connected it is.

Figure 33: Clusters of common co-bidders. Only companies that bid on three or
more awards are included. Node sizes are proportional to number of bids and
nodes are more blue the more central they are. Edge widths and colors are
proportional to number of co-bids.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

44

From the figure we can see some very interesting synergies. G3 appears to be a set of bidders
who have all bid on a contract with one another the same number of times. G1 has a highly
irregular pattern in which one central node co-bids with a few other partners, each of whom
have their own networks of co-bidding. Again a more in-depth analysis of the data needs to be
undertaken to ascertain the relevance and accuracy of the findings. Code and data to build
these networks can be found in the Additional Resources section.
Having shown that mapping bidder relationships is possible, the Bank can extend this method to
conduct analyses on the impact of project factors to the bidding process. Adding the debarment
data and/or contractor profile would greatly help identify whether collusion is likely occurring.
Fraud and Corruption staff could examine the degree of separation between firms; bidding firms
with common addresses, officers, etc. might be likely to be involved in collusion.
Lastly, the team examined trends in the project approval process, specifically the number of
approvals made by the World Bank per year. The team acquired all project data made by the
World Bank between 1947 and 2012 and ran an analysis of the trends over time (Figures 34-35).
Project Approval Trends

Figure 34: Total number of project approvals made by the Board per year.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

45

Figure 35: The left chart shows approvals per month, where we see an increase in the number of
approvals toward the end of each month. The right shows the aggregate of all approvals.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

46

Figure 36: Project approvals by month over time. June approvals seem to increase most.

Figure 36 shows the proportion of approvals by month of year over time along with trend lines
fit by linear regression. Green lines indicate an increase in the number of approvals while red
indicate a decrease. We can see that the spring and summer months have been rising in the
proportion of approvals over time, meaning the Bank is granting proportionally more in late
spring and early summer.
Recommendations and Next Steps
For future projects on debarment, it would be useful to have the following:

A chronology of company/individual actions that resulted in debarment
For the companies/individuals that have been debarred:
o Did the projects request extensions?
o Did the projects request additional funding?
o Were they “problem” projects?

The Bank needs to create and maintain a formal "data warehouse" of their data that is cleansed,
organized and well cataloged. The Bank should consider creating unified profiles for:







Countries
Suppliers
Project types
Projects
Evaluation types
Project activities
Project activity outcome types
Time periods

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

47

To enable a proper, useful data warehouse, rigor and data cleansing/ETL (Extract, Transform and
Load) will need to be implemented.
When studying contractor relationships, the team found that there are companies that may not
have the capacity to do the projects and therefore most likely hire subcontractors. Finding
better data for subcontractors and individual consultants could greatly improve the quality of
the results.
The World Bank may want to supply governments with standardized forms or questionnaires to
accompany RFPs and ask for experience of individuals who will potentially be working on the
projects as well as the backgrounds of management teams. This form should be uniform for all
projects and ask specific questions such as capability of the contractors and subcontractors and
their past experience dealing with similar projects and results. In the case of subcontractors, the
same form should be filled out. Also, the Bank may want to pick a few subcontractors at random
and interview them confidentially to find out more about the work being subcontracted.
Note: Even though the governments are the ones who grant/award projects, the Bank can
provide this form for the governments and have it as a requirement in the RFPs. The team was
not sure how much the Bank is involved in the RFPs and if RFPs are standardized or not.
There are a rich number of future projects available from this weekend’s results. High potential
topics include:

Creating automated algorithms to flag suspicious firms and companies as they come
into the Bank’s pipeline; For example, building on the analysis at the DC Big Data
Exploration, the Bank may test out methods of flagging contractors when: a firm’s
geodesic distance to a debarred firm falls below a certain threshold; j of its k nearest
neighbors has debarred histories; it is classified as an “at risk” firm using a supervised
clustering algorithm trained on a carefully vetted sample of the data.
Further analyzing the distribution of debarred firms over time using factors such as:
country of origin, firm size, firm industry etc. The Bank may find it useful to investigate
trends such as locally high concentrations of debarred firms, or concentrations of
certain types of misconduct over time.
Building a unified set of profiles for major Bank entities (e.g. suppliers, countries, etc.)
beyond what the current APIs allow; developing a tool that can be used by both fraud
and corruption examiners as well as implementing agencies to analyze existing or
potential contractors. For example: automate the process of cross-checking suppliers
against debarred firms, and alerting users to known relationships to debarred firms or
individuals.
A deeper analysis of co-bidder relationships to automatically flag suspicious behavior;
the World Bank can extend this method to conduct analyses on the impact of project
factors to the bidding process. Adding the debarment data and/or contractor profile
would greatly help identify whether collusion is likely occurring. Fraud and Corruption
staff could examine the degree of separation between firms; bidding firms with
common addresses, officers, etc. might be likely to be involved in collusion.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

48

Factoring in new data streams in the vein of “civil witness” for understanding corruption
during project execution.

Additional Resources
• Team HackPads
o HackPad 1
o HackPad 2
• API for Supplier Profiles GitHub
• Data visualization of disbarred firms and individuals (Excel file)
• Code to generate network graphs of similarities between debarred and non-debarred
firms.
• Code to scrape bidder information
• The cleansed co-bidder data with co-bid groups included
• Python code used to scrape and parse award notices available
• Excel file used to create network diagram of co-bidders

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

49

UNDP RESOURCE ALLOCATION
Background and Problem Statement
The UNDP Capacity and Performance team was trying to improve UNDP’s ability to fund
development by examining the relationship between its staffing and its expenditure across the
programs.
Over the last several years, UNDP is increasingly focusing on measuring and improving its
performance. Its objective is to make sure that all resources that UNDP brings to developing
countries are used as effectively as possible, produce maximum value and lead to tangible and
sustained improvements in people’s lives. To reach this objective, it is critical for UNDP to be
able to monitor how well its offices are performing, especially in implementing concrete
programs and projects. UNDP must be able to identify and, ideally predict, weaknesses and
potential setbacks, and to take timely action to correct the course.
The DataKind team joined with UNDP experts to use their data to understand how well their
projects have been performing. The team chose expenditure as the measure of performance for
this analysis. They addressed the following questions:


Are women or men more likely to work in specific program areas?
What mix of workforce characteristics is associated with the greatest performance?
Can workforce characteristics accurately predict a downturn in performance?

Datasets Available
The team compiled a dataset on the UNDP workforce, budget and expenditure from programs
and projects that took place between 2008 and 2012. Each observation in the data represents
an employee, a description of that employee, and the project he or she worked on between
2008 and 2012. Only employees that worked on projects listed in the Budget and Expenditure
data were retained in this dataset. The dataset is available here.
Key Findings
• Looking purely at budget and expenditure data, the team was able to classify UNDP
projects into four broad categories of efficiency and analyze the breakdown of each type
of project by country, region, time, and type.

Key drivers of efficiency were mostly related to characteristics of the project and not
staff characteristics. Of the staff characteristics in the data, the team found that the
average number of years of service, total number of staff, and more recent projects
were indicators of whether a project is more likely to spend more than budgeted.

As with all other projects from the DC Big Data Exploration, the findings should be
considered provisional, as there are a number of methodological issues that still need to be
addressed, e.g. sample size, selection bias and validity of sources.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

50

Methods and Analysis
The team’s first challenge was to define a clear metric for the "success" of a project. UNDP did
not appear to have an internal measure of success so the team came up with measures that
could be used to define "efficiency." In this analysis, an efficient project spends the budgeted
money without too high of an overhead.
The team first combined several data sets to get at these measures. First, they calculated the
amount of overhead spent by country and year by adding the estimated salaries of personnel
not associated with any program. Then they allocated this overhead back to the programs,
proportional to the amount expended on each program.
Second, the team calculated the ratio of expended to budgeted money, ideally meeting or
exceeding projected figures, or in the worst case scenario being unable to put money to use.

Figure 37: Plot of (overhead/expended) vs. (expended/budget). Each point is a project, and
large (million dollar or greater) projects are indicated by red dots.

Figure 38: Each point is a project, and large (million dollar or greater) projects are indicated by
red dots. The four regions of project types are shown as well.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

51




Couldn’t spend all their money (yellow)
Had high overhead (purple)
Spent more than allocated (red)
Were near target (green)

These categories were then annotated back into the data files, containing properties of
programs, as described below. The team also created an interactive visualization of these ratios
by year, available here.
Properties of Programs
The team then sought to understand basic properties of program spending over time. The team
analyzed 324 projects from 2012 and explored their ratio of expenditure to budget. The table
below shows the number of projects that fell into each type of expenditure ratio:
Expenditure Ratio
NA
>1
=1
<1
=0
<0

# of Projects with that Ratio
45
25
11
206
24
13

The largest category was for projects with an expenditure ratio <1. That means that two-thirds
of the projects underspent their budget. Breaking that category down further, the team
explored what proportions of projects fell into more specific expenditure categories:
Expenditure Ratio
0.95 - 0.99
0.80 – 0.94
0.50 – 0.79
0.00 – 0.50

# of Projects with that Ratio
47
62
53
44

Here we see that 44 / 312, or about 14% of projects, spent less than half their budget. Dr. Harris,
a consultant to the team, had quoted that almost a third of projects in 2012 spent less than 80%
of their budgets. From this table, we can see that that number is exactly 97 / 312, or 31%. This is
an issue that UNDP should explore more thoroughly as it seems troubling that about 1/3 of
projects can’t spend their budgets.
Programs by Year
Using updated versions of the staff and program files, the team then studied program
performance by year. Figure 40 shows program expenditure in each of the six regions from 2008
to 2012. This figure shows a large dip in expenditure in the Oceania regions starting 2009 to
2011. Oceania spent less than the other regions during these years. It turns out that the majority

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

52

of the almost one third of projects mentioned in Dr. Harris's summary above are concentrated in
this region. The team conjectured that reshuffling of funds could be one contributing factor.
UNDP should consider comparing other similar programs in the regions to help determine
whether UNDP projects overspent/underspent significantly.
Performance Measures of Programs
The team took some initial steps toward identifying indicators for performance in the program
data. Trying to correlate staff expertise with performance proved difficult, as there were so
many projects with 10 or fewer assigned staff.
Turning back to rate of expenditure, the team then explored the ratio of expenditure/budget for
each individual bureau. For some bureaus, the histogram of expenditure/budget for their
programs sharply peaked near 1, indicating good performance for the majority of projects
coming out of that bureau. In other bureaus, the distribution is more spread out between 0 and
1, indicating large variation in how efficiently funds are disbursed. Deeper analysis of the
bureaus with more varied distributions should be performed. The team also felt it would be
interesting to look at this observation alongside country development data.

Figure 39: Program expenditure by year in each of the six major sites.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

53

Geographic Breakdowns of Program Efficiency
The team also looked at geographic breakdowns of programs. Figure 40 shows the ratio of
expenditure / budget by country. From this figure alone it is not clear if there are surprising
trends, but the team would like UNDP to weigh in on what they see in this plot.

Figure 40: Expenditure of projects by country.

Recommendations and Next Steps
First of all, a better measure of program efficiency and impact will help focus future analysis.
Looking purely at expenditure and budget does not speak to the actual performance of the
project. Much can be learned by looking at the variables that effect program efficiency distinct
from program performance.
Some of the factors that UNDP could investigate:
• Correlations between staff expertise and project performance using additional personal
information about the team members as variables. Some variables that they might
consider include: education level, ambition, and income level.
• Using more granular data about the projects such as what was accomplished in a given
year, the scope of the project, progress towards outcome metrics.
• The team observed that in some regions, the expenditure: budget ratio varied more
widely than others. UNDP could further analyze the variance among expenditure:
budget ratios by region.
• The team observed that 1/3 of projects spent less than 80% of their budget. UNDP
should consider comparing similar programs across the regions to identify correlations
across programs (size, sector and project).
Additional Resources
• Team HackPad

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

54

ADDITIONAL PROJECTS
There were two additional projects at the DC Big Data Exploration that teams worked on:
• Social networking analysis for risk measurement: Can you forecast project risk using
social networking analysis tools?

Can you use simple heuristic auditing to sniff out discrepancies in expenditure data:
What do you do when you have the information but don’t know if it contains signals
about potential fraud and corruption related risk?

Because these projects were not setup through DataKind, we unfortunately, do not have
detailed information about the projects. However, the link above leads to the hackpad
contributed by the authors.Their involvement in the event shows that innovation can come from
a wide community of innovators and technologists.
Next Steps
Thanks to the Bank’s willingness to team with DataKind prior to this event, the volunteer teams
were able to deliver a huge amount of work to the Bank and its partners in just a short amount
of time. The major takeaways across all projects were:
• Huge amounts of data exist outside the Bank in the form of mobile, social media, and
open data that must be brought to bear on the Bank’s problems. Data scientists could
be introduced to fill the capacity gap in using and understanding this type of data.
• Greater effort needs to be made within the Bank to reconcile its data across
departments for reusability and advanced analytics. One application could be creating
unified profiles for entities like suppliers.
Events like DataDives, competitions, and startup weekends raise visibility for the Bank, unite the
community and generate ideas; however, this will not lead to sustainable change unless the
Bank commits serious resources to continuing the work and supporting it.
Additional Reading
Blogs that may be relevant:
1. Short recap blog - with links to raw project hackpads
2. Chris Kreutz's recap of the DataDive in Vienna
3. Max Richman on scraping pricing data to measure poverty
4. Francis Gagnon on better data and the power of data visualization
5. Ben Ranoust on using visual analytics to probe risk factors influencing project outcomes
6. Marc Maxson on auditing the world - the sequel
7. Dennis McDonald on learning from data explorations
8. Giulio Quaggiotto and Prasanna Lal Das on personal data philanthropy
9. Milica Begovic, Giulio Quaggiotto, and Ben Ranoust on social networking analysis for
development
10. Giulio Quaggioto, Anoush Tatevossian and Prasanna Lal Das set the stage.

World Bank Group Finances
finances.worldbank.org

@WBOpenFinances

55

Sign up to vote on this title
UsefulNot useful