Developing A Machine Learning Portal For

DEVELOPING A MACHINE LEARNING PORTAL
FOR PREDICTING SUKUMA WIKI PRICES
BY
DOMINIC ODHIAMBO OWINO
649285
UNITED STATES INTERNATIONAL UNIVERSITY-

AFRICA
SUMMER 2019
DEVELOPING A MACHINE LEARNING PORTAL
FOR PREDICTING SUKUMA WIKI PRICES
BY
DOMINIC ODHIAMBO OWINO
A Project Report Submitted to the School of Science

and Technology in Partial Fulfillment of the
Requirement for the Degree of Master of Science in
Information Systems and Technology (MSc IST)
UNITED STATES INTERNATIONAL UNIVERSITY-

AFRICA
SUMMER 2019
i
STUDENT’S DECLARATION
I, the undersigned, declare that this is my original work and has not been submitted to any
other college, institution or university other than the United States International University
- Africa for academic credit.
Signed:________________________ Date: __________________

Dominic Odhiambo Owino (ID No 649285)
This project report has been presented for examination with my approval as the appointed
supervisor.
Signed: _______________________ Date:___________________

Dr. Sylvester Namuye
Signed: _______________________ Date:___________________

Dean, School of Science and Technology
ii
COPYRIGHT
All rights reserved. No part of this dissertation report may be photocopied, recorded or
otherwise reproduced, stored in retrieval system or transmitted in any electronic or
mechanical means without any prior permission of USIU-A or the author.
Dominic Odhiambo Owino © 2019
iii
ABSTRACT
Agriculture forms the backbone of Kenya's economy contributing 30% of the country's
GDP. However, little is known about the prices, supply and demand of fruits and
vegetables. Market price information is a very important agricultural concern which can be
enhanced by big data and machine learning techniques like predictive analytics. This
research work was based on developing an agricultural commodity price prediction
application that uses a machine learning model on appropriate data to facilitate commodity
price market information. The research investigated the applicability of advanced machine
learning techniques through the study of different predictive analytics models for
forecasting agricultural commodity prices with a specific focus on sukuma wiki in Nairobi,
Mombasa and Kisumu Counties.
The forecasting approaches used were regression predictive modelling using linear
regression, ensemble techniques using random forests algorithm and the popular gradient
boosting algorithm and its variant, XGBoost algorithm. These algorithms were then
compared to identify the best performing algorithm that gave the best Root Mean Squared
Error (RMSE) accuracy and therefore the best sukuma wiki price prediction.
The research work utilized data-sets from multiple sources including the National Farmer
Information Service (NAFIS) weekly crop information on sukuma wiki. Additionally, data
from survey results collected from a CEDIA-Utawala value-chain study done in five
Kenyan counties, weather conditions such as prevailing temperature and precipitation, and
macro-economic data such as the consumer price index and prevailing inflation rates. The
methodology involved the collection of relevant data sets, cleaning and preparing the data,
training the models, model testing and improving the models with a view to selecting the
best performing model.
The price prediction application was able to provide real-time insights and a visualization
of the prices trends for sukuma wiki in the past and forecasts on future trends. This study
will assist in driving commodity prices, demand and supply knowledge. It will lead to smart
decision-making and promotes markets for farmers, consumers, processors, traders and
policy makers.
Keywords: agriculture, machine learning, predictive analytics, sukuma wiki, prices,

counties.
iv
ACKNOWLEDGEMENT
I wish to express my greatest gratitude to Almighty God for granting me life, strength,
ability and the opportunity to excel and go through my academic life.
Special acknowledgement goes to my late parents Mr. Alfred Owino Olawo and Mrs.
Milcah Achieng Owino. I could not have reached this far without your support and
encouragement. Thank you very much.
I am eternally grateful to my family for providing me moral and emotional support

throughout my academic life hitherto. I am equally grateful to my friends, colleagues,
United States International University –Africa staff and faculty who have supported me
along my academic journey.
Special gratitude goes to my research supervisor, Dr. Sylvester Namuye, for his thorough
advice, guidance, wisdom and patience demonstrated during this research project. I am
equally grateful to Dr. Leah Mutanu for her advice and in-depth knowledge towards the
completion of this research project.
God bless you all.
v
DEDICATION
I dedicate this dissertation to my family; specifically, my wife Mrs Joy Nzilani Kilili
Odhiambo for her incessant support, patience and encouragement, my parents Mr. Alfred
Owino Olawo, Mrs. Milcah Achieng Owino, Mr. Philemon Ndolo and Mrs Sarah Ndolo
for their belief and investment in my education. I also dedicate it to my daughters Jewel
Achieng Odhiambo and Ruby Nkirote Odhiambo.
vi
TABLE OF CONTENTS
STUDENT’S DECLARATION…………………………………………………………ii
COPYRIGHT ....................................................................................................................iii
ABSTRACT ....................................................................................................................... iv
ACKNOWLEDGEMENT ................................................................................................. v
DEDICATION................................................................................................................... vi
TABLE OF FIGURES ...................................................................................................... II
LIST OF TABLES ........................................................................................................... xii
LIST OF ABBREVIATIONS ........................................................................................ xiii
Chapter 1: Introduction .................................................................................................... 1
1.1 Background of the study ....................................................................................... ….1

1.2 Statement of the problem ........................................................................................... 2
1.3 General Objective ...................................................................................................... 2
1.4 Specific Objectives .................................................................................................... 2
1.5 Significance of the study ............................................................................................ 2
1.6 Scope of the study ...................................................................................................... 3
1.7 Definition of terms ..................................................................................................... 4
1.8 Chapter Summary ...................................................................................................... 6
Chapter 2: Literature Review ........................................................................................... 8
2.1 Introduction ................................................................................................................ 8

2.2 Theoretical Foundations ............................................................................................. 9
2.3 Factors and data sources that influence sukuma wiki prices ................................... 11
2.4 Evaluation and selection of the best-performing prediction algorithm .................... 13
2.5 Development and testing of a sukuma wiki price prediction application ................ 22
2.6 Research Approach .................................................................................................. 23
2.7 Chapter Summary .................................................................................................... 25
Chapter 3: Methodology.................................................................................................. 26
3.1 Introduction .............................................................................................................. 26

3.2 Research Design ....................................................................................................... 27
3.3 Data Collection ........................................................................................................ 29
3.3.1 NAFIS Prices ................................................................................................. 30
3.3.2 ASDSP Research Survey Results .................................................................. 30
vii
3.3.3 Weather Data ................................................................................................ 31
3.3.4 Inflation Rates............................................................................................... 32
3.4 Research Procedures ................................................................................................ 32
3.4.1 Data Collection .............................................................................................. 32
3.4.2 Feature engineering ....................................................................................... 34
3.4.3 Model fitting and prediction .......................................................................... 34
3.4.4 Prediction tool development .......................................................................... 37
3.5 Data Analysis ........................................................................................................... 38
3.6 Chapter Summary .................................................................................................... 39
Chapter 4: System Implementation ............................................................................... 40
4.1 Introduction .............................................................................................................. 40

4.2 Analysis .................................................................................................................... 41
4.3 Modeling and Design ............................................................................................... 41
4.3.1 User Scenarios ............................................................................................... 42
4.3.2 Database Schema (ERD) ............................................................................... 42
4.4 Proof of Concept ...................................................................................................... 45
4.4.1 System Architecture ...................................................................................... 46
4.4.2 Process Flow .................................................................................................. 47
4.4.3 System Components ...................................................................................... 52
4.5 Chapter Summary .................................................................................................... 59
Chapter 5: System Performance, Results and Findings ............................................... 60
5.1 Introduction .............................................................................................................. 60

5.2 Factors and data sources that influence sukuma wiki prices ................................... 60
5.3 Evaluation and selection of the best-performing prediction algorithm .................... 72
5.4 Development and testing of sukuma wiki price prediction application ................... 91
5.3 Chapter Summary .................................................................................................... 97
Chapter 6: Discussion, Conclusions and Recommendations ....................................... 98
6.1 Introduction .............................................................................................................. 98

6.2 Summary .................................................................................................................. 98
6.2.1 Purpose of the study ...................................................................................... 98
6.2.2 Specific Objectives ........................................................................................ 98
6.2.3 Research Methodology .................................................................................. 98
viii
6.2.4 Major Findings ............................................................................................. 99
6.3 Discussion .............................................................................................................. 100
6.3.1 Factors and data sources that influence sukuma wiki prices ...................... 100
6.3.2 Evaluation and selection of the best-performing prediction algorithm ...... 102
6.3.3 Development and testing of a sukuma wiki price prediction application ... 102
6.4 Conclusions ............................................................................................................ 103
6.4.1 Factors and data sources that influence sukuma wiki prices. ...................... 104
6.4.2 Best-performing algorithm for sukuma wiki price prediction. .................... 105
6.4.3 Development and testing of a sukuma wiki price prediction application .... 105
6.5 Recommendations and Future Work...................................................................... 105
6.5.1 Recommendations for the research .............................................................. 105
6.5.2 Future work.................................................................................................. 107
References ....................................................................................................................... 108
APPENDIX I: Sample Datasets .................................................................................... 112
i. Final Research Data ............................................................................................... 112

ii. Inflation Rates ...................................................................................................... 114
iii. NAFIS Prices - Sample Data ............................................................................... 116
iv. Accu-Weather ...................................................................................................... 117
v. ASDSP Report on Population and Trends ............................................................ 118
vi. ASDSP Price Trends ........................................................................................... 119
APPENDIX II: Sample Machine Learning Model Code ........................................... 120
i. Statistics ................................................................................................................. 120

ii. Model Service....................................................................................................... 120
iii. Data Service ........................................................................................................ 126
iv. Upload Service .................................................................................................... 132
v. Dashboard Service ................................................................................................ 140
ix
TABLE OF FIGURES
Figure 2-1: Types of Machine Learning Algorithms… .................................................... 10

Figure 2-2: Top Prediction Algorithms............................................................................. 15
Figure 2-3: L1 Regularization ........................................................................................... 17
Figure 2-4: L2 Regularization ........................................................................................... 17
Figure 2-5: Comparison of traditional price information vs predictions .......................... 24
Figure 3-1: Research Approach ........................................................................................ 28
Figure 3-2: Regression metrics mathematical formulas ................................................... 36
Figure 3-3: Research data sample selected features ......................................................... 37
Figure 3-4: Research data sample selected features ......................................................... 38
Figure 3-5: Final data sample selected data after encoding the seasons feature ............... 39
Figure 3-6: Research Data Summary Statistics ................................................................ 39
Figure 4-1: Database Schema ........................................................................................... 43
Figure 4-2: Database Schema ........................................................................................... 44
Figure 4-4: System Architecture ....................................................................................... 46
Figure 4-5: System Sequence Diagram............................................................................. 47
Figure 4-6: Training and Test Process Flow ..................................................................... 49
Figure 4-7: Price Prediction Process Flow........................................................................ 51
Figure 4-8: Dashboard Service Class Diagram ................................................................. 54
Figure 4-9: Upload Service Class Diagram ...................................................................... 55
Figure 4-10: Model Service Class Diagram...................................................................... 56
Figure 4-11: Data Service Class Diagram ........................................................................ 58
Figure 5-1: NAFIS Price Trends in Nairobi for the period 2015 to 2018......................... 62
Figure 5-2: NAFIS Price Trends in Mombasa for the period 2017 to 2018 ..................... 62
Figure 5-3: NAFIS Price Trends in Kisumu for the period 2017 to 2018 ........................ 63
Figure 5-4: Average Temperature Trends for the period 2015 to 2018 in Nairobi... ....... 63
Figure 5-5: Average Temperature Trends for the period 2017 to 2018 in Mombasa ....... 64
Figure 5-6: Average Temperature Trends for the period 2017 to 2018 in Kisumu .......... 64
Figure 5-7: Precipitation Trends for the period 2015 to 2018 in Nairobi County ............ 65
Figure 5-8: Precipitation Trends for the period 2017 to 2018 in Mombasa County......... 65
Figure 5-9: Precipitation Trends for the period 2017 to 2018 in Kisumu County ............ 66
Figure 5-10: 12-Month average inflation rates for Kenya from 2015 to 2018 ................. 66
x
Figure 5-11: Annual average inflation rates for Kenya from 2015 to 2018 ..................... 67
Figure 5-12: Data Histograms ........................................................................................... 68
Figure 5-13: Train_test_split function .............................................................................. 70
Figure 5-14: Feature Importance ...................................................................................... 71
Figure 5-15: Linear Regression Model Predictions .......................................................... 73
Figure 5-16: Random Forests Model Predictions ............................................................. 75
Figure 5-17: Gradient Boosting Model Predictions .......................................................... 76
Figure 5-18: XGBoost Model Predictions ........................................................................ 77
Figure 5-19: Comparison of NAFIS Prices vs Gradient Boosting ................................... 80
Figure 5-20: Comparison of NAFIS Prices vs Linear Regression .................................... 80
Figure 5-21: Comparison of NAFIS Prices vs Random Forests ....................................... 81
Figure 5-22: Comparison of NAFIS Prices vs XGBoost .................................................. 81
Figure 5-23: Hyper-parameter tuning ............................................................................... 83
Figure 5-24: Gradient boosting performance vs number of trees ..................................... 88
Figure 5-25: Gradient boosting performance vs number of trees ..................................... 89
Figure 5-26: Distribution of residuals ............................................................................... 90
Figure 5-27: Portal CSV Upload Screenshot .................................................................... 91
Figure 5-28: Portal Training Data CSV File Upload Screenshot ..................................... 92
Figure 5-29: Portal Prediction Data CSV File Upload Screenshot ................................... 92
Figure 5-30: Portal August to October 2019 Average Predictions Screenshot ................ 93
Figure 5-31: Portal May to October 2019 Predictions Line Graph Screenshot ................ 94
Figure 5-32: Portal NAFIS Prices vs Predicted Prices Trends 2015 to 2019 ................... 94
Figure 5-33: Portal Nairobi NAFIS Prices vs Predicted Price Trends .............................. 95
Figure 5-34: Portal Mombasa NAFIS Prices vs Predicted Price Trends .......................... 95
Figure 5-35: Portal Kisumu NAFIS Prices vs Predicted Price Trends ............................. 96
xi
LIST OF TABLES
Table 4-1: Database Tables ............................................................................................... 45

Table 5-1: Features Averages for Period 2015 to 2018 .................................................... 67
Table 5-2: Correlation Values ........................................................................................... 69
Table 5-3: Feature F-Scores .............................................................................................. 71
Table 5-4: Coefficients for Linear Regression.................................................................. 73
Table 5-6: Performance Metrics for Random Forests....................................................... 76
Table 5-7: Performance Metrics for Gradient Boosting ................................................... 77
Table 5-8: Performance Metrics for XGBoost .................................................................. 78
Table 5-9: Performance Metrics for 500 records .............................................................. 79
Table 5-10: Performance Metrics for 714 Records........................................................... 79
Table 5-11: Results of gradient boosting hyper-parameter tuning ................................... 84
Table 5-11: Results of gradient boosting hyper-parameter tuning (continued) ................ 85
Table 5-11: Results of gradient boosting hyper-parameter tuning (continued) ................ 85
Table 5-12: Results of gradient boosting hyper-parameter tuning on trees ..................... 86
Table 5-12: Results of gradient boosting hyper-parameter tuning on trees (continued) . 87
Table 5-12: Results of gradient boosting hyper-parameter tuning on trees (continued) . 87
xii
LIST OF ABBREVIATIONS
ANN - Artificial Neural Networks
ARIMA - Auto-Regressive Integrated Moving Average algorithm
ASDSP - Agriculture Sector Development Support Program
CART - Classification and Regression Trees
CBK - Central Bank of Kenya
KNBS - Kenya National Bureau of Statistics
MSE - Mean Squared Error
MAE - Mean Absolute Error
NAFIS - National Farmer Information Service
RMSE - Root Mean Squared Error
R-Squared - Explained variation / Total variation. A statistical measure of how close the
data are to the fitted regression line
XGBoost - Extreme Gradient Boosting machine learning model
xiii
Chapter 1: Introduction
1.1 Background of the study
Achieving food security through sustainable agriculture is a global priority for the UN in
the next 15 years, so enshrined in the "Sustainable Development Goals (SDG 2)". This
means sustainably increasing agricultural productivity, while creating more resilient food
production systems, and shaping more accessible and equitable markets. There are gaps
regarding market access and price knowledge that need to be bridged. To meet these
challenges, growing volumes of data generated by governments, organizations and
individuals can be harnessed and utilized (Wolfert, Ge, Verdouw, & Bogaardt, 2017).
Improved access to and the use of open data at the sub-county, county, national and global
levels holds the potential to transform both long-standing and emerging problems.
This entails finding solutions that benefit stakeholders including farmers, consumers,
traders, processors and government. Some solutions that are already utilized in developed
countries include the use of big data to analyze agricultural information collected over a
long period of time. Artificial intelligence and machine learning are technologies that are
used to analyze these huge datasets in order to make decisions and improve productivity.
Models are trained on agricultural datasets such as weather information, past pricing, crop
seasons, and supply and demand, to learn and assist to predict future prices of agricultural
commodities.
This research work covered the analysis and selection of datasets that influence the supply
and demand of sukuma wiki. It covered the evaluation of four regression algorithms that
perform well in price prediction tasks. These algorithms consisted of linear regression,
random forests, gradient boosting and XGBoost. Finally, it covered the implementation of
a price prediction application for agricultural commodities with a focus on sukuma wiki in
the three major counties of Nairobi, Mombasa and Kisumu. The research work can be
extended to other agricultural commodities such as fruits and vegetables at various levels
including at the sub-county, county, national and even international levels.
1
1.2 Statement of the problem
There is lack of organized market information on agricultural commodities prices for

farmers to get the best prices for their produce and for consumers, traders and processors
to access real-time market information on agricultural commodities. Predicting vegetable
prices is very complex due to the very high nature of perishability and seasonality.
Empirical work focusing particularly on supply and demand of agricultural commodities in
Kenyan households at county levels has not been comprehensive. For example, the price
and expenditure elasticities of particular vegetables, like sukuma wiki, maize, and
tomatoes, to mention few, have not been directly examined in Kenya using a complete
demand systems approach. Having precise demand elasticity estimates at all levels of the
value chain is essential if one is to have a more meaningful analysis of the prices,
consumption needs, supply and demand.
1.3 General Objective
The general objective of this research was to develop a price prediction application for
agricultural commodities at the county level, specifically for sukuma wiki.
1.4 Specific Objectives
i. To identify the features and data sources that influence sukuma wiki prices.
ii. To evaluate different algorithms and select the best-performing algorithm for
sukuma wiki price prediction.
iii. To develop and test a prediction application that utilizes the selected algorithm to
predict and visually display sukuma wiki prices for Nairobi, Mombasa and Kisumu
counties.
1.5 Significance of the study
In a world where data is growing, the analysis of data moves from traditional methods to
"big data" methods such as artificial intelligence, data mining, predictive analytics and
machine learning (Kempenaar et al., 2016). Despite the rising potential of machine
learning, there has not been much effort to utilize these technologies to analyze and
organize data on agriculture in a way that can be useful and easily accessible to the
2
stakeholders like producers, consumers, traders and processors locally. Most efforts have
been geared towards using traditional statistical methods, infographics and reports to
disseminate information which is not easily accessible to those it is meant to benefit the
most.
The study identified different features from various data sources that influence pricing and
evaluated four machine learning algorithms to identify the best-performing price prediction
model and data features. The algorithms were linear regression, random forests, gradient
boosting and XGBoost algorithms. A web application was developed to display price
information on sukuma wiki in the major Kenyan counties of Nairobi, Mombasa and
Kisumu. This will provide agricultural value-chain stakeholders like farmers, traders,
processors, consumers and government just the right kind of information to make the best
decisions through access to real-time, predicted information.
1.6 Scope of the study
The research study was conducted in order to develop a market price information portal for
the sukuma wiki agricultural commodity in the urban counties i.e. Nairobi, Mombasa and
Kisumu. The research work covered the use of machine learning, predictive analytics and
big data to predict sukuma wiki prices in the future based on current and historical data. It
covered the use of various big data features and regression machine learning algorithms
culminating in the development of a price prediction application. The application utilizes
these data features and the machine learning algorithm with the best accuracy.
The data features consisted of price data from the National Farmer Information Service,
household data, weather information like temperature’s and precipitation, inflation data and
seasonality defining peak, normal and low seasons. The four machine learning algorithms
evaluated were linear regression, random forests, gradient boosting and XGBoost. This
culminated in the development of an application that used the data and algorithms for
sukuma wiki price prediction in the three counties. The secondary data was collected,
cleaned, organized and feature engineered into 714 data records. The area covered were the
urban counties of Nairobi, Mombasa and Kisumu which host the largest cities in Kenya.
The duration covered the period between 2015 and August 2018 when data was collected
and used to predict the prices for the succeeding next three months. The portal implemented
3
can be expanded to include price prediction for other agricultural commodities like
vegetables and fruits.
1.7 Definition of terms
Algorithm - Machine Learning algorithm is the hypothesis set that is taken at the beginning
before the training starts with real-world data. When we say Linear Regression algorithm,
it means a set of functions that define similar characteristics as defined by Linear
Regression and from those set of functions we will choose one function that fits the most
by the training data.
Annual Average Inflation - is the percentage change in the annual average consumer price
index (CPI) of the corresponding months e.g. July 2019 and July 2018.
Artificial Intelligence - The theory and development of computer applications that are able
to perform tasks that normally require human intelligence. These are tasks such as speech
recognition, image recognition, decision-making, and natural language translation.
Classification - In classification, data is categorized into discrete values or predefined

classes. For example, cancer can either be benign or malignant.
Feature - Features are independent variables that act as the input in your system and can
be existing or new features obtained from old features using a method known as ‘feature
engineering’. Prediction models use features to make predictions.
Gradient Boosting – This is a machine learning model suited for regression and
classification problems. It produces a prediction model in the form of an ensemble of weak
prediction models namely decision trees. It builds the model in iteratively like other
boosting methods, and generalizes by allowing optimization of an arbitrary differentiable
loss function.
Label - Labels are the final output. Labeled data mean groups of samples that have been
tagged to one or more labels.
Linear Regression - A machine learning algorithm based on supervised learning which

performs a regression task. Regression models a target prediction value based on
independent variables and is mostly used for finding out the relationship between variables
and forecasting. Different regression models differ based on the kind of relationship
4
between dependent and independent variables considered and the number of independent
variables being used.
Machine Learning - ML is the scientific study of algorithms and statistical models that
computer systems use to effectively perform a specific task without human intervention or
using explicit instructions, but instead relying on patterns and inference.
Model - A machine learning model is a mathematical representation of a real-world process

generated by providing training data to a machine learning algorithm to learn from.
Overfitting – A model overfits or has high variance if it fits the training data too well and
it’s too specific to generalize to new data. In machine learning, it is important how well the
target function that has been trained using training data generalizes to new data.
Generalization works best if the signal or the sample that is used as the training data has a
high signal to noise ratio. If that is not the case, generalization would be poor leading to
inaccurate predictions.
Parameter and Hyper-Parameter - Parameters are variables that can be configured to

improve model performance. They are internal to the model as they can be estimated from
the training data. Algorithms possess mechanisms to optimize parameters. Hyper-
parameters, however, cannot be estimated from the training data. Hyper-parameters of a
model are set and tuned depending on a combination of some heuristics and the experience
and domain knowledge of data scientists.
Predictive Analytics - Predictive analytics encompasses a variety of statistical techniques

from data mining, predictive modelling, and machine learning that analyze current and
historical facts to make predictions about future or otherwise unknown events.
Random Forests - Random forests or random decision forests are an ensemble learning
method used for regression, classification and other tasks. They operate by building a
multitude of decision trees at training time and giving as output the class that is the mode
of the classes (classification) or mean prediction (regression) of the individual trees. To get
stable and accurate predictions, random forests build multiple decision trees and merges
them together.
Regression - Regression techniques are used when the output is a discrete value based on
continuous variables such as time series data. This technique involves fitting a line.
5
Regularization - Regularization is a method to estimate a preferred complexity of the
machine learning model so that the model generalizes and the over-fit or under-fit problem
is avoided. This is done by adding a penalty on the different parameters of the model
thereby reducing the freedom of the model.
Target - The target is whatever the output of the input variables. It could be the individual
classes that the input variables maybe mapped to in case of a classification problem or the
output value range in a regression problem. If the training set is considered, then the target
is the training output values that will be considered.
Training - While training for machine learning, you pass an algorithm with training data.
The learning algorithm finds patterns in the training data such that the input parameters
correspond to the target. The output of the training process is a machine learning model
which you can then use to make predictions. This process is also called “learning”.
Twelve-Month Inflation (12-month) Inflation - Also known as the inflation rate, it is

defined as the percentage change in the monthly consumer price index (CPI). For example,
the 12-month inflation rate for July 2019 is the percentage change in the CPI of July 2019
and July 2018.
XGBoost – This is form of regularized gradient boosting algorithm that has high predictive
power since it is both a linear model and a tree learning algorithm. It is 10x faster than
gradient boosting, reduces overfitting and has a built-in cross validation at each iteration of
the boosting process.
1.8 Chapter Summary
Chapter one introduced the research project that sought to develop and implement a sukuma
wiki price prediction application using machine learning and predictive analysis. The
background of the study was informed by the need to improve market price knowledge on
key agricultural commodities such as fruits and vegetables with a specific focus on sukuma
wiki. The problem identified was a lack of organized market information on agricultural
commodity prices for stakeholders such as farmers, traders, processors, consumers and
government.
The specified objectives were identifying the data features that influence sukuma wiki
prices, evaluating four regression algorithms and selecting one that gave the best accuracy.
6
Finally, developing a sukuma wiki price prediction application that utilizes the data features
and algorithm selected with the counties of Nairobi, Mombasa and Kisumu as the case
study. The chapter presented also the rationale and significance of the research, and finally,
the scope and coverage of the research that utilizes machine learning and predictive
analytics. The next chapter discusses the literature review.
7
Chapter 2: Literature Review
2.1 Introduction
This chapter discusses related literature on the use of machine learning in agricultural price
prediction. The chapter describes how big data technologies such as machine learning and
predictive analytics are used in agriculture. It introduced the concept of machine learning
and the different categories such as supervised, unsupervised and reinforcement learning.
It also discusses the importance of using machine learning for price prediction for the
various stakeholders.
The chapter then delves into each of the research objectives for the development of a price
prediction application. Objective one discusses the features and data sources that influence
sukuma wiki prices. Various studies have highlighted the features revolving mainly around
weather conditions, macro-economic factors, seasonality’s like politics, supply-demand
relationships, and production factors. It discussed the factors affecting prices globally with
a specific focus on the Kenyan context. Finally, it highlighted the data features that were
utilized in this research work.
Objective two covers studies done on the evaluation and selection of the best-performing
prediction algorithm and identified the algorithms most used for prediction like regression
and time-series algorithms. It discusses various algorithms in detail explaining their
approach and formula's they use, how to tune them and evaluate metrics on performance.
The main algorithms covered are linear regression, random forests, gradient boosting and
XGBoost. Other algorithms that were considered were Auto-Regressive Integrated Moving
Average algorithm or commonly known as ARIMA and Artificial Neural Networks (ANN)
which however, were not used in the research.
Objective three covers the development of the price prediction application. It highlights the
tools that are used by data scientists and machine learning practitioners for their predictions.
Most of the tools provide generic machine learning capabilities that need expert usage
which have inspired the development of the agricultural price prediction application that
forms the subject of this research.
8
The research highlights the process of analyzing and evaluating raw data, the training of
various models and the selection of the final model. It also described the process of
implementing the prediction application.
2.2 Theoretical Foundations
Big data technologies include machine learning, artificial intelligence and predictive
analytics. Machine learning algorithms learn from huge amounts of structured and
unstructured data, e.g. text, images, video, voice, body language, and facial expressions.
Predictive analytics encompasses a variety of statistical techniques from predictive
modeling, machine learning, and data mining that analyze current and historical facts to
make predictions about future or otherwise unknown events. Artificial Intelligence is the
development of computer systems able to perform tasks normally requiring human
intelligence, such as visual perception, speech recognition, decision making, and translation
between languages.
Machine learning is a branch of artificial intelligence that uses statistical techniques to give
computers the ability to learn and progressively improve performance on a specific task
with data without being explicitly programmed (Samuel, 1959). Machine learning explores
the study and construction of algorithms which can learn and make predictions or decisions
based on data. The data may be structured such as in relational databases, semi-structured
such as in JSON or XML format or unstructured such as social media, videos, among
others. Machine learning tasks are typically classified into three broad categories as shown
in Figure 2-1.
9
Figure 2-1: Types of Machine Learning Algorithms (Source: Cognub.com, 2017)
A lot of literature has been written about the use of big data technologies like predictive
analytics, machine learning and data mining in agriculture. Kaur (2016), Okori, Obua, &
Quinn, (2011), Lukyamuzi Andrew, John, & George Washington (2015) and Radhika &
Shashi, (2013) described some agricultural applications like crop selection and crop yield
prediction, weather forecasting, smart farming, crop disease prediction, and deciding
minimum support price by the government.
There have been numerous studies across the world to predict food prices in an effort to
improve crop yields, household incomes, food availability, consumption trends, market
access, and supply and demand. The primary function of any market is efficient price-
discovery for all stakeholders including the consumers on the buy side and traders,
processors and farmers on the sell-side. Agricultural commodity prices vary across markets
and the current and future commodity prices that avail the benefit of better prices in both
local and other markets should leverage distributed big data eco-systems (Techwave, 2016).
Historical data can be used in computing the near real time data to find out the current and
future prices of all variety of the crops.
10
Price prediction given well in advance for agriculture commodities is helpful in many ways.
i. Farmers planting and harvesting decisions - Price forecasting aids farmers know
prices in advance and take appropriate decisions on what they plant, when they plant
and what markets to target.
ii. Government policies can be aided by price forecasting information in order to make
decisions and appropriate interventions that facilitates agriculture and farmers as a
whole. This is in regards to import and export of foods, price regulation and farmer
extension services.
iii. Market prices - crop yields vary across the board based on a multitude of factors
such as adequate weather conditions, crop diseases, etc. These factors can promote
price knowledge on agricultural commodities for better returns.
Thus, it is necessary to provide forecasted price information at every level including at the
county, sub-county and national levels.
2.3 Factors and data sources that influence sukuma wiki prices
Past studies have identified factors that influence the prices of agricultural commodities as
largely related to supply-demand relationships, macro-economics, seasonality’s, consumer
preferences and government agricultural and trade policies. The general price level of an
agricultural commodity is influenced by market forces that can alter current or expected
balance in supply and demand. Kretschmer, Bowyer, & Buckwell, (2012) studied factors
that have led to increases in agricultural commodity prices. They include consumer
preferences, end-users changing needs, factors affecting production processes such as
weather, input costs, pests and diseases, alternative substitutes, crop area, government
policies, storage and transportation.
Reinhart & Borensztein, (2009) and Nelson, (2008) highlighted macro-economic

determinants of agricultural commodity prices consisting largely of supply-side factors,
demand-side factors and government policies and interventionism. Byrne, Fazio, & Fiess
(2010) and Borychowski & Czyżewski (2016) identified both the supply-side and demand-
side factors that influence agricultural commodity prices. Supply-side factors include
available arable land, degree of technical and biological progress, climate changes and
weather conditions, production costs such as oil prices. Demand-side factors include
11
population, household sizes, level of economic development, scale of demand,
consumption changes, competition for and alternative uses of land, speculation, trade and
world prices.
Baffes & Dennis (2013) and Borychowski & Czyżewski (2016) also highlighted the supply
and demand-side factors that are very significant in price prediction that include
government policies, interventionism, business cycles, the country’s economic activity and
changes in exchange rates. According to Kretschmer et al. (2012) and Hanson, Robinson,
& Schluter (1993) these also include issues such as globalization, changes in the macro-
economic environment, economic policy, fiscal policy and budgets. Others are monetary
policy affecting interest rates, trade policy in terms of imports and exports, exchange rate
policy and sectoral policies.
The research in agricultural and applied economics report Tongai Hu (2012) identified
factors leading to rise of prices of agricultural products which mainly include tension of
supply-demand relationship, promotion of production cost and circulation cost, and
speculation of Refugee Capital. Tension in supply-demand relationship covers population
growth, rise in the consumer price index (CPI) leading to a rise in basic living expenses and
inflation, the number of people engaged in agriculture and labor costs contributes to a rise
in agricultural prices. The rise of production factors such as land, labor force, and
production factors such as seed, fertilizer, pesticide and agricultural machinery lead to
growth of production cost. Circulation costs that define processes from farm to plate such
as production, purchase, transport, packing charges, cost of loading and unloading,
weighing charges, freight charges and rentals, wholesale and retail also leads to price rises.
Food (2017) identified other factors that affect food prices including pests and diseases that
destroy crops and affect livestock production, political and economic situations that
influence food prices up or down, seasonality like surplus seasons drive lower food prices
while episodes of extreme temperature like drought and higher oil and gas prices drive
increases in food prices (Ramesh & Vardhan, 2013), (Alif, Shukanya, & Afee, 2018).
Cai et al. (2017) in their price prediction model for Gro-Intelligence use multiple data points
such as the crop yields and production datasets, crop monitoring and environmental
conditions datasets from National Agricultural Statistics Service (NASS) and USDA
12
WASDE reports. Others include favorable temperatures, soil conditions, crop management
through the use of fertilizer and irrigation, growing seasons, field and remote sensing data
USDA's forecasts (Cai et al., 2017).
The ASDSP - UNES (2016) survey concluded that other factors such as market size,
demand and supply patterns, buying habits and motives of customers, trends in production,
trading and consumption. Other issues were factors affecting market competitiveness such
as quality, distance to market, education levels, and value-addition.
Due to the lack of adequate and structured data sources that is as developed as the US and
the rest of the developed countries, our research work focused on only a few of these factors
that influence agricultural commodity prices at the local level. These were the weather
information i.e. rainfall and temperature, crop seasons, inflation rates, household sizes and
total sukuma wiki demand.
2.4 Evaluation and selection of the best-performing prediction algorithm
In an attempt to analyze and predict commodity prices, several machine learning models
like time series models, regression models, neural networks, support vector machines, tree-
based models, gradient-boosting models, were considered.
Agrawal, Adhikari and Agrawal (2013) defined time series is defined as a sequence of data
sampled over continuous intervals of time with equal spacing between the data points. Time
series forecasting models can be created to predict future values using characteristics
extracted from observed values. The point in time from which the prediction is made is
often referred to as the forecast origin. The length of the interval between the forecast origin
and the point in time we make a prediction for is called the forecast horizon. The prediction
is evaluated by the calculating its error in order to make accurate predictions.
Predictions for vegetable prices can also be enhanced by combining predictions from
different models that would lead to better price information about the commodities in the
wholesale markets. For time series forecasting, different models combinations techniques
can give better Root Mean Squared Error accuracy comparing to single algorithms. The
Support Vector Machines (SVMs) and Auto-regressive Integrate Moving Average
13
(ARIMA) are some of the concepts which have been also applied to gain better forecasting
of agriculture using machine learning (Sujjaviriyasup & Pitiruek, 2013).
Ramesh and Vardhan (2013) studied data mining approaches for agricultural intelligence
and showed that crop model and decision tools are increasingly used in agricultural fields
to improve production efficiency. Data mining techniques like classification, neural
networks and regression are run against realistic data sets for analyses and make the
prediction on the agriculture crop yield. The study discussed various data mining
methodologies such as association, classification, clustering, regression, neural networks,
fuzzy set and decision tree and Bayesian classification. Currently, there are advanced
machine learning models which are in extensive usage such as linear regression, random
forests, gradient boosting, XGBoost and Artificial Neural Networks.
Artificial Neural Networks such as Long-Short Term Memory (LSTM), Recurrent Neural
Networks, Tensor Flow, while suited for prediction modelling require much greater
resources in terms of computing power and learning duration. They are mostly used in
complex tasks and deep learning tasks such as image recognition and pattern recognition.
They are very slow to train and requires high computing power utilizing a High
Performance Computing machine.
Decision trees and a linear model were chosen for the research work as they are easy to
understand and implement as depicted in Figure 2-2. Decision trees use a branching method
to match all possible outcomes of a decision.
Linear regression is an entry-level algorithm which is easy to understand and provides a

best line of fit through all data models. It is used extensively for regression tasks involving
continuous numerical variables comparing against a target variable. Random Forests takes
the average of many decision trees which are random subsets of the data and features, is
fast to train and gives high quality models.
14
Figure 2-2: Top Prediction Algorithms (Source: Dataiku, Inc 2017)
15
Gradient Boosting uses the boosting technique whereby it combines a number of weak
learners to form a strong learner. It is a high-performing algorithm as it uses regression
trees as base learners and subsequent trees are built on the errors calculated by the previous
tree. XGBoost is an advanced implementation of the gradient boosting algorithm with high
predictive power. It is 10 times faster than other gradient boosting models with a variety of
regularization techniques to reduce overfitting and improve performance. The four
algorithms used, including linear regression, random forests, gradient boosting and
XGBoost algorithms are described as follows.
i. Linear Regression
Linear regression models relationships between a scalar dependent variable y and one or
more independent variables denoted X, (Parag., 2017) and (Pavlyshenko, 2016). Simple
linear regression has one dependent variable while multiple linear regression use more than
one independent or explanatory variable. For prediction sor forecasting or error reduction
goals, linear regression can be used to fit a predictive model to an observed data set of y
and X values. After developing such a model, if an additional value of X is given without
its accompanying value of y, the fitted model can be used to make a prediction of the value
of y.
Given a variable y and a number of variables x1, ..., xp that may be related to y, linear
regression analysis can be applied to quantify the strength of the relationship between y and
the xj, to assess which xj may have no relationship with y at all, and to identify which
subsets of the xj contain redundant information about y. Thus, the model takes the form:
where T denotes the transpose, so that xiTβ is the inner product between vectors xi and β.
The ε is a random error component which measures how far above or below the True
Regression Line (i.e. the line of means) the actual observation of y lies. The mean of ε is
zero.
Linear regression methods are derived by using different types of regularization: ordinary
least squares or linear least squares uses no regularization; ridge regression uses L2
16
regularization; and Lasso uses L1 regularization. The average loss or training error, is
known as the mean squared error.
Regularization is a machine learning technique used to avoid overfitting by adding a

regularization term. It works by biasing data towards particular values such as small values
near zero. The key difference between these techniques is that L1 also known as Lasso
Regression shrinks the less important feature’s coefficient to zero thus, removing some
features altogether. So, this works well for feature selection in case we have a huge number
of features.
The difference between L1 regularization and L2 regularization is that L1 is the sum of the
weights while L2 is the sum of the square of the weights (Chioka, 2013).
Figure 2-3: L1 Regularization (Source: Chioka, 2013)
Figure 2-4: L2 Regularization (Source: Chioka, 2013)
For L1 regularization in Figure 2-3, if λ is zero then we get back Ordinary Least Squares
whereas very large value will make coefficients zero hence it will under-fit. For L2
regularization also known as Ridge Regression shown in Figure 2-4, if λ is very large then
it will add too much weight and lead to under-fitting. This technique works very well to
avoid the over-fitting issue.
ii. Random Forests
Breiman, (2001), Louppe (2014) and Cutler, (2010) defined random forests algorithm as
ensembles of decision trees and train a set of decision trees separately, so the training can
be done in parallel. The algorithm injects randomness into the training process so that each
17
decision tree is a bit different. Combining the predictions from each tree reduces the
variance of the predictions, improving the performance on test data (Yu et al., 2010), (Noi,
Degener, & Kappas, 2017). The randomness injected into the training process includes
bootstrapping by sub-sampling the original dataset on each iteration to get a different
training set and splitting different random subsets of features at each tree node.
Random forests aggregate predictions from its set of decision trees for a new instance. In
regression, each tree predicts a real value and the label is predicted to be the average of the
tree predictions. Random trees utilize specific parameters and tuning them can often
improve performance. These parameters are:
i. Number of trees in the forest. An increase in the number of trees results in a decrease
in the variance in predictions that leads to an improvement in the test-time accuracy
of the model. Training time also increases linearly with the number of trees.
ii. Maximum depth of each tree in the forest. An increase in depth makes the model
more expressive and powerful although deep trees take longer to train and are also
more prone to overfitting. It is generally acceptable to train deeper trees when using
random forests than when using a single decision tree. One tree is more likely to
over-fit than a random forest due to the variance reduction from averaging multiple
trees in the forest.
iii. The sub-sampling rate. This parameter specifies the size of the dataset used for
training each tree in the forest, as a fraction of the size of the original dataset.
iv. The feature subset strategy: Number of features to use for splitting at each tree node
is specified as a fraction or function of the total number of candidate features.
Decreasing this number speeds up training, but impacts performance if too low
occassionally.
The last two parameters generally do not require tuning but can be tuned to speed up
training of the model.
iii. Gradient Boosting and XGBoost
Developed by Chen & Guestrin (2016), the eXtreme Gradient Boosting (XGBoost) model
is an implementation of the gradient boosting framework (Santhanam, 2016). Gradient
Boosting algorithm is a machine learning technique used for building predictive tree-based
18
models. Boosting is an ensemble technique in which new models are added to correct the
errors made by existing models. Models are added sequentially until no further
improvements can be made.
The ensemble technique uses the tree ensemble model which is a set of classification and
regression trees (CART). The ensemble approach is used because a single CART, usually,
does not have a strong predictive power. By using a set of CART (i.e. a tree ensemble
model) a sum of the predictions of multiple trees is considered.
Gradient boosting (Paradkar, 2017) is an approach where new models are created that
predict the residuals or errors of prior models and then added together to make the final
prediction.
The objective of the XGBoost model is given as:
Obj = L + Ω
Where,
L is the loss function which controls the predictive power, and
Ω is regularization component which controls simplicity and overfitting
The loss function (L) which needs to be optimized can be Root Mean Squared Error for
regression, Logloss for binary classification, or mlogloss for multi-class classification. The
regularization component (Ω) is dependent on the number of leaves and the prediction score
assigned to the leaves in the tree ensemble model.
It is called gradient boosting because it uses a gradient descent algorithm to minimize the
loss when adding new models. The Gradient boosting algorithm supports both regression
and classification predictive modeling problems. eXtreme Gradient Boosting also called
XGBoost has received rave reviews from machine learning practitioners and researchers
alike as having more success rate compared to other machine learning models (Paradkar
M. , 2017).
Other algorithms that have been used in price prediction include Auto-Regressive
Integrated Moving Average and Artificial Neural Networks. Shukla & Jharkharia (2011)
investigated the applicability of ARIMA (Auto-Regressive Integrated Moving Average)
19
models in wholesale vegetable market to forecast the demand of a vegetable on daily basis
with the objective of facilitating farmers and wholesalers in effective decision making.
ARIMA is a forecasting technique that projects the future values of a series based entirely
on its own inertia that works best for short term forecasting for stable data that has a
consistent pattern over time with a minimum amount of outliers. Data on onion sales was
collected in Ahmedabad market in India (Shukla & Jharkharia, 2011) and ARIMA models
applied to forecast the demand by validating using data for potato sales in the same market.
The forecasted values obtained shows that the model is highly efficient in forecasting the
demand of vegetables on a day-to-day basis with a lot of accuracy.
ARIMA models are used a lot in time-series forecasting and are more preferred in literature
for short term forecasting as compared to artificial intelligent models. It is found that these
models can be applied to forecast the demand with Mean Absolute Percentage Error
(MAPE) in the range of thirty percent. This error is acceptable in fresh produce market
where the demand and prices are highly unstable. The study postulated that there exists a
need to forecast the demand of vegetables in the wholesale market to avoid wastage of
vegetables, have demand visibility at the customer's end, base harvesting decisions on
market demand rather than speculation and to increase the profits of all the stake holders.
ARIMA stands for Auto-Regressive Integrated Moving Average (Nau, 2018). Stationary
series lags are called autoregressive terms, as of the forecast errors are moving average
terms and time series which needs to be differenced to be made stationary is an integrated
version of a stationary series. An ARIMA model which is non-seasonal is classified as an
“ARIMA (p,d,q)” where:
p is the number of autoregressive terms,
d is the number of non-seasonal differences needed for stationarity,
q is the number of lagged forecast errors in the prediction equation.
Gathondu (2014) studied three autoregressive models to predict and model the wholesale
prices for selected vegetables in Kenya shillings per kilogram. The models used were
Autoregressive Moving Average (ARMA), Vector Autoregressive (VAR), Generalized
Autoregressive Condition Heterostadicity (GARCH) and the mixed model of ARMA and
20
GARCH. The study was conducted using vegetables like tomato, potato, cabbages, kales
and onions for markets in Nairobi, Mombasa, Kisumu, Eldoret and Nakuru wholesale
markets whose time series data is considered the classic national average.
Based on the model selection criterion, the models found to have given the best forecasting
models in ARIMA were; Potato ARIMA (1,1,0), Cabbages ARIMA (2,1,2), tomato
ARIMA (3,0,1), onions ARIMA (1,0,0), Kales ARIMA (1,1,0). Furthermore, the mixed
model of ARMA (1, 1) and GARCH (1, 1) model was also identified as the best model in
forecasting. GARCH (1, 1) model was selected as the best fitting model when conditioned
on t-distribution among the four GARCH (p, q) models used and compared.
Frelat et al. (2015) postulated that improving market access is essential in addition to
bridging yield gaps. They calculated that the food availability status for 72% of the
households could be correctly predicted with an Artificial Neural Network (ANN) mini-
model using only three explanatory variables of household size, number of livestock, and
land area.
Artificial Neural Networks are inspired by the structure and function of biological neural
networks, the basis of which is an interconnected group of artificial neurons which connect
to each other. Each connection has a weight, and each neuron has a so called "transfer
function". With the weights and transfer function, inputs of the model can be mapped
through a complex (highly configurable) collection of mathematical functions, to its
corresponding output. Finding the right network topology (amount of neurons, and its
connections), weights and transfer function is a difficult task that can be solved by using
additional algorithms that learn these parameters.
Learning is done most of the time in an iterative way, starting with an initial guess and
converging to an optimum (local or global). For each iteration a test has to be performed to
prevent the model from overfitting. Overfitting is a phenomenon that occurs when the
model’s performance is high on the training-data, but low on testing and/or unseen data.
This balance is important to keep an eye on, since it is useless to have a model that can
model the training data perfectly but cannot generalize and therefore cannot be used for
anything else.
21
Otunaiya & Shittu (2014) studied the household demand system of vegetables in Ogun
State, Nigeria, at finer levels of disaggregation to estimate the price and income
(expenditure) elasticities of demand for the commonly consumed vegetables among
households. The aim was to determine the expenditure, own-price, and cross-price
elasticities of demand of commonly consumed vegetables in the study area using data on
household characteristics, such as age, level of education, marital status, sources of income,
household sizes, household expenditure on various vegetable commodities and the prices
and quantities of the commonly consumed vegetable.
The Nonlinear Quadratic Almost Ideal Demand System (NQAIDS) was used as it had
better forecasting performance due to its flexibility that includes nonlinearities and
interactions with the household-specific characteristics in the utility function.
Time series forecasting methods like ARIMA model and GARCH model are only based on
history prices of agricultural commodities while ignoring other factors (Sujjaviriyasup &
Pitiruek, 2013). Therefore, these models no longer work when the prices are affected by
non-seasonal factors. Artificial Neural Networks comprising of algorithms like Long-Short
Term Memory (LSTM), Recurrent Neural Networks, Tensor Flow, while suited for
prediction modelling require much greater resources in terms of computing power and
learning duration. As such, the research work focused mainly on the regression algorithms
of linear regression, random forests, gradient boosting and XGBoost which perform well,
and if not better in prediction modelling based on the datasets available.
2.5 Development and testing of a sukuma wiki price prediction application
There exists a myriad of browser-based, cloud-based and desktop machine learning tools
for this kind of research work. These include browser-based tools such as Apache Zeppelin,
Jupyter Notebooks, Spark Notebooks, among many others. Desktop tools used by data
scientists over the years include R-Studio, WEKA, SPSS, among others while major cloud-
based services also providing these tools include Google Cloud ML, IBM Spark, Amazon
ML, Intel ML, MapR and Databricks. However, these are mainly general-purpose tools for
machine learning and serve a wide range of machine learning disciplines, and thus, not
customized for agricultural commodity price predictions.
22
Cai et al. (2017) has developed a suite of machine learning-based models forecasting end
of season or final corn yields weekly using data from the US from 2001 to 2016. The data
used in the forecasts ranges from economic data and crop growing cycles to satellite
imagery, weather data, and plant health measurements. Three independent tree ensemble
machine learning algorithms were tested for training and predicting including random
forests (Breiman, 2001), extreme gradient boosting – XGBoost (Pedregosa, Weiss, &
Brucher, 2011) and Cubist (Noi et al., 2017).
The yield model is capable of achieving a high level of accuracy months before the
traditional window and has enormous implications for food security, trade, and the global
economy in terms of quickly alerting the global community to potential supply shortages.
The multi-level model combines several algorithms and uses the understanding of
physiological processes in temporal feature selection to achieve high precision in intra-
season forecasts. It is able to better predict not only normal years but also anomalous years
that lead to bumper harvests or drastic declines in yields. Gro-Intelligence’s full yield
model updates at least every eight days and forecasts are made at the county level and
aggregated to the national level and compares favorably to the USDA’s and others’ monthly
releases.
Thus, numerous models have been tested and proven to be able to give accurate predictions
based on several features which will be tested and evaluated in this research work.
2.6 Research Approach
This research work establishes a conceptual framework that uses machine learning to
predict sukuma wiki prices in the three counties of Nairobi, Mombasa and Kisumu based
on historical agricultural-related data. It first identifies the data factors that influence
agricultural commodity prices including weather-related factors such as temperature and
rainfall, population factors such as household sizes and vegetable demand, inflation and
crop seasonality in relation to the periodic government published prices. Then, it evaluates
and tests four regression machine learning algorithms that are used for price prediction.
The relationship between previous agricultural crop prices and the process that forecasts
the future prices of agricultural crops can be described at a fairly general level as depicted
in Figure 2-5.
23
Figure 2-5: Comparison of traditional price information vs predictions
Agricultural commodity market price information today relies on historical prices,

estimations and traditional infographics and statistical analysis. The current techniques
though informative still lead to price inelasticity where farmers, traders and consumers can't
get accurate prices and are therefore exploited, even leading to post-harvest losses. These
techniques can be complemented and improved where there is supply and demand
knowledge through the use of machine learning and predictive analytics.
Stakeholders can rely on commodity price forecasts for the successive one to three months
or more that is analyzed using historical price trends and appropriate agriculture-related
factors. The existing big data on agriculture can be aggregated, processed and presented in
dashboards or using channels such as SMS, USSD and apps leading to better market price
knowledge. Therefore, the stakeholders in the agricultural value chain can adopt and
continue to use the application due to the benefits and perceived usefulness it accords them.
Farmers are able to be more productive and make smarter decisions on when and what to
plant depending on the demand that there will be. Traders get knowledge of best prices and
where demand is leading to improved profits.
Task Technology Fit theory, theorized by (Dale & Ronald, 1995) and (Dishaw et al., 2004)
argues that information system use and performance benefits are attained when an
24
information system is well-suited to the tasks that must be performed. Continued interest
in the application of TTF theory is therefore expected as it leads to satisfaction and
continuance intention. It highlights the importance of a suitable fit between the
representation of a problem and the tasks that must be performed to solve the problem. The
use of machine learning models fits the task of predicting agricultural commodity prices
due to the myriad of algorithms that can perform time series and regression.
The research targets various stakeholders i.e. farmers, consumers, traders, processors and
government. Farmers want to know the price of sukuma wiki in the near future to influence
their production decisions. They also want to know which locations fetch the best prices,
the best price for their crops and how prices trended in similar periods in the past. Traders
and processors want to gain knowledge on the best prices in the market while consumers
also want to know the cost of their commodities like vegetables and fruits. Government
wants to know agricultural commodity performance, price trends, supply and demand in
order to make planning decisions. The research utilized machine learning to build an
application that fulfills these stakeholder needs and filling a gap in market price knowledge
currently dominated by brokers.
The adoption of machine learning for prediction tasks will result in smarter decisions by all
stakeholders. Farmers will be able to make smart planting and harvesting decisions while
other stakeholders like customers, traders, processors and governments will also have
access to real-time market information to make smart decisions. This in turn leads to better
prices, better fruit and vegetable markets, productivity from smart farming and smart
decision-making.
2.7 Chapter Summary
The chapter has presented the literature review on the research topic. It covered review of
related literature and the conceptual framework based on the study objectives of identifying
features that influence agricultural commodity prices, the evaluation of relevant machine
learning models and the development of a machine learning tool for price prediction. The
research work builds on the existing studies done on food price predictions. These informed
the implementation of a model for the research work. The next section discusses the
research methodology and the proposed approach.
25
Chapter 3: Methodology
3.1 Introduction
This chapter presents the research methodology used to determine the data features
influencing sukuma wiki prices, the machine learning model selection approach and
implementation of a price prediction portal. The research design discusses the general
approach of the study which focused on modelling a complex process of using machine
learning techniques on prepared data from various sources to extract patterns and trends on
prices. It describes the iterative process flow for machine learning that consists of collecting
relevant data and data pre-processing tasks such as cleaning and preparing the data for
analysis. It also describes the process of model training through splitting of the data into
training and test sets, the evaluation and continuous improvement of the model. The
technological infrastructure used ss highlighted and includes the popular scikit-learn
machine-learning library.
The process of data collection from various sources that provide weather information,
inflation data, household data and actual prices is discussed. The identified data features
that influence pricing are highlighted. Some of the sources such as NAFIS weekly
published prices which is the study's main dependent variable are discussed with the price
trends from August 2015 to May 2019 presented. Survey results data obtained from an
ASDSP report to assess the market and behavior of prioritized agricultural commodities
are considered. Weather is considered a critical factor in crop yield production and forecasts
attributes such as average, maximum and minimum temperatures, and precipitation levels
are discussed.
The chapter also presents the research procedures followed for data collection, clean-up
and organization, feature engineering, model fitting and prediction and finally prediction
tool development. The models most suited for prediction tasks and selected for training,
testing, evaluation and selection were linear regression, random forests, gradient boosting
and XGBoost. The metrics used for evaluating the accuracy of the models are also
discussed. Finally, the data analysis process using the scikit-learn machine learning library
and jupyter notebook provided some statistics on the data prepared defining such variables
like the mean, standard deviation and count of datasets.
26
3.2 Research Design
The goal of this research was to develop an application with a suitable machine learning
algorithm for predicting sukuma wiki prices in the urban counties of Nairobi, Kisumu and
Mombasa. The cities of Nairobi, Mombasa and Kisumu were selected for this research due
to the following factors:
i. Price data for sukuma wiki on the National Farmer Information Service website is
readily available for the three counties.
ii. The demand and consumption for sukuma wiki is very high especially in urban
areas from consumers and traders.
iii. Urban counties have a bigger population that is concentrated within smaller areas
than rural areas which leads to higher demand.
iv. The urban populace has more disposable incomes and are primarily consumer
driven.
v. Urban areas have little to hardly any farmlands than rural areas and thereby provide
the main markets for fruits and vegetables
The research used big data technologies such as machine learning, feature engineering and
predictive analytics for predicting sukuma wiki prices. The general approach focused on
modelling a complex process using machine learning techniques to explore the large
amount of data, finding correlations that are not apparent with traditional techniques, and
learning its patterns. This involved the design and development of a predictive analytics
engine to be used as a decision support system. Complex machine learning algorithms learn
from multiple data sources, collect, measure and aggregate data to extract hidden patterns.
Analytics is a key success factor to create value out of these data.
Four machine learning models consisting of linear regression, random forests, gradient
boosting and xgboost were trained using the sample organized dataset retrieved from
various sources including weather information from weather services like Accu-weather,
daily price information from national farmer’s information service (NAFIS), Cedia-
Utawala survey, socio-economic data from the national bureau of statistics and Central
Bank of Kenya. The approach consisted of five steps presented in Figure 3-1.
27
Figure 3-1: Research Approach
The steps are described as follows:
1. Collecting data: Raw data from excel, access, text files etc., this step (gathering past
data) formed the foundation of the machine learning. The better the variety, density
and volume of relevant data, the better the learning prospects for the machine
becomes.
2. Pre-processing or preparing the data: Any analytical process thrives on the quality
of the data used. One needs to spend time determining the quality of data and then
taking steps for fixing issues such as missing data and treatment of outliers.
Exploratory analysis was done to study the nuances of the data in details thereby
enhancing the useful content of the data. This process involved pre-processing of
the training data including data inspection, data cleaning, initial parameter
determination and development of ontologies as a way to making a conceptual
model of the data.
3. Training a model: This step involved choosing the appropriate algorithm and
representation of data in the form of the model. The cleaned data was split into two
parts – train and test (proportion depended on the prerequisites); the first part
(training data) was used for developing the model. The second part (test data), was
used as a reference. The dataset was divided and the model set up.
4. Evaluating the model: To test the accuracy, the second part of the data (holdout /
test data) was used. This step determines the precision in the choice of the algorithm
28
based on the outcome. A better test to check accuracy of model is to see its
performance on data which was not used at all during model build.
5. Improving the performance: This step involved choosing a different model

altogether or introducing more variables to augment the efficiency. That’s why
significant amount of time was spent in data collection and preparation. Test data
analysis was done to fine tune the model.
The technological infrastructure used were:
i. The Scikit-Learn Machine Learning Library

ii. Jupyter Browser-based Notebook for Machine Learning
iii. MySQL Database
iv. Spring Boot Framework
v. Angular Frontend User Interface
vi. Charts.js for visualization
3.3 Data Collection
The data collected was primarily secondary data from online sources aggregated and used
for the research work. The data sources included weather pattern information from the accu-
weather website which provides weather information for various cities around the world.
The weather information available include maximum, minimum and average temperatures
per day, rainfall amounts, humidity and wind speed.
Other data sources included inflation data, primarily annual average inflation and twelve-
month inflation retrieved from both the Central Bank of Kenya and Kenya National Bureau
of Standards websites, crop availability seasonality, population and household information
from an ASDSP survey study on market demand and supply conducted in the counties of
Nairobi, Mombasa, Kisumu, Meru and Nandi. The features or variables that were identified
for the model development were:
i. Year
ii. Month
iii. Temperatures – average, highs and lows
iv. Precipitation or rainfall amount(mm)
29
v. Seasonality - peak, normal and low seasons
vi. Household sizes
vii. Total household demand in kilograms
viii. Annual average inflation
ix. Twelve-month inflation
x. NAFIS crop prices
xi. NAFIS crop price dates
xii. NAFIS commodity variety and unit sizes
The features as summarized in Appendix I (i) identified for the machine learning model
were sourced from the following data sources.
3.3.1 NAFIS Prices
National Farmers Information Service (NAFIS) is a comprehensive information service,

intended to serve farmers’ needs throughout the country including the rural areas where
internet access is limited. It provides agriculture and livestock information for Kenyan
farmers in partnership with partners like the Kenya Government, Metereological
department, Agricultural Sector Development Support Programme (ASDSP), donor
organizations and related agriculture departments.
Agricultural commodities prices are published on the NAFIS site every week based on price
data collected in the various county markets throughout the country. Information on
commodities like maize, beans, potatoes, tomatoes, cabbages, among others in various
towns like Nairobi, Mombasa, Kisumu, Nakuru, Eldoret, among other towns is published.
The NAFIS service (NAFIS, 2018) publishes weekly in various Kenyan counties the
average, maximum and minimum prices for a 50 kilogram bag of sukuma wiki, also known
as kales as listed in Appendix I (iii).
3.3.2 ASDSP Research Survey Results
Survey data from a pilot market research study facilitated by the Agriculture Sector
Development Support Programme (ASDSP) in partnership with UNES and CEDIA
Utawala was conducted in 2016 on prioritized agricultural commodities in the 5 counties
of Nairobi, Mombasa, Kisumu, Nandi and Meru. The purpose was to assess the market and
30
behavior of prioritized agricultural commodities at a county, country, regional and
international level and support the development of the 29 prioritized agricultural
commodities in the 47 Counties of Kenya as listed in Appendix I (v) and Appendix I (vi).
The exercise took place in the five counties and covered the wards in each of these counties
with a target population of producers, consumers, traders and processors. The research
study used a mix of both qualitative and quantitative research methods and techniques as
well as semi-structured questionnaires using GIS enabled tablets, focus group discussions
and key informant interviews. The objectives of the market survey were:
i. To determine the size of the market.

ii. To determine the pattern of the demand and supply.
iii. To establish the buying habits and motives.
iv. To establish the past and present trends for agricultural products.
v. To identify issues that impede market competitiveness and develop mitigation
measures.
The survey data that was used as datasets in the machine learning model were:
i. Seasonality - Peak, Normal and Low Seasons

ii. Consumer demand
iii. Household Sizes
The average selling price for a bunch which may be composed of between 3 and 8 leaves
that was analyzed from the survey ranged between 10 shillings to 11 shillings per unit. The
highest was noted in December 2015 of about 26 Kenya shillings.
3.3.3 Weather Data
Weather and climate is inextricably linked to agricultural production and is certainly the
single most important factor in determining supply of food products. Weather forecasting
enables farmers to manage the whole value chain from land preparation, planting,
management of crops and harvesting under appropriate and optimum conditions for
increasing productivity of the crops. Hence weather forecasting is very important for the
farmers to make critical decisions on managing their crops properly and improving their
productivity.
31
The weather data was sourced from the accu-weather site (Accu-Weather, 2018) as shown
in Appendix I (iv), which provides temperature and precipitation data. The average,
maximum and minimum temperatures and precipitation amounts were captured on a daily
basis. They were collected, cleaned, organized and used as a dependent variable in the
machine learning model.
3.3.4 Inflation Rates
Macro-economic factors like the inflation rates, published on the Central Bank of Kenya
(CBK, 2018) , and Kenya National Bureau of Statistics during the period under review were
considered. Inflation, also known as consumer price index (CPI) measures changes in the
price level of market basket of consumer goods and services purchased by households. The
CPI is a statistical estimate constructed using the prices of a sample of representative items
whose prices are collected periodically. Annual and twelve-month inflation rates was used
as a dependent variable in the machine learning dataset as it affects food prices in the
country as listed in Appendix I (ii).
3.4 Research Procedures
The research procedure involved the following tasks.
i. Data collection, clean-up and organization

ii. Feature engineering
iii. Model fitting and prediction
iv. Development and implementation of price prediction application
3.4.1 Data Collection
Data was collected from both online sources and from an ASDSP research study. The data
was then cleaned and organized into a csv file for analysis, modelling and price prediction.
The data was obtained from the following sources:
i. NAFIS - National Farmer Information Service
The National Farmer Information Service (NAFIS) publishes agricultural commodity

prices weekly on their website, www.nafis.co.ke/marketinfomation/. This data is sourced
from markets country-wide in various counties in Kenya. The study used commodity price
32
datasets from the 3 main urban counties of Nairobi, Mombasa and Kisumu for a fifty
kilogram (50 kg) bag of sukuma wiki or kales as it is known. This was for the period
between the years 2015 to 2018.
ii. Weather (Accu-weather)

The accu-weather website (https://www.accuweather.com/en/ke/kenya-weather) provides
weather forecasts for cities and countries worldwide. The weather is collected from weather
stations world-wide and contains past weather conditions as well as future predictions of
up-to ninety (90) days. The weather conditions published include daily high, low and
average temperatures, precipitation or rainfall amounts, rainfall duration, wind speed,
humidity levels, max UV levels, among others on an hourly basis and at various times of
the day. The research study retrieved these daily weather conditions from the website
coinciding with the NAFIS price publication dates.
iii. Inflation (CBK and KNBS)
Inflation data was retrieved from the Central Bank of Kenya and the Kenya National Bureau
of Statistics websites which provide macro-economic data like the annual average inflation
and twelve-month inflation for every month. The 12-month inflation, which is normally
considered as the inflation rate, is defined as the percentage change in the monthly
consumer price index (CPI). For example, the 12-month inflation rate for November 2017
was the percentage change in the CPI of November 2017 and November 2016. The annual
average inflation is the percentage change in the annual average consumer price index (CPI)
of the corresponding months e.g. November 2017 and November 2016. The inflation levels
were collected for the years starting 2015 to 2018 as published on their websites.
iv. ASDSP Report - Questionnaires, Interviews
Data from the ASDSP market research report geared towards understanding the market
situation of prioritized agro value chains at the county and sub-county levels was also used.
The ASDSP research study was conducted as a pilot market research study in five (5) of
the forty-seven (47) counties in Kenya during the period from July 2015 to August 2016.
The research conducted in Nairobi, Mombasa, Kisumu, Nandi and Embu used a mix of
both qualitative and quantitative research methods and techniques such as structured
questionnaires, interviews on the various actor categories comprising farmers, consumers,
33
traders and processors. The ASDSP data had been collected mainly through interviews,
literature review and secondary information, and physical observations. The research study
used a subset of this data that included the household sizes and vegetable demand for each
county, crop seasons and prices per unit.
The data collected from the different sources was then cleaned and organized together into
a comma-separated (CSV) file for analysis coinciding with the NAFIS prices which is the
dependent variable used for training the price prediction model.
3.4.2 Feature engineering
The data collected and organized was pre-processed using the scikit-learn machine learning
library. This involved dropping records with missing values and null values, and also
features not needed for the modelling. It also involved one-hot encoding categorical (non-
numerical) values into their numerical equivalent, thus creating new features and scaling
the data using a standard scaler.
3.4.3 Model fitting and prediction
The task required the selection, evaluation and use of machine learning models suited for
predictive analytics. These are regression algorithms, a bagging algorithm, and decision
trees family of algorithms. The Jupyter data science notebook and the scikit-learn machine
learning library were utilized for this task. The following machine learning models were
tested, optimized, evaluated and compared for the best performing model that gave the best
accuracy against the different datasets specified.
ii. Random Forests
iii. Gradient Boosting
iv. XGBoost
This modelling followed the initial exploratory analysis done on past data and involved
testing and evaluating the performance accuracy of each model on the dataset. The data
was divided into both training and test datasets in the ratio of 67 percent training data
34
against 33 percent test data. Training involved fitting the training data into the model and
predicting the sukuma wiki prices against the NAFIS published prices.
The test data was then used to evaluate the accuracy based on metrics such as root mean
squared error, r-squared and mean absolute error. The model that produced the best
accuracy was then hyper-parameter tuned and tested against new data representing future
dates. This model was finally used in building a price prediction portal. The performance
was measured against the following criteria:
i. Root Mean Squared Error (RMSE) — measures how close the data is to the fitted
regression line.
ii. Mean Squared Error (MSE) — measures the average of the squares of the errors or
deviations i.e., the difference between the estimator and what is estimated.
iii. Mean Absolute Error (MAE) — measures the average of the difference between the
original values and the predicted values. It gives a measure of how far the
predictions were from the actual output.
iv. R-Squared also known as the coefficient of determination varies between 0 and 100
percent and is the proportion of the variance in the dependent variable that is
predictable from the independent variable(s). The same is also defined as "(total
variance explained by model) / total variance.” If it is 100 percent, the two variables
are perfectly correlated with no variance at all. A low value represents a low level
of correlation that means the regression model performs poorly.
The above metrics can be expressed mathematically as in Figure 3-2:
35
Figure 3-2: Regression metrics mathematical formulas (Source: DataTechNotes, 2019)
Once the model was selected, it was optimized for our prediction problem by tuning the
model hyper-parameters. Hyper-parameters are settings such as number of trees which
affect the model performance by altering the balance between under-fitting and over-fitting
in a model. Both an under-fit and an over-fit model does not generalize well to the test data.
The research implemented random search with cross validation.
i. Random Search is a technique used to select hyper-parameters where a grid is

defined and then different combinations are randomly sampled.
ii. Cross Validation is a technique used to evaluate a selected combination of hyper-

parameters. K-Fold Cross Validation involves dividing training data into K number
of folds, then performing an iterative process where data is trained on K-1 of the
folds and then the performance is evaluated on the Kth fold. This process is repeated
K times and at the end of k-fold cross validation, the average error on each of the K
iterations is taken as the final performance measure. K-Fold Cross Validation with
K = 5 is as shown in Figure 3-3.
36
Figure 3-3: Research data sample selected features (Source: Bettyty, 2016)
The entire process of performing random search with cross validation was:
1. Set up a grid of hyper-parameters to evaluate
2. Randomly sample a combination of hyper-parameters
3. Create a model with the selected combination
4. Evaluate the model using K-fold cross validation
5. Decide which hyper-parameters worked the best
3.4.4 Prediction tool development
An agricultural commodity price prediction portal was developed. The portal has a
dashboard for displaying prices to users, an upload area for uploading data in csv format
and a web service for consuming the deployed machine learning model. The portal was
developed using appropriate frameworks such as angular for front-end visualizations and
upload capabilities, spring boot for the back-end APIs, Flask API for serving the machine
learning model and a MySQL database to store data. An agile software development
approach consisting of requirements analysis, database design, system design,
implementation, testing and deployment was used. The web portal is a hosted solution for
easier access for users such as administrators, farmers, consumers, traders, processors,
among other stakeholders.
37
3.5 Data Analysis
In order to predict the sukuma wiki prices in Nairobi, the dataset was sourced from various
data sources. This dataset was based on the period from July 2015 to May 2019. The data
contained 714 observations and 20 features were used for the analysis. The data was pre-
processed to ensure the usage of only the features that would have the greatest effect on the
prediction model.
The data processing and analysis was done using Jupyter notebook. Jupyter is a browser-
based data science editor used for data pre-processing, feature extraction, data analysis and
visualization and incorporates various python libraries used for machine modelling. It
features libraries such as pandas, numpy, scikit-learn, statsmodel and matplotlib which
were used in this research work. The data was loaded by reading the csv file containing the
data into the jupyter notebook for analysis and sample data is as shown in Figure 3-4.
Figure 3-4: Research data sample selected features
Upon one-hot encoding of the data to remove features not needed for the price analysis and
modifying the categorical or textual variables to their numerical form, there were extra
features from the seasons feature as shown in Figure 3-5.
38
Figure 3-5: Final data sample selected data after encoding the seasons feature
The summary statistics for the numeric variables was as shown in Figure 3-6. There were
714 observations available, with their associated mean, standard deviation, maximum and
minimum for each feature variable including 25%, 50% and 75% rows displaying their
corresponding percentiles.
Figure 3-6: Research Data Summary Statistics
3.6 Chapter Summary
This chapter has covered the methodology that was used to develop, train, test and evaluate
the machine learning models. It also covered the different data sources consisting of online
data and survey data, and the machine learning models that were used in the research work.
Finally, it also highlighted the criteria used to measure the model performance to get the
best performing model with least errors. The criteria used were the evaluation of the root
mean square error, mean squared error, bias and variance of the model. The summary
statistics for the seven hundred and fourteen data records collected was also analyzed. The
next chapter describes the implementation of the sukuma wiki price prediction application.
39
Chapter 4: System Implementation
4.1 Introduction
This chapter presents the design and implementation of the agricultural commodity price
prediction application based on the results of the algorithm with the best accuracy applied
on the identified data features. It covers the requirements analysis for the system, the user
scenarios, the modelling and design process and the system implementation. The user
scenarios show how different stakeholders like farmers, consumers, traders, government
agencies, data analysts and any other interested parties interact with the system to achieve
their goals. The user scenarios also guided the human interaction with the system and
ensured the system was designed and developed to fulfill the needs of the stakeholders. The
requirements analysis covers the users of the system and the scope of the system
implementation.
The modelling and design presents the organization of the data in the database and the
system architecture design. The application uses a MySQL database and the database
schema lists the tables with their columns and their relationships generated using the
MySQL Workbench tool. The design describes the architecture of the system highlighting
the system components and data flows.
Process flows depicting the functionality and data flows are described. This covers the
high-level process from the CSV file upload, the invocation of the file upload service and
model services to perform Extract, Transform and Loading (ETL) of data. It describes the
data service which is invoked to fetch data to be displayed on the portal dashboard. The
model service exposes the best performing Gradient Boosting machine learning model that
was selected in the previous stage of machine learning model analysis. Detailed process
flows describe the machine learning model processes for the initial training and testing
phase on existing data to prediction modelling on new data.
The system functionality is presented with screenshots of the various user interfaces
available to the user to interact with. The screenshots demonstrate the upload, search and
display of data on the system's dashboards. The chapter also presents the technologies used
to develop the system and their respective usage. Finally, the price prediction application
40
was tested with past data and future data to get commodity price predictions for the next
three months.
4.2 Analysis
The sukuma wiki commodity price prediction system serves the need to have a simplified
application that provides agricultural commodity prices to various stakeholders such as
consumers, farmers, traders, processors and government. The prediction tool has
functionality to load new data for analysis in an ETL process, execute the model to process
the data and display predictions and relevant information to the end users. The following
are the requirements that the application fulfilled:
i. The system enables the upload of a csv file into a folder.
ii. The system facilitates the upload of raw dataset for training and test.
iii. The system exposes an Application Programming Interface (API) for serving
machine learning model.
iv. The system displays prediction data using visual aids like charts, line and bar
graphs.
v. The system displays other data features i.e. weather, inflation and population data.
vi. The system displays data for different counties.
vii. System has search capabilities for the feature datasets and predictions.
4.3 Modeling and Design
The system was modeled and designed using software design lifecycle and agile
methodology. This involved performing the following tasks:
i. Modelling user scenarios and their interactions with the system.
ii. Evaluation of the relevant tools and technologies to use.
iii. Scoping of the tasks to be done.
iv. Design of the database.
v. Design of the system architecture.

41
vi. Design of the process flows.
vii. Breaking down of the components into actual development tasks.
viii. Development of the application.
ix. Testing of the application.
x. Deployment of the application for end-users to interact with.
4.3.1 User Scenarios
There are many potential users of the system such as farmers, consumers, traders,
government and policy makers. The uses of the application are:
i. Data Analyst preparing and loading data in csv files for analysis and prediction.
ii. Farmer wants to know what the price of Sukuma wiki commodity will be in 1-
month, 2 months or 3-months to inform their planting decisions.
iii. The trader and consumer want to know the trends of Sukuma wiki prices and
weather conditions over the last few months.
iv. The government and policy makers want to know the effect of the weather,
inflation, population and macro-economic conditions on commodity prices.
v. All stakeholders accessing the dashboard to view past trends on sukuma wiki prices.
This will in future also cover other fruits and vegetables.
4.3.2 Database Schema (ERD)
The data is stored in a MySQL database which stores the datasets including the features
and predictions. The Entity Relationship Diagram (ERD) represents the schema visually
with data such as:
i. NAFIS prices
ii. Weather data
iii. Inflation data
iv. Population data
42
v. Raw data and prediction results
vi. Future price predictions
The database schema is shown in Figure 4-1 and Figure 4-2.
Figure 4-1: Database Schema
43
Figure 4-2: Database Schema
Figure 4-1 shows the lookup tables while Figure 4-2 shows the tables that store the raw and
predicted data. Table 4-3 lists the tuples for the price prediction application and describes
the database schema implemented that shows what information is held in each table of the
database.
44
Table 4-1: Database Tables
No. Table Description
1. counties This table holds counties information like county

name, code.
2. weather This table holds the weather information like average,

high and low temperatures, and precipitation.
3. nafis This table holds the national farmer information

service data like nafis price, and the date of publishing
4. population This table holds population data like population, and

household sizes.
5. inflation This table holds macro-economic data i.e. monthly

inflation rates and average annual inflation.
6. raws This table holds the unprocessed uploaded raw CSV

data.
7. predictions This table holds predicted sukuma wiki prices against

the raw nafis prices.
8. future_predictions This table holds data on future predictions of sukuma

wiki prices.
4.4 Proof of Concept
The system was implemented following a micro-service architecture and agile approach.
The system consisted of the following components:
i. Extract, Transform and Load (ETL) process to upload the data from csv files into
the portal
ii. Machine Learning pipeline to predict commodity prices
iii. Data Visualization of the raw data and predictions
45
4.4.1 System Architecture
The flow is as presented below.
Figure 4-4: System Architecture
Process flow explained:
1. A CSV file with feature data is uploaded into the portal.
2. The uploaded file is forwarded to an upload service to be saved and processed.
3. The data is consumed by the model which pre-processes, fits and does prediction.
4. The raw and predicted data are saved in the database while the trained model is
saved as a pickle file on disk.
5. The portal visually displays the predicted prices and any other useful information
using charts and tables with data fetched from the data service.
The flow of data is as in the Figure 4-5.
46
Figure 4-5: System Sequence Diagram
4.4.2 Process Flow
Once the data upload has been completed, the model service is invoked to process the files.
The predictive modelling process involves two steps:
i. Training and Test Process – Past data is trained, tested and evaluated for best
accuracy.
ii. Prediction Process – New forecast data is uploaded and fit into the trained model
for prediction.
47
4.4.2.1 Training and Test Process Flow
The following process in Figure 4-6 explains the flow of information when the training and
testing on past data is done.
i. File upload
A csv file is uploaded in the portal upload area and saved into an upload folder.
ii. Read and extract data
The data in the uploaded csv file is then read using the Pandas library in Scikit-learn
and the data extracted into memory.
iii. Save into raws table
The extracted data is saved into the raws table and other associated tables in a
MySQL database. This data is then fetched and displayed in the various dashboards.
iv. Encode categorical features
Categorical features like seasons (peak, normal, low) are encoded into matrices
since the algorithms only work with numerical data. This results into several new
features such as season peak, season normal and season low which are represented
with 1's whenever they appear while the other features are 0's.
v. Drop unnecessary features, empty rows and non-numbers
Features not needed for the prediction training and test tasks are dropped. These
features include county, units, kg, variety, commodity and NAFIS dates. Data rows
which are empty or contain null values are dropped while some are populated with
the gaussian mean. The algorithm requires all data fields to be filled. The NAFIS
price is assigned to the Y axis.
vi. Split train-test data
The remaining data features are then split into training data and test data in the
ratio of 0.33, with training data allocated in the ratio or 67 per cent compared to 33
per cent for test data. The model is trained using the training dataset to find the best
accuracy scores. Then, the test data is used to evaluate how well the trained model
fits into new data.
48
Figure 4-6: Training and Test Process Flow
49
vii. Scale data to normalize
The feature variables are then scaled to normalize them and avoid some bigger
figures overpowering the smaller features. Features such as the household sizes and
total demand are scaled to standardize the dataset with features such as inflation
rates, temperature and precipitation which are in single-digit and double-digit
amounts. Scaling is done using the scikit-learn preprocessing StandardScaler
function by removing the mean and scaling to unit variance.
viii. Fit and Prediction
The scaled data is then fitted into the model and the prediction process takes place.
This is done by calling the model's fit and predict functions to do the predictive
modelling. The independent variables comprising of the temperatures,
precipitation, inflation, household sizes, demand amounts and the dummy season
variables are trained against the dependent variable, NAFIS sukuma wiki prices.
This will result in prediction results that will be saved into a predictions table in the
MySQL database.
ix. Save prediction results
The predicted results of predicted sukuma wiki prices are then saved into the
predictions table on the MySQL database. These will be queried and fetched to be
displayed on the portal's dashboards using charts, line and bar graphs.
x. Save trained model onto disk
The trained model is saved as a pickle file onto the disk using scikit-learn's joblib
library. This results into a .pkl file that will be used for future predictions.
4.4.2.2 Price Prediction Process Flow
The process explains the flow of information when the prediction on new future data is
done. The model saved on disk in the previous step is used to do predictions on new data
that doesn't have the dependent NAFIS price feature variable.
50
Figure 4-7: Price Prediction Process Flow
i. Upload new data file

New data is uploaded with all the features forecasted for future dates that excludes
the dependent NAFIS price variable. Based on the filename of the csv which
contains the "predict" keyword, the predict service of the model service is called
to process this new data.
ii. Load model from disk
The predict service loads the model saved in step 1(x) above from disk using the
scikit-learn joblib library.
51
iii. Encode categorical features
Categorical features like season, and any other are encoded into their numerical
form to create new features.
iv. Drop un-needed features, empty rows and NaN
Un-needed features like variety, commodity, NAFIS dates, units and kg are dropped
from the data-frame. Empty rows and rows with nulls are dropped while the mean's
of other numerical variables are assigned if data is missing.
v. Scale data to standardize
The data is scaled to standardize it so that all features are normalized for analysis.
vi. Predict future prices
The new scaled data is fitted into the model and prediction is done.
vii. Save into future predictions table
The predicted data is saved into the future predictions table in the MySQL database
for display in the portal dashboards.
4.4.3 System Components
The system was built using the following technologies:
i. JAVA Spring Boot Framework
ii. Angular 2
iii. Chartjs
iv. MySQL Database
v. Python
vi. Flask RESTful Framework
vii. Jupyter Notebook (Anaconda)
viii. Scikit-Learn Libraries – pandas, numpy, matplotlib
52
4.4.3.1 Dashboards
The dashboard fetches data from the saved raw data and predictions data tables to display
various types of information to the end-users. There are specialized queries implemented
using the spring boot JPA framework that retrieves this data. The queries fetch the
following data:
i. Average predicted prices for the next 3 months by county.
ii. Monthly average NAFIS versus predicted prices.
iii. Daily predicted versus NAFIS prices per county.
iv. NAFIS versus predicted price trends over several years.
v. NAFIS price trends by county.
vi. Population and household trends by county.
vii. Temperature trends by county.
viii. Inflation trends.
53
Figure 4-8: Dashboard Service Class Diagram (APPENDIX I (v))
The dashboard is developed and implemented using the Angular 2 javascript library. It has
the following components:
i. The model classes i.e. PredictionsModel, FuturePredictionsModel and RawsModel,

define the field values with corresponding getter and setter functions.
ii. The service classes i.e. PredictionsService, FuturePredictionsService and

RawsService, consume the Data Service API end-points to fetch data to be
displayed on the dashboard.
iii. The Dashboard Module and the Dashboard Components classes import and load all
classes including the service components, charts library and required angular
modules.
54
iv. The Dashboard Routing class provides all the routes for the dashboard and services.
v. The Chartjs library provides table and chart components i.e. bar graphs, line graphs
and tables.
vi. The Dashboard HTML Component provides the view for the web portal displaying
the fetched data in charts and tables.
4.4.3.2 Upload Service
The application is a web portal that provides an upload page to upload data from csv files
into an upload directory. The upload service in a micro-service API built using the Java
Spring-Boot framework. It exposes an API that receives the uploaded csv data file and
saves it in an upload folder. It then invokes the model service to load and process the data
in the uploaded csv file.
Figure 4-9: Upload Service Class Diagram (APPENDIX I (iv))
55
The upload service is a Java micro-service that has the following components:
i. The FileController class exposes a web service API that receives uploaded csv
files. It sends the file uploaded to the FileStorageService class and the
ModelService class to consume the Model Service API.
ii. The FileStorageService class saves the file in the upload directory.
iii. The UploadFileResponse class is the bean class that has the file fields and getters
and setters.
iv. The FileStorageException and MyFileNotFoundException classes handle file-

related exceptions.
4.4.3.3 Model Service

The model service is a Flask-Python micro-service that serves the machine-learning
model and is exposed to the portal. The model service is invoked by the upload service
upon uploading the file to process the file. The Flask micro-framework wraps the model
and exposes functionality to process training and test data and another functionality to
process new data uploaded for prediction.
Figure 4-10: Model Service Class Diagram (APPENDIX I (ii))
The model service has the following functionality:
i. The model_ml() function exposes an API consumed by the upload service to

perform model training and testing when csv data is uploaded into the application.
It invokes the saveRawData() and savePredictionData() to save the raw data
uploaded and the predicted data respectively.
56
ii. The predict_ml() model function exposes an API consumed by the upload service
to perform prediction when csv data for prediction is uploaded into the application.
It saves the predicted data by invoking the saveFuturePredictionData() function.
iii. The saveRawData() function saves raw data uploaded into the application.
iv. The savePredictionData() function saves predicted and raw data into the
application.
v. The saveFuturePredictionData() function saves future prediction data into the
application.
4.4.3.4 Data Service
The data service exposes API's for fetching the saved raw and prediction data for display
on the portal dashboard's. The data service accesses the MySQL database and presents the
results of queries that return the required information that will be displayed on the portal.
Sample data fetched includes daily predictions, monthly averages and raw data. The data
service is built using spring boot, a java-based framework for building enterprise web
applications.
57
Figure 4-11: Data Service Class Diagram (APPENDIX I (iii))
The data service provides API end-points that are used to fetch data stored in the
application’s database for display in the dashboard. These are:
i. The FuturePredictionsController, PredictionsController and RawsController

classes that provide API end-points accessible to the dashboard functions.
ii. The FuturePredictionsRepository, PredictionsRepository and RawsRepositoey

classes provide repository interfaces that contains queries to fetch specific data
from the database.
iii. The FuturePredictions, Predictions and Raws classes are the java bean classes
with the respective fields and getter and setter functions.
58
4.5 Chapter Summary
This chapter described the implementation of the sukuma wiki price prediction application.
It covered the software development lifecycle used to develop the application. This was the
analysis of commodity price predictions, the use cases and the design of the entity
relationship diagram with the database schema. The system components that consist of the
file upload functionality, the prediction, data storage and data visualization were covered.
The process flows for the data processing from the file upload of raw data, scaling, model-
fitting and the prediction was also covered. The system fulfills the task of predicting
sukuma wiki prices in the counties of Nairobi, Mombasa and Kisumu. The next chapter
discusses the results and findings of the research work.
59
Chapter 5: System Performance, Results and Findings
5.1 Introduction
Chapter five discusses on the results and findings of the development of the machine
learning tool for the prediction of agricultural commodity prices. It describes the influence
of the data selected and the performance of the machine learning models evaluated. It also
describes and evaluates the features of the prediction application.
5.2 Factors and data sources that influence sukuma wiki prices
Features for the model were selected from various data sources including NAFIS published
commodity prices, weather information such as temperature and precipitation, average
annual and monthly inflation, household sizes and consumer demand. Others included data
from a research survey conducted to determine market demand for various commodities of
which the Nairobi dataset was used.
Machine learning involves making some assumptions and hypothesis on the data and
testing by performing some tasks. Initially, the research made the following intuitive
assumptions for each of the features.
i. Higher temperatures lead to higher sukuma wiki prices. This is due to the lower
supply of sukuma wiki causing higher demand from consumers and traders.
Sukuma wiki is a crop that grows best in cool, moist conditions and altitudes of 800
- 2200 meters above sea level. The crop grows best at temperatures between 4 and
21°C (40 – 50°F) and also requires at least 6 hours of direct sunlight daily. Thus,
high temperature is a directly proportional variable to the commodity price.
ii. Higher precipitation results in lower prices due to a higher supply of sukuma wiki.
Naturally, sukuma wiki requires sufficient amounts of moisture for optimal
production and grows well in cool moist conditions. Well-distributed rainfall
amount of between 30-500mm is ideal for optimum yield though irrigation is
recommended where rainfall is inadequate. They are inversely proportional
variables.
60
iii. Larger household sizes mean increased demand for sukuma wiki, hence higher
prices. Sukuma wiki is a favorite staple food in Kenya and the members of the
household will consume more sukuma wiki. They are directly proportional.
iv. Periods of high inflation means a rise in commodity prices including fruit and
vegetable prices while low inflation causes a drop in prices. Inflation is directly
proportional to sukuma wiki prices.
v. Peak seasons for sukuma wiki during the months of April, August and December
means that there is increased supply of vegetables while low seasons signify lower
sukuma wiki production. Normal seasons represent stable supply of sukuma wiki.
Therefore, peak seasons represent lower prices which is inversely proportional to
sukuma wiki prices and the inverse in the case of low seasons.
The data was organized, cleaned and prepared into a comma-separated file (CSV) for
upload and analysis using the various algorithms i.e. linear regression, random forests,
gradient-boosting and xgboost.
The survey results represented by the raw secondary data trends for the period 2015 to 2018
and the period 2017 to 2018 are as represented in the listed graphs. These are:
i. NAFIS prices data for the Nairobi, Mombasa and Kisumu counties.
ii. Temperature data for the Nairobi, Mombasa and Kisumu counties.
iii. Precipitation data for the Nairobi, Mombasa and Kisumu counties.
iv. Annual and twelve-month inflation data for Kenya
NAFIS price data for the three counties for the period ranging from 2015 to 2018 was used
as the independent variable factor of the machine learning model. Price trends for the period
from 2015 to 2018 are as shown in the Figures 5-1, 5-2 and 5-3.
61
Figure 5-1: NAFIS Price Trends in Nairobi for the period 2015 to 2018
NAFIS prices have been constant over the 2015 to 2018 time duration with minimal
fluctuations. Prices have ranged between 900 Kenya shillings and 2,000 Kenya shillings.
The highest prices were experienced in 2017 but prices stabilized in 2018.
Figure 5-2: NAFIS Price Trends in Mombasa for the period 2017 to 2018
62
Figure 5-3: NAFIS Price Trends in Kisumu for the period 2017 to 2018
Temperature trends for the period are were analyzed for each county. Average temperatures
showed a trend of reducing from the hottest months seen in February, March, August and
December with an average 26 degrees centigrade to the lowest temperatures of 18 degrees
centigrade in June 2016.
Figure 5-4: Average Temperature Trends for the period 2015 to 2018 in Nairobi
63
Figure 5-5: Average Temperature Trends for the period 2017 to 2018 in Mombasa
Figure 5-6: Average Temperature Trends for the period 2017 to 2018 in Kisumu
64
The rainfall amount influences the production amount of Sukuma wiki. Precipitation trends
for the period were also analyzed for each county. The average rainfall or precipitation
witnessed during the period was as shown in the Figures 5-7, 5-8 and 5-9.
Figure 5-7: Precipitation Trends for the period 2015 to 2018 in Nairobi County
Figure 5-8: Precipitation Trends for the period 2017 to 2018 in Mombasa County
65
Figure 5-9: Precipitation Trends for the period 2017 to 2018 in Kisumu County
Annual average inflation has shown an almost flat average growth for the duration from
January 2016 to April 2018 while the month on month inflation rate has had wild swings
during the same duration. Between the months of January to December 2016, annual
inflation rates reduced from 6.77 to 6.32 while for the period of January 2017 to February
2018, it increased from 6.26 to 7.4 with highest recorded between July 2017 and October
2017.The annual average inflation and 12-month inflation rates for the period 2015 to 2018
are represented in Figure 5-9 and Figure 5-10 showing an increase in inflation in 2017 and
leveling off afterwards.
Figure 5-10: 12-Month average inflation rates for Kenya from 2015 to 2018
66
Figure 5-11: Annual average inflation rates for Kenya from 2015 to 2018
The table 5-1 shows the average values of average temperatures, minimum temperatures,
um temperatures, precipitation amounts, average annual inflation and twelve-month
inflation values for the collated data covering the period from June 2015 to July 2018.
These were the averages for the data features influencing sukuma wiki commodity prices
that were used to train and test the model and select the model with the best accuracy.
Table 5-1: Features Averages for Period 2015 to 2018
County Average Minimum Maximum Precipitatio Annual Twelve

Temperature Temperature Temperatur n Average Month
e Inflation Inflation
Nairobi 29 14 48 2.87 6.9 6.4

Mombas 30 11 35 6.15 6.9 6.3
a
Kisumu 22 10 32 3.68 6.9 6.2
Due to the fact that algorithms work with numerical variables, the categorical variables
such as seasons and month were converted to their numerical representation. The seasons
were defined as,
i. 1 - Peak Season
ii. 2 - Normal Season
67
iii. 3 - Low Season
The NAFIS date was also converted to their numerical representation with a month feature
variable added, for example, January was assigned 1, and February assigned 2 up until
December assigned 12.
A histogram presented in Figure 5-12 was plotted for each numeric variable to get insights
into the data we were dealing with which gave the following plot.
Figure 5-12: Data Histograms
(a) Annual Average Inflation (d) Maximum Temperatures

(b) Average Temperature (e) Minimum Temperatures
(c) Household Sizes (f) NAFIS prices
68
The dependent variable that was used for prediction is the NAFIS price which represents
the published national farmer information service daily prices for a 50 kg bag of sukuma
wiki. So, a correlation was done to gain insight into how much each independent variable
correlated with the NAFIS price dependent variable which gave the following results.
Table 5-2: Correlation Values
Feature Correlation Value

Average Temperature 0.463213
Minimum temperature 0.378239
Maximum temperature 0.371553
Low Season 0.223264
Precipitation in mm 0.013284
Normal Season -0.120415
Peak Season -0.168187
Twelve-month inflation -0.205665
Annual average inflation -0.372485
Total household demand in kilograms -0.474938
Household sizes -0.474938
The NAFIS price showed a tendency to increase when the total household demand in kg,
household sizes and annual average inflation increased. There was very minimal correlation
between the average temperature and the NAFIS price which indicated average monthly
temperatures didn't influence much the price of sukuma wiki.
Negative coefficients and coefficients close to zero indicated that there was no linear
correlation. The variables that mainly influence and correlate with the NAFIS prices are
total household demand in kilograms, household sizes annual average inflation, average
humidity and season.
The data was split into two sets consisting of training and testing sets in order to train the
model properly and validate it. The training set contained the values that were used to train
the model while the testing set contained values that were used to validate the model and
get a score for how well the model was trained using the training set.
69
The split function in scikit learn library is used to split the data into training and test data.
It takes the following parameters:
X: These are the input feature variables read from the Sukuma wiki data csv file.
Y: The output which is the national farmer information service prices for Sukuma
wiki in Nairobi.
Test_size: This is the percentage of data to be used for test data, which is about 33
percent in this case.
Random_state: If not entered, the function randomly selects rows to use for the
train and test samples.
The split function returned four values which were the inputs used for training and testing,
and the outputs used for training and testing. These are depicted in Figure 5-13 below:
Figure 5-13: Train_test_split function
X_train: These are input features representing the training data i.e. 67%.
X_test: These are the input features representing the test data i.e. 33%.
Y_train: These are the NAFIS prices representing the training data set i.e. 67%.
Y_test: These are the NAFIS prices representing the test data set i.e. 33%.
The evaluation metric used in regression or continuous problems for predictions were:
i. Mean Absolute Error (MAE) is the mean of the absolute value of the errors.
ii. Mean Squared Error (MSE) is the mean of the squared errors.
iii. Root Mean Squared Error (RMSE) is the square root of the mean of the squared
errors.
To ensure that features are weighted on the same scale, feature scaling was done using the
scikit-learn data preprocessing StandardScaler library. The feature standardization involves
70
removing the mean and scaling to unit variance using the StandardScaler function applied
on the X-train and X-test variables before the model fitting. The model had 20 features
(variables) and in order to measure the predictive power of each feature, the feature
importance of each variable yielded the results as shown in Table 5-3 and Figure 5-14.
Table 5-3: Feature F-Scores
Feature F-Score (%)

Total household demand in kilograms 37.44 %
(f0)
Household sizes (f3) 32.95 %
Average temperature (f4) 16.88 %
Annual average inflation (f2) 4.86 %
High temperature (f7) 4.15 %
Twelve month inflation (f5) 2.18 %
Precipitation (f6) 1.54 %
Figure 5-14: Feature Importance
The F-Score measured each feature’s importance by considering how many decisions (tree
splits) rely on the particular feature and how much that decision helped improve the result
such as reducing mean squared error.
71
The most important features were total household demand, household sizes, average
temperature, annual average inflation and twelve-month inflation. And the least important
feature was precipitation which means that regardless of the precipitation levels, it doesn’t
affect the Sukuma wiki prices. This could be due to farmers using other sources of water
such as irrigation.
5.3 Evaluation and selection of the best-performing prediction algorithm
Price prediction tasks are regression techniques since they involve continuous data over a
period of time. Based on this and associated literature, regression-related and time series
algorithms were preferred in this research work. The research work identified the following
algorithms for developing the price prediction machine learning model to be tested:
ii. Random Forests
iv. XGBoost algorithm
Feature engineering was performed on the data before executing the machine learning
algorithms. This involved data reprocessing to drop features deemed unnecessary, imputing
missing values, scaling the features to normalize the data and selecting the most important
features. Scaling using the Standard-Scaler function of scikit-learn was used to scale the
data so that no feature dominates the other features to influence the analysis. The features
selected consisted of eighteen variables which were expanded to twenty variables. This is
after the seasons variables consisting of peak season, normal season and low season were
converted to numerical variables in a matrix. Peak season was represented by (1,0,0),
normal season by (0,1,0) and normal season by (0,0,1). Feature selection resulted in the
usage of the following features based on their importance computed by the algorithms:
i. total household demand in kilograms

ii. household sizes
iii. season
iv. annual average inflation
v. twelve-month inflation
vi. average temperature
vii. precipitation
The data was split into training and test data in the ratio of 66.7% and 33.3% respectively.
Finally, the training data was fitted to the model and prediction analysis was performed to
train the models resulting in predicted values compared against actual values. The resulting
models were then tested against the test data to get the predicted values using the best line
of fit and to ensure the model didn't under-fit or over-fit new data.
72
A linear regression model was fitted into both the training and test data sets and evaluated
producing the graphs shown in Figure 5-15:
Figure 5-15: Linear Regression Model Predictions
There is a smooth straight line representing the best fit in red as shown in the figure
representing the predicted value against the actual NAFIS price. The intercept and
coefficients for the linear regression were as shown in Table 5-4.
Table 5-4: Coefficients for Linear Regression
Feature Co-efficient
Total household demand in kilograms 223.38635874
Household sizes -332.23499439
Average annual inflation -190.67358241
Twelve month inflation -35.52786212
Average temperature 24.96797401
Maximum temperature 25.94204207
Minimum temperature 69.89716704
Precipitation in mm -6.64009988
Low Season 45.88749351
Normal season 18.52770864
Peak season -61.30130957
73
The intercept was found to be: 1340.6779661692185
The linear regression model utilizes the formula
with an intercept and coefficients against the input features. In our case, the coefficients for
the individual features are interpreted as:
Y = 1538.07106 + (24.96797401 x avg_temp) + (25.94204207 x max_temp) +

(69.89716704 x min_temp) + ((-6.64009988) x precipitation_mm) + ((-
190.67358241) x annual_avg_inflation) + ((-35.52786212) x twelve_month_inflation) +
((-332.23499439) x household_sizes) + ((223.38635874) x total_household_demand_kg)
+ (45.88749351 x season_low) + (18.52770864 x season_normal) + ((-61.30130957) x
season_peak)
This meant, a unit increase in precipitation, annual average inflation, twelve-month

inflation, household sizes and probability of a peak season, was associated with a unit
decrease in Sukuma wiki prices. Likewise, a unit increase in average temperatures,
maximum temperatures, minimum temperatures, total household demand and the
occurrence of both normal and low seasons was associated with a unit increase in Sukuma
wiki prices. A regularization process through these coefficients and intercept helps to avoid
under-fitting or over-fitting of the model to the data and features. The performance metrics
for linear regression were as shown in Table 5-5:
74
Table 5-5: Performance Metrics for Linear Regression
1st Round 2nd Round

(500 records) (714 records)
Metric Training Test Training Test
R-squared 0.19 0.26 0.38 0.45
Mean Absolute 261.0 252.0 308.0 267.0
Error (MAE)
Mean Squared 132383.0 105668.0 155263.0 131618.0
Error (MSE)
Root Mean Squared 363.85 325.07 394.03 362.79
Error (RMSE)
This means in the model, 26% to 45% of the variability in NAFIS price (Y) was explained
(predicted) using the features used (X). From the value of the RMSE, the model was able
to predict the price in the test set within the range of 325.07 to 362.79 Kenya Shillings of
the actual NAFIS price. The test set metrics closely matched the training set proving that
the model did not over-fit or under-fit the data and can be used effectively for new data.
ii. Random Forests
A random forest model was fitted into the data and evaluated producing the graphs in Figure
5-16.
Figure 5-16: Random Forests Model Predictions

75
Table 5-6: Performance Metrics for Random Forests
1st Round 2nd Round

R-squared 0.84 0.87 0.9 0.86
Mean Absolute Error 110.0 98.0 109.0 121.0
(MAE)
Mean Squared Error 26789.0 18795.0 23996.0 33689.0
(MSE)
Error (RMSE)
This means in the model, 86% to 87% of the variability in NAFIS price (Y) was explained
to predict the price in the test set within 137.09 to 183.55 Kenya Shillings of the actual
NAFIS price. The test set metrics closely matched the training set proving that the model
did not over-fit or under-fit the data and can be used effectively for new data.
A Gradient Boosting regression model, GradientBoostingRegressor was fitted into the data
and evaluated producing the graph in Figure 5-17.
Figure 5-17: Gradient Boosting Model Predictions
The model was evaluated and produced the following metrics:

76
Table 5-7: Performance Metrics for Gradient Boosting
1st Round 2nd Round

R-squared 0.89 0.96 0.92 0.96
Mean Absolute 100.0 53.0 103.0 68.0
Error (MAE)
(MSE)
Error (RMSE)
This means in the model, 96% of the variability in NAFIS price (Y) can be explained
(predicted) using the features used (X). From the value of the RMSE, our model was able
to predict the price in the test set within 76.25 to 95.89 Kenya Shillings of the actual NAFIS
price. Increasing the number of data records did not have any effect on the test data. The
test set metrics closely matched the training set proving that the model did not over-fit or
under-fit the data and can be used effectively for new data.
iv. XGBoost
An XGBoost regression model, XGBRegressor was fitted into the data and evaluated
producing the graph displayed in Figure 5-18.
Figure 5-18: XGBoost Model Predictions
77
Table 5-8: Performance Metrics for XGBoost
1st Round 2nd Round

R-squared 0.86 0.95 0.9 0.95
Mean Absolute 108.0 60.0 115.0 80.0
Error (MAE)
(MSE)
Error (RMSE)
This means in the model, 95% of the variability in NAFIS price (Y) was explained
to predict the price in the test set within 81.97 to 110.75 Kenya Shillings of the actual
NAFIS price. Increasing the number of data records did not have any effect on the test data.
The test set metrics closely matched the training set proving that the model did not over-fit
or under-fit the data and can be used effectively for new data.
Metrics were performed to evaluate the efficacy of the models and select the best
performing model. The table below lists the performance comparison of the accuracy of
the models in relation to the metrics of mean absolute errors, mean squared error, root mean
squared error and the r-squared.
78
Table 5-9: Performance Metrics for 500 records
MAE MSE RMSE R-Squared

Model Train Test Train Test Train Test Train Test
Linear 261.0 252.0 132383.0 105668.0 363.85 325.07 0.19 0.26
Regression
Random 110.0 98.0 26789.0 18795.0 163.67 137.09 0.84 0.87
Forests
Gradient 100.0 53.0 18176.0 5814.0 134.82 76.25 0.89 0.96
Boosting
XGBoost 108.0 60.0 22238.0 6719.0 149.12 81.97 0.86 0.95
Table 5-10: Performance Metrics for 714 Records
MAE MSE RMSE R-Squared

Model Train Test Train Test Train Test Train Test
Linear 308.0 267.0 155263 131618 394.03 362.79 0.38 0.45
Regression
Random 109.0 121.0 23996.0 33689.0 154.91 183.55 0.9 0.86
Forests
Gradient 103.0 68.0 20400.0 9195.0 142.83 95.89 0.92 0.96
Boosting
XGBoost 115.0 80.0 25678.0 12265.0 160.24 110.75 0.9 0.95
The evaluation metrics based especially on the R-Squared and the Root Mean Squared
Error (RMSE) values showed that the Gradient Boosting gives the best accuracy followed
closely by XGBoost and Random Forests algorithms. Linear Regression algorithm gave a
poor performance on this dataset.
These predictions showed very slight variations between the NAFIS prices and the
predicted prices. This is as displayed in the line charts below that shows the price variations
between the different models over the period from December 2015 to July 2018.
79
Figure 5-19: Comparison of NAFIS Prices vs Gradient Boosting
Figure 5-20: Comparison of NAFIS Prices vs Linear Regression
80
Figure 5-21: Comparison of NAFIS Prices vs Random Forests
Figure 5-22: Comparison of NAFIS Prices vs XGBoost
The four models had varying levels of performance over the period. XGBoost and Gradient
Boosting showed almost similar performance and performed better than both Random
Forests and Linear Regression. Thus, based on the above trends and the model’s RMSE,
81
Gradient Boosting was recommended as the best performing model to be used in predicting
prices for sukuma wiki and by extension other vegetables and fruits commodities.
The selected Gradient Boosting algorithm was further hyper-parameter tuned to optimize
it for the prediction task and avoid model under-fitting or over-fitting. Random search with
cross validation was implemented to select the optimal hyper-parameters for the gradient
boosting regressor. The parameters that were tuned were:
i. loss: the loss function to minimize i.e.

loss = ['ls', 'lad', 'huber']
ii. n_estimators: the number of weak learners or decision trees to use i.e.
n_estimators = [100, 500, 900, 1100, 1500]
iii. max_depth: the maximum depth of each decision tree i.e.
max_depth = [2, 3, 5, 10, 15]
iv. min_samples_leaf: the minimum number of examples required at a leaf node of the
decision tree i.e.
min_samples_leaf = [1, 2, 4, 6, 8]
v. min_samples-split: the minimum number of examples required to split a node of
the decision tree i.e.
min_samples_split = [2, 4, 6, 10]
vi. max_features: the maximum number of features to use for splitting nodes.
max_features = ['auto', 'sqrt', 'log2', None]
Firstly, a hyper-parameter grid was defined, then an iterative process of randomly sampling
a set of hyper-parameters from the grid and evaluation of the hyper-parameters using 4-
fold cross-validation over 25 different combinations of hyper-parameters was done as
shown in Figure 5-23. After performing the randomized search, the best combination of
settings was used to select hyper-parameters with the best performance.
82
Figure 5-23: Hyper-parameter tuning
Scikit learn uses the negative mean absolute error to evaluate with the goal of maximizing
metrics. Hence, a better score is closer to 0. The results of the hyper-parameter tuning test
produced the following performance as depicted in Table 5-11.
83
Table 5-11: Results of gradient boosting hyper-parameter tuning
Rank Mean Mean Mean Mean Param

Param
No test fit score test train params max
loss
score time time score score depth
{'n_estimator
s': 1500,
13 1 4.8687 0.0114 -214.9128 -88.4113 lad 5
'min_samples
_split': 10...
{'n_estimator
s': 500,
9 2 1.1339 0.0022 -225.3845 -95.6151 huber 3
'min_samples
_split': 4, ...
{'n_estimator
s': 100,
0 3 0.0519 0.0007 -225.7016 -124.8873 ls 5
'min_samples
_split': 2, ...
{'n_estimator
s': 1500,
11 4 13.830 0.0237 -227.4063 -55.9299 lad 10
'min_samples
_split': 2,...
5 {'n_estimator
s': 500,
21 2.1527 0.0054 -227.9262 -70.0162 huber 5
'min_samples
_split': 6, ...
84
Table 5-11: Results of gradient boosting hyper-parameter tuning (continued)
No Param Param
Param Param n split0 split0 split1 split1
min min
max estimato test train test train
sample sample
feature rs score score score score
leaf split
13 sqrt 1 10 1500 -218.86 -92.24 -209.95 -87.73
9 auto 2 4 500 -233.40 -93.92 -208.85 -99.00
0 auto 6 2 100 -229.35 -130.48 -211.34 -127.11
11 log2 4 2 1500 -244.90 -58.46 -224.36 -55.04
21 log2 4 6 500 -240.02 -76.59 -213.03 -70.54
Table 5-11: Results of gradient boosting hyper-parameter tuning (continued)
N split2 split2 split3 split3 Std

Std fit Std score Std test
o test train test train train
time time score
score score score score score
13 -233.85 -84.76 -196.97 -88.89 0.7639 0.0043 13.4242 2.6777
9 -246.65 -91.91 -212.62 -97.61 0.0366 0.0004 15.4363 2.8287
0 -253.70 -118.30 -208.39 -123.64 0.0117 0.0004 18.0491 4.5055
11 -244.46 -52.21 -195.89 -57.98 1.1217 0.0032 19.9934 2.5102
21 -254.15 -64.97 -204.49 -67.94 0.0994 0.0025 20.0322 4.2794
The best score labelled here as the Rank Test Score was achieved by 1,500 decision trees
grouped in samples of 10. This had a Mean Test Score of -214.9128 and Standard Test
Score of 13.4242 despite having a long fit time on the data. These grid search parameters
were chosen for our grid since they were close to the optimal values.
85
An experiment was done to further improve the model performance by changing the
number of estimators also called decision trees while holding the rest of the parameters
steady. The results were as shown in Table 5-12.
Table 5-12: Results of gradient boosting hyper-parameter tuning on trees
Rank Param n
Mean fit Mean score Mean test Mean train
No test estimato
time time score score
score rs
13 1 2.2757 0.0052 -232.4202 -111.5332 750
14 2 2.6254 0.0052 -232.4972 -111.1048 800
12 3 2.7719 0.0045 -232.5638 -112.4699 700
11 4 2.7282 0.0042 -232.8292 -113.3254 650
10 5 2.3080 0.0044 -233.2879 -114.6126 600
86
Table 5-12: Results of gradient boosting hyper-parameter tuning on trees (continued)
No split0 split0 split1 split1 split2

split2 train
params test train test train test
score
score score score score score
13
{'n_estima
-238.9985 -113.6961 -204.9674 -109.2530 -261.5593 -107.8112
tors': 750}
14
{'n_estima
-238.9653 -113.2889 -205.0698 -109.0631 -262.0623 -107.2321
tors': 800}
12
{'n_estima
-238.1398 -114.9003 -204.8838 -110.1027 -261.6610 -108.4356
tors': 700}
11
{'n_estima
-238.6203 -116.0888 -204.9902 -110.9860 -261.9759 -109.0174
tors': 650}
10
{'n_estima
-239.7224 -118.0727 -204.9569 -112.0012 -262.0333 -109.8770
tors': 600}
Table 5-12: Results of gradient boosting hyper-parameter tuning on trees (continued)
No split3 test split3 train Std fit Std score Std train
Std test score
score score time time score
13
-224.1555 -115.3724 0.5229 0.0004 20.7021 3.1012
14
-223.8914 -114.8350 0.2849 0.0004 20.8701 3.0761
12
-225.5707 -116.4409 0.3698 0.0005 20.5719 3.2998
11
-225.7305 -117.2094 0.1572 0.0004 20.6668 3.4188
10
-226.4391 -118.4994 0.4151 0.0004 20.7199 3.7524
This experiment to reduce the number of tree estimators however did not improve the model
performance. This was evidenced by the lower Mean Test Score of -232.4202 for 750
estimators as compared to -214.9128 for 1,500 estimators. The training and testing error on
87
the model performance in comparison with the increase in the number of trees indicated
hardly any performance improvements as in Figure 5-24.
Figure 5-24: Gradient boosting performance vs number of trees
The plot in Figure 5-24 shows clearly that the model was overfitting. The training error was
significantly lower than the testing error which shows that the model learnt the training data
very well but was unable to generalize to the test data well. Both the training and test errors
decreased as the number of trees used by the model increased, but the training error
decreased more rapidly than the testing error.
The model was also overfitting as it performed well on the training data but did not achieve
the same performance on the test data. This can be addressed by getting more training data
or decreasing the model complexity through the hyper-parameter tuning or regularization.
Options for the gradient boosting regressor include reducing the number of trees, reducing
the max depth of each tree, and increasing the minimum number of samples in a leaf node.
Figure 5-25 plots the distribution of actual sukuma wiki prices and predicted prices on the
test set.
88
Figure 5-25: Gradient boosting performance vs number of trees
The distribution looks nearly the same although the density of predicted values is closer to
the median of the test values. This means the model is more accurate predicting values
closer to the median than extreme values.
A histogram of the residuals in Figure 5-26 shows that the residuals are closer to normally
distributed with a few outliers on both low ends. These indicate errors where the model
estimate was far below that of the actual price value.
89
Figure 5-26: Distribution of residuals
The results show that the gradient boosting algorithm is applicable for our sukuma wiki
price prediction task, with the final model able to predict the prices to within 229.50 Kenya
shillings. Hyper-parameter tuning was also able to improve the model performance
although at a considerable time to test. This is a confirmation that proper feature
engineering and more data gathering has a higher pay-off than hyper-parameter tuning.
However, there are various options that can be used to improve the performance of the
models, such as:
i. Adding more features (variables).

ii. Add more data to cover a wider duration for example, over several years.
iii. Extensive hyper-parameter tuning
iv. Better data collection, organization, clean-up and preprocessing before analysis
90
5.4 Development and testing of sukuma wiki price prediction application
The agricultural price prediction portal was implemented successfully for use in predicting
commodity prices according to defined specifications. The training data of approximately
867 records covering the period from December 2015 to May 2019 was used with the
prediction data covering the five-month period from June 2019 to October 2019.
i. Data File Upload
Users are able to upload csv files with feature data into the portal which saves the files in
the upload directory and subsequently invokes the model service to process the saved file.
The upload page is displayed in Figure 5-27. Figure 5-28 and Figure 5-29 display the upload
of csv files containing the training dataset and the prediction dataset respectively.
Figure 5-27: Portal CSV Upload Screenshot
91
Figure 5-28: Portal Training Data CSV File Upload Screenshot
Figure 5-29: Portal Prediction Data CSV File Upload Screenshot
ii. Model Service
The model service that is exposed via an API loads the uploaded csv file, reads and extracts
data from the file and invokes the gradient boosting model to perform prediction on the
past data and save the gradient-boosting model on disk. This data which comprises of the
raw data and the predicted price is saved in the relevant raws and predictions tables for
display. The forecasted future data is then uploaded into the portal and saved in the upload
folder. Then, the saved model is loaded and executed to predict future sukuma wiki prices
which are saved into the future predictions table.
92
iii. Dashboard
The dashboard page consists of line and bar charts and content which displays varied
information such as the average prices for the next three months, trends for the last one year
and one months.
Figure 5-30 presents the averages of the predicted prices for the three months from June
2019 to October 2019 duration in the three counties of Nairobi, Mombasa and Kisumu. The
prices are predicted to be stable across the three counties with a marginal differences around
2,800 Kenya shillings.
Figure 5-30: Portal August to October 2019 Average Predictions Screenshot
Figure 5-30 presents a line graph showing the predicted price trends from 21st May 2019 to
October 2019. Sukuma wiki prices ranged from 2,845 Kenya shillings to 2,890 Kenya
shillings which shows a stable price regime during this period. This can be associated to a
stable average inflation rate of 5.8 percent, average temparature of 24 degree centigrade
and precipitation ranging between 1mm and 12 mm.
93
Figure 5-31: Portal May to October 2019 Predictions Line Graph Screenshot
Figure 5-32 shows the comparison between the actual NAFIS prices and the predicted
prices from December 2015 to May 2019. Prices have been increasing signaling a rising
demand for sukuma wiki and supply due to a growing population. The predicted prices
matched the fluctuations of the actual prices with minimal residual errors.
Figure 5-32: Portal NAFIS Prices vs Predicted Prices Trends 2015 to 2019
Nairobi Data
The dashboard for Nairobi displays a line graph for the trends and a table view.
94
Figure 5-33: Portal Nairobi NAFIS Prices vs Predicted Price Trends
The line graph showed that sukuma wiki prices in Nairobi had maintained a steady price
range of around 1,200 Kenya shillings from April 2018 to October 2018. Thereafter, prices
dropped to reach a low of 800 Kenya shillings in November 2018 before beginning a
gradual rise to highs of 2,400 Kenya shillings. The predicted price closely matched the
actual prices with an average residual of 120 Kenya shillings which indicates that the model
gives good predictions.
Mombasa Data
The dashboard for Mombasa displays a line graph for the trends and a table view.
Figure 5-34: Portal Mombasa NAFIS Prices vs Predicted Price Trends
Kisumu Data
The dashboard for Kisumu displays a line graph for the trends and a table.
95
Figure 5-35: Portal Kisumu NAFIS Prices vs Predicted Price Trends
Mombasa and Kisumu prices for sukuma wiki showed an unstable trend for the past 1 year
fluctuating over the peak, normal and low seasons. This was occasioned by the erratic
rainfall and high temperatures in these two counties despite the lower rate of inflation in
this period. However, difference between the NAFIS price and the actual prices was very
minimal showing a high accuracy of prediction for the model.
The price trends followed an erratic trend which was against the research study's
hypothesis. The price increase can be attributed to the growth in household sizes over the
years from the research study starting period where they averaged between 800 Kenya
shillings to 1,200 Kenya shillings. In addition, inflation has gone down over the period
which has somewhat ensured prices have remained steady in the last few months. The
actual and predicted prices in regards to temperature were inconsistent as prices varied
across high, normal and low temperatures.
In some cases, high temperature periods had low sukuma wiki prices while in some cases
caused high prices. This may be explained by the fact that farmers may be using green
houses or irrigation to produce sukuma wiki. Nairobi has the lowest prices depicting
consistent supplies of sukuma wiki throughout the year and distance to sukuma wiki
growing zones around as compared to Mombasa and Kisumu. This could be caused by any
number of issues such as higher transport costs and more extreme weather than Nairobi.
96
The predictions for the period between May to October 2019 show the prices stabilizing
between the 2,800 to 2,900 Kenya shillings range. Nairobi maintained a price range of
2,800 Kenya shillings while Mombasa and Kisumu had oscillating prices during the same
period. This can be attributed mainly to average temperatures and precipitation remaining
constant through the period in Nairobi whereas Mombasa and Kisumu fluctuated. The other
factors such as household sizes, total demand and inflation remained constant during these
period, thus did not have much effect on the prices.
5.3 Chapter Summary
This chapter covered the results and findings of the research work. Trends on the raw data
for the three counties were analyzed and displayed in graphs. It described the influence and
importance of the data selected on price prediction showing the correlation of this data to
the dependent NAFIS price variable. It also discussed the performance of the machine
learning models using metrics such as the root mean squared error and r-squared.
The gradient boosting algorithm provided the best accuracy among the four algorithms
evaluated. Cross validation tests and hyper-parameter tuning was further done on the
gradient boosting algorithm to find the optimal parameters that improve performance.
Consequently, the resulting model was utilized in the price prediction application for
sukuma wiki prices. The prediction application was finally implemented and presented with
relevant charts and tables showing predicted sukuma wiki prices for the successive three
months and a comparison of past prices and their predictions. The next chapter concludes
the research work.
97
Chapter 6: Discussion, Conclusions and Recommendations
6.1 Introduction
This chapter summarized the major findings on the development of a machine learning tool
for the prediction of agricultural commodity prices with the discussion focusing on each of
the objectives. The research objectives included the study of the features and data sources
that influence sukuma wiki prices, the evaluation and selection of the best performing
machine learning algorithms and the development of a sukuma wiki price prediction
machine learning tool. It gave the conclusions based on the findings and recommendations
for further research and development following the limitations of the study.
6.2 Summary
6.2.1 Purpose of the study
The purpose of the study was to develop and test a price prediction application for
agricultural commodities at the county level, specifically for sukuma wiki.
6.2.2 Specific Objectives
The research focused on attaining specific objectives with commitment to achieve overall
expectations as detailed below:
i. To identify the features and data sources that influence sukuma wiki prices.
ii. To evaluate different algorithms and select the best-performing algorithm for
sukuma wiki price prediction.
iii. To develop and test a prediction application that utilizes the selected algorithm to
predict and visually display sukuma wiki prices for Nairobi, Mombasa and
Kisumu counties.
6.2.3 Research Methodology
Chapter 3 introduced the research methodology used in the implementation of the sukuma
wiki price prediction application. The research involved collecting data from various
sources such as the weekly sukuma wiki prices published by the National Farmer
Information Service which was the main dependent variable. Other data features collected
include weather information such as temperatures and precipitation, inflation data,
98
population data such as household sizes, crop seasons, and the Agricultural Sector
Development Support Scheme report on value chains in Nairobi, Mombasa and Kisumu.
This data formed the data features used by the price prediction machine learning application
to predict sukuma wiki prices.
The data was then cleaned, organized and prepared for analysis using the browser-based
Jupyter Notebooks and Scikit-Learn library. The data was prepared by removing records
with missing values and scaling the data so that they have equal weighting. This included
creating an extra set of data features from existing features such as breaking seasons into
peak, normal and low seasons. The data was then fitted into various machine learning
algorithms including linear regression, random forests, gradient boosting and XGBoost
which were then used to predict. The data was split into training and test data in the ratio
of 70 per cent for training and 30 per cent for test data. The models were evaluated and
tested to choose the model which gave the best accuracy.
The gradient boosting algorithm gave the most accurate predicted price compared to the
actual price and thus was chosen to implement the price prediction price application. The
application was built with modules for uploading CSV files, storing data from the uploaded
file into a database, executing the price prediction model and displaying the results in charts
and tables for end-users like farmers, traders, processors, consumers and government.
6.2.4 Major Findings
The implementation of this project provided insight into the use of machine learning to
predict food prices according to the stated research objectives. The research found out that
there are various data features that can be used for price prediction with the most influential
being weather and household population size. The gradient boosting algorithm produced
the best accuracy among the four algorithms based on the available datasets and was thus
used to implement the sukuma wiki price prediction application as a model service. This
means in the model, 96% of the variability in NAFIS price (Y) can be explained (predicted)
using the features used (X). From the value of the RMSE, our model was able to predict
the price in the test set within 76.25 to 95.89 Kenya Shillings of the actual NAFIS price
99
6.3 Discussion
A sukuma wiki price prediction application was developed after analyzing various
regression and decision tree algorithms on a variety of datasets for the urban Kenyan
counties of Nairobi, Mombasa and Kisumu. The study focused on three key objectives
which were the identification of features that influence sukuma wiki prices, evaluation and
selection of the best-performing prediction algorithm from several algorithms, and the
development and evaluation of a price prediction application for predicting sukuma wiki
prices into the future. The conclusions were therefore drawn with references to the defined
objectives.
6.3.1 Factors and data sources that influence sukuma wiki prices
The study analyzed the data features that influence sukuma wiki prices in the three Kenyan
Counties of Nairobi, Mombasa and Kisumu and used data spanning the period from July
2015 to May 2019. The data was collected from multiple online sources like the National
Farmer Information Service (NAFIS), weather data such as the average, maximum and
minimum temperatures from the accu-weather website, and inflation data from the Central
Bank of Kenya (CBK) website, and finally, population and demand data from the
Agriculture Sector Development Support Programme (ASDSP).
The primary dependent variable was the NAFIS sukuma wiki price data for a fifty (50)
kilogram bag collated on multiple random days in a week while the rest of the data formed
the independent variables. It is stated in data science that eighty (80%) of the time is spent
in data collection, clean-up and analysis tasks. This effort was expended in collecting and
preparing the data for analysis. The data was collected, cleaned and organized in a csv
document based on the NAFIS price reporting dates. Similar studies conducted in the USA
by Gro-Intelligence used data from multiple sources like the USDA for corn price
prediction, environmental indicators, climate signals, and other variables that are correlated
with crop yield. Other studies also identified weather conditions, consumer price indexes
and population attributes as factors affecting crop prices.
Exploratory data analysis using the scikit-learn library, a python-based machine learning
library with a myriad of pre-built algorithms, conducted on the data established that
household demand and weather were a major determinant of sukuma wiki pricing. Due to
100
the fact that demand from households which have an average of 5 members has been
increasing every year with the rising population, the price of sukuma wiki rose
exponentially over the period. According to the KNBS 2009 census, Nairobi with 17 sub-
counties had a population size of 3,138,369 and household size of 985,016. The population
has grown at an average rate of 4% per year and is expected to rise above 4 million people
with over 1.4 million households in 2019.
Mombasa's household sizes grew from 298,896 households and total demand of 1,660,117
kilograms in 2015 to 306,323 households and total demand of 1,862,254 kilograms in July
2018. Similarly, Kisumu household sizes grew from 266,453 households and total demand
of 1,630,692 kilograms in 2015 to 273,074 households and total demand of 1,671,213
kilograms in July 2018. The demand was calculated from the number of households and
the demand of about 6.12 kg per households determined by the ASDSP survey conducted
between the months of July 2015 to August 2016.
The price however fluctuated based on weather seasons, going up during the peak seasons
of April, august and December, and low during the low seasons of the months of January,
February, May, June and July. These seasons coincided with periods of low temperature
and high precipitation during the high seasons and vice versa during the low seasons.
Inflation rates, however, remained constant throughout the study period which meant it
didn't have much influence on pricing. The humidity and wind-speed characteristics were
not considered in this study due to their non-changing values over the study period. Also,
future weather conditions were collected comprising weather from May 2019 to October
2019.
NAFIS prices ranged from a high of 3500 Kenya shillings to a low of 800 Kenya shillings
for the study data and has been steadily rising over the years. The ASDSP report covering
the one-year period between July 2015 and August 2016 reported a consistent unit price
per bunch which wasn't sufficient enough to influence the study findings. The study used
data from a dataset of about 750 records combining Nairobi, Mombasa and Kisumu data
with the specified factors.
101
6.3.2 Evaluation and selection of the best-performing prediction algorithm
The study identified four significantly-used regression algorithms that are most appropriate
for prediction tasks. These were linear regression, the random forests bagging algorithm
and the decision trees algorithms of gradient boosting and XGBoost. This was measured
on different sets of data. Evaluation and prediction were performed on 2 datasets, one which
had 577 records recorded while the dataset which had 749 records. Time-series algorithms
like ARIMA were not used for this study because of the inconsistent state of the data. Time-
series algorithms work best on continuous and consistent data. The artificial neural
networks were also not considered here because of the compute power needed and the time
it would take to train and predict the prices.
The study established that the decision trees algorithms of Gradient Boosting and XGBoost
performed well on the data available. Explanatory power measured by R-squared for
gradient boosting algorithm ranges between 0.92 and 0.96 which is the best case for our
sukuma wiki price model. This means 92% to 96% of variance in sukuma wiki prices was
explained by this model. This was closely-followed by XGBoost algorithm with an almost
similar range of between 0.9 and 0.95, thus proving that decision tree-based algorithms
performed better than regression-based algorithms like linear regression and bagging
algorithms like random forests on the available data.
In this study, linear regression algorithm had the weakest performance among the
algorithms studied with an R-Squared value of 0.45 and 0.38 on training and test data
respectively. It had the highest RMSE values of 362.79 and 394.03 for training and test
data which shows the largest amount of difference from the optimal prices. Thus, gradient-
boosting algorithms are recommended as the most suitable model for sukuma wiki, and
other fruits and vegetables commodity price prediction tasks as a whole.
6.3.3 Development and testing of a sukuma wiki price prediction application
The research study culminated in the development of a web-based agricultural commodity

price prediction portal that is used to perform price prediction and visualization of the
predicted prices and associated data. The portal features an easily-accessible dashboard and
an administration section for upload of a data CSV file, price prediction and presentment.
The main portal dashboard displays sukuma wiki price information for both 3-months
102
ahead and past pricing information in the 3 counties of Nairobi, Mombasa and Kisumu.
The average price for sukuma wiki for the next 3 months were 2870 in Nairobi, 2880 in
Mombasa and 2890 in Kisumu.
The portal dashboard also displays a line graph showing the trend of prices in the next 3-
months, the predicted and actual price comparisons for the last 1 year for the 3 counties. It
also displays the last 1-month price comparisons for the actual and predicted prices for
selected dates in tables for easy access. Search capabilities to search for data using filters
such as different date ranges and counties, and commodity types to be added in the future.
The data to be searched can either be price information, weather, inflation, household sizes
and demand over any date duration.
The prototype application features an upload page whereby a user uploads their prepared
data in a prescribed format into the application which stores this data and utilizes the
deployed model to predict sukuma wiki prices. The portal was developed following an agile
minimum viable product (MVP) approach using suitable technologies like angular 2 for the
frontend, chart.js for the charting and tables, java spring boot for back-end RESTful
fetching of data, Flask for serving the machine learning model built using the scikit-learn
python library, and data stored in a MySQL database. The application can be used by
different stakeholders like farmers, consumers, traders, processors, government policy
makers and any other interested agricultural stakeholder.
The model was tested in the urban counties of Nairobi, Mombasa and Kisumu because of
extensive data availability, especially historical data. The application can be easily
extended to cover other agricultural commodities and other counties within Kenya, within
Africa and globally. Even under situations of missing data, for example, when weather or
price information is not complete or available, there are supervised machine leaning models
to use to learn patterns despite the existence of faulty data. For example, one could fill in
the missing data by referring to nearby samples.
6.4 Conclusions
The use of machine learning techniques such as predictive analytics in determining

agricultural commodity prices is on the rise. The machine learning application implemented
in this research work was geared appropriately for sukuma wiki price prediction and was a
103
useful tool for stakeholders such as farmers, consumers, traders, processors and policy
makers.
It required the right kind and amount of data sets and features, the testing and evaluation of
appropriate algorithms for modelling and developing an appropriate system. There needed
to be identification of the features to be used and the collection of this data from the most
appropriate sources of data, be it online or using surveys and reports. The collected data
required cleaning and organizing so as to be effectively used and the most appropriate and
important features that influence the modelling be used. The lack of adequate data
especially for agricultural commodities like vegetables and fruits which are highly seasonal
and unpredictable makes price prediction extremely difficult.
Due to the fact that there is no right algorithm for certain tasks in machine learning, several
algorithms need to be tested and their performance metrics evaluated to get the best
performing algorithm. The algorithms require hyper-parameter tuning and the most
appropriate features to be tested against and the one with the best accuracy for the scenario
selected for use.
The machine learning tool developed should fit the price prediction task by allowing users
who are important stakeholders to load the data onto the system and access information in
visually appealing and useful ways. This can be on various channels like smart-phones, via
SMS or USSD and any other means that reaches the widest audience. Adequate available
data and the most appropriate model leads towards the successful development and
implementation of a price prediction machine learning tool.
6.4.1 Factors and data sources that influence sukuma wiki prices.
The research evaluated various types of data that influence food pricing with a specific
focus on sukuma wiki as a highly consumed vegetable in Kenya. The data collected and
organized for the 3 urban counties of Nairobi, Mombasa and Kisumu consisted of weather
data such as temperatures and rainfall, monthly inflation rates, report from an ASDSP
survey on the same, seasonal food availability and population data. These data sets had an
impact on the food prices with variations causing oscillations between the high and low
sukuma wiki prices.
104
6.4.2 Best-performing algorithm for sukuma wiki price prediction.
The research evaluated and tested 4 different machine learning algorithms particularly
suited for predictive analytics. They consisting of linear regression, random forests,
gradient boosting and XGBoost adapted from the scikit-learn library. These were evaluated
and tested using the browser-based Jupyter Notebook used for data analytics tasks by data
scientists. The data was loaded from prepared comma-separated CSV files, feature-
engineered to remove missing values and scaled to normalize the data.
6.4.3 Development and testing of a sukuma wiki price prediction application
The research study led to the development and testing of a browser-based machine learning
application geared towards sukuma wiki price prediction in Nairobi, Mombasa and Kisumu
counties. The application was developed using various tools and programming languages.
The application developed is composed of an upload area for CSV data, a database module
for data storage, a model service for prediction, and a dashboard with charts and tables to
display the data in a visually-informative format. The application can be used to predict
fruits and vegetable prices for other agricultural commodities apart from sukuma wiki and
in various geographical locations.
6.5 Recommendations and Future Work
6.5.1 Recommendations for the research
The research study focused on four algorithms consisting of linear regression, random
forests, gradient boosting and XGboost with a limited range of data features. To achieve
better accuracy, it is recommended to have the following:
Wider data coverage
The study tested only a dataset consisting of 714 records from the 3 counties. More
extensive coverage is needed with efforts on collecting data from the other counties in
Kenya and covering extra past years. It is recommended to gather extensive data spanning
several years back and include more counties in the study. The study could also be widened
to include other vegetable and fruits commodities as the same factors influencing sukuma
105
wiki do also influence these other crops. More evenly-spaced and consistent data would
also improve the prediction modelling.
More data features
There is a dearth of data that would help aid the prediction tasks. To enrich the datasets
used for prediction, the model would capture more variability with more signals such as:
i. Pests and Diseases

ii. Cost of inputs such as costs for fertilizer, seeds, among others
i. Petrol prices
ii. Pests and Disease
iii. Cost of inputs such as costs for fertilizer, seeds, among others
iv. Chi-squared values
v. Population density
vi. Total Production amount
vii. Location - Latitude and Longitude
viii. Climatic zones
ix. Transport Modes and the available road network
x. Wastage
xi. Proximity to markets
xii. Urban / Rural characteristics
xiii. Gender - Male / Female
xiv. Household heads or decision makers
xv. Household Income levels
xvi. Expenditure on food items
xvii. Education levels
xviii. Importation volumes
xix. Political sentiment and situation
xx. Farm sizes
xxi. Trader quantities per season
106
Hyper-parameter tuning of existing algorithms and evaluation of other algorithms
Predictions for vegetable prices can be enhanced by intensive hyper-parameter tuning of

the algorithms to get the best settings, combining predictions from different models, and
even evaluating other time-series algorithms. This would lead to better price information
about the commodities in the wholesale markets.
6.5.2 Future work
Develop channels for user's access to the prices data e.g. USSD, or SMS based services
The prediction application is currently a web-based portal which can only be accessible on
desktop computers and online. Other channels like SMS, USSD and Mobile Apps could be
developed so that it’s easily accessible to a larger population of the main study targets of
farmers, consumers, traders, processors and government stakeholders.
Evaluation of Deep Learning algorithms
Deep learning algorithms and neural networks such as Tensor-Flow, Keras, Convolutional
Neural Networks (CNN) among many more have emerged that give better accuracy.
However, these require more powerful computer processing power and take longer to train
which results into better results. Further research using these algorithms need to be
conducted.
Data Collection and Cleaning Tools
The prediction tool can be automated by implementing application programming interfaces

(APIs) that scrape websites or consume available APIs to retrieve and collect data. Sites
such as accu-weather provide API’s for fetching their data in paid plans. Other sources such
as the NAFIS sites which provide CSV and image file downloads, the central bank and
other such sites can implement APIs which can be exposed to consumers to use to get their
data. There is a necessity to use better data ingestion, data cleaning and data preparation
tools to collect, organize and prepare the available data into a highly-usable form.
107
References
Accu-weather. (2018). Accu Weather. Retrieved from www.accu-weather.com
Agrawal, R. K. R. A. K., Adhikari, R., & Agrawal, R. K. R. A. K. (2013). An

Introductory Study on Time Series Modeling and Forecasting Ratnadip Adhikari R.
K. Agrawal. ArXiv Preprint ArXiv:1302.6613, 1302.6613, 1–68.
Alif, A., Shukanya, I., & Afee, T. (2018). Crop prediction based on geographical and
climatic data using machine learning and deep learning. (December). Retrieved
from http://dspace.bracu.ac.bd/xmlui/handle/10361/11429
Baffes, J., & Dennis, A. (2013). Long-Term Drivers of Food Prices. The World Bank
Development Prospects Group, (May).
Borychowski, M., & Czyżewski, A. (2016). Determinants of prices increase of

agricultural commodities in a global context. Management, 19(2), 152–167.
https://doi.org/10.1515/manment-2015-0020
Breiman, L. E. O. (2001). Random Forests. 5–32.
Byrne, J. P., Fazio, G., & Fiess, N. (2010). Primary Commodity Prices : Co-movements ,
Common Factors and Fundamentals. 1–33.
Cai, Y., Moore, K., Pellegrini, A., Elhaddad, A., Townsend, C., Solak, H., & Semret, N.
(2017). Crop yield predictions - high resolution statistical model for intra-season
forecasts applied to corn in the US Gro Intelligence , Inc . I.
CBK. (2018). Central Bank of Kenya Inflation Rates. Retrieved from

https://www.centralbank.go.ke/statistics/inflation-rates/
Chen, T., & Guestrin, C. (2016). XGBoost : A Scalable Tree Boosting System.
Chioka. (2013). Differences between L1 and L2 as Loss Function and Regularization.

Retrieved from http://www.chioka.in/differences-between-l1-and-l2-as-loss-
function-and-regularization/
Cutler, A. (2010). Random Forests for Regression and Classification.
Dale, L., & Ronald, L. (1995). Task-technology fit and individual performance.
108
Davis, F. D., Bagozzi, R. P., & Warshaw, P. R. (1989). User Acceptance of Computer
Technology: A Comparison of Two Theoretical Models. Management Science,
35(8), 982-1003.
Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of
information technology. MIS Quarterly.
Dishaw, M., Strong, D., & Bandy, D. B. (2004). The Impact of Task-Technology Fit in
Technology Acceptance and Utilization Models The Impact of Task-Technology Fit
in Technology Acceptance and Utilization Models.
Food, A. A. (2017). All About Food. Retrieved from http://allaboutfood.aitc.ca/article/5-

factors-that-affect-food-prices.php
Frelat, R., Lopez-Ridaura, S., Giller, K. E., Herrero, M., Douxchamps, S., Djurfeldt, A.
A., … van Wijk, M. T. (2015). Drivers of household food availability in sub-Saharan
Africa based on big data from small farms. Proceedings of the National Academy of
Sciences, 113(2), 458–463. https://doi.org/10.1073/pnas.1518384112
Gathondu, E. K. (2014). University of Nairobi Modeling of Wholesale Prices for Selected

Vegetables Using Time Series Models in Kenya.
Hanson, K., Robinson, S., & Schluter, G. (1993). Sectoral Effects of a World Oil Price
Shock : Economywide Linkages to the Agricultural Sector. 18(1), 96–116.
Kaur, K. (2016). Machine Learning : Applications in Indian Agriculture. 5(4), 342–344.

https://doi.org/10.17148/IJARCCE.2016.5487
Kempenaar, C., Lokhorst, C., Bleumer, E. J. B., Veerkamp, R. F., Been, T., Evert, F. K.
van, … Noorbergen, H. (2016). Big data analysis for smart farming.
Kretschmer, B., Bowyer, C., & Buckwell, A. (2012). EU BioFuel Use and Agricultural
Commodity Prices -A Review of the Evidence Base - ActionAid.
Louppe, G. (2014). Understanding Random Forests Form Theory to Practice. (July).
Lukyamuzi Andrew, by, John, N., & George Washington, O. (2015). a Dynamic Model
for Prediction of Food Insecurity. 1–56.
NAFIS. (2018). National Farmer Information Service - NAFIS. Retrieved from

109
www.nafis.go.ke
Nelson, C. (2008). What’s Driving Food Prices. Farm Foundation, (July).
Noi, P. T., Degener, J., & Kappas, M. (2017). Comparison of multiple linear regression,
cubist regression, and random forest algorithms to estimate daily air surface
temperature from dynamic combinations of MODIS LST data. Remote Sensing, 9(5).
https://doi.org/10.3390/rs9050398
Okori, W., Obua, J., & Quinn, J. (2011). Machine Learning Classification Technique for
Famine Prediction. Proceedings of the World Congress on Engineering, II, II(1), 4–
9. Retrieved from http://www.iaeng.org/publication/WCE2011/WCE2011_pp991-
996.pdf[ Accessed on 3rd Aug 2014]
Otunaiya, A. O., & Shittu, A. M. (2014). Complete household demand system of

vegetables in Ogun State , Nigeria. 2014(11), 509–516.
Parag., R. (2017). Linear Regression. Retrieved from

https://towardsdatascience.com/simple-linear-regression-2421076a5892
Paradkar, M. (2017). Forecasting Markets using eXtreme Gradient Boosting (XGBoost).

Retrieved from https://www.quantinsti.com/blog/
Pavlyshenko, B. M. (2016). Linear , Machine Learning and Probabilistic Approaches for

Time Series Analysis. (August).
Pedregosa, F., Weiss, R., & Brucher, M. (2011). Scikit-learn : Machine Learning in
Python. 12, 2825–2830.
Radhika, Y., & Shashi, M. (2013). Atmospheric Temperature Prediction using Support
Vector Machines. International Journal of Computer Theory and Engineering, 1(1),
55–58. https://doi.org/10.7763/ijcte.2009.v1.9
Ramesh, D., & Vardhan, B. V. (2013). Data Mining Techniques and Applications to
Agricultural Yield Data. International Journal of Advanced Research in Computer
and Communication Engineering, 2(9), 3477–3480.
https://doi.org/http://dx.doi.org/10.5120/16620-6472
Reinhart, C., & Borensztein, E. (2009). The determinants of commodity prices. (13870).
110
Samuel, A. (1959). Machine Learning. Retrieved from
https://en.wikipedia.org/wiki/Machine_learning
Santhanam, R. (2016). Comparative Study of XGBoost4j and Gradient Boosting for

Linear Regression Comparative Study of XGBoost4j and Gradient Boosting for
Linear Regression. International Journal of Control Theory and Applications, 9(40),
1131–1142. https://doi.org/10.13140/RG.2.2.36040.21767
Shukla, M., & Jharkharia, S. (2011). Applicability of ARIMA Models in Wholesale

Vegetable Market : An Investigation. 1125–1130.
Sujjaviriyasup, T., & Pitiruek, K. (2013). Agricultural product forecasting using machine
learning approach. International Journal of Mathematical Analysis, 7(37–40), 1869–
1875. https://doi.org/10.12988/ijma.2013.35113
Techwave. (2016). Leveraging Big Data in Agriculture. Retrieved from

https://techwave.net/leveraging-bigdata-in-agriculture/
Tongai Hu. (2012). Factors Influencing Price of Agricultural Products and Stability
Countermeasures. Asian Agricultural Research.
UNES, A. (2016). Kenya Agricultural Sector Value Chain Market Research Survey : A
pilot study of Nairobi County November 2016.
Wikipedia. (2018a). Consumer Price Index- CPI. Retrieved from

https://en.wikipedia.org/wiki/Consumer_price_index
Wolfert, S., Ge, L., Verdouw, C., & Bogaardt, M. J. (2017). Big Data in Smart Farming –
A review. Agricultural Systems, 153, 69–80.
https://doi.org/10.1016/j.agsy.2017.01.023
Yu, H., Lo, H., Hsieh, H., Lou, J., Mckenzie, T. G., Chang, P. T., … Lin, C. (2010).
Feature Engineering and Classifier Ensemble for KDD Cup 2010. 1–12.
111
APPENDIX I: Sample Datasets
i. Final Research Data
Final excel data prepared for training and testing the data.
NAFIS Nafis
County Date Variety Commodity Unit Kg Prices Season
Nairobi 02/12/2015 Horticulture Kales Bag 50 1250 peak
Nairobi 03/01/2016 Horticulture Kales Bag 50 1200 low
112
Annual Twelve Total
Avg Max Min Precipitation Average Month Household Household
Temp Temp Temp mm Inflation Inflation Sizes Demand kg
19 26 10 7.4 6.58 8.01 1259617 7708857
19 26 10 7.4 6.58 8.01 1259617 7708857
19 26 10 7.4 6.58 8.01 1259617 7708857
19 26 10 7.4 6.58 8.01 1259617 7708857
19 26 10 7.4 6.58 8.01 1259617 7708857
19 26 10 7.4 6.58 8.01 1259617 7708857
22 28 15 15.04 6.77 7.78 1263995 7735650
22 28 15 15.04 6.77 7.78 1263995 7735650
22 28 15 15.04 6.77 7.78 1263995 7735650
22 28 15 15.04 6.77 7.78 1263995 7735650
22 28 15 15.04 6.77 7.78 1263995 7735650
22 29 13 10.1 6.87 6.84 1268388 7762535
22 29 13 10.1 6.87 6.84 1268388 7762535
22 29 13 10.1 6.87 6.84 1268388 7762535
22 29 13 10.1 6.87 6.84 1268388 7762535
22 29 13 10.1 6.87 6.84 1268388 7762535
22 29 13 10.1 6.87 6.84 1268388 7762535
22 29 13 10.1 6.87 6.84 1268388 7762535
24 31 14 4.9 6.88 6.45 1272796 7789512
113
ii. Inflation Rates
This is the inflation rates gathered from the Central Bank of Kenya and Kenya National
Bureau of Standards websites.
Annual Average
Year Month Inflation 12-Month Inflation
2015 January 6.74 5.53
2015 February 6.63 5.61
2015 March 6.63 6.31
2015 April 6.69 7.08
2015 May 6.65 6.87
2015 June 6.63 7.03
2015 July 6.54 6.62
2015 August 6.34 5.84
2015 September 6.29 5.97
2015 October 6.31 6.72
2015 November 6.42 7.32
2015 December 6.58 8.01
2016 January 6.77 7.78
2016 February 6.87 6.84
2016 March 6.88 6.45
2016 April 6.72 5.27
2016 May 6.59 5
2016 June 6.46 5.8
2016 July 6.44 6.4
2016 August 6.47 6.26
2016 September 6.5 6.34
2016 October 6.48 6.47
2016 November 6.43 6.68
2016 December 6.3 6.35
2017 January 6.26 6.99
114
Annual Average
Year Month Inflation 12-Month Inflation
2017 February 6.43 9.04
2017 March 6.76 10.28
2017 April 7.2 11.48
2017 May 7.84 11.7
2017 June 8.13 9.21
2017 July 8.21 7.47
2017 August 8.36 8.04
2017 September 8.4 7.06
2017 October 8.33 5.72
2017 November 8.15 4.73
2017 December 7.98 4.5
2018 January 7.79 4.83
2018 February 7.4 4.46
2018 March 6.89 4.18
2018 April 6.24 3.73
2018 May 5.61 3.95
2018 June 5.2 4.28
2018 July 4.95 4.35
115
iii. NAFIS Prices - Sample Data
This is a sample of the commodity prices published weekly on the National Farmer
Information Service (NAFIS) website for different counties.
116
iv. Accu-Weather
This is sample data from the accu-weather website showing the daily weather conditions in
Nairobi for the month of June 2018. The temperatures were converted to degrees Celsius
from the published degrees Fahrenheit.
117
NAFIS Average Maximum Minimum Precipitation
County Date Temperature Temperature Temperature mm
Nairobi 01/05/2018 24 26 13 1
Nairobi 03/05/2018 23 25 18 48
Nairobi 04/05/2018 24 26 17 0
Nairobi 08/05/2018 22 23 17 0
Nairobi 09/05/2018 22 23 16 0
Nairobi 10/05/2018 22 23 15 4
Nairobi 11/05/2018 22 23 16 3
Nairobi 14/05/2018 22 25 16 0
Nairobi 17/05/2018 22 24 16 3
Nairobi 21/05/2018 22 22 16 0
Nairobi 23/05/2018 22 22 16 13
Nairobi 24/05/2018 22 23 16 0
Nairobi 30/05/2018 21 24 13 0
v. ASDSP Report on Population and Trends
The ASDSP report provided the population and household trends from the year 1979 up-to
projections for the year 2017. The report projected a household size of four persons with a
consumption of 6.12 kilograms of sukuma wiki per household as calculated from the
visualization below.
118
vi. ASDSP Price Trends
119
APPENDIX II: Sample Machine Learning Model Code
This section shows the summary data statistics and code snapshots of the model service,
dashboard service, upload service and the data service for the application implemented.
i. Statistics
Summary statistics on the gathered dataset using the df.describe() function on scikit-learn
library.
df.describe()
ii. Model Service
SmartFlaskAPI.py
# GRADIENT BOOSTING
from sklearn import ensemble
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import preprocessing
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
120
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.externals import joblib
import calendar
import time
import datetime
from time import strptime
import pymysql
import mysql.connector
import sqlalchemy
from sqlalchemy import create_engine
from flask import request, url_for
from flask_api import FlaskAPI, status, exceptions
app = FlaskAPI(__name__)
@app.route("/model", methods=['GET', 'POST'])
def model_ml():
"""
Predict Commodity Prices.
"""
df = pd.read_csv('upload_dir\\Sukuma_Wiki_Counties_Cumulative.csv',
encoding='ISO-8859-1')
121
# save raw data
saveRawData(df)
# one-hot encode categorical variable season. produces season_low, season_normal

and season_peak
df = pd.get_dummies(df, columns=["season"])
sf = df
# drop rows with missing values
sf.dropna(inplace=True)
Y = sf['nafis_prices']
X = sf.drop(['nafis_prices','nafis_date','variety','commodity','unit','kg','county'],
axis=1)
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size,

random_state=seed)
# Instantiate Gradient Boosting: model
# Standardize features by removing the mean and scaling to unit variance using the
StandardScaler() function
# Apply Scaling to X_train and X_test
std_scale = StandardScaler().fit(X_train)
X_train_scaled = std_scale.transform(X_train)
122
X_test_scaled = std_scale.transform(X_test)
model = ensemble.GradientBoostingRegressor()
reg_scaled = model.fit(X_train_scaled, y_train)
y_train_scaled_fit = reg_scaled.predict(X_train_scaled)
y_pred = reg_scaled.predict(X_test_scaled)
X_test.loc[:,'predictions'] = y_pred;
X_train.loc[:,'predictions'] = y_train_scaled_fit;
df_row_merged = pd.concat([X_train, X_test], ignore_index=False)
# Recreate your dataframe\n",
df_row_merged['nafis_date'] = df['nafis_date']
df_row_merged['nafis_prices'] = df['nafis_prices']
df_row_merged['variety'] = df['variety']
df_row_merged['commodity'] = df['commodity']
df_row_merged['unit'] = df['unit']
df_row_merged['county'] = df['county']
df_row_merged['kg'] = df['kg']
df = pd.DataFrame(df_row_merged)
# save predicted data
123
savePredictionData(df)
# save the model
joblib.dump(model, 'gradient_boosting_final_model.pkl')
@app.route("/predict", methods=['GET', 'POST'])
def model_predict():
"""
Predict Commodity Prices.
"""
df = pd.read_csv('upload_dir\\Sukuma_Wiki_Cumulative_Predict.csv',
encoding='ISO-8859-1')
# one-hot encode categorical variable season. produces season_low, season_normal

and season_peak
df = pd.get_dummies(df, columns=["season"])
sf = df.drop(['nafis_date','variety','commodity','unit','kg','county'], axis=1)
# Saved file in the current working directory
joblib_file = "gradient_boosting_final_model.pkl"
# Load from file
joblib_model = joblib.load(joblib_file)
Ypredict = joblib_model.predict(sf)
124
print("Predicted: ", Ypredict)
df['predictions'] = Ypredict
df['prediction_date'] = pd.to_datetime(df['nafis_date'], format='%d/%m/%Y')
df.to_csv('gradient_boosting_predicted_final_results.csv')
# save future predicted data
saveFuturePredictionData(df)
def saveRawData(df):
engine = create_engine("mysql+mysqlconnector://root:root@localhost/smart_agri")
df.to_sql(name='raws', con=engine, if_exists='append', index=False)
def savePredictionData(df):
df.to_sql(name='predictions', con=engine, if_exists='append', index=False)
def saveFuturePredictionData(df):
df.to_sql(name='future_predictions', con=engine, if_exists='append', index=False)
if __name__ == "__main__":
app.run(debug=True)
125
iii. Data Service
(a) FuturePredictionsController.java
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.validation.annotation.Validated;
import org.springframework.web.bind.annotation.*;
import com.smartagri.model.FuturePredictions;
import com.smartagri.repository.FuturePredictionsRepository;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
@RestController
public class FuturePredictionsController {
private final Logger LOG = LoggerFactory.getLogger(getClass().getName());
private final FuturePredictionsRepository futurePredictionsRepository;
@Autowired
public FuturePredictionsController(FuturePredictionsRepository
futurePredictionsRepository) {
this.futurePredictionsRepository = futurePredictionsRepository;
@RequestMapping(value = "/futurepredictions/averagenairobifuturepredictions",
method = RequestMethod.GET)
126
public List<Map> getAverageNairobiFuturePredictions() {
List<Object[]> result =
futurePredictionsRepository.getNairobiFuturePredictionAverages();
LOG.info("Getting Average Nairobi Future Predictions Prices Data: {}.",

result);
Map<String, String> map = null;
List<Map> futurePredictions = new ArrayList<Map>();
if (result != null && !result.isEmpty()) {
for (Object[] object : result) {
map = new HashMap<String, String>();
map.put("county", String.valueOf(object[0]));
map.put("prediction", String.valueOf(object[1]));
// Add to list
futurePredictions.add(map);
return futurePredictions;
@RequestMapping(value =
"/futurepredictions/averagemombasafuturepredictions", method =
RequestMethod.GET)
127
public List<Map> getMombasaAverageFuturePredictions() {
futurePredictionsRepository.getMombasaFuturePredictionAverages();
// Add to list
@RequestMapping(value = "/futurepredictions/averagekisumufuturepredictions",
method = RequestMethod.GET)
public List<Map> getKisumuAverageFuturePredictions() {
futurePredictionsRepository.getKisumuAverageFuturePredictions();
128
// Add to list
@RequestMapping(value = "/futurepredictions/dailyfuturepredictions", method =

RequestMethod.GET)
public List<Map> getDailyFuturePredictions() {
futurePredictionsRepository.getDailyFuturePredictions();

129
map.put("prediction_date", String.valueOf(object[1]));
map.put("prediction_year", String.valueOf(object[2]));
map.put("prediction_month", String.valueOf(object[3]));
map.put("prediction_month_int",
String.valueOf(object[4]));
map.put("prediction_price", String.valueOf(object[5]));
// Add to list
(b) FuturePredictionsRepository.java
import com.smartagri.model.FuturePredictions;
import org.springframework.stereotype.Repository;
import java.util.List;
import org.springframework.data.jpa.repository.*;
/**
* Spring Data JPA repository for the Future Predictions entity.

130
*/
@Repository
public interface FuturePredictionsRepository extends JpaRepository<FuturePredictions,

Long> {
//Get the average amount of future prediction
String futurePredictionAverages = "SELECT county, (FLOOR(AVG(predictions)

/ 10) * 10) AS "
+ "avg_future_prediction FROM FuturePredictions";
@Query(value = futurePredictionAverages)
List<Object[]> getFuturePredictionAverages();
//Get the average amount of future prediction for Nairobi
String futureNairobiPredictionAverages = "SELECT county,

(FLOOR(AVG(predictions) / 10) * 10) AS "
+ "avg_future_prediction FROM FuturePredictions WHERE county

= 'Nairobi'";
@Query(value = futureNairobiPredictionAverages)
List<Object[]> getNairobiFuturePredictionAverages();
//Get the average amount of future prediction for Mombasa
String futureMombasaPredictionAverages = "SELECT county,


= 'Mombasa'";
@Query(value = futureMombasaPredictionAverages)
List<Object[]> getMombasaFuturePredictionAverages();
131
//Get the average amount of future prediction for Kisumu
String futureKisumuPredictionAverages = "SELECT county,


= 'Kisumu'";
@Query(value = futureKisumuPredictionAverages)
List<Object[]> getKisumuAverageFuturePredictions();
//Get the daily future price prediction by counties
String dailyFuturePredictions = "SELECT county, DATE(prediction_date) AS

prediction_date, YEAR(prediction_date) AS prediction_year, "
+ "MONTHNAME(prediction_date) AS prediction_month,
MONTH(prediction_date) AS prediction_month_int, "
+ "(FLOOR(predictions)) AS prediction_price FROM

FuturePredictions ORDER BY DATE(prediction_date) ASC";
@Query(value = dailyFuturePredictions)
List<Object[]> getDailyFuturePredictions();
iv. Upload Service
(a) FileController.java
@RestController
@RequestMapping(value = { "/fileupload/" })
public class FileController {
132
private static final Logger logger = LoggerFactory.getLogger(FileController.class);
@Autowired
private FileStorageService fileStorageService;
@Autowired
private URLProperties urlProperties;
@Value("${url.model-train-api}")
private String modelTrainAPI;
@Value("${url.model-predict-api}")
private String modelPredictAPI;
@Autowired
private ModelService modelService;
@PostMapping("/uploadFile")
public UploadFileResponse uploadFile(@RequestParam("file") MultipartFile file) {
ResponseEntity<String> modelWebServiceResp = null;
String fileName = fileStorageService.storeFile(file);
String fileDownloadUri = ServletUriComponentsBuilder.fromCurrentContextPath()
133
.path("/downloadFile/")
.path(fileName)
.toUriString();
// On successful file upload, invoke the predict or train flask APi to process the request
if(StringUtils.containsIgnoreCase(fileName,"predict")){
// invoke the API
modelWebServiceResp = modelService.sendPredictModel(this.modelPredictAPI);
} else {
// invoke the API
modelWebServiceResp = modelService.sendTrainModel(this.modelTrainAPI);
logger.info("Model Web Service Response: " + modelWebServiceResp);
return new UploadFileResponse(fileName, fileDownloadUri,
file.getContentType(), file.getSize());
@PostMapping("/uploadMultipleFiles")
public List<UploadFileResponse> uploadMultipleFiles(@RequestParam("files")

MultipartFile[] files) {
return Arrays.asList(files)
.stream()
.map(file -> uploadFile(file))
.collect(Collectors.toList());
134
}
(b) FileStorageService.java
@Service
public class FileStorageService {
private final Path fileStorageLocation;
@Autowired
public FileStorageService(FileStorageProperties fileStorageProperties) {
this.fileStorageLocation = Paths.get(fileStorageProperties.getUploadDir())
.toAbsolutePath().normalize();
try {
Files.createDirectories(this.fileStorageLocation);
} catch (Exception ex) {
throw new FileStorageException("Could not create the directory where the

uploaded files will be stored.", ex);
public String storeFile(MultipartFile file) {

135
// Normalize file name
String fileName = StringUtils.cleanPath(file.getOriginalFilename());
try {
// Check if the file's name contains invalid characters
if(fileName.contains("..")) {
throw new FileStorageException("Sorry! Filename contains invalid path

sequence " + fileName);
// Copy file to the target location (Replacing existing file with the same name)
Path targetLocation = this.fileStorageLocation.resolve(fileName);
Files.copy(file.getInputStream(), targetLocation,
StandardCopyOption.REPLACE_EXISTING);
return fileName;
} catch (IOException ex) {
throw new FileStorageException("Could not store file " + fileName + ". Please try
again!", ex);
public Resource loadFileAsResource(String fileName) {
try {
Path filePath = this.fileStorageLocation.resolve(fileName).normalize();
Resource resource = new UrlResource(filePath.toUri());
136
if(resource.exists()) {
return resource;
} else {
throw new MyFileNotFoundException("File not found " + fileName);
} catch (MalformedURLException ex) {
throw new MyFileNotFoundException("File not found " + fileName, ex);
(c) ModelService.java
@Service
public class ModelService {
private static final Logger logger = LoggerFactory.getLogger(ModelService.class);
@Autowired
RestTemplate restTemplate;
private String modelTrainAPI;
@Autowired
public ModelService(URLProperties urlProperties) {
this.modelPredictAPI = urlProperties.getModelPredictAPI();
137
this.modelTrainAPI = urlProperties.getModelTrainAPI();
}*/
/*
Send Predict Model
*/
public ResponseEntity<String> sendPredictModel(String modelPredictAPI) {
logger.info("Processing sendPredictModel() :: Model Predict Url: " +

modelPredictAPI);
try {
HttpHeaders headers = new HttpHeaders();
headers.setAccept(Arrays.asList(MediaType.APPLICATION_JSON));
HttpEntity<String> entity = new HttpEntity<String>("parameters", headers);
ResponseEntity<String> result = restTemplate.exchange(modelPredictAPI,

HttpMethod.GET, entity,
String.class);
logger.info("sendPredictModel ::: Result: " + result);
} catch (RestClientException e){
logger.info("sendPredictModel ::: RestClientException " + e.getMessage());
} catch (Exception e) {
logger.info("sendPredictModel ::: Exception " + e.getMessage());
return new ResponseEntity<String>(HttpStatus.OK);

138
}
/*
Send Train Model
*/
public ResponseEntity<String> sendTrainModel(String modelTrainAPI) {
logger.info("Processing sendTrainModel() :: Model Train Url: " + modelTrainAPI);
try {
HttpHeaders headers = new HttpHeaders();
headers.setAccept(Arrays.asList(MediaType.APPLICATION_JSON));
HttpEntity < String > entity = new HttpEntity < String > ("parameters", headers);
ResponseEntity < String > result = restTemplate.exchange(modelTrainAPI,

HttpMethod.GET, entity,
String.class);
logger.info("sendPredictModel ::: Result: " + result);
} catch (RestClientException e){
logger.info("sendPredictModel ::: RestClientException " + e.getMessage());
} catch (Exception e) {
logger.info("sendTrainModel ::: Exception " + e.getMessage());
return new ResponseEntity<String>(HttpStatus.OK);
}
139
v. Dashboard Service
(a) DashboardComponent.html

<div style="display: flex;">
<div class="col col-xl-6 col-lg-12">
<div class="card mb-3">
<div class="card-header"><b>Nairobi</b> - Nafis Prices vs Predicted Prices Trends:

Last 1 Year</div>
<div class="card-body">
<div *ngIf="avgNairobiPredictionsPastChart">
<canvas id="avg_nrb_past_canvas">{{ avgNairobiPredictionsPastChart }}</canvas>
</div>
</div>
</div>
</div>
<div class="card-header"><b>Nairobi</b> - Nafis Prices vs Predicted Prices: Last 1

Month </div>
<div class="card-body table-responsive">
<table class="table table-sm" >
<thead>
<tr>
<th scope="col">Date</th>
140
<th scope="col">Nafis Price</th>
<th scope="col">Predicted Price</th>

</tr>
</thead>
<tbody>
<tr *ngFor="let prediction of dailyNairobiNafisVSPredictionsData | async">
<td>{{prediction.nafis_date}}</td>
<td>{{prediction.nafisprices}}</td>
<td>{{prediction.predictions}}</td>

</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>

<div class="card-header"><b>Mombasa</b> - Nafis Prices vs Predicted Prices Trends:

Last 1 Year</div>
141
<div *ngIf="avgMombasaPredictionsPastChart">
<canvas id="avg_msa_past_canvas">{{ avgMombasaPredictionsPastChart

}}</canvas>
</div>
</div>
</div>
</div>
<div class="card-header"><b>Mombasa</b> - Nafis Prices vs Predicted Prices: Last

1 Month </div>
<table class="table table-sm">
<thead>
<tr>
</tr>
</thead>
<tbody>
<tr *ngFor="let prediction of dailyMombasaNafisVSPredictionsData |

async">
142
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>

<div class="card-header"><b>Kisumu</b> - Nafis Prices vs Predicted Prices Trends:

Last 1 Year</div>
<div *ngIf="avgKisumuPredictionsPastChart">
<canvas id="avg_ksm_past_canvas">{{ avgKisumuPredictionsPastChart }}</canvas>
</div>
</div>
</div>
</div>

143
<div class="card-header"><b>Kisumu</b> - Nafis Prices vs Predicted Prices: Last
1 Month </div>
<table class="table table-sm">
<thead>
<tr>
</tr>
</thead>
<tbody>
<tr *ngFor="let prediction of dailyKisumuNafisVSPredictionsData | async">
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
144
(b) Dashboard.Components.ts
// Get Daily Future Prices Chart Data
getDailyFuturePredictionsChartData(){
this.futurePredictionsService.getDailyFuturePredictData()
.subscribe(res => {
console.log(res)
res.forEach((futureprediction) => {
let predictMonth = futureprediction.prediction_month
this.dailyFuturePredictMonth.push(predictMonth)
let predictionDate = futureprediction.prediction_date
this.dailyFuturePredictDate.push(predictionDate)
let county = futureprediction.county
let predicted_price = futureprediction.prediction_price
this.dailyFuturePredictNairobi.push(predicted_price)
})
this.dailyFuturePredictChart = new Chart('dailycanvas', {
type: 'line',
data: {
labels: this.dailyFuturePredictDate,
datasets: [
145
{
data: this.dailyFuturePredictNairobi,
borderColor: "#3cba9f",
fill: false,
label: "August - October Predicted Prices",
},
/*{
data: this.dailyFuturePredictMombasa,
borderColor: "#0000ff",
fill: false,
label: "Mombasa",
},
data: this.dailyFuturePredictKisumu,
borderColor: "#FFA500",
fill: false,
label: "Kisumu",
}, */
},
options: {
legend: {
display: true
},
scales: {
146
xAxes: [{
display: true
}],
yAxes: [{
display: true
}],
}); } )}
147

Developing A Machine Learning Portal For

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Developing A Machine Learning Portal For

Uploaded by

Copyright:

Available Formats

DEVELOPING A MACHINE LEARNING PORTAL

FOR PREDICTING SUKUMA WIKI PRICES

DOMINIC ODHIAMBO OWINO

UNITED STATES INTERNATIONAL UNIVERSITY-

DOMINIC ODHIAMBO OWINO

A Project Report Submitted to the School of Science

UNITED STATES INTERNATIONAL UNIVERSITY-

Signed:________________________ Date: __________________

Signed: _______________________ Date:___________________

Signed: _______________________ Date:___________________

Dominic Odhiambo Owino © 2019

Keywords: agriculture, machine learning, predictive analytics, sukuma wiki, prices,

I am eternally grateful to my family for providing me moral and emotional support

God bless you all.

1.1 Background of the study ....................................................................................... ….1

2.1 Introduction ................................................................................................................ 8

3.1 Introduction .............................................................................................................. 26

4.1 Introduction .............................................................................................................. 40

5.1 Introduction .............................................................................................................. 60

6.1 Introduction .............................................................................................................. 98

APPENDIX I: Sample Datasets .................................................................................... 112

i. Final Research Data ............................................................................................... 112

i. Statistics ................................................................................................................. 120

Figure 2-1: Types of Machine Learning Algorithms… .................................................... 10

Table 4-1: Database Tables ............................................................................................... 45

ANN - Artificial Neural Networks

ARIMA - Auto-Regressive Integrated Moving Average algorithm

ASDSP - Agriculture Sector Development Support Program

CART - Classification and Regression Trees

CBK - Central Bank of Kenya

KNBS - Kenya National Bureau of Statistics

MSE - Mean Squared Error

MAE - Mean Absolute Error

NAFIS - National Farmer Information Service

RMSE - Root Mean Squared Error

XGBoost - Extreme Gradient Boosting machine learning model

1.1 Background of the study

There is lack of organized market information on agricultural commodities prices for

1.3 General Objective

1.4 Specific Objectives

1.5 Significance of the study

1.6 Scope of the study

1.7 Definition of terms

Classification - In classification, data is categorized into discrete values or predefined

Linear Regression - A machine learning algorithm based on supervised learning which

Model - A machine learning model is a mathematical representation of a real-world process

Parameter and Hyper-Parameter - Parameters are variables that can be configured to

Predictive Analytics - Predictive analytics encompasses a variety of statistical techniques

Twelve-Month Inflation (12-month) Inflation - Also known as the inflation rate, it is

1.8 Chapter Summary

2.2 Theoretical Foundations

Reinhart & Borensztein, (2009) and Nelson, (2008) highlighted macro-economic

2.4 Evaluation and selection of the best-performing prediction algorithm

Linear regression is an entry-level algorithm which is easy to understand and provides a

Regularization is a machine learning technique used to avoid overfitting by adding a

Figure 2-3: L1 Regularization (Source: Chioka, 2013)

Figure 2-4: L2 Regularization (Source: Chioka, 2013)

ii. Random Forests

iii. Gradient Boosting and XGBoost

The objective of the XGBoost model is given as:

L is the loss function which controls the predictive power, and

Ω is regularization component which controls simplicity and overfitting

p is the number of autoregressive terms,

Signed:______ Date:

Signed: _____ Date:_

Signed: _____ Date:_