You are on page 1of 12

Mao 1

Joel Mao

Ms. Leila Chawkat

Intern/Mentor Program Period 6

22 December 2022

Incorporation of Machine Learning Techniques in Algorithmic Trading

Abstract

Visualizing data is a necessary aspect of conveying information in an effective and efficient

manner. Especially for machine learning models, which often consist of predictions and true

values that are difficult to appropriately display with conventional data visualization methods,

we examine the steps to create a confusion matrix, as well as the challenges incorporated.

Table of Contents

I. Introduction

II. Literature Review

III. Data Collection

IV. Conclusion

References
Mao 2

I. Introduction

Data visualization is and has been used by a wide variety of people, ranging from

businessmen and salespeople to scientists and researchers. As such, it plays a vital role in the

communication of data points, as well as a summation of the conclusion and results that can be

drawn from such data. In this investigation specifically, one data visualization technique in

particular, confusion matrices, are examined in depth. Confusion matrices deal with the

predictive and true value of any predictive model, most common being machine learning models.

These models function as making predictions, and being checked with confirmed sets. Confusion

matrices take these predictions and place them into a table with true predicted values, and allow

for a visualization of the correct and incorrect predictions, allowing the creator to understand

which areas of the model may be weaker than others.

II.Literature Review

Confusion matrices are n x n matrices that consist of several different classes, each with

its own column and row. The X and Y axis may be labeled according to the user preferences, but

one axis is typically the predicted classes, while the other is the true classes. In order to create the

confusion matrix, there are several steps needed, with multiple classes having different scenarios

in comparison to a situation with only a True and False set of predictions.

First understanding the classification of data allows users to format the foundation

necessary to set up the confusion matrix and its different result labels. There exists four labels:

True Positives, True Negatives, False Positives, and False Negatives. True Positives are where

the model correctly predicts the “positive” class (Google), while True Negatives are accordingly

the converse, where an outcome where the model correctly predicts the “negative” class
Mao 3

(Oracle). On the contrary, False Positives are were the model incorrectly predicts the “positive”

class (Google), and False Negatives are where the model incorrectly predicts the “negative” class

(Oracle).

Creation of the confusion matrix typically takes place in a virtual, programmed

environment, such as Python or Java. Most commonly, confusion matrices created in Python and

Java can have incorporated metrics (Brownlee), which means that there are prebuilt functions

and modules written by others that can be utilized for personal uses. This makes it much easier to

create matrices off of prebuilt “templates” that simplify the process. Typically the standard for

data science and confusion matrices are Python scripts, according to Oracle, due to the flexibility

of Python in conforming data to a user's preferences. In addition, Sklearn and other packages

from Python have made the coding and scripting relatively simple (Shin), with not much

tweaking or large adjustments that are necessary to be made.

One of the largest enticing elements of confusion matrices is that there are a variety of

available metrics to evaluate the performance of the model through the confusion matrix. The

most common metric is accuracy, which is the general accuracy of the whole model, or the

fraction of the total predictions that the model correctly predicted (Mohajon). Another common

metric is precision, which focuses specifically on the positive class, and determines the fraction

of the predicted positive class that were truly positive (Mohajon). Some smaller but still

important metrics are recall, or the portion of the true positive class that were correctly predicted

as positive (Vidiyala), and F1, which combines both the precision and the recall scores,

measuring the harmonic mean of the two metrics and is determined as a percentage, with 1

having 100% accuracy (Hernandez).


Mao 4

A wide range of applications can be used for confusion matrices, especially towards

machine learning uses. One of the largest applications lies within classification problems, such as

predicting for population genetic variants (Indeed). Classifying different items, and

understanding the abilities of predictive models for specific classes allows for determination of

which classes are better or worse for classification of models. Other applications lies in cancer

patient modeling in healthcare, and understanding whether the models are effective (Shin), as

well as business model predictions, and predicting whether client will purchase or not

(Hernandez).

III. Data Collection

Variables

● Availability of Resources online (Greater than 10)

● Availability of Coding IDEs available

● Difficulties/problems with each available tutorial/instructional guide

Procedure

Testing will be conducted by the process of building a complete confusion matrix based

on sample data given from a machine learning model, and evaluating the process and steps taken

to complete the entire process. In addition, perhaps the most valuable part of the experiment will

be determining the difficulties posed with using and creating a custom confusion matrix, whether

it be inherent issues with the functionality of a confusion matrix, or with available resources and

information contained within, and if they prove thorough enough to successfully guide through

the process without issues. The data used will be a sampled list from a Javascript file, containing

mock predictions from a sensor, with eight classes of predictions.


Mao 5

Data Collection

IDEs Available:

● Jupyter Notebook **

● Google Colaboratory

● Pycharm

● Visual Studio Code

Online Confusion Matrix Resources

● https://www.w3schools.com/python/python_ml_confusion_matrix.asp

○ Creates confusion matrix from random numbers

○ Uses sklearn metrics to create/compile numbers and form confusion matrix

○ Some issues may be in attempting to increase the number of classes, and seeing

how the confusion matrix is able to expand and populate these new classes with

values.

● https://stackoverflow.com/questions/2148543/how-to-write-a-confusion-matrix

○ Open Source thread on creating confusion matrix

○ Seems long and inefficient in adding elements to the confusion matrix.

○ Uses Scikit Learn, which is the most efficient and easy to use confusion matrix

creator

Cell Programming
Mao 6

This section imported various resources/modules necessary to support different

properties. The Pandas was necessary for formatting the data frame, and the data was stored in a

json file, importing the json module to successfully transfer the dataframe. The Matplotlib was

used for the visual confusion matrix, while Sklearn was imported for the various confusion

matrix processing and actual plotting

This cell displays the format of the data. This specific data file was formatted to be a dictionary

style, with there being 4 levels to the dictionary, with Classes, Predictions Class, Categories of

Predictions, and numerical Prediction Values. This process was rather simple, as it involved

simply finding a python loading script for a json file.


Mao 7

This portion was the most complicated, with issues arising as to how to access the

different areas of the dictionary. The intention was to access each individual result of the

prediction for each class. However, with the dictionary format of the data, it was rather difficult

to access each definition and sub definition. For example, the class of “Ground Motor Vehicle”

would have a Predictions subclass with its own definition, possessing 9 elements of its own

defining the Predictions the model made of what it thought the “Ground Motor Vehicle” was.

This was solved by organizing the different subclasses of the data, and understanding which

variables were changeable and which were necessary to remain static. Because each of the

prediction values for each class were in an array within the dictionary, it became clear that in

order to collect each numerical value of the prediction, a for loop could be used to note each

individual change in value. In addition, these values would be collected and placed into a

numerical matrix, with the name cm, denoting its future usage within the formal

confusion matrix.

This final cell allowed for the gathering of all the information and the placement into the

actual confusion matrix. As can be seen in the image above, the confusion matrix has nine

different classes, with each class occupying a row and a column. The rows would denote the true

objects, while the columns represent the predictions made by the model. This part of the coding

was relatively simple, as it used a short line of script that took in the premade confusion matrix

data array, and converted it into a more formal colored confusion matrix, revealing the

tendencies and patterns of the predictive model.


Mao 8

Rationale

The IDE chosen to use in creating the confusion matrix was Jupyter Notebook. This IDE

was chosen due to the strength and widespread use within the coding community. Jupyter

Notebook is also commonly used for creating and testing machine learning models. Due to its

various “cells”, it can allow for organization as well as sectioning off of different areas of code to

identify the possible functioning or malfunctioning areas, as well as simplifying the process of

isolating and retesting code.

For this experiment, it was crucial for a complete documentation of every step that was

coded. There were a variety of different functions that needed to come together in order to create

the confusion matrix, and by researching the specifics of using a Python dictionary, as well as the

formatting of confusion matrices, the process could be clearly documented.

Analysis

Examining the difficulties with coding the program, there were several issues, with some

being resolvable with an online search, but others requiring some intuition with finding

solutions. There was a relatively large amount of resources available for creating confusion

matrices in Python, likely due to the rising usage of machine learning and predictive models as

used in the confusion matrix made in this experiment. However, one major issue lacking in

online resources was the conversion of the dictionary stored in the json file. It became somewhat

challenging converting the file into readable format, as well as understanding how to exactly

access and retrieve information from the dictionary. This issue can be resolved with some level

of intuition, with testing the different levels with a print function, and then accessing the

dictionary with the appropriate functions themselves.


Mao 9

Besides this initial challenge, the programming of the confusion matrix went rather

smoothly. The most helpful resource was StackExchange, which is an open forum-based website

for coding and programming related questions. With many other programmers on the website

providing feedback and strategies on how to tackle various problems, many solutions have been

suggested, and were thus implemented within the programming of the confusion matrix. This

also assisted with some of the more niche issues that were encountered that could not be

answered by the generic online resources found. One such issue was that many of these online

resources needed to use predicted and true values of the predictions from the classes in order to

format and create the confusion matrix, using the column classes of true objects and row classes

of predicted classes. However, in order to create the confusion matrix from the provided values

and their corresponding classes and predictions, there was a workaround needed, where a

different approach to providing and formatting the confusion matrix data was needed. This was

likely because many of the guides had intended for use and incorporation within the predictive

model code itself, which means that the confusion matrix would be created alongside the results

of the predictive model. However, in this specific case where the results had been provided and

categorized already, it made it somewhat more difficult in the sense that there were few guides

provided on the organization of the data rather than calculation of true/false predictions into a

confusion matrix.

With the popularity of confusion matrices in evaluating the performance of predictive

models rising, as well as the overall increase in interest of machine learning models,

understanding how to effectively create, utilize, and analyze confusion matrices has become ever

more important. By going through the process of collecting various resources available to help

new users and programmers overcome their issues with creating confusion matrices, as well as
Mao 10

provide a basis of code for anyone to be able to utilize and adjust for their own functions, the

process has been simplified and compiled for easier usage in the future. Thus, with a

combination of intuition as well as resourceful research on various online websites, creating a

confusion matrix with any style and format of data is well within the bounds of achievability.

IV. Conclusion

This scientific paper discusses the necessary steps leading to the creation of a confusion

matrix, as well as the importance of each step and the confusion matrix as a whole. In order to

have an effective method of analyzing the results of a predictive model, we have looked closely

at the resources available to consumers in creating such a confusion matrix to evaluate the

performance of their own machine learning models. The steps are quite clear and widely

available online, with the classification of labels such as the true-positive and true-negative

labels, as well the transfer to the metrics in the actual measuring and evaluation of the

performances of the models. This analysis of the steps needed to create the confusion matrix

ensures that there is a centralized research conducted on the details of what exactly a confusion

matrix is, as well as the advantages and disadvantages in the process of the confusion matrix.

The next major research direction involves the combination of several different data visualization

techniques to be able to cover all facets and aspects of the data itself.
Mao 11

References

[1] Brownlee, J. (2020, August 14). What is a confusion matrix in machine learning.

MachineLearningMastery.com. Retrieved December 10, 2022, from

https://machinelearningmastery.com/confusion-matrix-machine-learning/

[2] Genesis. “Confusion Matrix and ROC Curve.” From The GENESIS, 26 June 2018,

https://www.fromthegenesis.com/confusion-matrix-and-roc-curve/.

[3] Google Developers. “Classification: True vs. False and Positive vs. Negative | Machine

Learning | Google Developers.” Google, Google,

https://developers.google.com/machine-learning/crash-course/classification/true-false-

positive-negative#:~:text=A%20true%20positive%20is%20an,incorrectly%20predicts

%20the%20positive%20class.

[4] Hernández, Pablo. “Mine Is Better: Metrics for Evaluating Your (and Others) Machine

Learning Models.” Datascience.aero, 7 Jan. 2021, https://datascience.aero/metrics-

evaluating-machine-learning/.

[5] Indeed Editorial Team. (2022, October 3). What is a confusion matrix? (plus how to calculate

one). Indeed.com Career Guide. Retrieved December 11, 2022, from

https://www.indeed.com/career-advice/career-development/confusion-matrix

[6] Manliguez, Cinmayii. (2016). Generalized Confusion Matrix for Multiple Classes.

10.13140/RG.2.2.31150.51523.
Mao 12

[7] Mohajon, Joydwip. “Confusion Matrix for Your Multi-Class Machine Learning Model.”

Medium, Towards Data Science, 24 July 2021,

https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-

model-ff9aa3bf7826.

[8] Ng, Andrew, and Kian Katanforoosh. “Advanced Evaluation Metrics.” Section 8 (Week 8), 1

Jan. 2022, https://cs230.stanford.edu/section/8/.

[9] Shin, Terence. “Understanding the Confusion Matrix and How to Implement It in Python.”

Medium, Towards Data Science, 4 Dec. 2021,

https://towardsdatascience.com/understanding-the-confusion-matrix-and-how-to-

implement-it-in-python-319202e0fe4d.

[10] Tayabali, S. (2020, December 11). A simple guide to building a confusion matrix.

Blogs.oracle.com. Retrieved December 10, 2022, from https://blogs.oracle.com/ai-and-

datascience/post/a-simple-guide-to-building-a-confusion-matrix

[11] Vidiyala, Ramya. “Confusion Matrix in a Nutshell.” Medium, Analytics Vidhya, 25 May

2020, https://medium.com/analytics-vidhya/a-z-of-confusion-matrix-under-5-mins-

147c1b4467ab.

You might also like