You are on page 1of 6

Sebahar 1

Jack Sebahar
Dr. Luo
Math 424
9 December 2023

Automated ML Model Training and Comparison


Abstract
The increasing prominence of artificial intelligence (AI) and machine learning (ML) underscores
the significance of factors influencing their accuracy. This project addresses the challenge of determining
the optimal "data split" for training, testing, and validation sets, a crucial aspect of model development.
Utilizing TensorFlow and OpenMPI, my project automates the training and testing process to enable the
rapid comparison of multiple models, enhancing efficiency through parallelization.
The training portion involves user input for the number of models to compare, data source
location, split proportions, and output directory. The "train" method is implemented using TensorFlow,
and OpenMPI is leveraged to concurrently train models with different splits. Evaluation results, including
elapsed time, are presented, demonstrating simultaneous training.
The model comparison phase, also employing OpenMPI, prompts the user for the model storage
location, sample image, and model labels. Prediction values are obtained alphabetically and linked with
user-input labels for model evaluation. The project successfully achieves its goals, allowing concurrent
training and testing of multiple models.
An important consideration is the assumption that the number of ranks corresponds to the number
of models. The project's efficiency is demonstrated by comparing parallel and sequential executions.
While effective for a limited number of threads and small datasets, scaling to numerous models and large
datasets may require substantial computing power. Future implementation could explore "Distributed
Training with TensorFlow" for optimized parallelization tailored to TensorFlow.

Problem
The popularity of artificial intelligence and machine learning has brought attention to factors
influencing the accuracy of these systems. One key factor is the "data split," where data is divided into
training, testing, and validation sets. Training data, the largest segment, is used to train the model, while
validation data tunes hyperparameters, and testing data evaluates the final performance. Although a
common split is 80% training, 10% validation, and 10% testing, the user determines the allocation,
impacting model accuracy. The challenge lies in determining the optimal split, which depends on various
Sebahar 2

factors, making it difficult without expert knowledge. To address this, my project automates the training
and testing process using TensorFlow and OpenMPI, aiming to quickly compare multiple models and
enhance efficiency by leveraging the correct number of threads.

Process
My first task was to design the training portion of the code. This portion was specifically
designed for image classification, but the process is relatively the same with other types of models, and I
could easily change this if I wanted. This code was in charge of a few things:

1. Ask the user how many models they want to compare


I limited this to 10 models because I’m not entirely sure what the processing power implications of a
number this high are, but I don’t really see an advantage to doing that many anyway.
2. Ask the model where their data is coming from
The user needs to provide their data. This is stored in a folder with different folders for each type.
3. Retrieve the split for each model.
In order to test each model, they need to have different splits. This is where the user provides the amount
of training, testing, and validation data for their different models. This is given in the form of [X X X] for
training, validation, and testing. This number was given via decimals and needed to equal 1.
4. Specify the output directory
The models need somewhere to go, so I allowed the user to specify this as well.

Once everything is set by the user, the training process can begin. I created a method called
“train.” Most of this code is TensorFlow-related, like initializing the model and starting the training
process. But one important thing to note is that this training method takes the training split as a parameter
in the form of an array. Using OpenMPI, I created a rank for each model that got trained. Each rank calls
the training method with each different split parameter one time. For instance, if the user wanted to train
two models, one with an 80%, 10%, 10% split, and one with a 90%, 0%, 10% split, one of the ranks
would call train with a parameter of [80, 10, 10] and the other would call train with a parameter of [90, 0,
10].
Once the training method begins, the data gets run through ten training epochs, which is ten
iterations through the samples. Once all ten iterations are performed, you get a brief evaluation of the
model. Using OpenMPI, it is evident that all models get trained at the same time since you can see
multiple epochs running at once. Once the epochs are finished, you get a summary of the elapsed time for
Sebahar 3

each model and the total elapsed time (see results), and each model will be located in the specified
directory.
The second part of my project was the model comparison, also using OpenMPI. This code was in
charge of doing the following:

1. Ask the user where the models are stored


2. Ask the user what sample image they want to compare the models with
3. Ask the user what labels their models contain

The code takes a directory of models, so one thing I needed to do was create a list of all of the
models in the directory named “model_files.” Once the list is obtained, I assign a rank to each model in
the list the user wants to compare. Each rank calls a method called “classify_image,” which takes
parameters “image_path,” “labels,” and “model_path.” The first two are the same for each rank, but the
model that is input as a parameter is different for each rank. Each model then gets run through the
interpreter concurrently, and predictions are returned for each model.
When using the TensorFlow Interpreter, I was only able to get it to return the prediction values
and not the label with its corresponding prediction. I did, however, know that the predictions are returned
alphabetically. For example, I used the flower classification model to classify daisies, dandelions, roses,
sunflowers, and tulips. Predictions are returned in that exact order. In order to assign a label to its
prediction, I prompted the user to input their labels. This may not be the best way to do it, but it works for
the purposes of the project. Since I know the predictions are returned alphabetically, I used the sort()
method to sort the user inputs and link them with the predictions (see results). It is clear to see that each
model has different accuracy levels.

Results (Training Code)

Above, this is what normal training epochs look like; they get run one at a time.
Sebahar 4

Above, this is what parallelized training epochs look like. As you can see, it runs three epochs at the same
time, since I’m training three models.

In the above example, I trained three models sequentially. The training splits were [70 15 15], [60 20 20],
and [50 25 25]. As you can see, it took a total of about 82.3 seconds for execution time, and each epoch
runs one at a time. All of the stuff above the execution time is TensorFlow-related values.

This is the exact same program but in parallel. I ran the exact same training splits, but as you can see, the
execution time was roughly half, 41 seconds. It is evident that OpenMPI was parallelizing the code
because the total execution time was slightly higher than the execution time of the slowest thread and you
can see multiple epochs running at the same time.

This is what the output looks like. Each model is contained in its respective folder. The first number
corresponds to the rank that trained the model, and the next three numbers represent the training split for
each.
Sebahar 5

Results (Model Comparison Code)

Above, we have the output of the model comparison code run sequentially. As you can see, each model
returns slightly different confidence scores based on its training parameters. The total execution time of
this program was about 0.69 seconds.
Sebahar 6

Lastly, this is the output from comparing the models in parallel. In this example, each rank tests a
different model. The confidence values are the same as previously, except now the execution time was
down to 0.34 seconds, highlighting the effectiveness of OpenMPI
It is apparent in both programs that utilizing OpenMPI decreases the execution time. For both
examples, I used a relatively small data set and trained a few models. The time reduction would be even
higher with more complex data, assuming you have a very powerful computer.

Conclusion
Overall, my project worked exactly as intended. Using the model training program, I was able to
successfully train multiple models concurrently using OpenMPI with different training splits. Using the
second program, I was able to test the models against a sample image and return the predictions to the
user concurrently. There is also an observable execution time speed-up when using OpenMPI. One
important note to make is that both of my programs assume the number of ranks is equal to the number of
models they want to train/test. For example, if they want to train three models, they need to run the
program with three ranks. The same goes for the testing program. This is because I was having difficulty
distributing ranks if there were not enough ranks for the number of models. So, if there were two ranks
and the user wanted to train four models, both of the ranks would have to train two models. This was
never working as intended, but it is safe to assume that the user would know how many models they want
to train before they run the program, so it isn’t really an issue to do it this way. In order to test the
efficiency of parallelization, I created a duplicate of each program that ran sequentially instead of
concurrently. By doing so, I was able to view the execution time differences between the two. This project
worked for a relatively small number of threads and a small amount of training data. I am uncertain as to
what would happen if you wanted to train many models with thousands of data, but it would definitely
require a very powerful computer. If I were to actually implement this project, I would probably use
“Distributed Training with TensorFlow” to achieve parallelization as it is built directly for TensorFlow
and probably optimizes the parallelization process.

You might also like