Professional Documents
Culture Documents
Jack Sebahar
Dr. Luo
Math 424
9 December 2023
Problem
The popularity of artificial intelligence and machine learning has brought attention to factors
influencing the accuracy of these systems. One key factor is the "data split," where data is divided into
training, testing, and validation sets. Training data, the largest segment, is used to train the model, while
validation data tunes hyperparameters, and testing data evaluates the final performance. Although a
common split is 80% training, 10% validation, and 10% testing, the user determines the allocation,
impacting model accuracy. The challenge lies in determining the optimal split, which depends on various
Sebahar 2
factors, making it difficult without expert knowledge. To address this, my project automates the training
and testing process using TensorFlow and OpenMPI, aiming to quickly compare multiple models and
enhance efficiency by leveraging the correct number of threads.
Process
My first task was to design the training portion of the code. This portion was specifically
designed for image classification, but the process is relatively the same with other types of models, and I
could easily change this if I wanted. This code was in charge of a few things:
Once everything is set by the user, the training process can begin. I created a method called
“train.” Most of this code is TensorFlow-related, like initializing the model and starting the training
process. But one important thing to note is that this training method takes the training split as a parameter
in the form of an array. Using OpenMPI, I created a rank for each model that got trained. Each rank calls
the training method with each different split parameter one time. For instance, if the user wanted to train
two models, one with an 80%, 10%, 10% split, and one with a 90%, 0%, 10% split, one of the ranks
would call train with a parameter of [80, 10, 10] and the other would call train with a parameter of [90, 0,
10].
Once the training method begins, the data gets run through ten training epochs, which is ten
iterations through the samples. Once all ten iterations are performed, you get a brief evaluation of the
model. Using OpenMPI, it is evident that all models get trained at the same time since you can see
multiple epochs running at once. Once the epochs are finished, you get a summary of the elapsed time for
Sebahar 3
each model and the total elapsed time (see results), and each model will be located in the specified
directory.
The second part of my project was the model comparison, also using OpenMPI. This code was in
charge of doing the following:
The code takes a directory of models, so one thing I needed to do was create a list of all of the
models in the directory named “model_files.” Once the list is obtained, I assign a rank to each model in
the list the user wants to compare. Each rank calls a method called “classify_image,” which takes
parameters “image_path,” “labels,” and “model_path.” The first two are the same for each rank, but the
model that is input as a parameter is different for each rank. Each model then gets run through the
interpreter concurrently, and predictions are returned for each model.
When using the TensorFlow Interpreter, I was only able to get it to return the prediction values
and not the label with its corresponding prediction. I did, however, know that the predictions are returned
alphabetically. For example, I used the flower classification model to classify daisies, dandelions, roses,
sunflowers, and tulips. Predictions are returned in that exact order. In order to assign a label to its
prediction, I prompted the user to input their labels. This may not be the best way to do it, but it works for
the purposes of the project. Since I know the predictions are returned alphabetically, I used the sort()
method to sort the user inputs and link them with the predictions (see results). It is clear to see that each
model has different accuracy levels.
Above, this is what normal training epochs look like; they get run one at a time.
Sebahar 4
Above, this is what parallelized training epochs look like. As you can see, it runs three epochs at the same
time, since I’m training three models.
In the above example, I trained three models sequentially. The training splits were [70 15 15], [60 20 20],
and [50 25 25]. As you can see, it took a total of about 82.3 seconds for execution time, and each epoch
runs one at a time. All of the stuff above the execution time is TensorFlow-related values.
This is the exact same program but in parallel. I ran the exact same training splits, but as you can see, the
execution time was roughly half, 41 seconds. It is evident that OpenMPI was parallelizing the code
because the total execution time was slightly higher than the execution time of the slowest thread and you
can see multiple epochs running at the same time.
This is what the output looks like. Each model is contained in its respective folder. The first number
corresponds to the rank that trained the model, and the next three numbers represent the training split for
each.
Sebahar 5
Above, we have the output of the model comparison code run sequentially. As you can see, each model
returns slightly different confidence scores based on its training parameters. The total execution time of
this program was about 0.69 seconds.
Sebahar 6
Lastly, this is the output from comparing the models in parallel. In this example, each rank tests a
different model. The confidence values are the same as previously, except now the execution time was
down to 0.34 seconds, highlighting the effectiveness of OpenMPI
It is apparent in both programs that utilizing OpenMPI decreases the execution time. For both
examples, I used a relatively small data set and trained a few models. The time reduction would be even
higher with more complex data, assuming you have a very powerful computer.
Conclusion
Overall, my project worked exactly as intended. Using the model training program, I was able to
successfully train multiple models concurrently using OpenMPI with different training splits. Using the
second program, I was able to test the models against a sample image and return the predictions to the
user concurrently. There is also an observable execution time speed-up when using OpenMPI. One
important note to make is that both of my programs assume the number of ranks is equal to the number of
models they want to train/test. For example, if they want to train three models, they need to run the
program with three ranks. The same goes for the testing program. This is because I was having difficulty
distributing ranks if there were not enough ranks for the number of models. So, if there were two ranks
and the user wanted to train four models, both of the ranks would have to train two models. This was
never working as intended, but it is safe to assume that the user would know how many models they want
to train before they run the program, so it isn’t really an issue to do it this way. In order to test the
efficiency of parallelization, I created a duplicate of each program that ran sequentially instead of
concurrently. By doing so, I was able to view the execution time differences between the two. This project
worked for a relatively small number of threads and a small amount of training data. I am uncertain as to
what would happen if you wanted to train many models with thousands of data, but it would definitely
require a very powerful computer. If I were to actually implement this project, I would probably use
“Distributed Training with TensorFlow” to achieve parallelization as it is built directly for TensorFlow
and probably optimizes the parallelization process.