You are on page 1of 23

MACHINE LEARNING

MODEL TO PREDICT
SOLAR FLARES
Eugene Kang
Literature Review

I used the following resources in my literature review:

Ahmed, O., Qahwaji, R., Colak, T., Higgins, P., Gallagher, Cinto, T., Gradvohl, A. L. S., Coelho, G. P., & da Silva, A.
Daren Yu, Xin Huang, Huaning Wang, & Yanmei Cui. (2009).
P., & Bloomfield, D. (2013). Solar Flare Prediction Using E. A. (2020). Solar Flare Forecasting Using Time Series and
Short-Term Solar Flare Prediction Using a Sequential
Advanced Feature Extraction, Machine Learning, and Feature Extreme Gradient Boosting Ensembles. Solar Physics, 295(7),
Supervised Learning Method. Solar Physics, 255(1), 91–105.
Selection. Solar Physics, 283(1), 157–175. N.PAG. https://doi-org.proxygsu-
https://doi.org/10.1007/s11207-009-9318-9
https://doi.org/10.1007/s11207-011-9896-1 scob.galileo.usg.edu/10.1007/s11207-020-01661-9

These papers helped me in my research by providing a pathway of what other data scientist
did to approach similar problems. The case study of Time Series Data for Solar Flares was
extremely helpful for providing a starting point from where I could branch from. (Cinto.
2020) As this field is relatively new, innovations in this field are always happening. The
general consensus among these studies are that Total Photospheric Energy Density was one of
the more prominent features in Flare Forcasting.
Feedback Loop

01 02 03
Visualize Train Make
Data Machine Improvements
Learning
Model
About the Data
◦ Solar Weather ANalytics for Solar Flares or SWAN-SF is the dataset that I am going to be
using to train the Machine Learning (ML) model.
◦ Contains 5 partitions, different splits of the dataset over different lengths of time
◦ These partitions will be referred to as, p1, p2, p3, p4, and p5

◦ Contains different features that contribute to the flare and nonflare outputs, 1 and 0
respectively.
◦ Ahmed, O., Qahwaji, R., Colak, T., Higgins, P., Gallagher, P., & Bloomfield, D. (2013). Solar Flare Prediction Using Advanced
Feature Extraction, Machine Learning, and Feature Selection. Solar Physics, 283(1), 157–175.
https://doi.org/10.1007/s11207-011-9896-1
◦ Daren Yu, Xin Huang, Huaning Wang, & Yanmei Cui. (2009). Short-Term Solar Flare Prediction Using a Sequential Supervised
Learning Method. Solar Physics, 255(1), 91–105. https://doi.org/10.1007/s11207-009-9318-9
◦ Cinto, T., Gradvohl, A. L. S., Coelho, G. P., & da Silva, A. E. A. (2020). Solar Flare Forecasting Using Time Series and Extreme
Gradient Boosting Ensembles. Solar Physics, 295(7), N.PAG. https://doi-org.proxygsu-scob.galileo.usg.edu/10.1007/s11207-020-
01661-9
Integrated Development Environment
(IDE) of choice
◦ An IDE is a software for coding applications that combines multiple developer tools into a
simple and convenient graphical user interface (GUI).

◦ The IDE of choice for this project is PyCharm, a Python dedicated IDE, providing a myriad
of quality-of-life features for software developers
Libraries
◦ Python Libraries are a set of methods that allow software engineers to not reinvent the
wheel (code something from scratch that already is available).

◦ In this project I will be using multiple python libraries such as …

◦ Matplotlib is a Python plotting library, derived from the mathematical extension NumPy, It
provides Object Oriented API for creating plots.

◦ Seaborn (referred as sns) is a python visualization library. It is like matplotlib but has
more attractive graphs and more sophisticated functions.

◦ Pandas, (referred to as pd) is a Python software library for data manipulation and
analysis. It offers data structures and operations for manipulating numerical tables.

◦ Scikit-learn,(referred to sk.learn) is a software machine learning Python Library. It


features various classification, regression and clustering algorithms.
Visualizing the Data
◦ In data science, visualizing data is important as it can reveal, otherwise unknown,
insights. This is especially prominent in larger sets of data. Before starting to train our
Machine Learning (ML) model.

◦ In this project I will be using multiple Python library, Matplotlib. It is a simple yet
powerful set of functions for data visualization. I will also be using Seaborn, a more
sophisticated data visualization tool.

◦ I will be using the specific functions…


◦ sns.boxplot()
◦ plt library of functions which include
◦ plt.title
◦ plt.xlabel
◦ plt.ylabel
◦ plt.subplots
◦ Etc..
Graphical Analysis- nonflare/flare
distribution
When creating the graph for
distribution of nonflare vs flare,
we can see the skew to the nonflare
outputs than the flared ones.
This causes a class imbalance; it is
a big challenge in data science. I
will use under sampling to combat
this problem.
Graphical Analysis –feature
distribution and ranges (1/2)
Graphical Analysis –feature
distribution and ranges (2/2)
◦ Here I have created 8 different boxplots to represent the ranges between the subclasses of
the features.

◦ Upon further inspection of these graphs, it is obvious that the features are not
distribution among a singular range. This causes problems when the machine learning is
trying to identify patterns, this is due to one feature suppressing the others. To do this
we have to scale our data set through normalization.

◦ An interesting not is that the TOTPOT features have unusually large ranges compared to the
other features.
Normalization
◦ Normalization is a technique in statistics that allows data to be rescaled, so they end up
in a range of 0 to 1. The formula is shown below…
Training the Machine Learning Model
◦ Now that we have a better understanding of our data we can start training our machine learning model. To
do this we have to…
◦ 1. Import the training dataset and testing datasets, these will be p1 and p2 respectively.
◦ 2. Clean and normalize data, this will consist of removing the last two columns of the data and
normalizing the data. I will be using different normalizations for the two classifiers.
◦ 3. Under sample the data 11 different times – 10 times for p1 and once for p2.
◦ 4. Split the 10 under sampled data frames from the independent variables and dependent variable. Do this
with the undersampled p2 as well.
◦ 5. create and train 10 different decision trees for each undersampled p1 dataframe.
◦ 6. Generate predicted results via the independent variables of the undersampled p2 dataset.
◦ 7. Generate 10 different F1-Scores based on the dependent variable of the undersampled p2 data set and
the predicted results.

Some things that are important for this are…


FLOWCHART
CLASSIFIER
About : Decision Tree (DT)
◦ A decision tree is a flowchart-like tree structure where an internal node represents feature(or
attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost
node in a decision tree is known as the root node. It learns to partition on the basis of the attribute
value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like
structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics
the human level thinking. That is why decision trees are easy to understand and interpret.
About: Confusion Matrix
◦ A Confusion Matrix is a 2x2 grid that contains true
positive(tp), false positive(fp), true negative(tn), and
false negative(fn). These values correspond to the
correct positive and correct negative outputs, vice
versa. Not only is this a useful tool for calculating
accuracy, but it also allows data scientist to see where
the ML model gets confused. This is especially useful in
supervised learning frameworks.
About: Accuracy, Precision, Recall,
and F1-score
◦ Accuracy – The number of samples correctly classified out of all the samples present in the
test set ((tp+tn)/(tp+fp+tn+fn))

◦ Precision – The number of samples belonging to the positive class out of all the samples
that were predicted to be of positive class by the model. (tp/(tp+fp))

◦ Recall – The number of samples predicted correctly to the belonging to the positive out of
the samples that belong in the positive class. (tp/(tp+fn))

◦ F1-Score – The harmonic mean of the precision and recall scores obtained from the positive
class. ((2*precision*recall)/(precision+recall)) ~a measure of a model’s accuracy on a
dataset~ This will also be the metric I will be using for comparison
Classifier #1 (updated)
◦ In this instance, the ML was trained using the classifier Decision Tree and used a 0-1 normalized
dataset. The ML model was trained on p1 of the dataset and was tested on p2. I under sampled the
data in 10 different instances and trained 10 different Machine Learning Models based on these
instances. I did this by separating the dataframe horizontally from where the positive values of
the dependent variables begins. Than I undersampled the data by the fraction which can be
described as (positive values/negative values). Finally I joined the two data frames via the
concat function. This lead me with 10 F1-scores. I then got the average of these to scores. Base
parameters of the sklearn.tree.DecisionTreeRegressor¶ function were used.

◦ Decision Tree
◦ Depth is 16
◦ Number of leaves is 81

◦ F1-score(mean) ~ 69.7% (standard deviation is 0.02604342992341698)

◦ Analysis – This classifier had average performance.


Classifier # 2
◦ This classifier was trained on p1 and tested on p2. I scaled the dataset via log
normalization. To do this I took the absolute value of each integer within the data frame.
To make all the values above one I used log+1 within NumPy. I undersampled the scaled
dataset 10 different times and trained 10 different Decision Tree Classifiers. From these
classifiers I generated the F1-scores and took the mean. No modified parameters from the
sklearn.tree.DecisionTreeRegressor.

◦ F1-score ~ 83.327% (Standard Deviation was 0.043593858214123496)

◦ Decision Tree hyper parameters


◦ Depth is 22
◦ Number of leaves is 104

◦ Analysis ~ this classifier did better than the 1st one.


Deliverables
◦ In this project I have created two
models to predict Solare flares.

◦ A boxplot comparing their performance


is on the left.
Coding Skills that I have learned
Encapsulation – One of the 4 Pillars of Object-Oriented Programming (OOP) – it
allows developers to “encapsulate” variables and methods in a
“class” This allows for cleaner code.

Version Control (Git) - One of the things that I have learned that I am most excited about
is Version Control. It is very helpful to for collaborating
developers, as it allows for developers to code alongside each
others as well as save previous versions of code to be rolled back.
◦ GitHub Project Link : https://github.com/moxifo/flare_forcasting_HHS_sciencefair_2022
Conclusion
◦ The Machine Learning model had a desired performance being within the desirable range of
70.0-90. The first iteration of the model was trained with 0-1 normalized data having a
mean F1-score of 0.697. The current iteration was trained with log normalization. This
yielded, a mean F1-score of 0.837. This change had a noticeable increase of 20.09%. This
suggests that log normalization is better for boosting model performance. In the future,
modifying the parameters within the classifier may help the model more clearly comprehend
the nuances in the data. Branching out from a Decision Tree Classifier to other
classification models could also provide more insight into the nature of the SWAN-SF
dataset, such as how different models process the data. A Support Vector Machine could be a
promising choice for improving the F1-score. One of my major weaknesses is that my scope of
analyses was not broad enough to make a conclusive statement on the exact methodology for
the most optimal Machine Learning Model. If a researcher were to continue this further, a
combination of different data scaling techniques and classifiers could be conducted to make
a concrete conclusion.
Goals for the Future.
◦ Become more familiar with coding concepts such as OOP.

◦ Start learning more languages, frontend and backend developing.

◦ Make my first contribution to an opensource project.

◦ Become a seasoned developer.

◦ Start a Programming club at my school.

◦ Spread my love for programming to others.

◦ Get accepted into Georgia Governor Honors Program for coding.

◦ Land a coding internship at Microsoft.

◦ Get accepted into a T20 school.

◦ Make a difference in the world.


Acknowledgments
◦ Dr. Azim Ahmadzadeh – Georgia State University

◦ Nancy Curran – Harrison High School

You might also like