Professional Documents
Culture Documents
Copy of Project
Copy of Project
Stellar Classification
Imports
In [1]:
import numpy as np
import pandas as pd
import graphviz
In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Deelane/CST383---Final/main/star_cl
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 1/18
2/25/22, 7:35 PM Copy_of_Project
The dataset has no null entries, but we will still need to check for outliers
In [4]:
df.describe()
Note the -9999 values as min for columns u, g, and z. Is this missing/bad data?
In [5]:
print(df['class'].value_counts())
sns.countplot('class', data=df);
df.reset_index(drop=True, inplace=True)
GALAXY 59445
STAR 21594
QSO 18961
C:\Users\Jordan\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pa
ss the following variable as a keyword arg: x. From version 0.12, the only valid positio
nal argument will be `data`, and passing other arguments without an explicit keyword wil
l result in an error or misinterpretation.
warnings.warn(
As you can see, this is an unbalanced dataset. There are significantly more 'GALAXY' entries than
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 2/18
2/25/22, 7:35 PM Copy_of_Project
In [6]:
corr = df.corr()
Remaining columns
Columns ['u', 'g', 'z'] and ['r', 'i'] correlate to the other columns in almost excactly same way. 'r' and 'i'
are different enough with regards to redshift (our biggest predictor) so we will keep them.
In [8]:
sns.heatmap(df.corr());
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 3/18
2/25/22, 7:35 PM Copy_of_Project
Final columns
In [9]:
df.drop(columns=['u', 'z'], inplace=True)
sns.heatmap(df.corr());
Pairplot
'redshift', 'g', and 'redshift' vs 'g' show to be very good at classifying, with each class clearly having
its own value range for each.
In [10]:
sns.pairplot(df, hue='class')
<seaborn.axisgrid.PairGrid at 0x211eb05a6a0>
Out[10]:
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 4/18
2/25/22, 7:35 PM Copy_of_Project
enc_df = pd.DataFrame(enc.fit_transform(df[['class']]).toarray())
In [12]:
targets = ['GALAXY', 'QSO', 'STAR']
# run 1
# run 2
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 5/18
2/25/22, 7:35 PM Copy_of_Project
# run 3
# run 4
# run 5
Getting Accuracy
In [13]:
for p in predictors:
# X = df[[p]].values
# run 2
# X = df[[p, 'redshift']].values
# run 3
# run 4
# run 5
y = df[targets].values
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn.fit(X_train, y_train)
# get predictions
predictions = knn.predict(X_test)
predictions_enc = pd.DataFrame(predictions).idxmax(axis=1)
y_test_enc = pd.DataFrame(y_test).idxmax(axis=1)
alpha 0.9565
delta 0.9546
plate 0.9620666666666666
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 6/18
2/25/22, 7:35 PM Copy_of_Project
X = df[predictors].values
y = df[targets].values
Choosing a model
KNN Exploration
In [16]:
#WARNING, COSTLY COMPUTATIONS
#use gridsearch on cross validation to find the best n and p for knn
n_sqrt = round(np.sqrt(X_train.shape[0]))
knnCV.fit(X_train, y_train)
print(knnCV.best_params_)
{'n_neighbors': 3, 'p': 1}
Learning Curves
In [17]:
from sklearn.model_selection import learning_curve
#cell height
for i in range(len(n_neighbors_list)):
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 7/18
2/25/22, 7:35 PM Copy_of_Project
axs[i].legend()
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 8/18
2/25/22, 7:35 PM Copy_of_Project
X = df[predictors].values
y = df['class'].values
# The other solvers are giving us Convergence warnings. This is another reason we stuck
grid = [{'solver':['liblinear', 'sag', 'saga']}]
clf_cv.fit(X_train, y_train)
print(clf_cv.best_params_)
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 9/18
2/25/22, 7:35 PM Copy_of_Project
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
C:\Users\Jordan\anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:328: Convergenc
eWarning: The max_iter was reached which means the coef_ did not converge
{'solver': 'liblinear'}
Drop labels
In [20]:
df.drop(columns=['class_cat'], inplace=True)
params = {'max_depth':np.arange(2,11)}
clf_cv.fit(X_train, y_train)
# print results
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 10/18
2/25/22, 7:35 PM Copy_of_Project
X = df[predictors].values
y = df[targets].values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
kNN Classification
In [24]:
knn = KNeighborsClassifier(weights='distance', metric='manhattan')
knn.fit(X_train, y_train)
# get predictions
predictions = knn.predict(X_test)
predictions_enc = pd.DataFrame(predictions).idxmax(axis=1)
y_test_enc = pd.DataFrame(y_test).idxmax(axis=1)
# print accuracy
print(classification_report(y_test_enc, predictions_enc))
cm = confusion_matrix(y_test_enc, predictions_enc)
print(cm)
display = ConfusionMatrixDisplay(cm).plot()
Accuracy: 0.95145
[ 225 1 4060]]
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 11/18
2/25/22, 7:35 PM Copy_of_Project
clf.fit(X_train, y_train)
# get predictions
predictions = clf.predict(X_test)
predictions_enc = pd.DataFrame(predictions).idxmax(axis=1)
y_test_enc = pd.DataFrame(y_test).idxmax(axis=1)
# print accuracy
print(classification_report(y_test_enc, predictions_enc))
cm = confusion_matrix(y_test_enc, predictions_enc)
print(cm)
display = ConfusionMatrixDisplay(cm).plot()
Accuracy: 0.9652
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 12/18
2/25/22, 7:35 PM Copy_of_Project
[ 483 3327 0]
[ 6 0 4280]]
In [26]:
target_names = ['GALAXY', 'QSO', 'STAR']
dot_data = export_graphviz(clf, precision=2, feature_names=predictors, proportion=True,
graph = graphviz.Source(dot_data)
graph
Out[26]:
value = [[0.41, 0.59]
[0.81, 0.19]
[0.78, 0.22]]
redshift <= -0.8 redshift <= -0.79 redshift <= 0.18 g <= 0.84
gini = 0.0 gini = 0.07 gini = 0.08 gini = 0.07
samples = 13.7% samples = 8.4% samples = 61.9% samples = 15.9%
value = [[1.0, 0.0] value = [[0.94, 0.06] value = [[0.06, 0.94] value = [[0.95, 0.05]
[1.0, 0.0] [1.0, 0.0] [0.94, 0.06] [0.05, 0.95]
[0.0, 1.0]] [0.06, 0.94]] [1.0, 0.0]] [1.0, 0.0]]
redshift <= -0.79 i <= -1.78 redshift <= -0.79 plate <= 0.7 g <= 0.56 g <= 0.7 redshift <= 0.88
gini = 0.0
gini = 0.0 gini = 0.08 gini = 0.01 gini = 0.05 gini = 0.23 gini = 0.02 gini = 0.32
samples = 0.0%
samples = 13.7% samples = 0.4% samples = 8.0% samples = 52.9% samples = 9.0% samples = 14.5% samples = 1.4%
value = [[0.0, 1.0]
value = [[1.0, 0.0] value = [[0.06, 0.94] value = [[0.99, 0.01] value = [[0.04, 0.96] value = [[0.22, 0.78] value = [[0.98, 0.02] value = [[0.59, 0.41]
[1.0, 0.0]
[1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [0.96, 0.04] [0.78, 0.22] [0.02, 0.98] [0.41, 0.59]
[1.0, 0.0]]
[0.0, 1.0]] [0.94, 0.06]] [0.01, 0.99]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]]
g <= 2.33 g <= -0.29 redshift <= -0.79 plate <= -1.13 redshift <= -0.79 g <= -1.58 i <= 0.66 g <= 0.29 g <= 0.28 g <= 0.78 g <= 0.5 redshift <= 0.88 plate <= 1.37 redshift <= 1.49
gini = 0.1 gini = 0.0 gini = 0.29 gini = 0.06 gini = 0.01 gini = 0.26 gini = 0.03 gini = 0.15 gini = 0.23 gini = 0.07 gini = 0.02 gini = 0.14 gini = 0.26 gini = 0.19
samples = 0.1% samples = 13.7% samples = 0.0% samples = 0.4% samples = 7.9% samples = 0.1% samples = 47.3% samples = 5.6% samples = 2.0% samples = 7.0% samples = 13.6% samples = 0.9% samples = 0.6% samples = 0.8%
value = [[0.92, 0.08] value = [[1.0, 0.0] value = [[0.32, 0.68] value = [[0.05, 0.95] value = [[1.0, 0.0] value = [[0.75, 0.25] value = [[0.03, 0.97] value = [[0.13, 0.87] value = [[0.78, 0.22] value = [[0.05, 0.95] value = [[0.99, 0.01] value = [[0.89, 0.11] value = [[0.27, 0.73] value = [[0.83, 0.17]
[1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [0.99, 0.01] [0.97, 0.03] [0.87, 0.13] [0.22, 0.78] [0.95, 0.05] [0.01, 0.99] [0.11, 0.89] [0.73, 0.27] [0.17, 0.83]
[0.08, 0.92]] [0.0, 1.0]] [0.68, 0.32]] [0.95, 0.05]] [0.0, 1.0]] [0.26, 0.74]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]]
gini = 0.08 gini = 0.0 gini = 0.0 gini = 0.01 gini = 0.15 gini = 0.0 gini = 0.21 gini = 0.04 gini = 0.0 gini = 0.06 gini = 0.0 gini = 0.17 gini = 0.03 gini = 0.18 gini = 0.32 gini = 0.09 gini = 0.1 gini = 0.33 gini = 0.22 gini = 0.04 gini = 0.01 gini = 0.05 gini = 0.32 gini = 0.07 gini = 0.33 gini = 0.13 gini = 0.29 gini = 0.04
samples = 0.1% samples = 0.0% samples = 7.9% samples = 5.8% samples = 0.0% samples = 0.0% samples = 0.0% samples = 0.4% samples = 7.6% samples = 0.2% samples = 0.0% samples = 0.1% samples = 46.1% samples = 1.2% samples = 1.1% samples = 4.5% samples = 1.3% samples = 0.7% samples = 0.9% samples = 6.1% samples = 11.0% samples = 2.6% samples = 0.2% samples = 0.7% samples = 0.2% samples = 0.4% samples = 0.4% samples = 0.4%
value = [[0.94, 0.06] value = [[0.0, 1.0] value = [[1.0, 0.0] value = [[1.0, 0.0] value = [[0.13, 0.87] value = [[1.0, 0.0] value = [[0.19, 0.81] value = [[0.03, 0.97] value = [[1.0, 0.0] value = [[0.95, 0.05] value = [[0.0, 1.0] value = [[0.86, 0.14] value = [[0.02, 0.98] value = [[0.16, 0.84] value = [[0.39, 0.61] value = [[0.07, 0.93] value = [[0.91, 0.09] value = [[0.54, 0.46] value = [[0.2, 0.8] value = [[0.03, 0.97] value = [[1.0, 0.0] value = [[0.96, 0.04] value = [[0.6, 0.4] value = [[0.95, 0.05] value = [[0.56, 0.44] value = [[0.1, 0.9] value = [[0.67, 0.33] value = [[0.97, 0.03]
[1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [1.0, 0.0] [0.99, 0.01] [0.98, 0.02] [0.84, 0.16] [0.61, 0.39] [0.93, 0.07] [0.09, 0.91] [0.46, 0.54] [0.8, 0.2] [0.97, 0.03] [0.0, 1.0] [0.04, 0.96] [0.4, 0.6] [0.05, 0.95] [0.44, 0.56] [0.9, 0.1] [0.33, 0.67] [0.03, 0.97]
[0.06, 0.94]] [1.0, 0.0]] [0.0, 1.0]] [0.0, 1.0]] [0.87, 0.13]] [0.0, 1.0]] [0.81, 0.19]] [0.97, 0.03]] [0.0, 1.0]] [0.05, 0.95]] [1.0, 0.0]] [0.15, 0.85]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]] [1.0, 0.0]]
SVC
SVC does not require one-hot encoding, so we will instead use the 'class' column as our target.
In [27]:
# Set target
target = ['class']
# Set y
y = df[target].values.ravel()
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [28]:
svc = SVC()
svc.fit(X_train, y_train)
predictions = svc.predict(X_test)
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 13/18
2/25/22, 7:35 PM Copy_of_Project
# print accuracy
print(classification_report(y_test, predictions))
cm = confusion_matrix(y_test_enc, predictions_enc)
print(cm)
display = ConfusionMatrixDisplay(cm).plot()
Accuracy: 0.9625
[ 483 3327 0]
[ 6 0 4280]]
Logistic Regression
In [29]:
clf = LogisticRegression(solver='liblinear')
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
# print accuracy
print(classification_report(y_test, predictions))
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 14/18
2/25/22, 7:35 PM Copy_of_Project
cm = confusion_matrix(y_test_enc, predictions_enc)
print(cm)
display = ConfusionMatrixDisplay(cm).plot()
Accuracy: 0.93425
[ 483 3327 0]
[ 6 0 4280]]
Chosen Model
Show decision tree code, graph of hits vs misses and where the majority occurred
Accuracy
In [30]:
#Label encoding
df['class_cat'] = df['class'].astype('category').cat.codes
X = df[predictors]
y = df['class_cat']
#TTS
#default
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
train_predict = clf.predict(X_train)
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 15/18
2/25/22, 7:35 PM Copy_of_Project
test_predict = clf.predict(X_test)
train_accuracy = 1 - train_error
test_accuracy = 1 - test_error
Misclassifications
Per the previous section's confusion matrix, it can be seen that the majority of misclassifications are
GALAXY/QSO misclassifications. Furthermore, these misclassifications tend to happen in a specific
redshift and g range. Let's take a look at them
In [31]:
#Data
redshift_range_upper = 1.7
redshift_range_lower = 0.05
g_range_upper = 23.5
g_range_lower = 18.5
#Misclassifications
shared_indices = []
if (index in df_small.index):
shared_indices.append(index)
misclassified_in_df_small = df_small.loc[shared_indices]
misclassified_in_df_small['misclassified'] = misclassified_in_df_small['class'].apply(l
sns.scatterplot(x='redshift', y='g', hue='misclassified', data=misclassified_in_df_smal
df.drop(columns=['class_cat'], inplace=True)
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 16/18
2/25/22, 7:35 PM Copy_of_Project
In [32]:
#2016 SDSS data
df_sdss_2016 = pd.read_csv(r"C:\Users\Jordan\Downloads\archive\sdss.csv")
df_sdss_2016.drop_duplicates(subset=['obj_ID'], inplace=True)
df_og = pd.read_csv("https://raw.githubusercontent.com/Deelane/CST383---Final/main/star_
obj_id_set = set(df_sdss_2016['obj_ID'])
ids_to_drop = []
if obj_id in obj_id_set:
ids_to_drop.append(obj_id)
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 17/18
2/25/22, 7:35 PM Copy_of_Project
df_sdss_2016.reset_index(drop=True, inplace=True)
df_sdss_2016['class_cat'] = df_sdss_2016['class'].astype('category').cat.codes
X = df_sdss_2016[predictors].values
y = df_sdss_2016['class_cat'].values
#Using our already trained model to predict all 732,646 new objects
test_predict = clf.predict(X)
test_accuracy = 1 - test_error
Accuracy: 98.29%
file:///C:/Users/Jordan/Downloads/Copy_of_Project.html 18/18