You are on page 1of 6

Scott Underwood

02/17/2023

MS&E 226 Project Part II – Met with Isha Thapa on 02/21/2023 at 5:00 pm
The first thing I wanted to address in this part of the project was to find a solution to the variables
with abnormal patterns that were mentioned in Part I. Namely, azimuth and low/medium cloud cover
variables. Looking first at azimuth, I plotted a histogram of azimuth values in the dataset in Fig. 2 of
Appendix, confirming my observation in Part I that there are no azimuth observations between
approximately 170-180 and 190-195. My intuition is that some of the observations in these ranges were
rounded to an azimuth of about 180, as it is a round number for azimuth angle (corresponds to when the
sun is directly south, which is common in the northern hemisphere). Because I plan to do a sine
transformation on this variable (more details further on), and sin(180) is very close to sin(170) and
sin(190), I left azimuth as is.
Looking at the low and medium cloud cover variables, they show higher counts of values at 0, 10,
and 100, as shown in the histogram in Fig. 3 of the Appendix. The large counts at 0 and 100 make sense,
as cloud covers of 0 percent or 100 percent are common, but the significantly higher count at 10 percent
that each of the variables shows is not natural. Similar to above, my suspicion is that the observations
tended to round to 10 percent as a round number that represented slight cloud cover. I plan to only use the
total cloud cover variable in my final model in order to avoid this issue.
To build the models, I opted to apply a log transformation to all the variables in my dataset that
only take on positive values. This is every variables except precipitation, temperature, and pressure.
Because many of the variables have zero value for many entries, I applied a log(x+1) transformation, so
that all zero values become zero after transformation (rather than undefined). I then opted to use cross-
validation as the first screening method for potential models, with a k value of 10 to balance between
computational intensity and ample cross validation.
Continuous – generated_power_kw
Looking first at the continuous response variable, generated_power_kw, I first created a model
using all covariates other than the binary precipitation variable to use as a baseline. The mean value of the
power generation, which is 1136 kW, will be used to compare to the mean of the predicted values. This
resulted in a prediction error of 2.37 kW, which will serve as a baseline prediction error for CV, and a
mean predicted power generation of 1105.4.
My goal is to find a model that minimizes the prediction error and results in a mean predicted
power generation as close to the true mean as possible, without overpredicting. Overpredicting is
undesirable because this model would be used for power planning purposes, and power planning must
always be done for a worst-case scenario. If we overpredicted power generation, the power planners
might under-allocate other generation resources, which could result in power outages. Therefore,
underprediction is preferred to overprediction.
I next used lasso to identify the covariates that would be best to include in the model. Performing
cross-validation lasso with cv.glmnet using all covariates, the optimal lambda value is found to be 0.007,
which gives a prediction mean of 1096 kW and a prediction error of 2.34 kW. However, a model with
this lambda and all covariates only results in two covariates being dropped. This presents a tradeoff, as
more covariates appear to result in a more optimal model, but the complexity may also introduce bias
and/or variance when we use the model on the test dataset.
The next model used all of the variables that appeared to be correlated with the generated power,
as identified in Part I of the project. This model gave a prediction error of 2.59 kW and a mean power
Scott Underwood
02/17/2023

generation of 1053 kW, both worse than using all covariates or lasso. Next, I tried a model that used all
the seemingly correlated variables, but using sine transformations for the variables that are angles
(azimuth, zenith, angle of incidence), to reflect their real world impact on solar power. This improved the
metrics, achieving a prediction error of 2.34 kW and a mean of 1070 kW. I then introduced the total cloud
cover to the above model, because intuitively I’d expect cloud cover to be correlated to solar power.
Introducing the cloud cover term to the previous model resulted in a prediction error of 2.29 kW and a
mean of 1103 kW, nearing the performance of the model that used all covariates.
Lastly, keeping the above model with correlated variables, higher order terms, and cloud cover, I
introduced a number of interaction terms that I thought intuitively could influence power generation –
namely between radiation and temperature, angle of incidence, azimuth, and zenith. This had an
interesting impact on the metrics – a prediction error of 1.96 kW (significantly lower) and a mean of 1055
kW (significantly worse). I also wanted to try a model using only covariates that are used in various
equations to calculate solar power generation – knowledge drawn from prior experience. This resulted in
a poor prediction error of 2.47 kW but a closer mean of 1090 kW.
I checked the residual plots of all of the models to see how they seemed to fit the data. I found
that none of the residual plots did not seem to be randomly distributed, showing a clear correlation.
Through trial and error, such as introducing the interaction terms described above, I was able to minimize
the amount of correlation in the residuals plot, resulting in Fig. 4 in the Appendix. My theory is that this
dataset lacks data about shading of the solar panels, which has a large impact on power generation. After
testing many different models, I settled on one that uses the correlated variables, with the sine terms
introduced and the interactions described. The code for the model is contained in the Appendix, and
results in a prediction error of 2.2 kW and a prediction mean of 1070 kW.
When using this model on the test data, I would anticipate a prediction error that is slightly higher
than the 2.2 kW listed above, perhaps about 3 kW, as the model is fairly complex and the estimated error
is therefore likely optimistic. Due to the complexity of the model, I’d also expect the variance to be fairly
high, but the model should have fairly low bias.
Classification – precipitation
For the classification model, I opted to create a binary variable ‘precipitation’ which has a value
of zero if there was no precipitation and a value of one if there was any precipitation at all. This threshold
was chosen because it is valuable to predict whether it is going to rain at all so people can plan
accordingly, such as packing a rain jacket. I chose to build logistic regression models and compare the 0-1
loss and Type II error of each one. The 0-1 loss gives an indication of the overall accuracy, while the
Type II error is an important metric, as we want to minimize the number of times that we forecast no rain
(precipitation = 0) when it rains (precipitation = 1). Another tool I will use to assess models is the ROC
curve and the area under curve (AOC) of the ROC curve.
The optimal model will minimize both the 0-1 loss and the type II error, while keeping the type I
error within a reasonable range. As a baseline, I ran a model using all covariates with a probability
threshold of 0.5, which resulted in a 0-1 loss of 0.06, a Type I error of 0.013, a Type II error of 0.58 and
an AOC of 0.928. The 0-1 loss is in an acceptable range, but the Type II error is far too high, so the
primary goal in model experimentation is to minimize the Type II error while keeping the 0-1 loss and
Type I error in a reasonable range.
The next classification model I tried used all the covariates I had identified in Part I as seemingly
correlated with the precipitation level, using a probability threshold of 0.5. This model gave a 0-1 loss of
Scott Underwood
02/17/2023

0.08, a Type I error of 0.007, a Type II error of 0.92, and AOC of 0.82. Seeing the high disparity of the
Type I and Type II error, I changed the probability threshold to 0.1, which resulted in a 0-1 loss of 0.2, a
Type I error of 0.2, and a Type II error of 0.28. Because precipitation can’t occur without the presence of
clouds, I then added the total cloud cover to the above model. The resulting model, with a probability
threshold of 0.5, gave 0.91 AOC with a 0-1 loss of 0.06, a Type I error of 0.01 and a Type II error of 0.66.
Once again adjusting the probability threshold to 0.1 resulted in a 0-1 loss of 0.14, a Type I error of 0.14,
and a Type II error of 0.21.
I also wanted to create a model that used my prior knowledge and intuition about weather to
select covariates that I thought would be correlated to precipitation levels. This model gave a 0-1 loss of
0.06, a Type I error of 0.02, a Type II error of 0.59, and an AOC of 0.92. The last model I explored
introduced an interaction term between the relative humidity and the total cloud cover, as humid days
with high cloud cover seem most likely to have precipitation. I opted to introduce this term in the intuitive
model, as that had the best balance between errors and AOC thus far. Adding the interaction term to this
model gave a 0-1 loss of 0.06, a Type I error of 0.02, a Type II error of 0.58, and AOC of 0.92 with a
probability threshold of 0.5.
I selected the last model discussed, including all covariates that intuitively seemed to influence
precipitation levels along with an interaction term between cloud cover and humidity. This model was
selected due to the high AOC, low 0-1 loss, and relatively low Type I and II errors. The ROC curve for
this model is shown below in Fig. 1. In order to find an optimal balance between Type I and II error, I
tested a number of different probability thresholds, aiming to find a value which provides a balance
between accurately identifying true positives and maintaining accuracy. This corresponds to the red star
in Fig. 1, which maximizes the sensitivity without lowering the specificity too much. Through testing
different probability thresholds, I settled on a threshold of 0.1, which provides a 0-1 loss of 0.14 with a
Type I error of 0.13 and a Type II error of 0.18. The code used to generate and assess the selected model
is contained in the Appendix.

Figure 1: ROC curve for logistic regression classification model.

When using this classification model on the test dataset, I would expect the 0-1 loss to be slightly
higher than the 0-1 loss of 0.14 for the training dataset. This is because the model that was selected is
fairly complex, with many covariates and an interaction term, and this provides an optimistic estimate of
the error. However, I hope and expect that the Type II error for the test data set will remain around 0.18,
as the probability threshold was tuned to strike a balance between Type I and II error, regardless of the 0-
1 loss.
Scott Underwood
02/17/2023

Appendix

Figure 1: Histogram of azimuth values.

Figure 2: Histogram of low_cloud_cover_low_cld_lay values.

Figure 4: Residual plot of chosen linear regression model.


Scott Underwood
02/17/2023

Continuous Code:

# use log transformation for outputs that should always be positive


cols <- colnames(train)
to_normalize <- cols[-c(1, 3, 22)]
no_normalize <- cols[c(1, 3, 22)]
train.normalized <- cbind(log(train[to_normalize] + 1), train[no_normalize])

# calculate mean power generation to use to compare to models


power.mean <- mean(train$generated_power_kw)
msg <- paste('Mean power generation:', power.mean)
print(msg)

# run lasso regression with all covariates


train.normalized.x <- train.normalized[-c(19, 22)]
cols <- colnames(train.normalized.x)
y <- train.normalized$generated_power_kw
x <- data.matrix(train.normalized.x[, cols])
cv_model <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model$lambda.min
best_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)
pred <- predict(best_model, s = best_lambda, newx = data.matrix(train.normali
zed[, cols]))
msg <- paste('Prediction mean for lasso:', mean(exp(pred)-1))
print(msg)

# build best linear regression model


fm <- lm(generated_power_kw ~ I(sin(angle_of_incidence*pi/180)) +
I(sin(zenith*pi/180)) + shortwave_radiation_backwards_sfc +
I(sin(azimuth*pi/180)) +
total_precipitation_sfc + snowfall_amount_sfc +
total_cloud_cover_sfc +
temperature_2_m_above_gnd +
shortwave_radiation_backwards_sfc:total_cloud_cover_sfc +
shortwave_radiation_backwards_sfc:angle_of_incidence +
shortwave_radiation_backwards_sfc:zenith +
shortwave_radiation_backwards_sfc:azimuth, data=train.normalized)

# plot residuals
qplot(fitted(fm), residuals(fm), alpha = I(0.1)) + xlim(2.5, 8.5) +

# create cv folds to use


cv.folds <- cvFolds(nrow(train.normalized), 10)
cv.out <- cvFit(fm, folds = cv.folds, data = train.normalized, y = train.norm
alized$generated_power_kw)
# get and print errors
CV.err <- exp(cv.out$cv)-1
msg <- paste('Prediction error:', CV.err)
print(msg)
Scott Underwood
02/17/2023

pred <- predict(fm, train.normalized)


msg <- paste('Prediction mean:', mean(exp(pred)-1))
print(msg)

Classification Code:

# model with intuition and interaction term


fm <- glm(precipitation ~ +
relative_humidity_2_m_above_gnd:total_cloud_cover_sfc +
generated_power_kw + high_cloud_cover_high_cld_lay +
low_cloud_cover_low_cld_lay + medium_cloud_cover_mid_cld_lay +
total_cloud_cover_sfc + mean_sea_level_pressure_MSL +
relative_humidity_2_m_above_gnd +
shortwave_radiation_backwards_sfc +
temperature_2_m_above_gnd + wind_direction_10_m_above_gnd +
wind_direction_80_m_above_gnd + wind_direction_900_mb +
wind_gust_10_m_above_gnd +
wind_speed_10_m_above_gnd + wind_speed_80_m_above_gnd +
wind_speed_900_mb, data = train.normalized, family = binomial())
fitted_model <- fitted(fm)

# calculate and print 0-1 error, Type I/II error, and TPR/TNR
confusion <- table(fitted_model > 0.1, train.normalized$precipitation)
loss <- (confusion[2,1] + confusion[1,2])/length(train.normalized$precipitati
on)
FNR <- confusion[1,2] / (confusion[1,2] + confusion[2,2])
FPR <- confusion[2,1] / (confusion[2,1] + confusion[1,1])
TPR <- confusion[2,2] / (confusion[1,2] + confusion[2,2])
TNR <- confusion[1,1] / (confusion[1,1] + confusion[2,1])
msg <- paste('0-1 loss:', loss, 'Type I error:', FPR, 'Type II error:', FNR)
print(msg)

msg <- paste('Sensitivity:', TPR, 'Specificity:', TNR)


print(msg)

# Calculate roc curve and AUC


roc_data <- data.frame(fit = fitted_model, obs = train.normalized$precipitati
on)
my_roc <- roc(roc_data$obs ~ roc_data$fit, plot = FALSE)

cat("AUC = ", toString(auc(my_roc)))

options(repr.plot.width=4, repr.plot.height=4)
plot(my_roc)

You might also like