Professional Documents
Culture Documents
Compulsory Activity 2
This assignment focuses on the creation and analysis of two models, with the purpose of identifying
weather or not a specific speech was of US origin, through R. The data used ‘Ungd.Rdata’ consists
of several speeches from the UN General Debate, held at different points in time. The dataframe
includes seven variables of which ‘row_id’ (speech numeric id), ‘text’ (content of speech) and
‘is_us’ (Binary variable where 1 indicates US origin and 0 indicates non-US origin), are used to
create the ridge and lasso - regularized regression models.
set.seed(100000)
uid <- unique(ungd_sample$row_id)
Tobias Dons 6/11-2020
## [1] 2.696501
ridge$lambda.1se
## [1] 213.6931
## [1] 0.005675876
lasso$lambda.1se
## [1] 0.04823095
The creation of the two models can be seen above. The key difference between the two models
relates to the previously mentioned bias variance trade-off. The ridge model penalises the size of
parameter values through ∑ of squared risiduals+ λ∗slop e 2, this is done to reduce the complexity
and thereby variance. The ‘most useful’ lambda (ridge$lambda.1se ¿ 27,0844 ) is given by the most
regularized model such that error is within one standard error of the minimum.
Where the ridge model penalised by shrinkage, the lasso model erases certain parameters’
influence by setting their coefficient to 0 (dictated by lambda), in the case of mathematical
insignificance. This is due to the numeric rather than squared slope. This means that in a model
Tobias Dons 6/11-2020
where few parameters are significant, the ‘lasso’ is generally more applicable, whereas when the
model consists of several equally good (or more similar) predictors, the ‘ridge’ should be superior.
The previously mentioned minimum token frequency (2) might impact the conclusion on whether
to use the ridge or lasso, as higher frequency requirement could lead to a higher percentage of
significant values, tilting the conclusion in favour of the ‘ridge’.
Ridge Lasso
Depicted above are the two plotted models. As shown there is a positive correlation between the
value of lambda (penalty) and the error suggesting that the training data would be susceptible to
overfitting. The vertical lines indicate the lambda value with the minimum error and lambda with
minimum error which is exactly one standard deviation from the minimum error. To reach a
conclusion on which model fits the data best, further comparison is needed.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.944
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.955
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 1
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.833
Tobias Dons 6/11-2020
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.375
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.625
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.375
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.625
## [1] 0.5432456
## [1] 0.9699478
## [1] 0.7232887
## [1] 0.9823456
As seen above the lasso scores better when comparing accuracy, recall, and F-score, wheres ridge
model scores better when comparing precision.
## [1] 56
## [1] 46574
sum(beta == 0)
## [1] 46544
## coef word
## 1 1.1618740 peac_endur
## 2 1.0319813 ask_congress
## 3 0.8189179 control_agenda
## 4 0.6292725 liberti
## 5 0.5605610 today_unit
## coef word
## 1 0 privileg
## 2 0 extend
## 3 0 warm
## 4 0 congratul
## 5 0 unit
As we concluded (although without thorough argumentation due to space issues) the lasso was
the better model, and we will therefore solely be calculating the best predictive terms using the
lasso. Usage of the ridge can be found in the appendix. Above are listed the 5 top predictors which
are both the best predicters for is_US and is_not_US, but with negative instead of positive
coefficient. These predicters seem reasonable, and it is believable that unigrams and bigrams such
as “peac_endur” and “liberti” are characteristic for American speeches. Using this model we
cannot definitively say that is a ‘worst’ predictor, rather that several predictors are less significant
than our critical level of lambda allows. This, as mentioned, would not have been the case if we
used the ridge model.
To briefly comment on decisions made throughout the assignment, several minor changes could
have changed the outcome, such as ‘minimum word frequency’, ngrams amount, and even the
seed, even the conclusion on the best model might be up for debate, but commenting on these
subjects would require a bit more space and maybe time as well.
Tobias Dons 6/11-2020
Comp2.r
Tobias Dons
2020-11-05
#Compulsory activity 2
## ── Attaching packages
───────────────────────────────────────────────────────────────────────────────
─────────────── tidyverse 1.3.0 ──
## ── Conflicts
───────────────────────────────────────────────────────────────────────────────
────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(glmnet)
##
## Attaching package: 'Matrix'
library(quanteda)
##
## Attaching package: 'quanteda'
library(yardstick)
##
## Attaching package: 'yardstick'
library(knitr)
## [1] 441 7
dim(speeches_dfm)
speeches_dfm[1:6, 1:6]
Tobias Dons 6/11-2020
#Ridge
ridge <- cv.glmnet(train_dfm, docvars(train_dfm, "is_us"),
family = "binomial", alpha = 0, nfolds = 5,
parralel = TRUE, intercept = TRUE,
type.measure = "class")
summary(ridge)
ridge$lambda.min
## [1] 7.713741
ridge$lambda.1se
## [1] 27.08442
plot(ridge)
Tobias Dons 6/11-2020
#Lasso
lasso <- cv.glmnet(train_dfm, docvars(train_dfm, "is_us"),
family = "binomial", alpha = 1, nfolds = 5, #Alpha = 1
changes the minimum significanse criteria
parralel = TRUE, intercept = TRUE,
type.measure = "class")
summary(lasso)
lasso$lambda.min
## [1] 0.005316782
Tobias Dons 6/11-2020
lasso$lambda.1se
## [1] 0.01781974
plot(lasso)
##
## preds_ridge 0 1
## 0 81 5
## 1 0 3
##
## preds_lasso 0 1
## 0 80 3
## 1 1 5
p_lasso = factor(preds_lasso))
results
## 48 0 0 0
## 49 0 0 0
## 50 0 0 0
## 51 0 0 0
## 52 0 0 0
## 53 0 0 0
## 54 0 0 0
## 55 0 0 0
## 56 0 0 0
## 57 0 0 0
## 58 0 0 0
## 59 0 0 0
## 60 0 0 0
## 61 0 0 0
## 62 0 0 0
## 63 0 0 0
## 64 0 0 0
## 65 0 0 0
## 66 0 0 0
## 67 0 0 0
## 68 0 0 0
## 69 0 0 0
## 70 0 0 0
## 71 0 0 0
## 72 0 0 0
## 73 0 0 0
## 74 0 0 0
## 75 0 0 0
## 76 0 0 0
## 77 0 0 0
## 78 0 0 0
## 79 0 0 0
## 80 0 0 0
## 81 0 0 0
## 82 0 0 0
## 83 0 0 0
## 84 0 0 0
## 85 0 0 0
## 86 0 0 0
## 87 0 0 0
## 88 0 0 0
## 89 0 0 0
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.944
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.955
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 1
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.833
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.375
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.625
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.375
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.625
## [1] 0.5454545
## [1] 0.9700599
Tobias Dons 6/11-2020
## [1] 0.7142857
## [1] 0.9756098
#_________________________________________________________
#Using the lasso-model to produce top predictive terms for US
best.lambda <- which(lasso$lambda == lasso$lambda.1se)
best.lambda
## [1] 56
## [1] 46574
sum(beta == 0)
## [1] 46544
## coef word
## 1 1.1618740 peac_endur
## 2 1.0319813 ask_congress
## 3 0.8189179 control_agenda
## 4 0.6292725 liberti
## 5 0.5605610 today_unit
## 6 0.5211970 world_democraci
## 7 0.4708195 full_scope
## 8 0.4265196 terribl_weapon
## 9 0.4186460 everi_nation
## 10 0.3610864 grievanc
## coef word
## 1 0 privileg
## 2 0 extend
## 3 0 warm
## 4 0 congratul
## 5 0 unit
## 6 0 state
## 7 0 deleg
Tobias Dons 6/11-2020
## 8 0 elect
## 9 0 presid
## 10 0 twenti
## [1] 0
#_________________________________________________________
#Using the ridge-model to produce top predictive terms US
best.lambda1 <- which(ridge$lambda == ridge$lambda.1se)
best.lambda1
## [1] 47
## [1] 46574
sum(beta1 == 0)
## [1] 152
## coef word
## 1 0.02154828 intern_landscap
## 2 0.02154798 emerg_within
## 3 0.02154761 hard_earn
## 4 0.02154760 involv_question
## 5 0.02053956 intern_oversight
## 6 0.01941480 slowli_sure
## 7 0.01941431 institut_secur
## 8 0.01871609 food_unit
## 9 0.01871432 caus_second
## 10 0.01871428 nation_victim
## coef word
## 1 -0.002699953 term_need
## 2 -0.002699920 improv_infrastructur
## 3 -0.002699918 bring_uniqu
## 4 -0.002699918 govern_right
## 5 -0.002699905 govern_move
## 6 -0.002698543 way_give
Tobias Dons 6/11-2020
## 7 -0.002698541 secur_architectur
## 8 -0.002698525 press_ahead
## 9 -0.002671338 mean_creat
## 10 -0.002671309 process_east
#Converting dfm to df
speeches_df <- convert(speeches_dfm, to = "data.frame")
## [1] 3
sum(speeches_df$ask_congress)
## [1] 4
sum(speeches_df$control_agenda)
## [1] 4
sum(speeches_df$liberti)
## [1] 92
sum(speeches_df$today_unit)
## [1] 20
sum(speeches_df$world_democraci)
## [1] 3
sum(speeches_df$full_scope)
## [1] 3
sum(speeches_df$terribl_weapon)
## [1] 4
sum(speeches_df$everi_nation)
## [1] 74
sum(speeches_df$grievanc)
## [1] 11
## [1] 3
sum(speeches_df$emerg_within)
## [1] 3
Tobias Dons 6/11-2020
sum(speeches_df$hard_earn)
## [1] 4
sum(speeches_df$involv_question)
## [1] 4
sum(speeches_df$intern_oversight)
## [1] 4
sum(speeches_df$slowli_sure)
## [1] 5
sum(speeches_df$institut_secur)
## [1] 3
sum(speeches_df$food_unit)
## [1] 3
sum(speeches_df$caus_second)
## [1] 3
sum(speeches_df$nation_victim)
## [1] 3