You are on page 1of 10

• U of MN Department of Political Science

• Professor Jane Lawrence Summer


• Researching SEC 10K Risk Factors

Project
Overview
“What percentage of a companies Risk Factors are
focused on specific topics; i.e. employee wages”

2019 U of MN Data Analytics & Visualization Boot Camp 2


< 10 Hand-coded Samples

Why Machine 3,000 Scrapped Text Files


Learning?

Table of Results

2019 U of MN Data Analytics & Visualization Boot Camp 3


Summarization of Corporate Risk
Factor Disclosure Through Topic
Modeling

Research paper published in 2012


Sent-LDA modeling

Researched Andrewsthompson.co
Similar Clustering 143k articles with Kmeans

Projects

Brandonrose.org
KMeans clustering
LDA modeling

2019 U of MN Data Analytics & Visualization Boot Camp 4


- Training an unsupervised model
with RNN was going to take
DAYS

- Data was scrapped, so it required


Challenges extensive cleaning.
- Text was not broken into Paragraphs

- So many different path options,


- Not sure which if any will be
successful,
- Which should we choose?

2019 U of MN Data Analytics & Visualization Boot Camp 5


Raw Data

TFDI
Vectorizer

Kmeans
Clusters

Clusters

Classify
to-d
Vector
Documents
represent
word

Classification
Munge Docs
Algorithms

Validating
Words w/metrics
(silhouette score)

glove
(Global vector / Predict / Verify
words)
Documents
collections
datetime
future
gensim
heapq
math
matplotlib
nltk
numpy
operator
pandas
Path
re
scipy
seaborn
sklearn
string
sys
time
tqdm

You might also like