You are on page 1of 21
1410212023, 17:10 Latent Dirichlet location (LDA) Tutorial Tople Modeling in NLP. © Omdena 4 Steps of Using Latent Dirichlet Allocation (LDA) for Topic Modeling in NLP Published Reading Time Jan 15, 2022 9 min © Rate this post Follow us (11 votes) eooo0e8 hitpsomdena.convogiatent-aichet-alocabon! et 1410212023, 1710 Latent Dirichlet Allocation (LDA) Tutors Tope Modeling in NLP CHA Real World Tutorials by @mdena EV CHe Uiianica@. Wer-1e Cn (89) TOES CMM em Ta me | a. =a -_ oO RSS a wy CE ) Jump to section An introduction to the LDA algorithm Step 1: Data collection Step 2: Preprocessing ‘Step 3: Model implementation * 3.1. Training * 3.2. Improving preprocessing Step 4: Visualization Summary References tps omdena.combogiatent-arichet-alocaton! ar ‘0272028, 17:10 Latent Dietet Allocation (LDA) Tuoi Topic Moeling in NLP This tutorial will guide you through how to implement its most popular algorithm, Latent Dirichlet Allocation (LDA) algorithm, step by step in the context of a complete pipeline. First, we will be learning about the inner works of LDA. Then, we will be using scikit-learn for data preprocessing and model implementation, and pyLDAvis tor visualization. As a little extra, we will also be doing our own data collection with newspaper3k. Topic Modeling is a technique that you probably have heard of many times if you are into Natural Language Processing (NLP). Topic Modeling in NLP commonly used for document clustering, not only for text analysis but also in search and recommendation engines. Sounds good? Let's start! Author: Jessica Becerra Formoso What is Latent Dirichlet Allocation? Latent Dirichlet Allocation (LDA) is an unsupervised algorithm that assigns each document a value for each defined topic (let's say, we decide to look for 5 different topics in our corpus) Latent is another word for hidden (ie, features that cannot be directly measured), while Dirichlet is a type of probability distribution LDA considers each document as a mix of topics and each topic as a mix of words. It iterates through the total number of topics and each word. It will randomly assign each word to a topic and evaluate how often the word occurs in that topic together with which other words. This approach follows a similar way of thought as we humans would. This makes LDA easier to interpret and one of the most popular methods out there. The trickiest part of it though is to figure out the optimal number of topics and iterations. Latent Dirichlet Allocation is not to be confused with Latent Discriminant Analysis (also referred to as LDA). Latent Discriminant Analysis is a supervised dimensionality reduction technique used for the classification or preprocessing of high-dimensional data. You might also like * Using Topic Modeling to Understand Climate Change Domains tps omdena.combogiatent-arichet-alocaton! sat 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Step 1: Data collection To spice things up, let's use our own dataset! For this, we will use the newspaper3k library, a wonderful tool for easy article scraping, Ipip install newspaper3k import newspaper from newspaper import Article We will be using the build functionality to collect the URLs on our chosen news website's main page. # Save URLs from main page. /www.theguardian.com/us", memoize_articles-False) news = newspaper. build(“https: By passing the memoize_articles argument as False, we ensure that, if we call the function a second time, all the URLs will be collected again. Otherwise, only the new URLs would be returned. We can check news.size() to get the number of collected news URLs. In our case, 143. Next, we need to simply pass each URL through Article(), call download() and parse(), and finally, we can get the article's text. We also pass a length condition to avoid storing some previously spotted exceptions. That way, we ensure adding only long texts to our dataset texts = [] # For each URL, tps omdena.combogiatent-arichet-alocaton! an 4i02/2023, 1710 Latent Dirichlet Aocation (LDA) Tutorial: Tope Modeling in NUP for article in news.articles: # Get the corresponding article. article = article(article.url) article.download() if articl article.parse() # Get text only if has more than 68 characters -- to avoid undesired exct if len(article.text) > 60: texts. append(article. text) After running these lines, the total number of news articles is 132. Step 2: Preprocessing The next step is to prepare the input data for the LDA model. LDA takes as input a document- ‘term matrix. We will be using Bag of Words, specifically the CountVectorizer implementation from scikit- learn. from sklearn.feature_extraction.text import CountVectorizer bow_vectorizer = CountVectorizer(stop_words=stopwords, lowercase=True, max_df=0.' bow_matrix = bow_vectorizer.fit_transform(texts) There are a couple of things to mention here. First, it is essential not to forget to remove stopwords. We call the lowercase method for increased normalization, and we set a series of parameters to avoid high-frequency words (common words not in the stopwords list that do not add any meaning overall) or too low-frequency terms. Our resulting Bag of Words has a shape of (132, 438). With that in place, it is time to use LDA algorithm, tps omdena.combogiatent-arichet-alocaton! iat 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Step 3: Model implementation Using scikitlearn's implementation of this algorithm is really easy. However, this abstraction can make it really difficult to understand what is going on behind the scenes, It is important to have at least some intuition on how the algorithms we use actually work, so let's recap a bit on the explanations from the introduction. from sklearn.decomposition import LatentDirichletAllocation as LDA Ida_bow = LDA(n_components=5, random_state=42) Ida_bow. fit (bow_matrix) LDA needs three inputs: a document-term matrix, the number of topics we estimate the documents should have, and the number of iterations for the model to figure out the optimal words-per-topic combinations. n_components corresponds to the number of topics, here, 5 as a first guess The number of iterations is 10 by default, so we can omit that parameter. Having the configurations of our LDA model set up under the Ida_bow variable, we fit (train) on the BOW. 1da_bow. transform(bow_matrix[:2]) By calling transform, we get to see the results of the trained model. This gives us a good picture of how it actually works. We pass only the first two rows of our BOW matrix as an example. array({[0.76662544, 0.01858679, 0.0183296 , 0.17813906, 0.01831911], [0.00103261, 0.00102449, 0.001021 , 0.00102753, 0.99589436])) As you can see, we have 5 values in each of the two vectors. Each value represents a topic (remember we told the model to find 5 different topics). In specific, it illustrates how much of tps omdena.combogiatent-arichet-alocaton! 621 ‘0272028, 17:10 Latent Dietet Allocation (LDA) Tuoi Topic Moeling in NLP that topic is covered in that document (vector). This makes sense since a document is usually made up of several (sub)topics. let's naw print the mast camman words far each tanic for idx, topic in enumerate(1da_bow.components_): print(f"Top 5 words in Topic #{idx}:") print ([bow_vectorizer.get_feature_names()[i] for i in topic.argsort()[-5:]]) print(‘*) The output looks like this: Top 5 words in Topic #0: [time’, ‘years’, life, ‘says’, ‘ike] Top 5 words in Topic #1: public’, ‘york, ‘new’, ‘police’, ‘trump'] Top 5 words in Topic #2: [white,, decision, ‘international, black, uk] Top 5 words in Topic #3: [like ‘year’, ‘food, ‘police’, ‘city Top 5 words in Topic #4: [bill, ‘democrats’, ‘rights’, ‘voting’, ‘biden’] This type of visualization is actually an excellent indicator of how well our topic model is being trained, Having words such as “like” or “says” does not provide much meaning. One way around this is to do lemmatization and add these undesired words to our stopwords list. Let's improve our current model next. 3.2. Improving preprocessing tps omdena.combogiatent-arichet-alocaton! nen ‘0272028, 17:10 Latent Dietet Allocation (LDA) Tuoi Topic Moeling in NLP Coming back to the preprocessing step is something very common and often necessary. After all, Machine Learning is an iterative process. In our case, we need to improve our Bag of Words to not take into account some very frequent words that could not be filtered out with the previous approach. Furthermore, it would be good to add a lemmatizer to avoid repeated words under different forms. For the first case, we just need to add our new list of stopwords to the already defined set of stopwords. For the second step though, CountVectorizer does not integrate a lemmatizer, so we have to create our own lemmatizer class and pass it to the tokenizer parameter. No need to worry much here, scikit-learn has you covered with their documentation on how to customize your vectorizer in this particular case. nitk.download(‘punkt") nltk.download( "wordnet" ) from nitk import word_tokenize from nitk.stem import WordNetLemmatizer class LemmaTokenizer: def init__(self): self.wnl = WordNetLenmatizer() def _call_(self, doc): return [self.wnl.lemmatize(t) for t in word tokenize(doc) if (t.isalpha( We download first some necessary packages and import the corresponding dependencies. The LemmatTokenizer class is the same as in the documentation except for two extra conditions we add to account only for tokens with alphabetic characters and with more than one letter. Otherwise, your topics will be flooded with punctuation and other undesired tokens. Now, we only have to pass our new parameter to the vectorizer. The rest remains as before. bow_vectorizer = CountVectorizer(stop_words=stopwords, tokenizer=LemmaTokenizer( bow_matrix = bow_vectorizer.fit_transform(texts) If we run all again, we see that indeed the most common words for our topics do change. Top 5 words in Topic #0: tps omdena.combogiatent-arichet-alocaton! at 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” experience’, ‘event, ‘life’ ‘year’, city] Top 5 words in Topic #1 republican’, ‘voting, ‘right, ‘trump’, biden] Top 5 words in Topic #2: film, ‘life’, ‘new, ‘time, ‘year'] Top 5 words in Topic #3: year’, ‘vaccine, ‘food’, ‘city’ police! Top 5 words in Topic #4: ['week’, ‘governor’, ‘new’, ‘state’, ‘woman’] That is looking good, well done! Step 4: Visualization One last step in our Topic Modeling analysis has to be visualization. One popular tool for interactive plotting of Latent Dirichlet Allocation results is pyLDAvis. !pip install pyldavis import pyLDAvis import pyLDAvis.sklearn pyLDAvis.enable_notebook() Make sure to import the corresponding module to the main library you are using for Topic Modeling (in our case, scikit-learn). Again, this step will help us determine how well our model is performing. Let's take a look at the visualizations as they were before improving our vectorizer with the lemmatizer. tps omdena.combogiatent-arichet-alocaton! erat 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors Tope Modeling in NLP Sii0oln [Ome Ie [ole le Piel In ea la lelelaia le MISE. wyN ry 0D timccoemsre mi ara a EEE EE ee Ee NLP Topic modeling - Source: Omdena There are two main parts to pyLDAvis. On the left side, the intertopic Distance Map shows each topic as a bubble. The bigger the bubble, the higher the number of documents in our corpus belonging to that topic. The more distanced the bubbles are from each other, the more different their topics are. On the right side, the Top-30 Most Relevant Terms for Topic N consist of a barplot with two indicators. In blue, the total frequency of that word in the corpus, and in red, the frequency of that word in that topic. tps omdena.combogiatent-arichet-alocaton! sor 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” NLP Topic modeling - Source: Omdena It seems we did not have a bad result after all! Let's see how it shows after lemmatization. The sizes of the bubbles are more irregular among them, and Topic 1 has a very large bubble that overlaps in great part with Topic 5. One thing we could explore further is the number of topics. It possibly is that five topics are much for our limited dataset. After some tweaking, we conclude that 3 topics without lemmatizer gives the best results for our case. The topics may still not make entire sense, or may sound repetitive or weak to us. There is no wrong in that. On the other hand, gathering more data can help the variety of our results and solidify the output. Feel free to explore with a larger amount of news articles or with your previously scraped tweets from Part 1 Summary In this tutorial, we learned about Latent Dirichlet Allocation. We built some intuition of the whole process and are ready to improve our first outputs by observing the performance of several parameters in our LDA implementation with the help of pyLDAvis. Now it's time to put this into practice! Happy coding! References ‘+ newspaper3k documentation: https://github.com/codelucas/newspaper + scikit-learn LDA documentation: https://scikit- learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html + pyLDAvis documentation: https://github.com/bmabey/pyLDAvis + Video tutorial on LDA with Gensim: https://www.youtube.com/watch?v=NYkbqzTIW3w Want to learn more? Check out the tutorials below: * Building “Yarub" Library for Arabic NLP Purposes tps omdena.combogiatent-arichet-alocaton! swe 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” * Labeling Text Data for News Article Classification and NLP. Tagged: Latent Dirichlet Allocation + LDA + Topic Modeling Jessica Becerra Formoso o tps omdena.combogiatent-arichet-alocaton! sam 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Al Projects to boost your career Develop vour skills. build a project portfolio. and make an impact. All at once! Most Read Posts Applying Machine Learning to Analyse Pipe-Borne Water Availability in Lagos, Nigeria 13 min read Crop Yield Prediction Using Deep Neural Networks 9 min read Measuring Soil Organic Carbon Changes Using RothC and Machine Learning 13 min read Data Analysis AlEngineering Object Detection Web Scraping Climate Change Machine Learning —Startups_—Deep Learning Data Science Career Geospatial Data Al'Startups Satellite Imagery Career Career Tips Data Visualization tps omdena.combogiatent-arichet-alocaton! 1321 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Join or host projects and build solutions through the power of collaboration. PROBLEM STATEMENT oy Monitoring the Water Quality in Bhopal Region using Satellite Imagery and GIS Techniques This Omdena Local Chapter Challenge runs for 5 weeks and is a unique experience to try. Challenge Start: Feb 16th tps lomdena.combogiatent-arichet-alocaton! 1421 1410212028, 17:10 Latent Dirichlet location (LDA) Tutorial Tople Modeling in NUP E-Commerce Customer Churn Prediction using Machine Learning This Omdena Local Chapter Challenge runs for 5 weeks and is a unique experience to try... Apply now Challenge Start: Feb 19th tps lomdena.combogiatent-arichet-alocaton! 1921 1410212023, 17:10 Latent Dirichlet Allocation (LDA) Tutors Tope Modeling in NLP Mitigating Air Pollution in Poland Through Machine Learning This Omdena Local Chapter Challenge runs for 4 weeks and is a unique experience to try... Apply now Challenge Start: Feb 19th See all projects > tps omdena.combogiatent-arichet-alocaton! 1490212028, 17:10 Latent Diichet Allocation (LDA) Tutorial Tople Modeling in NUP 2 oe ete Ce eee i eens eee Tees Applying Machine Learning to Analyse Pipe-Borne Water Availability in Lagos, Nigeria (Nov 17,2022 | @ 13 min read itpsomdena.combogiatent-arichet-alocaton! sre 1490212028, 17:10 Latent Diichet Allocation (LDA) Tutorial Tople Modeling in NUP Image cropped and frst band histogram =“ 6 eazsane Sepimighth—ryial t Crop Yield Prediction Using Deep Neural Networks 1H Sep 30, 2022 | @ 9min read Measuring Soil Organic Carbon Changes Using RothC and Machine Learning tps omdena.combogiatent-archet-alocatons 1921 1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Browse categories. Agriculture Al Planning Career Development _—_ Career Growth Stories Child Safety Civil Society Computer Vision Conversational Al Data Engineering DataScience Disaster Management —_ Education Environment & Sustainability Equality & Inclusion Finance Healthcare Impact Tech Startups Infrastructure —_ Logistics & Transportation Machine Learning Media NLP — Omdena impACT leadership Omdena Local Chapters Partner Success Stories Policy PR & Impact Real-World Tutorials Remote Sensing Renewable Energy Robotics, Drones, and loT Security & Justice Software Development Technical Case Studies Web and Mobile Applications tps omdena.combogiatent-arichet-alocaton! v9 sanz005, 40 Latent Ofte Alcaton (DA) Ta Tope MeeIng NL Leave a comment. 0 Comments Join Us About Us » Projects » Mission » Omdena School » Team » Omdena Local Chapters For Organizations » Al Startup Incubator » Civil Society y» Universities » Testimonials » Contact Us » FAQ Industry Applications » Media » Renewable Energy » Solar » Wildfire Prediction > More itpsomdena.combogiatent-arichet-alocaton! » Our partners » Careers Industry Applications » Agriculture » Crop Prediction » Finance » Healthcare » Cancer » Mental Health 20124 1490212028, 17:10 Latent Diichet Allocation (LDA) Tutorial Tople Modeling in NUP Privacy Policy Terms and Conditions © 2023 Omdena. All Rights Reserved. 5 « 6) Gl o tps omdena.combogiatent-arichet-alocaton! 21a

You might also like