1. We choose a default number of clusters(points) to begin with.
2. Now input datapoints go into respective clusters depending on its proximity.
3. Now each of the deafult 5 clusters have multiple datapoints,and now we find out the centroid/seed point/mean of each cluster 4. now the initial input datapoints will be assigned to the default clusters depending on the proximity of each point to the seed points of each cluster 5. post reassignment of the datapoints into the respective clusters,we again compute the seed point/mean of each cluster 6. Again the reassignment of datapoints happen in each cluster depending on the proximity to the seed points calculated in step 5. 7. again compute the seed point/mean of each cluster 8.again calculate the seed point or centrod.8. Reassignment of the datapoints will happen depending on the proximity of cluster maens claculated in step 7 9. This process will continue till no further reassignment takes place. i.e points in one cluster remains in the same cluster
NLP #CountVectorizer helps tokenize the documents,it converts text to vectors by assigning numeric values to it