Topic Modeling is a text-mining approach which can be valuable for identifying which topics or subjects are part of a dataset. With TDM Studio, Topic Modeling can be used with both newspaper content as well as dissertation and thesis content for several different objectives. For example:
In the example below, we are using LDA to analyze a set of 8851 newspaper articles from the New York Times for the month of September 2001. These are all of the articles published by the New York Times for the month of September. How does the news cycle change in response to the tragic, terrorist attack? How does this differ from one newspaper to another?
LDA (Latent Dirichlet Allocation) is a generative model which attempts to discover ‘latent’ or hidden topics within a collection of documents. The only observed variable in the model is the occurrence of words in documents. The number of topics is provided as an input from the user (in TDM Studio via the ‘Number of Topics’ dropdown) and will impact the resulting topic model.
For TDM Studio, we use scikit-learn’s implementation of Latent Dirichlet Allocation.
This implementation also includes a valuable User Guide which includes further details on how word and topic distributions are computed.
For preparing documents for topic modeling, we rely upon scikit-learn’s CountVectorizer.
For newspaper articles, we use title, abstract, and full text as input. Because dissertations and theses are often hundreds of pages long, for dissertations and theses, we use the title and abstract as input.
For each topic, we list ten words which have the highest probability for the topic. These words often, though not always, give an indication of what the topic is about.
By clicking on a topic card, we present a list of up to fifty documents related to the selected topic. These are the documents for which the selected topic has a high probability of occurring. By clicking on the title of a document, a new window will open with the full text of the selected document.
Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. The Journal of Machine Learning Research, 3, pp.993-1022.
Hall, D., Jurafsky, D. and Manning, C.D., 2008, October. Studying the history of ideas using topic models. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 363-371).
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S. and Blei, D.M., 2009, December. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems (Vol. 22, pp. 288-296).
Dieng, A.B., Ruiz, F.J. and Blei, D.M., 2019. The dynamic embedded topic model. arXiv preprint arXiv:1907.05545.