This Topic Modeling script is an example using matrix factorization for detecting topics within a dataset of newspaper documents where we searched for the terms COVID OR Coronavirus. This one is written in python as you can see in the upper right corner. You can write your scripts in either R or python within Jupyter notebook
Topic Modeling is just one example of text mining but it provides us with methods to organize, understand and summarize large collections of textual information.
It helps in:
Topic modelling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection.
This topic modeling example produces a series of visualizations and scores the reoccurrence of the listed topics within documents across time. The red line was also plotted on the topic model to represent the number of new COVID cases in the UK over the same time period.
In this example, You can see the topics of – virus, cases, health Italy, spread, outbreak. From this graph, we could possibly interpret that it was likely the virus peaked in Italy several weeks before the UK by comparing the peaks of the blue and red graphs. The blue graph is representing the presence of these topics in the news and the Italy cases were being heavily discussed in the news.
Of course this is just the beginning of research but gives the researcher an idea of connected topics for further investigation.
If we scroll down further, you can also see the topics of economy, debt, crisis and gdp being heavily reported. This directly precedes the peak of cases in the UK (at least at the point of developing this model) but the financial reporting and impacts was being reported and experiencing significant impacts as COVID made its way across the globe.
These are simple topics that can again give the researcher a good idea or connection of data to make an interesting research topic or dig further into the analysis.
Now from this point, you can export the tables and data behind these graphs, the visualizations themselves, the script and any derivative data. The only thing that cannot be exported is the full text or any consumptive information that would allow the researcher to reconstruct the full text.