Skip to main content
ProQuest LibGuides Banner ProQuest LogoProQuest LibGuides homeProQuest LibGuides home

TDM Studio

A text a data mining solution for research, teaching and learning

Creating your dataset

Start with clicking the Create New Dataset
Creating a new dataset can be done in 1 of 2 ways:

  1. Choosing the specific publication titles
  2.  Choosing by the ProQuest database name.

Choosing your publications

In this dataset, we are going to add content from 5 sources – the second we chose was times of India and we chose the online version and the current version.  You can see at the bottom of the screen that the system is keeping track of all the publication titles that have been chosen until the user moves to the next step in creating a dataset.


We wanted to get a diverse view so we also chose the Canadian Press.


And finally the Sydney Morning Herald.  

Now we move to the next step of Refining the Content to meet the conditions of the research question.

Refining your search

Now you can see that the dataset is over 13 million documents and the user will need to refine the set down to less then 2 million documents.  If they are interested in more than 2 million, the easiest way to break up the set is to do that by time period or individual publications if necessary.


We enter the search terms COVID OR coronavirus.  This search box supports full Boolean so the researcher can have control over the search query.  

To learn more about Boolean operators, check our guide.

They can refine by date published, source type and document type.  All of these capabilities put the power of data into the hands of the researchers and allow them to curate a dataset specific to their research in ways they have not previously been able.  Then click on Review Dataset to complete the process.


Here you will name the dataset and add any description that will help you later identify this set amongst the other 9 possible datasets in your workbench.

The confirmation screen indicates your dataset is now being created and will take you back to the dashboard.  The dataset is not created instantaneously but takes some processing time.  Depending on the size of the dataset, it can take an hour or just under a day.  We process approximately 100,000 records per hour.

Now you can see the dataset on the dashboard, and it is queued for processing.  The processing will continue whether you are logged in or not. Check back in a few hours if your dataset is not millions but if you have created a dataset at the maximum of 2 million – then check back in a day and you should be ready to go.  The workbench indicates the file location which you will need when you go into your Jupyter notebook in order to access the dataset.