You will be able to select two options from the dropdown, Select Publication Titles and Select ProQuest Databases.
The list of publications is based on 2 items – what the university subscribes to and the content that ProQuest has cleared for Text and Data Mining Rights. This is important because you can move forward with confidence that your content has been rights cleared which is another process that can take a significant amount of time to negotiate.
If you have a specific title you would like to use (i.e. New York Times or The Wall Street Journal), use the search box in the upper right-hand corner to filter by publications.
Here are some things to note as you are searching for your publications:
In this example we are going to select three news titles with international coverage.
The second title is the Times of India. In this case the selection is current print version rather than the online version.
As you select your titles the number of titles in that will be included in the dataset is noted at the bottom of the page.
The third publication selected is the Canadian Press news wire feed.
You can add as many publications as you would like to each dataset you create. Once you have selected all the publications that you wish to use, click the Next: Refine Content button to proceed the next in creating your dataset.
In the refine content step you can search for topics or words that match your research objective. This is an important step since a dataset can only contain up to 2 million records. You can limit your search to specific fields in documents using Boolean operators and qualifiers such as language or location. For more information on the search capabilities you can check the ProQuest Platform LibGuide.
There are a number of filters available to limit your documents including full text, dates, source type, and document type to winnow your dataset. Select appropriate filters and click apply.
When trying to search for specific events, try to include as many versions of such event as possible. For example, a search for “2009 Lancaster Mayoral Elections” might limit your results but searching for “2009 Lancaster elections” or “mayoral elections” can generate more articles that might talk about the same subject, but not have the exact word matches.
As you apply searches and filters a sample of documents that match your criteria display providing a preview of the documents.
When you are satisfied with the dataset that you have created, you can start the process to create your dataset by clicking the Next: Review Dataset button on the bottom-right.
A summary of your dataset appears in the upper left. If you want to modify it, click on Refine Content on the progress bar. Enter a name for your dataset and optional description.
Click Create Dataset when you are ready to generate it.
Once you have created you dataset, you are returned to your dashboard and the dataset you just defined displays with the status of In-Progress. TDM Studio processes 100,000-200,000 documents an hour.
You can estimate the processing time based on the number of documents in your dataset. The status will be updated to Ready for Jupyter when it is finished processing.
Once the status updates, the documents will begin to automatically transfer to the Jupyter Notebook development environment – this will happen at the same rate as before. We recommend creating datasets hours, or even days in advance, if you plan on building large datasets.