LibGuides: TDM Studio: Creating your dataset

Creating your dataset

You will be able to select two options from the Create New Dataset dropdown, Select Publication Titles and Select ProQuest Databases.

Select Publication Titles: allows you to specify your search along certain publications – e.g. New York Times or The Washington Post
Select ProQuest Databases: allows you to specify your search along ProQuest’s databases – e.g. ProQuest Dissertations and Theses – Global

Choosing your publications

The list of publications is based on 2 items – what the university subscribes to and the content that ProQuest has cleared for Text and Data Mining Rights. This is important because you can move forward with confidence that your content has been rights cleared which is another process that can take a significant amount of time to negotiate.

If you have a specific title you would like to use (i.e. New York Times or The Wall Street Journal), use the search box in the upper right-hand corner to filter by publications.

Here are some things to note as you are searching for your publications:

Sometimes there will be multiple entries for the same publication title. Please pay attention the to the Source Type column when selecting your publication to make sure that you are selecting the right one. For example, a search of New York Times can result in multiple different matches, but different source types. For example, some newspapers are delivered via electronic feed, while others are scanned, so there may be varying levels of OCR quality.
You can use the Full Text column on the far right to determine whether your selected publication contains full text or not.
Make sure that the publications that you select cover the period that you want. Some publications are split between historical and current versions, so it may be necessary to select different or multiple ones depending on the time span you want covered.
Make sure that the publications that you select cover the period that you want. Some publications are split between historical and current versions, so it may be necessary to select different or multiple ones depending on the time span you want covered.
If you are selecting multiple publications of the same name (their current and historical versions), try to generate your dataset starting from the most recent publication and going back chronologically. For example, if a publication has historical coverage from 2000 – 2005, and a current coverage of 2003 – current, it may be better to generate your dataset with the current coverage first, and then with the historical coverage limiting it to 2000-2003 during the content refinement step.

In this example we are going to select three news titles with international coverage. The first is the New York Times newspaper beginning in 1980.

The second title is the Times of India. In this case the selection is current print version rather than the online version.

As you select your titles the number of titles in that will be included in the dataset is noted at the bottom of the page.

The third publication selected is the Canadian Press news wire feed.

You can add as many publications as you would like to each dataset you create. Once you have selected all the publications that you wish to use, click the Next: Refine Content button to proceed the next in creating your dataset.

Refining your content

In the refine content step you can search for topics or words that match your research objective. This is an important step since a dataset can only contain up to 2 million records. You can limit your search to specific fields in documents using Boolean operators and qualifiers such as language or location. For more information on the search capabilities you can check the ProQuest Platform LibGuide.

There are a number of filters available to limit your documents including full text, dates, source type, and document type to winnow your dataset. Select appropriate filters and click apply.

When trying to search for specific events, try to include as many versions of such event as possible. For example, a search for “2009 Lancaster Mayoral Elections” might limit your results but searching for “2009 Lancaster elections” or “mayoral elections” can generate more articles that might talk about the same subject, but not have the exact word matches.

As you apply searches and filters a sample of documents that match your criteria display providing a preview of the documents.

When you are satisfied with the dataset that you have created, you can start the process to create your dataset by clicking the Next: Review Dataset button on the bottom-right.

A summary of your dataset appears in the upper left. If you want to modify it, click on Refine Content on the progress bar. Enter a name for your dataset and optional description.

Click Create Dataset when you are ready to generate it.

Dataset processing

Once you have created your dataset, you are returned to your dashboard and the dataset you just defined displays with the status of In-Progress. TDM Studio processes 100,000-200,000 documents an hour.

You can estimate the processing time based on the number of documents in your dataset. The status will be updated to Ready for Jupyter when it is finished processing.

Once the status updates, the documents will begin to automatically transfer to the Jupyter Notebook development environment – this will happen at the same rate as before. We recommend creating datasets hours, or even days in advance, if you plan on building large datasets.