Training Data Generation

10 - Training Data Provider

Gold data contains the ground truth. Re-use available resources, but be careful that the dataset matches your purpose.

Gold data refers to data of very high quality, which is more or less as close as you can get to the ground truth. This is the data you want to use when training a new Language Model. Some Data Providers sell these high quality datasets. However, the use of an off-the-shelf dataset depends on the usability of the data. This depends on the NLP-Task, Language, Domain and Tag-schema. I’m skeptical of using third-party datasets as they almost never match your purpose. Unless you use it for a default NLP task or a demo purpose or in addition to your own training dataset.

A good starting point for an overview of company- and research datasets for various tasks can be found in the Big Bad NLP database.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.