10 - Training Data Provider

Gold data contains the ground truth. Re-use available resources, but be careful that the dataset matches your purpose.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

11 Sep 2020• 1 min read

Gold data refers to data of very high quality, which is more or less as close as you can get to the ground truth. This is the data you want to use when training a new Language Model. Some Data Providers sell these high quality datasets. However, the use of an off-the-shelf dataset depends on the usability of the data. This depends on the NLP-Task, Language, Domain and Tag-schema. I’m skeptical of using third-party datasets as they almost never match your purpose. Unless you use it for a default NLP task or a demo purpose or in addition to your own training dataset.

A good starting point for an overview of company- and research datasets for various tasks can be found in the Big Bad NLP database.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

10 - Training Data Provider

Rob van Zoest

Rob van Zoest

13 - Rulebased Training Data

12 - Textual Data Augmentation

11 - Crowdsourcing Marketplace

09 - Annotation with Active Learning

11 - Crowdsourcing Marketplace