What is a translation corpus?

Like most machine-learning systems, machine translation (MT) requires massive amounts of data to produce intelligent results.

A translation corpus is a large and structured set of translated texts between two languages. Machine translation algorithms are often trained using datasets created by human translators to achieve high-quality output.

For companies seeking to improve their machine translation engines, Gengo can source large amounts of translation data across 70+ language pairs. Our crowd of 22,000+ human translators will deliver the volume you need to build and train an effective machine translation system.

Why Gengo?

Scale

We’re able to quickly prepare massive translation datasets with 500K+ segments per language pair with minimal lead time.

Quality

All data is translated by human translators (no PEMT), cleanly segmented, and aligned for easy input into your system.

Value

We offer clear, competitive per segment pricing depending on the volume and language(s) you need.

Content categories

Looking for a particular type of content? Gengo segments content into 23 different categories:

  • Art & Entertainment
  • Automotive
  • Business & Industrial
  • Human resources
  • Education
  • Family & Parenting
  • Finance
  • Food & Drink
  • Medicine, Health & Fitness
  • Hobbies & Interests
  • Home & Garden
  • Law, Govt & Politics
  • News
  • Pets
  • Real estate
  • Religion & Spirituality
  • Science
  • Retail
  • Society & Culture
  • Sports
  • Style & Fashion
  • Technology & Computing
  • Travel

Additional services

Machine translation retraining

We can identify and correct errors in your machine translation output to produce natural, error-free translations.

Custom corpora

Need a particular language pair or content type? We can create tailored corpora built specifically for your system.

Interested in a Gengo translation corpus?

Speak to one of our account managers.