What is a parallel text corpus?

Like most machine-learning systems, machine translation (MT) requires massive amounts of data to produce intelligent results.

A parallel text translation corpus is a large and structured set of translated texts between two languages. Machine translation algorithms are often trained using parallel corpora created by human translators in order to achieve high-quality output.

Gengo can source large amounts of parallel text data across 70+ language pairs for companies seeking to improve or develop machine translation engines. Our crowd of 22,000+ human translators will deliver the volume you need to build and train an effective machine translation system.

 

Why Gengo?

 

Scale

We’re able to quickly prepare massive translation datasets with 500K+ segments per language pair with minimal lead time.
 

Quality

All data is translated by human translators (no PEMT), cleanly segmented, and aligned for easy input into your system.
 

Value

We offer clear, competitive per segment pricing depending on the volume and language(s) you need.
 

Content categories

Looking for a parallel translation corpus for a particular type of content? Gengo segments content into 23 different categories:

 
  • Art & Entertainment
  • Automotive
  • Business & Industrial
  • Human resources
  • Education
  • Family & Parenting
  • Finance
  • Food & Drink
  • Medicine, Health & Fitness
  • Hobbies & Interests
  • Home & Garden
  • Law, Govt & Politics
  • News
  • Pets
  • Real estate
  • Religion & Spirituality
  • Science
  • Retail
  • Society & Culture
  • Sports
  • Style & Fashion
  • Technology & Computing
  • Travel
 
 

Additional services

Machine translation retraining

We can identify and correct errors in your machine translation output to produce natural, error-free translations.
 

Custom parallel corpora

Need a parallel text corpus for a particular language pair or content type? We can create tailored parallel corpora built specifically for your system.

Interested in a Gengo translation corpus?

Speak to one of our account managers.