Home » Translation and Localisation » Translation Terms: How to Calculate Unique and Repeated Words

Translation Terms: How to Calculate Unique and Repeated Words

August 27, 2020
| Ruju Patel

Last updated on January 28, 2022

Translation and Localisation

Seasoned professionals from the translation or language services industry are often guilty of using complicated terms that can confuse the client.

A client’s relationship with a language service provider (LSP) usually starts with requesting for a price estimate or quote. The first aspect to consider is the factors that determine the cost of translation – these range from the language pair to the category of text, and more. The second aspect is the pricing model or how a client will be charged. The most common method involves defining a per word price for the source text. This is where it can get tricky if you’re a first time client or new to the industry.

To make it easier for you to understand how word counts are calculated, here’s a sneak peek into our transparent pricing model. Through this post, you’ll learn what constitutes repeated words and how a long term partnership with an LSP can help you save on translation costs.

A Brief Background

Anyone who’s used a document authoring software like Microsoft Word knows how an electronic word count works. Unfortunately, it’s not as simple when it comes to word counts in translation files.

LSPs generate word counts using CAT (computer-aided translation) tools. The history of CAT tools goes back to the Cold War when intelligence data needed to be translated quickly. There were even attempts to create a machine to replace human translators; however, when that failed, the ’60s and ’70s saw an era where perspectives changed. Instead of replacing human translators, researchers agreed that the new system needed to aid translators with the translation process giving rise to the modern CAT tool.

CAT tools are essentially language databases, although they can’t automatically translate any source text. Interestingly, they don’t even have an inbuilt dictionary or memory of terms. The memory found in the tool is created over time when a translator adds the source text and its corresponding human translations to the database. When there is no stored memory, like in the case of a first time client, the tool looks only for text repetitions or ‘repeated words’ within the source document that is being translated.

Units of Translation: The Difference Between Words, Sentences and Segments

The biggest misconception related to translation word counts is that any word found more than once in the text, is considered, a ‘repeated’ word. While that may be true by definition, repeated words are calculated slightly differently in this context. When we translate, we don’t proceed word by word. We usually translate one sentence before moving to the next. Logically, this helps us understand and retain the context of the content. Since what constitutes a sentence can differ from language to language, we refer to this breakdown of text as ‘segments’.

What are Segments?

A segment is the generally accepted unit used in translation. Once the source text is imported into a CAT tool using the file format filter, the tool breaks down the text into segments through a quiet, behind-the-scenes process known as segmentation. Since sentence boundary detection differs between languages, segments are created using segmentation rules that are based on regular expressions or regex. These are patterns that have commonly become a part of the written text. For example, the end of a segment is marked by a definitive punctuation mark. If that is followed by a space and a word starting with a capital letter, it is the start of a new segment. Although general segmentation rules apply, segments can be edited manually or be predefined, like in the examples below:

Line breaks in a document
Bullet points
Word limits
Content in text boxes for desktop publishing or multimedia files
Separate strings in technical files
Content separated by cell breaks in XLS files

Identifying Unique and Repeated Segments

Since a CAT tool processes the source text and breaks it down into segments, it is those segments that it compares against one another to find repetitions or matches in a particular text. Generally, when the words, punctuation and format within a segment are identical to another segment, it is considered to be a repeated segment. Such segments are identified by the CAT tool so that the translator’s work is reduced and the same translation can be used multiple times.

If there is even a slight difference between two segments, each of them would be considered unique. Here are examples of segments appearing chronologically in a source document:

My name is Bob, and I’m from the UK. (Unique)
My name is Bob. (Unique)

Even though the words ‘My name is Bob’ appear in both segments, we look at each segment as a whole. For that reason, both of the segments above are unique.

Let’s consider another example:

My name is Bob (Unique)
My name is Bob. (Unique)
My name is Bob. (Repeated)

Here, the words in all the segments are the same; however, the first segment differs in punctuation. There is no full stop after the word ‘Bob’, and therefore, it is considered unique. The second and third segments are identical, and so, we consider the first time the segment is used as unique and any subsequent use as repetition. You’d wonder how a small punctuation mark can make such a great difference. Since we always look at segments as a whole and not through the individual words that build them, even a minor change can alter the meaning and context.

Calculating Total Repeated Words:

Now that we know how unique and repeated segments are identified, it’s essential to understand how they impact the cost of translation. For that, we need to return to our original case at hand – repeated words.

Once you understand the concept of repeated segments, calculating repeated words is easy.

To calculate the repeated words for a segment, multiply the number of times the segment is repeated by the number of words within the segment. To get the total number of repeated words, you will need to follow the same calculation for each unique segment that is repeated and add the answers. Fortunately, CAT tools do these calculations for you.

Pricing Benefits

The per-word translation cost for repeated words is generally less than the price of unique words. This is because the translator only has to translate the associated segments once, thereby reducing the time and effort it would take to translate the same content multiple times.

Let’s consider a client needs document translation services for 500-word document, and 100 of those are repeated. The client, in that case, will be charged a discounted rate per word for the 100 repeated words and the standard rate per word for the 400 unique words. What’s more? The client can benefit from even more significant savings by partnering with the LSP long-term. This is because all the repeated segments that are identified are stored in what’s called a translation memory and used for all subsequent translation orders – but that’s a discussion for another day!

Let the Saving Begin!

Now that you know exactly how unique and repeated word counts are calculated, it’s time to take the plunge! Partner with a translation and localisation company that uses an accurate and reliable CAT tool can ensure long-term pricing benefits and offers discounted pricing on repeated words from the word go!

Ruju Patel

Ruju Patel is a Content Marketing Specialist and the former Content Manager at Translate By Humans. She has a background in Psychology and Business and previous experience as an HR professional. She is a people’s person and a pro at drafting contracts, policies, and articles. When she’s not creating digital marketing strategies, you’ll find her reading with her cats or combining her love for food and culture by traveling around the world!

Ruju Patel

Explore more topics:

Latest Blogs

How Empathy Wins In Crafting User-Centric Experiences ft. Michela From Wix? November 21, 2023
Video Remote Interpretation: The Unsung Hero Overcoming 95% Challenges in Charities & NGOs November 3, 2023
A 100% Localisation Win-Win With Cristina From Meta October 31, 2023

RIGHT BANNER – SAME WIDTH HEIGHT CAN VARY

Blog