Skip to main content
SearchLoginLogin or Signup

The importance of context and colloquial register in PLN translators based on Artificial Intelligence

The accuracy of Google Translate, DeepL and Bing for the translation of colloquial register of language is about 29'52%, 36'96% and 17'39% respectively. The development of machine translation has raised desinterest to know the knowledge, its culture and creativity.

Published onJun 19, 2022
The importance of context and colloquial register in PLN translators based on Artificial Intelligence


It has been developed a traditional model based on Artificial Intelligence for natural language programming pursuant to RNN and a big dataset. The model makes its greatest contribution by improving the accuracy in the translation of informal language. The methodology consists on studying the training model that appy each of them (Goole Translate, Deepl and Bing). To do this, we have designed a dataset which includes more than 200 registers of poetic verses and 230 popular sentences - with differences between folk sayings and proverbs. Our aim is training a model and validating the results obtained from machine translators.

Key words: #AI #ArtificialIntelligence #machinetranslation #neuralnetworks #activationfunctions #translationmodels

Note: for better readability, cursive script has been chosen, in an orthodox way, in complete sentences from English language (neologisms). And, others have sometimes been written in italics with the aim of noting their English etimology. Either because they are proper nouns or because they appeal to name specific elements or components of one or more systems that are described in this paper.

1.- Introduction

In 1954, the start-up of machine translation involved tests translation activities between English and Russian language. Over the years, some strategies as interlanguage and transfer were developed in order to get a translation that enhanced the features of the original language and its practical implementation. The story of machine translation had its age of splendour in 1997 by the emergence of Babel Fish, as it would allow the free access of machine translation via Internet [1].

In 2013, machine translation suffered succesive transformations to boost this task, for example, the renewal of conditional linguistic modelling, allowing that word predictions was conditioned by the original text. It was only then that the main machine translation multilingual systems such as Google Translate, it started using models based on transformers [2]. Currently, machine learning and natural language programming are core elements in machine translation.

2.- Machine learning models

These type of models are trained to recognize patterns. To carry out training, we can use a dataset and algorithm which allow obtaining all information from these data. Once trained the model, it is possible to disaggregate data which cannot be previously seen and make predictions about it [3].

Deep learning is a new field in the research machine learning. By and large, they are types of algorithms designed to machine learning [4]. Their main objective is simulating a neural network for the analysis and learning of human brain, for instance, in the interpretation of data: images, sounds and texts. Deep learning algorithms contribute to the development of the artificial neural networks [5].

“An artificial neural network (ANN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron receives signals, and then processes then and can signal neurons to connected to it” [6].

In the next section, we see how convolutional and recurrent neural networks work.

2.2 Convolutional neural networks (CNN)

Convolutional neural network is a type of RNA with a supervised learning that processes its layers by imitating the human cerebral cortex. It is used to identify different characteristics in the inputs in order to pinpoint objects. To that end, CNNs have many hidden layers, which follow a hierarchy, that is, they analyse a sentence in left to right direction [7]. The first layers can detect lines, curves and specialize until deepest into layers which recognize complex forms such as a face or animal.

Image 1

Image credit: structure of convolutional neural networks by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 17/05/2022)

A CNN receives an input and returns as much outputs as we have defined. In the previous case, in which a convolutional neural network is used to classify images, we have as an input the image that we tried to sort. This output image passes through filters, which are detailed in the next subsection.

2.2.1 Filters and intermediate layers - convolutional neural networks (CNN) Convolutional layer

Convolutional layers consist of taking group of pixels near to the output image and operating mathematically (scalar product) against a little matrix defined as kernel. The kernel, which has a defined size, for instance, 3x3 pixels, it runs throughout all the output neurons (from left to right and from top to bottom) by generating a new output matrix [8]. Subsampling

It is a process throughout which we intend to reduce the number of neurons of the next layer but without losing the most important features detected by each filter [9]. MaxPooling layer

The maxPooling layers are processes of subsampling in which the dimensionality is reduced by running throughout the image. For example, in a 10x10 matrix and a 2x2 MaxPooling layer, we run throughout the matrix from left to right and from top to bottom. The output of this layer will be an only one value per each 4 pixels. That value will be the maximum corresponding to each iteration [10].

Image 2

Image credit: maxPooling layer by Adrián Hernández, 2022 Softmax layer

Softmax becomes a vector of values in a probability distribution. The elements of the output vector are situated in the range (0,1) and summarize 1. Each vector is handled in an independently way. The argument by values establishes in which axis of the input the function is applied. This function is often used as the activation of the last layer in a classification network because the result can be interpreted as a probability distribution [11].

2.3 Recurrent neural networks (RNNs)

A recurrent neural network has not a defined layer construction, but the connections between neurons are arbitrary. It includes backpropagation connections, which feed themselves between neurons inside the layers. In each timestep, the recurrent neuron receives the input of the previous layer, as well as its own output in order to generate an output.

The name of these networks comes from a set of algorithms that makes, in a complex way, an exchange of information between neurons. Due to its characteristics, these networks can propagate information forward in time, equivalent to predict events [12]. Recurrent networks have enormous power for everything to do with sequential analysis such as texts analysis, sounds and videos.

Image 3

Image credit: recurrent neural networks by Diana Díaz, 2022 (.) Based on:

Backpropagation is a calculation method of gradient used in learning algorithms and it is supervised to train artificial neural networks. The method uses a propagation cycle - adaptation of two fases. When a pattern has been applied in the input fase as a stimulus, it is spread from the first layer and throughout the followings to generate an output. The output signal is compared with the desired output and the error signal is calculated for each of the outputs [13].

There are different kind of neural networks depending on the number of hidden layers and the way to provide feedback. The most used are the long-short term memory networks (LSTM).

2.3.1 Long-short term memory networks (LSTM)

The long-short term memory networks are conventional recurrent neural networks that expand its memory to learn from important experiences which have taken place long time ago. They follow a series of steps to decide what information is going to be stored and disposed of. This provides solutions for problems of conventional recurrent networks because they tend to tremendously grow or to disspate with time. This information is or not store depending on its importance. The assignment of importance is decided upon by weights, which are also learnt by algorithms. In a LSTM neuron there are “three doors of information”: input gate, forget gate and output gate. These doors determine if a new output is allowed or not, if the information is removed because it is not important or if the output is affected in the current pass of time [14].

Image 4

Image credit: LSTM by Diana Díaz, 2022 (.). Basada en URL:

3.- Activation functions

Activation functions are responsible for activating the weights of a neuron from the neural network according to its results. This procedure can remove the lineality from the neural networks, improving the accuracy and the final output. The input values are corresponded with the values of the abscissa axis (X) and the output values with the ordinate axis (Y) [15].


This activation function returns the same value for positive output values and a value of ex - 1 for all the output values less than 0. Its mathematical function is described:

f(x) = ex - 1 si x < 0
f(x) = x si x ≥ 0

For example, for an input value x = 0, the output value of the function is 0. For an input value x = -1, the output value is e-1 - 1 or -0.6321….

Image 5

Image credit: ELU activation function by Adrián Hernández, 2022


This activation function returns the same value for positive values and it returns 0 for values which are less than 0 or equal. Its mathematical function is described:

f(x) = max(0,x)

Image 6

Image credit: RELU activation function by Adrián Hernández, 20222

Leaky RELU

This activation function is variant of the activation function RELU, but it admits negative values. Its function is described:

f(x) = x*α si x > 0
f(x) = x si x ≤ 0

α is a number previously chosen. With this function, for negative output values, we maintain the same value in the output and positive values are became more acute.

For example:

x=5 y α=2 -> f(5) = 5*2 = 10

x=-5 y α=2 -> f(-5) = -5

Image 7

Image credit: Leaky RELU activation function by Adrián Hernández, 2022


The lineal activation function returns as an output the same input that receives. Its function is described: f(x) = x

For example:

x=1 -> f(1) = 1

Image 8

Image credit: lineal activation function by Adrián Hernández, 2022


Sigmoide function returns for small values (<-5), a value near to 0, and for big values (>5), the result of function is near to 1. Sigmoide is equivalent to a softmax of two elements, where, supposedly, the second element is 0. Sigmoide function always returns a value between 0 and 1.

Its function is described:

f(x) = 1 / (1 + exp(-x))

For instance:

x=5 -> f(5) = 1 / (1+exp(-5))= 0.9933

x=-10 -> f(5) = 1 / (1+exp(10))= 0.0000453

Image 9

Image credit: sigmoide activation function by Adrián Hernández, 2022

4.- Natural language models

Natural language programming is a field of knowledge of Artificial Intelligence that does a research about the communication between machines and humans by natural languages, such as English and Spanish [16]. When we talk about natural language, we refer to the way of human language with communicative aims, according to economical principles and optimization [17].

Depending on the aim of its application, some of the linguistic components that are used in natural language programming are:

  • Morphology. It is a linguistic branch that studies the internal structure of words [18].

  • Syntax. It is a part of grammar that studies the form in which words are combined, syntagmatic and paradigmatic relations between them [19].

  • Semantics. It is focused on diverse aspects such as meaning, sense and the interpretation of linguistic signs [20].

  • Pragmatics. It is related to the influence of context in meaning interpretation [21].

4.1 Masked language modeling (MLM)

The name of masked language modeling is due to certain percentage of words in a sentence is covert and the model is expected to predict those words according to other words from the sentence. This model is two-way because the representation of the masked word is learnt based on the words that appear in both the left and rights. A clear example of this modeling is BERT [22]:

Image 10

Image credit: masked language modeling by Diana Díaz, 2022 (.). URL: (image published with the required authorization, with date: 17/05/2022)

4.2 Causal language model (CLM)

The causal language model follows the same pattern seen in the previous one, that is, they predict the masked words, but in this case, they are only taken into account the words that occur from left to right to repeat the process. For this reason, the modeling is one-way. An example is DeepL [23]:

Image 11

Image credit: causal language modeling by Diana Díaz, 2022 (.). URL: (image published with the required authorization, with data: 17/05/2022)

5.- Machine translation model

5.1 Machine translation (MT)

It is the process by which the software of a computer is used to translate a text in natural language, such as English or Spanish [24].

Mainly, in machine translation, we can distinguish two types:

  • Machine translation based on rules. Bilingual dictionaries are used and they work with the creation of linguistic rules. Its handicap are the exceptions in grammatical rules because they cannot translate a sentence if it does not appear in a grammar book [25].

  • Machine translation based on corpus, since 1989. It is a method of translation used in the present day, becoming in one of the most extended system. It consists on using big bilingual and parallel corpus to create the system of translation. As well, statistical machine translation is notable [26].

5.2 Statistical machine translation (SMT)

Statistical machine translation is defined as the translation of one text which is made by a computer that has learnt how to translate on the basis of a huge quantity of translated texts by using statistics [27]. It is a kind of machine translation based on corpus. The translation engine uses big volumes of corpus and parallel texts as bilingual as monolingual. A clear example of statistical machine translation is the system of Bing Microsoft Translate [28].

Image 12

Image credit: Statistical machine translation by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 17/05/2022)

5.3 Neural machine translation (NMT)

It is a translation system based on artificial neural networks algorithms. Neural machine translation is an approach of machine translation based on deep learning and neural networks. Big enterprises have shown interest in this kind of translation such as Google, Microsoft and DeepL because they have been created projects involved with neural machine translation to evaluate its performance. An example is transformer, a prototype of Google we will talk about later on. It is interesting to highligh its capacity of self-learning, particularly when the machine has to deal with syntax, semantics, lexicon and cultural references that exist from one language to another. In contrast to statistical machine translation, in neural machine translation, “the neural machine translation models require a fraction of memory that need the statistical machine translation models” [29].

Image 13

Image credit: neural machine translation (NMT) by Diana Díaz, 2022 (.). Based on URL:

5.4 Comparative between MT, SMT and NMT

Image 14

Image credit: comparative table between MT, SMT y NMT by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 18/05/2022)

There are six parameters to compare the different types of machine translations: quality, coherence, consistency, fluency, space, cost, grammatical knowledge and grammatical exceptions. In quality, coherence and consistency NMT and MT are better than SMT. With regard to capacity and fluency, MT is the oldest one. Despite of costs are relative, the cheapest one is SMT. And, finally, in respect of grammar, MT is the only which knows grammar because SMT does not know and NMT only replicates it. On the contrary, MT is the only one which has problems with grammatical rules.

6.- Model used by the main translators on Internet

6.1 Google Translate

Short description

It was launched in april 2006 becoming a reference for online translations, although it emerged as statistical machine translation in 2016. Later, Google changed its method of translation with GNMT, Google Neural Machine Translation. This type of NMT was created to improve the quality of translation for other NMT systems becayse they use to have difficulties to translate some words, speed and accuracy problems, (Quoc Le y Schuster, 2016). It uses techniques based on deep learning to guarantee more contextual accuracy. Currently, this service offers translations in 103 languages with more than 500 users per day [30].

Model: BERT (transformer)

Google Translate uses a model based on the architecture of transformer. This model has two parts: input and output. According to RAE dictionary, input is referred to the dataset will be introduced in a computer system [31], meanwhlile output makes reference to the information will be processed in a computer system [32].

The encoder maps from an output sequency to a sequency of continue representations, which would feed a decoder. This decoder receives the output of encoder with the output of decoder and it generates an output sequency.

In an encoder-decoder model, the output looks for differences with the input on the basis of the training that architecture has previously received. That means, when an image enters in a encoder-decoder block, it is codified according to the training in the output of the block, it is decodified and it will be compared with the original image by measuring the margin of error between them. If the output image corresponds to the training, the margin of error will be minimum. In this model, we can distinguish two blocks: embedding (input y output), add & norm, multi-head attention and feed forward. Nex integrates add & norm, multi-head attention and feed forward, as much in the input as the output one. The output also has linear function and the softmax one is used as the last layer in categorizers based on neural networks.

Image 15

Image credit: transformer by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 17/05/2022)

The embedding block provides of some scale layers for monitoring the words. In the input part, the second block multi-head implements a self-service mechanism of several multiheads that receive a projected version linearly with consults to produce parallel outputs, which will be used to generate a final result. On the other hand, the encoder is designed to attend all the words in the input sequence, independently of its position in the sequence. Therefore, the prediction of a word in the position can only depend on the outputs knowed by the words which preceded it in the sequence.

In the decoder part, we can observe a layer that constitutes the mechanism of multiple heads, which receives the consults of the sublayer of the previous decoder and the keys and output values of the encoder. By this way, the decoder can pay attention to all the words in the input sequence.

The third block feed forward, similar in both sides, is an advanced networking fully connected and it consists of two lineal transformations with the rectified linear activation function (ReLU) in its central part.

Finally, the output of the decoder passes through a layer completely connected, followed by a Softmax activation function to generate a prediction for the next word in the output sequence.

6.2 DeepL

Short description

DeepL is supported by the online dictionary Linguee. It is based on multilingual parallel corpus and algorithms based on AI and learning machine that already exist. The online translation in DeepL is the result of a combination between machine learning and convolutional neural networks, as well as a online big database of searches and translations used as references to train the translation engine and improve it [30].


The most of translation systems that are publicly available are direct modifications from transformer architecture. Moreover, Deepl’s neural networks have also parts of this architecture such as service mechanisms which are used to reshape far-reaching interaction, for example, with a text in NLP [33].

To explain the model that DeepL uses, it is important to bear in mind the similarities with transformer model [34]. In principle, it has the same parts, that is, embedding (input and output), add & norm, multi-head attention and feed forward and linear and softmax function in the output.

DeepL is composed of three blocks:

  • Token and Position Embedding

  • MultiHeadSelfAttention

  • MultiHeadSelfAttention.casual_attention_mask

Image 16

Image credit: model that uses DeepL by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 17/05/2022)

The embedding block has some layers to monitor the implementation of words, and so the transformer. Furthermore, the position embedding is trained, unlike that in transformer - the embedding is training by calculating sine and cosine.

Image 17

Image credit: DeepL-Transformer_4 by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 17/05/2022)

The sine function represents the variation of the ordinate of the point in function of its angle X. The sine function has the equation f (x) = A sin (x). The cosine function represents the variation of the abcise of the point in function of its angle X. The cosine function has the equation f (x) = A sin (x).

Image 18

Image credit: cosine function by Diana Díaz, 2022 (.) (Created in Geogebra)

Another important difference is that the method encoder-decoder is not applied. The multiheadselfattention is the block which implements the decoder of transformer through numerous layers.

Image 19

Image credit: DeepL-Transformer_1 by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 17/05/2022)

It combines an embedding with an arbitrary number of multi-head and feed award layers. The model “pushes” the contextual vector from this part through a lineal layer and then transforms it in a distribution of words probability by a softmax function.

Image 20

Image credit: DeepL-Transformer_2 by Diana Díaz, 2022 (.) URL: (image published with the required authorization, with data: 17/05/2022)

6.3 Bing Microsoft Translator

Short description

Bing Microsoft Translate is a service created by Microsoft and allows users to translate texts or complete web pages in different languages [35]. It is a translator that has two technologies; statistical machine translation and neural machine translation based on Artificial Intelligence [36].

The architecture of Zcode model is a type of architecture that uses learning by transfer in two ways. Firstly, the model is trained in many languages for getting the transference of knowledgment in serveral tongues. And, secondly, the multitasking training is used to transfer knowledges between tasks. For instance, the task of machine translation can help the understanding task of natural language. We can see in this link the image of Zcode model:

DeltaLM: pre training of encoder-decoder to translate from one language to another through an increase of pre trained multilingual encoders. Experiments carried out show that DeltaLM overcome several lines of solid basis in the generation of natural language, translation tasks (machine translation), text summary, conversion from data to text and generation of questions [37]. Code and pre trained models are available in this link:

DeltaLM graphical representation:

Image 21

Image credit: DeltaLM model by Diana Díaz, 2022 (.) URL:  (image published with the required authorization, with data: 23/05/2022)

It is possible to see the graphical representation of ZCode-DeltaLM in this link: 

7. Output dataset to train, validate and test

To translate some linguistic elements such as idioms and sayings, it is neccesary to know the origin language and the destiny one, otherwise, translation would be some sentences composed by words with no meaning [38]. Translation multilingual systems such as Google Translate works on the basis on the information available in its database. Unless an idiom will be popular, the translator provides literal results, modifying the meaning of the sentence [39]. We see examples:

I) In English language, the idiom hit a club is translated into Spanish as darse un garbeo. Google Translate: golpear un club, DeepL: golpear un club and Bing Microsoft Translate: golpear un club.

II) In English language, the modern refrain dealt it is translated into Spanish as el que lo huele, debajo lo tiene. We verify how it is translated by the machine translators. Google Translate: tratado, DeepL: lo repartió and Bing Microsoft Translate: lo repartí.

It is the same that occurs with literary genres, for example, poetry has also supossed a diffculty in machine translation. It is due to literality takes part in more than 80% of results, setting aside its essence: evasion from the reader’s mind, beauty, sense of aesthetics, rhyme, use of rhetorical figures, and communication of message [40]. We see examples:

I) In the poem She walks beauty by Lord Byron, in the verse 15 The smiles that win, the tints that glow is translated into Spanish such as las sonrisas que triunfan, los matices que refulgen. For Google Translate is: las sonrisas que ganan, los tintes que brillan, DeepL: las sonrisas que ganan, los tintes que brillan and Microsoft Translate: las sonrisas que ganan y los tintes que brillan.

II) In the Sonnet XXIX by William Shakespeare, the sixth verse Featur’d like him, like him with friends possess’d is translated into Spanish as deseando ser mejor compañía para disfrutar de la amistad. Google Translate: Tiene como él, como lo posee él con amigas, DeepL: Como un aspecto él de él, como el de los amigos que posee and Bing Microsoft Translate: Como él, como él con amigos poseídos.

To carry out the study, we have created a dataset that combines more than 200 registers about poetic verses and 230 popular expressions. Our aim is training a model, validating results and make a comparison with the results form machine translators. Our datasets can be viewed here:

8. Accuracy comparison between Google (BERT), DeepL y Bing by using informal language and emotions

In the translation of our dataset, we compare the accuracy of results by using the three most used models. In reference to poetic verses, it is important to attend to the context, because it offers us the possibility of knowing more than only the reading lines. The methodology consists on interpreting the composition from English language to Spanish, when data division has been done: 70 % (training), 20 % (validation) and 10 % (test). The obtained results in the three machine translators reflect literality and the lost of contextuality so, at the end, we have a serie of words without any sense and much less poetic composition. We can observe in this table the accuracy of each translators, highlighting the work of Google Translation:

Table 1

Google Translate


Bing Microsoft Translate




In relation to idioms, popular expressions and refrains, the accuracy obtained is low for the three models, except for DeepL:

Table 2

Google Translate


Bing Microsoft Translate




In this case, the three translators improve their accuracy in comparison with the poetic composition one. Bing Microsoft Translate has become again the worst one.

9. Development of a model based on artificial intelligence and personalized for NLP

The RNN-simple model has this structure:

Image 22

Image credit: simple RNN model by Adrián Hernández, 2022

By using our dataset and training the model during 30 times, we get an accuracy of 53,65%, which amounts to 58,41%. If we use 60 times in the translation of popular expressions and an accuracy of 39,83%, which increases in 2.94, to 42.77%. And if we use 60 times in the translation of poetry.

The model has around 800.000 trainable parameters in three different layers, adding a dropout layer to reduce/eliminate the possibility of include dying neurons during the training.

9.1 Development of a traditional model based on artificial intelligence and personalized for NLP relying on RNN with an embedding block

The RNN-embedding model has this structure:

Image 23

Image credit: Embedding RNN model by Adrián Hernández, 2022

By using our dataset and training the model during 30 times, we get an accuracy of 71,52%, which increases to 73.27% if we use 60 times in the translation of popular expressions. And, an accuracy of 62.46% which increases to 68.26%, if we use 60 times in the translation of poems.

The model has around one million of trainable parameters in four layers by using an embedding block.

9.2 Development of a transformer model based on artificial intelligence and personalized for machine translation (informal language and emotions)

The transformer model presents this structure:

Image 24

Image credit: transformer model by Adrián Hernández, 2022

By using our dataset and training the model during 30 times, we get an accuracy of 81,02%, that increases to 87.56% if we use 60 times in the translation of popular sentences and an accuracy of 71,51% that increases to 76,33% if we use 60 times in the translation of poetry.

The model has around 20 million of trainable parameters in three layers, as well as an encoder-decoder model.

The accuracy can improve if we apply a fine adjustement in each of the layers, increasing the number of layers or making a pre-processed to the input dataset of our model.

10. Conclusions

Currently, G. Translator, DeepL and Bing Translate are not known by their magnificence in poetic and informal language.

Image 25

Image credit: accuracy in the translation of popular sentences (in terms of percentage) by Adrián Hernández, 2022

The accuracy of the informal language and poetry from our dataset is below 30%. They are overcome by architectures less complex as RNN simple:

Image 26

Image credit: accuracy of translation in poetry (in terms of percentage) by Adrián Hernández, 2022

With these data, adding to it that Google Translate, DeepL and Bing Translate use transformer architecture, we can deduce that the problem of machine translations are not only focused on the architecture of the model, but rather in the lack of attention that translation companies show with oral language; the wisest one to learn a new language. As well, we conclude that Artificial Intelligence remains very dependent on the training dataset and how information is processed.

11. Acknowledgments

We thank to organizations and personalities that make possible this project by collaborating on updated information and testimonies which enhance the investigation value.

D. Juan Ignacio Bagnato -

D. Prakhar Mishra -

D. Kevin Knight -

D. ª Carla Parra Escartín -

D. Víctor Busqué -

D. Dongdong Zhang -

D. Roberto Rivas Couce -

A Reply to this Pub
No comments here
Why not start the discussion?