The importance of context and colloquial register in PLN translators based on Artificial Intelligence
The accuracy of Google Translate, DeepL and Bing for the translation of colloquial register of language is about 29'52%, 36'96% and 17'39% respectively. The development of machine translation has raised desinterest to know the knowledge, its culture and creativity.
The importance of context and colloquial register in PLN translators based on Artificial Intelligence
·
Abstract
It has been developed a traditional model based on Artificial Intelligence for natural language programming pursuant to RNN and a big dataset. The model makes its greatest contribution by improving the accuracy in the translation of informal language. The methodology consists on studying the training model that appy each of them (Goole Translate, Deepl and Bing). To do this, we have designed a dataset which includes more than 200 registers of poetic verses and 230 popular sentences - with differences between folk sayings and proverbs. Our aim is training a model and validating the results obtained from machine translators.
Note: for better readability, cursive script has been chosen, in an orthodox way, in complete sentences from English language (neologisms). And, others have sometimes been written in italics with the aim of noting their English etimology. Either because they are proper nouns or because they appeal to name specific elements or components of one or more systems that are described in this paper.
1.- Introduction
In 1954, the start-up of machine translation involved tests translation activities between English and Russian language. Over the years, some strategies as interlanguage and transfer were developed in order to get a translation that enhanced the features of the original language and its practical implementation. The story of machine translation had its age of splendour in 1997 by the emergence of Babel Fish, as it would allow the free access of machine translation via Internet [1].
In 2013, machine translation suffered succesive transformations to boost this task, for example, the renewal of conditional linguistic modelling, allowing that word predictions was conditioned by the original text. It was only then that the main machine translation multilingual systems such as Google Translate, it started using models based on transformers [2]. Currently, machine learning and natural language programming are core elements in machine translation.
2.- Machine learning models
These type of models are trained to recognize patterns. To carry out training, we can use a dataset and algorithm which allow obtaining all information from these data. Once trained the model, it is possible to disaggregate data which cannot be previously seen and make predictions about it [3].
Deep learning is a new field in the research machine learning. By and large, they are types of algorithms designed to machine learning [4]. Their main objective is simulating a neural network for the analysis and learning of human brain, for instance, in the interpretation of data: images, sounds and texts. Deep learning algorithms contribute to the development of the artificial neural networks [5].
“An artificial neural network (ANN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron receives signals, and then processes then and can signal neurons to connected to it” [6].
In the next section, we see how convolutional and recurrent neural networks work.
2.2 Convolutional neural networks (CNN)
Convolutional neural network is a type of RNA with a supervised learning that processes its layers by imitating the human cerebral cortex. It is used to identify different characteristics in the inputs in order to pinpoint objects. To that end, CNNs have many hidden layers, which follow a hierarchy, that is, they analyse a sentence in left to right direction [7]. The first layers can detect lines, curves and specialize until deepest into layers which recognize complex forms such as a face or animal.
A CNN receives an input and returns as much outputs as we have defined. In the previous case, in which a convolutional neural network is used to classify images, we have as an input the image that we tried to sort. This output image passes through filters, which are detailed in the next subsection.
2.2.1 Filters and intermediate layers - convolutional neural networks (CNN)
2.2.1.1 Convolutional layer
Convolutional layers consist of taking group of pixels near to the output image and operating mathematically (scalar product) against a little matrix defined as kernel. The kernel, which has a defined size, for instance, 3x3 pixels, it runs throughout all the output neurons (from left to right and from top to bottom) by generating a new output matrix [8].
2.2.1.2 Subsampling
It is a process throughout which we intend to reduce the number of neurons of the next layer but without losing the most important features detected by each filter [9].
2.2.1.3 MaxPooling layer
The maxPooling layers are processes of subsampling in which the dimensionality is reduced by running throughout the image. For example, in a 10x10 matrix and a 2x2 MaxPooling layer, we run throughout the matrix from left to right and from top to bottom. The output of this layer will be an only one value per each 4 pixels. That value will be the maximum corresponding to each iteration [10].
2.2.1.4 Softmax layer
Softmax becomes a vector of values in a probability distribution. The elements of the output vector are situated in the range (0,1) and summarize 1. Each vector is handled in an independently way. The argument by values establishes in which axis of the input the function is applied. This function is often used as the activation of the last layer in a classification network because the result can be interpreted as a probability distribution [11].
2.3 Recurrent neural networks (RNNs)
A recurrent neural network has not a defined layer construction, but the connections between neurons are arbitrary. It includes backpropagation connections, which feed themselves between neurons inside the layers. In each timestep, the recurrent neuron receives the input of the previous layer, as well as its own output in order to generate an output.
The name of these networks comes from a set of algorithms that makes, in a complex way, an exchange of information between neurons. Due to its characteristics, these networks can propagate information forward in time, equivalent to predict events [12]. Recurrent networks have enormous power for everything to do with sequential analysis such as texts analysis, sounds and videos.
Backpropagation is a calculation method of gradient used in learning algorithms and it is supervised to train artificial neural networks. The method uses a propagation cycle - adaptation of two fases. When a pattern has been applied in the input fase as a stimulus, it is spread from the first layer and throughout the followings to generate an output. The output signal is compared with the desired output and the error signal is calculated for each of the outputs [13].
There are different kind of neural networks depending on the number of hidden layers and the way to provide feedback. The most used are the long-short term memory networks (LSTM).
2.3.1 Long-short term memory networks (LSTM)
The long-short term memory networks are conventional recurrent neural networks that expand its memory to learn from important experiences which have taken place long time ago. They follow a series of steps to decide what information is going to be stored and disposed of. This provides solutions for problems of conventional recurrent networks because they tend to tremendously grow or to disspate with time. This information is or not store depending on its importance. The assignment of importance is decided upon by weights, which are also learnt by algorithms. In a LSTM neuron there are “three doors of information”: input gate, forget gate and output gate. These doors determine if a new output is allowed or not, if the information is removed because it is not important or if the output is affected in the current pass of time [14].
3.- Activation functions
Activation functions are responsible for activating the weights of a neuron from the neural network according to its results. This procedure can remove the lineality from the neural networks, improving the accuracy and the final output. The input values are corresponded with the values of the abscissa axis (X) and the output values with the ordinate axis (Y) [15].
ELU
This activation function returns the same value for positive output values and a value of ex - 1 for all the output values less than 0. Its mathematical function is described:
f(x) = ex - 1 si x < 0 f(x) = x si x ≥ 0
For example, for an input value x = 0, the output value of the function is 0. For an input value x = -1, the output value is e-1 - 1 or -0.6321….
RELU
This activation function returns the same value for positive values and it returns 0 for values which are less than 0 or equal. Its mathematical function is described:
f(x) = max(0,x)
Leaky RELU
This activation function is variant of the activation function RELU, but it admits negative values. Its function is described:
f(x) = x*α si x > 0 f(x) = x si x ≤ 0
α is a number previously chosen. With this function, for negative output values, we maintain the same value in the output and positive values are became more acute.
For example:
x=5 y α=2 -> f(5) = 5*2 = 10
x=-5 y α=2 -> f(-5) = -5
Lineal
The lineal activation function returns as an output the same input that receives. Its function is described: f(x) = x
For example:
x=1 -> f(1) = 1
Sigmoide
Sigmoide function returns for small values (<-5), a value near to 0, and for big values (>5), the result of function is near to 1. Sigmoide is equivalent to a softmax of two elements, where, supposedly, the second element is 0. Sigmoide function always returns a value between 0 and 1.
Its function is described:
f(x) = 1 / (1 + exp(-x))
For instance:
x=5 -> f(5) = 1 / (1+exp(-5))= 0.9933
x=-10 -> f(5) = 1 / (1+exp(10))= 0.0000453
4.- Natural language models
Natural language programming is a field of knowledge of Artificial Intelligence that does a research about the communication between machines and humans by natural languages, such as English and Spanish [16]. When we talk about natural language, we refer to the way of human language with communicative aims, according to economical principles and optimization [17].
Depending on the aim of its application, some of the linguistic components that are used in natural language programming are:
Morphology. It is a linguistic branch that studies the internal structure of words [18].
Syntax. It is a part of grammar that studies the form in which words are combined, syntagmatic and paradigmatic relations between them [19].
Semantics. It is focused on diverse aspects such as meaning, sense and the interpretation of linguistic signs [20].
Pragmatics. It is related to the influence of context in meaning interpretation [21].
4.1 Masked language modeling (MLM)
The name of masked language modeling is due to certain percentage of words in a sentence is covert and the model is expected to predict those words according to other words from the sentence. This model is two-way because the representation of the masked word is learnt based on the words that appear in both the left and rights. A clear example of this modeling is BERT [22]:
4.2 Causal language model (CLM)
The causal language model follows the same pattern seen in the previous one, that is, they predict the masked words, but in this case, they are only taken into account the words that occur from left to right to repeat the process. For this reason, the modeling is one-way. An example is DeepL [23]:
5.- Machine translation model
5.1 Machine translation (MT)
It is the process by which the software of a computer is used to translate a text in natural language, such as English or Spanish [24].
Mainly, in machine translation, we can distinguish two types:
Machine translation based on rules. Bilingual dictionaries are used and they work with the creation of linguistic rules. Its handicap are the exceptions in grammatical rules because they cannot translate a sentence if it does not appear in a grammar book [25].
Machine translation based on corpus, since 1989. It is a method of translation used in the present day, becoming in one of the most extended system. It consists on using big bilingual and parallel corpus to create the system of translation. As well, statistical machine translation is notable [26].
5.2 Statistical machine translation (SMT)
Statistical machine translation is defined as the translation of one text which is made by a computer that has learnt how to translate on the basis of a huge quantity of translated texts by using statistics [27]. It is a kind of machine translation based on corpus. The translation engine uses big volumes of corpus and parallel texts as bilingual as monolingual. A clear example of statistical machine translation is the system of Bing Microsoft Translate [28].
5.3 Neural machine translation (NMT)
It is a translation system based on artificial neural networks algorithms. Neural machine translation is an approach of machine translation based on deep learning and neural networks. Big enterprises have shown interest in this kind of translation such as Google, Microsoft and DeepL because they have been created projects involved with neural machine translation to evaluate its performance. An example is transformer, a prototype of Google we will talk about later on. It is interesting to highligh its capacity of self-learning, particularly when the machine has to deal with syntax, semantics, lexicon and cultural references that exist from one language to another. In contrast to statistical machine translation, in neural machine translation, “the neural machine translation models require a fraction of memory that need the statistical machine translation models” [29].
5.4 Comparative between MT, SMT and NMT
There are six parameters to compare the different types of machine translations: quality, coherence, consistency, fluency, space, cost, grammatical knowledge and grammatical exceptions. In quality, coherence and consistency NMT and MT are better than SMT. With regard to capacity and fluency, MT is the oldest one. Despite of costs are relative, the cheapest one is SMT. And, finally, in respect of grammar, MT is the only which knows grammar because SMT does not know and NMT only replicates it. On the contrary, MT is the only one which has problems with grammatical rules.
6.- Model used by the main translators on Internet
6.1 Google Translate
Short description
It was launched in april 2006 becoming a reference for online translations, although it emerged as statistical machine translation in 2016. Later, Google changed its method of translation with GNMT, Google Neural Machine Translation. This type of NMT was created to improve the quality of translation for other NMT systems becayse they use to have difficulties to translate some words, speed and accuracy problems, (Quoc Le y Schuster, 2016). It uses techniques based on deep learning to guarantee more contextual accuracy. Currently, this service offers translations in 103 languages with more than 500 users per day [30].
Model: BERT (transformer)
Google Translate uses a model based on the architecture of transformer. This model has two parts: input and output. According to RAE dictionary, input is referred to the dataset will be introduced in a computer system [31], meanwhlile output makes reference to the information will be processed in a computer system [32].
The encoder maps from an output sequency to a sequency of continue representations, which would feed a decoder. This decoder receives the output of encoder with the output of decoder and it generates an output sequency.
In an encoder-decoder model, the output looks for differences with the input on the basis of the training that architecture has previously received. That means, when an image enters in a encoder-decoder block, it is codified according to the training in the output of the block, it is decodified and it will be compared with the original image by measuring the margin of error between them. If the output image corresponds to the training, the margin of error will be minimum. In this model, we can distinguish two blocks: embedding (input y output), add & norm, multi-head attention and feed forward. Nex integrates add & norm, multi-head attention and feed forward, as much in the input as the output one. The output also has linear function and the softmax one is used as the last layer in categorizers based on neural networks.
The embedding block provides of some scale layers for monitoring the words. In the input part, the second block multi-head implements a self-service mechanism of several multiheads that receive a projected version linearly with consults to produce parallel outputs, which will be used to generate a final result. On the other hand, the encoder is designed to attend all the words in the input sequence, independently of its position in the sequence. Therefore, the prediction of a word in the position can only depend on the outputs knowed by the words which preceded it in the sequence.
In the decoder part, we can observe a layer that constitutes the mechanism of multiple heads, which receives the consults of the sublayer of the previous decoder and the keys and output values of the encoder. By this way, the decoder can pay attention to all the words in the input sequence.
The third blockfeed forward, similar in both sides, is an advanced networking fully connected and it consists of two lineal transformations with the rectified linear activation function (ReLU) in its central part.
Finally, the output of the decoder passes through a layer completely connected, followed by a Softmax activation function to generate a prediction for the next word in the output sequence.
6.2 DeepL
Short description
DeepL is supported by the online dictionary Linguee. It is based on multilingual parallel corpus and algorithms based on AI and learning machine that already exist. The online translation in DeepL is the result of a combination between machine learning and convolutional neural networks, as well as a online big database of searches and translations used as references to train the translation engine and improve it [30].
Model
The most of translation systems that are publicly available are direct modifications from transformer architecture. Moreover, Deepl’s neural networks have also parts of this architecture such as service mechanisms which are used to reshape far-reaching interaction, for example, with a text in NLP [33].
To explain the model that DeepL uses, it is important to bear in mind the similarities with transformer model [34]. In principle, it has the same parts, that is, embedding (input and output), add & norm, multi-head attention and feed forward and linear and softmax function in the output.
DeepL is composed of three blocks:
Token and Position Embedding
MultiHeadSelfAttention
MultiHeadSelfAttention.casual_attention_mask
The embedding block has some layers to monitor the implementation of words, and so the transformer. Furthermore, the position embedding is trained, unlike that in transformer - the embedding is training by calculating sine and cosine.
The sine function represents the variation of the ordinate of the point in function of its angle X. The sine function has the equation f (x) = A sin (x). The cosine function represents the variation of the abcise of the point in function of its angle X. The cosine function has the equation f (x) = A sin (x).
Another important difference is that the method encoder-decoder is not applied. The multiheadselfattention is the block which implements the decoder of transformer through numerous layers.
It combines an embedding with an arbitrary number of multi-head and feed award layers. The model “pushes” the contextual vector from this part through a lineal layer and then transforms it in a distribution of words probability by a softmax function.
6.3 Bing Microsoft Translator
Short description
Bing Microsoft Translate is a service created by Microsoft and allows users to translate texts or complete web pages in different languages [35]. It is a translator that has two technologies; statistical machine translation and neural machine translation based on Artificial Intelligence [36].
The architecture of Zcode model is a type of architecture that uses learning by transfer in two ways. Firstly, the model is trained in many languages for getting the transference of knowledgment in serveral tongues. And, secondly, the multitasking training is used to transfer knowledges between tasks. For instance, the task of machine translation can help the understanding task of natural language. We can see in this link the image of Zcode model: https://www.microsoft.com/en-us/research/project/project-zcode/
DeltaLM: pre training of encoder-decoder to translate from one language to another through an increase of pre trained multilingual encoders. Experiments carried out show that DeltaLM overcome several lines of solid basis in the generation of natural language, translation tasks (machine translation), text summary, conversion from data to text and generation of questions [37]. Code and pre trained models are available in this link: https://aka.ms/deltalm
To translate some linguistic elements such as idioms and sayings, it is neccesary to know the origin language and the destiny one, otherwise, translation would be some sentences composed by words with no meaning [38]. Translation multilingual systems such as Google Translate works on the basis on the information available in its database. Unless an idiom will be popular, the translator provides literal results, modifying the meaning of the sentence [39]. We see examples:
I) In English language, the idiom hit a club is translated into Spanish as darse un garbeo. Google Translate: golpear un club, DeepL: golpear un club and Bing Microsoft Translate: golpear un club.
II) In English language, the modern refrain dealt it is translated into Spanish as el que lo huele, debajo lo tiene. We verify how it is translated by the machine translators. Google Translate: tratado, DeepL: lo repartió and Bing Microsoft Translate: lo repartí.
It is the same that occurs with literary genres, for example, poetry has also supossed a diffculty in machine translation. It is due to literality takes part in more than 80% of results, setting aside its essence: evasion from the reader’s mind, beauty, sense of aesthetics, rhyme, use of rhetorical figures, and communication of message [40]. We see examples:
I) In the poem She walks beauty by Lord Byron, in the verse 15 The smiles that win, the tints that glowis translated into Spanish such as las sonrisas que triunfan, los matices que refulgen. For Google Translate is: las sonrisas que ganan, los tintes que brillan, DeepL: las sonrisas que ganan, los tintes que brillan and Microsoft Translate: las sonrisas que ganan y los tintes que brillan.
II) In the Sonnet XXIX by William Shakespeare, the sixth verse Featur’d like him, like him with friends possess’dis translated into Spanish as deseando ser mejor compañía para disfrutar de la amistad. Google Translate: Tiene como él, como lo posee él con amigas, DeepL: Como un aspecto él de él, como el de los amigos que poseeand Bing Microsoft Translate: Como él, como él con amigos poseídos.
To carry out the study, we have created a dataset that combines more than 200 registers about poetic verses and 230 popular expressions. Our aim is training a model, validating results and make a comparison with the results form machine translators. Our datasets can be viewed here:
8. Accuracy comparison between Google (BERT), DeepL y Bing by using informal language and emotions
In the translation of our dataset, we compare the accuracy of results by using the three most used models. In reference to poetic verses, it is important to attend to the context, because it offers us the possibility of knowing more than only the reading lines. The methodology consists on interpreting the composition from English language to Spanish, when data division has been done: 70 % (training), 20 % (validation) and 10 % (test). The obtained results in the three machine translators reflect literality and the lost of contextuality so, at the end, we have a serie of words without any sense and much less poetic composition. We can observe in this table the accuracy of each translators, highlighting the work of Google Translation:
Table 1
Google Translate
DeepL
Bing Microsoft Translate
27,18%
18,45%
10,19%
In relation to idioms, popular expressions and refrains, the accuracy obtained is low for the three models, except for DeepL:
Table 2
Google Translate
DeepL
Bing Microsoft Translate
29,52%
36,96%
17,39%
In this case, the three translators improve their accuracy in comparison with the poetic composition one. Bing Microsoft Translate has become again the worst one.
9. Development of a model based on artificial intelligence and personalized for NLP
The RNN-simple model has this structure:
By using our dataset and training the model during 30 times, we get an accuracy of 53,65%, which amounts to 58,41%. If we use 60 times in the translation of popular expressions and an accuracy of 39,83%, which increases in 2.94, to 42.77%. And if we use 60 times in the translation of poetry.
The model has around 800.000 trainable parameters in three different layers, adding a dropout layer to reduce/eliminate the possibility of include dying neurons during the training.
9.1 Development of a traditional model based on artificial intelligence and personalized for NLP relying on RNN with an embedding block
The RNN-embedding model has this structure:
By using our dataset and training the model during 30 times, we get an accuracy of 71,52%, which increases to 73.27% if we use 60 times in the translation of popular expressions. And, an accuracy of 62.46% which increases to 68.26%, if we use 60 times in the translation of poems.
The model has around one million of trainable parameters in four layers by using an embedding block.
9.2 Development of a transformer model based on artificial intelligence and personalized for machine translation (informal language and emotions)
The transformer model presents this structure:
By using our dataset and training the model during 30 times, we get an accuracy of 81,02%, that increases to 87.56% if we use 60 times in the translation of popular sentences and an accuracy of 71,51% that increases to 76,33% if we use 60 times in the translation of poetry.
The model has around 20 million of trainable parameters in three layers, as well as an encoder-decoder model.
The accuracy can improve if we apply a fine adjustement in each of the layers, increasing the number of layers or making a pre-processed to the input dataset of our model.
10. Conclusions
Currently, G. Translator, DeepL and Bing Translate are not known by their magnificence in poetic and informal language.
The accuracy of the informal language and poetry from our dataset is below 30%. They are overcome by architectures less complex as RNN simple:
With these data, adding to it that Google Translate, DeepL and Bing Translate use transformer architecture, we can deduce that the problem of machine translations are not only focused on the architecture of the model, but rather in the lack of attention that translation companies show with oral language; the wisest one to learn a new language. As well, we conclude that Artificial Intelligence remains very dependent on the training dataset and how information is processed.
11. Acknowledgments
We thank to organizations and personalities that make possible this project by collaborating on updated information and testimonies which enhance the investigation value.
«The history of machine translation begins in the 1950s. As early as 1949, Warren Weaver of the Rockefeller Foundation set up a cryptographic and language processing machine that was a precursor to the concept of machine translation » (Yugo, 2017) .