Automatic evaluations are independent of the language. Although they provide practical language quality measures, it says nothing about the contents. Therefore, assessments based on human judgements are complementary.

In natural language generation (NLG), several metrics show the difficulty of having a source text (written by a human being) and a target text (written by software). BLEU, ROUGE, METEOR, NIST and WER metrics are used to assign a score for measuring parts of words (N-grams) and their frequency by comparing a source text and a text target.

Test the automatic evaluation metrics

BLEU (Bilingual Evaluation Understudy) gives equal weight to all N-grams. When BLEU reaches 1, the N-grams between the source and target texts correspond. This metric was developed by IBM and is commonly used in automated translation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) gives weight to the higher proportion of N-grams. There are several metrics ROUGE. The most common is ROUGEN, which calculates the highest proportion of N-grams of a length N in a reference text. ROUGE variants correspond to variants of the method of computation (ROUGE-S, ROUGE-L, ROUGE-W,ROUGE-2 and ROUGEU). ROUGE is commonly used in connection with the generation of automatic text summaries.
NIST (National Institute of Standards and Technology) NIST is an adaptation of BLEU. While BLEU gives equal weight to all N-grams, NIST gives more importance to the less frequent N-grams. NIST correlates best with human judgments.
METEOR (Metric for Evaluation of Translation with Explicit Ordering) gives equal weight to all N-grams. It adds a recall rate (frequency) and a precision rate (relevance) into its formula. . This metric is based on the principle of explicit connections (literally between the source text and the target text, whether it is the exact word or the morphological variation of the word).
WER (Word Error Rate) this formula is based on explicit correspondence (exact word or morphological variant). This metric is commonly used in the field of voice recognition.


Readability scores and edit distance