Automatic evaluations are independent of the language. If it can provide useful measures of the language quality, it say nothing about the contents: that is why evaluations based on human judgements are complementary.

In natural language generation (NLG), several metrics show the difficulty of having a source text (written by a human being) and a target text (written by a software). BLEU, ROUGE, METEOR, NIST and WER metrics are used with assigning a score for measuring parts of words (N-grams) and their frequency, from a comparison between a source text and a text target.

Test the automatic evaluation metrics

BLEU (Bilingual Evaluation Understudy) gives an equal weight to all N-grams. When BLEU reaches 1 that means that the N-grams between the source text and the target text correspond. This metric was developed by IBM and is commonly used in automatic translation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) gives weight to the higher proportion of N-grams. There are several metrics ROUGE. The most common is ROUGEN, which calculates the highest proportion of N-grams of a length N in a reference text. ROUGE variants correspond to variants of the method of computation (ROUGE-S, ROUGE-L, ROUGE-W,ROUGE-2 and ROUGEU). ROUGE is commonly used in connection with the generation of automatic text summaries.
NIST (National Institute of Standards and Technology) NIST is an adaptation of BLEU. While BLEU gives an equal weight to all n-grams, NIST gives more importance to the less frequent N-grams. NIST correlates best with human judgments.
METEOR (Metric for Evaluation of Translation with Explicit Ordering) gives equal weight to all N-grams and adds into its formula a recall rate (frequency) and a precision rate (relevance). This metric is based on the principle of explicit connections (literally between the source text and the target text, whether it is the exact word or the morphological variation of the word).
WER (Word Error Rate) this formula this formula is based on the explicit correspondence (exact word or morphological variant). This metric is commonly used in the field of voice recognition. .


Readability scores and edit distance