A dataset for automatic Grammatical Error Correction in twelve languages

What is it about?

We introduce MultiGEC, a dataset for Grammatical Error Correction (GEC) in 12 European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. The data consists mostly of learner essays written by second language speakers of these languages, but also includes texts written by schoolchildren, heritage speakers and members of the general population. All texts come with one or more corrected versions.

Why is it important?

MultiGEC was built for the MultiGEC-2025 shared task, which is part of a series of initiatives aimed at fostering an interest in low-resource languages in the Natural Language Processing community. By making the dataset available to a broader public, we aim for it to have a more long lasting impact and aid the development of GEC system ready for educational settings.

The following have contributed to this page:

Alexandr Rosen, Orphee De Clercq, Katrin Wisniewski, Jennifer-Carmen Frey, Elena Volodina, and Arianna Masciolini