RITA (Resource for Italian Tests Assessment), is a new NLP dataset of academic exam texts written in Italian by second-language learners for obtaining the CEFR certification of proficiency level. RITA dataset is available for automatic processing in CSV and XML format, under an agreement of citation. In addition to the tests, RITA provides a variety of speech elements, annotations, and statistics, including phraseological units and their syntactic dependencies. The dataset consists of two corpora: one containing the analysis of task assignments, and the other containing analysis of the texts the learners elaborated in response to the assignment. The work to be cited describes also the data collection and annotation process, structure, and statistics computed to facilitate the analysis of the phraseological text. The RITA corpus is a collection of data about 3041 exam texts handed in by Italian L2 learners from the B1 to C2 Common European Framework of Reference for Languages (CEFR) levels, collected and transcribed by the Center for Language Evaluation and Certification (CVCL) at the University for Foreigners of Perugia. RITA is a valuable resource for researchers and educators interested in Italian phraseology, language assessment, and natural language processing. RITA dataset has been developed under the Italian Ministry of Research under PRIN Project “PHRAME” Grant n.20178XXKFY and directly derived from the CELI Corpus collected in the same PHRAME Project . Information not included in RITA (such as the original raw text) can be obtained by interactively querying the CELI Corpus at https://apps.unistrapg.it/cqpweb/
RITA: a Phraseological dataset of CEFR Assignments and Exams for Italian as a Second Language
Valentina Franzoni Franzoni
Supervision
;Giulio Biondi BiondiMembro del Collaboration Group
;Alfredo Milani MilaniFunding Acquisition
;
2023
Abstract
RITA (Resource for Italian Tests Assessment), is a new NLP dataset of academic exam texts written in Italian by second-language learners for obtaining the CEFR certification of proficiency level. RITA dataset is available for automatic processing in CSV and XML format, under an agreement of citation. In addition to the tests, RITA provides a variety of speech elements, annotations, and statistics, including phraseological units and their syntactic dependencies. The dataset consists of two corpora: one containing the analysis of task assignments, and the other containing analysis of the texts the learners elaborated in response to the assignment. The work to be cited describes also the data collection and annotation process, structure, and statistics computed to facilitate the analysis of the phraseological text. The RITA corpus is a collection of data about 3041 exam texts handed in by Italian L2 learners from the B1 to C2 Common European Framework of Reference for Languages (CEFR) levels, collected and transcribed by the Center for Language Evaluation and Certification (CVCL) at the University for Foreigners of Perugia. RITA is a valuable resource for researchers and educators interested in Italian phraseology, language assessment, and natural language processing. RITA dataset has been developed under the Italian Ministry of Research under PRIN Project “PHRAME” Grant n.20178XXKFY and directly derived from the CELI Corpus collected in the same PHRAME Project . Information not included in RITA (such as the original raw text) can be obtained by interactively querying the CELI Corpus at https://apps.unistrapg.it/cqpweb/I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.