CEDEL2: Corpus Escrito del Español L2 (version 2)

Corpus design


In CEDEL2 we investigate how people learn Spanish. That is why we collected a large database (=corpus) of written (and some spoken) texts produced by learners of Spanish. This is called a ‘learner corpus’ or ‘L2 corpus’.

The corpus is intended to be beneficial for linguists, researchers and teachers/learners of Spanish, as well as those interested in other uses of learner corpora (computational linguists, course material designers, etc).

Several thousand speakers have participated online from universities and schools all over the world (USA, UK, Japan, Spain, Italy, Germany, Greece, Russia, different Arabic countries, etc.). You can also participate online at http://learnercorpora.com

Corpus description

CEDEL2 (version 2) is a large corpus that contains samples of the language produced from learners of Spanish as a second language. For comparative purposes, it also contains a native control subcorpus of the language produced by native speakers of Spanish from different varieties (peninsular Spanish and all varieties of Latin American Spanish), so it can be used as a native corpus in its own right.

It contains an additional set of native control subcorpora by native speakers of different languages (English, Portuguese, Greek, Arabic, and Japanese). These are the L1s of some of the learners. In this way, researchers can also check whether the learner’s L1 is influencing their L2 Spanish (i.e., whether learners are transferring from their L1).

Therefore, at this stage of CEDEL2 (version 2), we have therefore the following set of control subcorpora: subcorpora of the type 1 (the learner’s mother tongue) and a subcorpus of the type 2 (the learner’s target language, i.e., the Spanish native subcorpus). In future versions of CEDEL2, we will add additional control subcorpora which are currently under development in such a way that there is a control subcorpus type 1 for every learner subcorpus.

Table: Native control subcorpora in CEDEL2 v.2

Native control subcorpus 1
(learner's mother tongue)
Learner subcorpus Native control subcorpus 2
(learner's target language)
L1 English L1 English-L2 Spanish L1 Spanish
L1 Portuguese L1 Portuguese-L2 Spanish L1 Spanish
L1 Greek L1 Greek-L2 Spanish L1 Spanish
L1 Arabic L1 Arabic-L2 Spanish L1 Spanish
L1 Japanese L1 Japanese-L2 Spanish L1 Spanish
L1 German [under development] L1 German-L2 Spanish L1 Spanish
L1 Dutch [under development] L1 Dutch-L2 Spanish L1 Spanish
L1 Italian [under development] L1 Italian-L2 Spanish L1 Spanish
L1 French [under development] L1 French-L2 Spanish L1 Spanish
L1 Russian [under development] L1 Russian-L2 Spanish L1 Spanish
L1 Chinese [under development] L1 Chinese-L2 Spanish L1 Spanish

CEDEL2 history

CEDEL2 was designed and implemented by Cristóbal Lozano, who directs the project since 2004. It originated at the Universidad Autónoma de Madrid in that year and since 2006 it has been continued and implemented at the Universidad de Granada.

Online data collection started in 2006. Following the standards of Open Data Science, the first version (CEDEL2 v.1) was released in September 2017 with a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and the entire corpus has been freely and publicly available online ever since (http://cedel2version1.learnercorpora.com/). It contains 2,578 speakers and files in total, collected mainly by Lozano (2,405 speakers, of which 1,609 were L1 English-L2 Spanish learners and 796 Spanish-speaking natives) and by Athanasios Georgopoulos (173 L1 Greek-L2 Spanish learners).

For its second version (CEDEL2 v.2), the corpus has been expanding since 2017 with the inclusion of a large list of subcorpora and the incorporation to the project of both local (Universidad de Granada, UGR) and international collaborators (cf. list of collaborators in the tab ‘CEDEL2 team’). CEDEL2 v.2 currently contains 1,691 new files plus the existing 2,578 files from v.1, which amounts to a total of 4,269 written and spoken files coming from 4,166 participants. CEDEL2 v.2 amounts to over one million words, which makes it currently the largest L2 Spanish corpus of its kind (cf. the tab ‘Statistics’ for further details). CEDEL2 v.2 has been publicly and freely released in July 2020 at http://cedel2.learnercorpora.com.

Corpus structure (CEDEL2 version 2): subcorpora

CEDEL2 is divided into two major components: the learner vs. native subcorpora. The 11 learner subcorpora consist of texts (mostly written, but some spoken) produced by learners of Spanish as a second language (L2). These learners are classified into subcorpora according to their mother tongue (i.e., their first language, L1). We have Indo-European languages, which are further subclassified into Germanic (English, German, Dutch), Romance (French, Italian, Portuguese), Hellenic (Greek) and Slavic (Russian). We also have East-Asian languages (Japanese, Chinese) and Arabic. All these typological similarities and differences make CEDEL2 an ideal L2 corpus for crosslinguistic comparisons to test for L1 influence on the L2 Spanish. These subcorpora contain learners at all proficiency levels (beginners, intermediates, advanced).

Table: CEDEL2 learner subcorpora

L2 Spanish learner subcorpora Words Documents
L1 Arabic - L2 Spanish 9,118 74
L1 English - L2 Spanish 558,731 1,931
L1 Chinese - L2 Spanish 4,373 22
L1 Dutch - L2 Spanish 9,069 60
L1 French - L2 Spanish 8,213 59
L1 German - L2 Spanish 16,164 82
L1 Greek - L2 Spanish 64,105 216
L1 Italian - L2 Spanish 14,426 83
L1 Japanese - L2 Spanish 23,049 243
L1 Portuguese - L2 Spanish 21,662 164
L1 Russian - L2 Spanish 16,117 101

The 6 native subcorpora serve as ‘control’ data and are used for comparative purposes. In particular, the L1 native Spanish subcorpus can be used as a traditional control subcorpus against which we can compare the language produced by the learners of L2 Spanish, especially to check ‘ultimate attainment’: whether very advanced and near-native learners of L2 Spanish can ultimately attain a native level. The L1 native Spanish subcorpus contains the language produced by native speakers of Spanish from Spain and from other Spanish-speaking countries (Mexico, Argentina, Colombia, etc.), so it can be used as a corpus of native Spanish in its own right.

Additionally, there are a few native control subcorpora which are used to investigate the properties of native language (L1) of the learners as well as likely L1 transfer, i.e., whether the learners are transferring properties from their mother tongue (L1) onto their L2 Spanish. The subcorpora are: L1 native English, Portuguese, Greek, Japanese, and Arabic. In future versions of CEDEL2, there will be additional control corpora so that there is always a control corpus for every L1 of the learners.

Table: CEDEL2 native subcorpora

Native control subcorpora Words Documents
L1 Arabic - L2 Spanish 1,465 6
L1 English - L2 Spanish 40,805 172
L1 Greek - L2 Spanish 2,031 12
L1 Japanese - L2 Spanish 9,126 47
L1 Portuguese - L2 Spanish 3,348 16
L1 Spanish - L2 Spanish 304,211 1,112

Corpus design: Tasks

Tasks 1-12 (see Table below) were used in CEDEL2 (version 1). For the enhancement of the CEDEL2 corpus in its 2nd version (CEDEL2 v.2), two of these tasks were kept (2 and 3), and two additional ones were added (13 and 14). Importantly, tasks are not associated with any particular proficiency level, i.e., learners can choose any task independently of their proficiency level.

Table: CEDEL2 tasks

Task number Task title Task description
1 Region where you live What is the region where you live like?
¿Cómo es la región donde vives?
2 Famous Person Talk about a famous person.
Habla de una persona famosa.
3 Film Summarise a film you have seen recently.
Resume una película que has visto recientemente.
4 Last year holidays What did you do during your holidays last summer?
¿Qué hiciste el año pasado durante las vacaciones?
5 Future plans What are your plans for the future?
¿Cuáles son tus planes para el futuro?
6 Recent trip Describe a trip you have recently made.
Describe un viaje que has hecho recientemente.
7 Experience Talk about an experience you have recently had.
Cuenta una experiencia que hayas vivido.
8 Terrorism Talk about the problem of terrorism in the world.
Habla del problema del terrorismo en el mundo.
9 Anti-smoking law What do you think about the new anti-smoking law?
¿Qué opinas de la nueva ley anti-tabaco?
10 Gay couples Do you think gay couples should have the right to get married and adopt children?
¿Crees que las parejas gay tienen el derecho de casarse y adoptar niños?
11 Marijuana legalization Do you think marijuana should be legal?
¿Crees que la marihuana se debería legalizar?
12 Immigration Analyse the main aspects concerning immigration.
Analiza los principales aspectos de la inmigración.
13 Frog Look at the following pictures and retell the story.
Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “Un día / One day... https://goo.gl/so3S6W

Mira las siguientes ilustraciones. Narra una historia basada en las ilustraciones. Puedes añadir ideas nuevas o ignorar algunas que aparezcan en las ilustraciones. Por favor, comienza la historia con la frase: "Un día..." https://goo.gl/so3S6W
14 Chaplin Watch the following Chaplin video clip and retell the story.
Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once.

Mira el siguiente video de Charles Chaplin (4 minutos). Haz un resumen de la historia. Puedes ver el video más de una vez. https://www.youtube.com/watch?v=4QkTNJFhu-g

Corpus design: Variables

CEDEL2 was designed with a second language acquisition (SLA) agenda in mind. For every participant, we collected a large number of variables that are essential for SLA researchers. There are two sets of variables: linguistic background variables and task variables.

Table: Learner’s variables (linguistic background and task)

Linguistic background variables Task variables
  1. L1 of the learner
  2. L1 of the learner’s father
  3. L1 of the learner’s mother
  4. Language(s) spoken at home
  5. Placement test score (1-43 points)
  6. Proficiency level (lower beginner up to upper advanced)
  7. Proficiency level self-evaluation on each skill in Spanish (speaking, listening, writing, reading).
  8. Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading).
  9. Spanish language certificates held, if any
  10. Sex
  11. Age
  12. Age of exposure to L2 Spanish (AoE)
  13. Years studying Spanish (Length of instructed exposure)
  14. Stays in Spanish-speaking countries? (yes/no):
  15. Stay(s): Where?
  16. Stay(s): When? (period(s) of residence)
  17. Stay(s): How long? (length of residence)
  18. School/University/Educational institution (if any)
  19. Major degree (if any)
  20. Year at university/school (if any)
  1. Task title
  2. Task text (written text/spoken text transcription/audio file)
  3. Approximate time to produce the task (in minutes).
  4. Where was the task done? (in class/outside class/both)
  5. Resources used to produce the task (help from Spanish native/bilingual dictionary/monolingual dictionary/spellchecker/grammar book/background readings/none)

Table: Native’s variables (linguistic background and task)

Linguistic background variables Task variables
  1. L1 of the native speaker
  2. L1 variety
  3. L1 of the native speaker’s father
  4. L1 of the native speaker’s mother
  5. Language(s) spoken at home
  6. Proficiency level self-evaluation on each skill in foreign language (speaking, listening, writing, reading).
  7. Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading).
  8. Sex
  9. Age
  10. School/University/Educational institution (if any)
  11. Major degree (if any)
  12. Year at university/school (if any)
  1. Task title
  2. Task text (written text/spoken text/audio file)
  3. Approximate time to produce the task (in minutes).
  4. Resources used to produce the task (Monolingual dictionary/Spellchecker/Grammar book/Background readings about the task topic (newspapers, internet, TV, etc.))

Corpus design: Proficiency level

CEDEL2 contains data from learners of Spanish at all proficiency levels (beginner, intermediate, advanced). Unlike other learner corpora that do not contain a standardised measure of the learner’s proficiency, CEDEL2 uses two proficiency-level measurements:

Data collection

Written datawere collected via online forms, which means that participants could participate from anywhere in the world. To ensure that all participants understood the forms and the instructions correctly, forms were written in their native language. To see the different forms, please visit http://learnercorpora.com.

Spoken data were collected in situ at the Universidad de Granada in a quiet room with the help of special audio recording equipment (Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa) to guarantee the best quality of sound possible in the audio files. This ensures that phoneticians can download the audio files to perform fine-grained acoustic analyses. For consistency, a protocol was followed by all data collectors during oral recordings. The audio files were transcribed orthographically and converted into text files which are searchable and downloadable, though the audio files can be also downloaded. See the transcription conventions in the tab ‘User Guide’ > ‘Transcription conventions’.

Other twin corpora

For comparative purposes, we created a twin corpus called COREFL (CORpus of English as a Foreign Language), which was designed following the CEDEL2 principles. It contains data from L1 Spanish-L2 English and L1 German-L2 English in such a way that users can do ‘mirror image’ comparisons. For example, a given linguistic phenomenon can be explored in both directions (L1↔L2): in L1 English-L2 Spanish (CEDEL2 subcorpus) vs. L1 Spanish-L2 English (COREFL subcorpus). Also in L1 German-L2 Spanish (CEDEL2) vs. L1 German-L2 English (COREFL). This corpus design feature is called ‘bidirectionality’.

Additionally, COREFL contains a subcorpus of native English, which is shared across CEDEL2 (as a corpus of the learner’s L1) and COREFL (as a corpus of the target language being acquired).

COREFL: Lozano, C., Díaz-Negrillo, A., & Callies, M. (2020). Designing and compiling a learner corpus of written and spoken narratives: COREFL. In C. Bongartz & J. Torregrossa (Eds.), What’s in a Narrative? Variation in Story-Telling at the Interface between Language and Literacy (pp. 9-32). Bern: Peter Lang.

Finally, WriCLE (Written Corpus of Learner English), is a similar L1 Spanish - L2 English corpus.

WriCLE: Rollinson, P., & Mendikoetxea, A. (2010). Learner corpora and second language acquisition: Introducing WRICLE: In J. L. Bueno Alonso, D. González Álvarez, U. Kirsten Torrado, A. E. Martínez Insua, J. Pérez-Guerra, E. Rama Martínez, & R. Rodríguez Vázquez (Eds.), Analizar datos > Describir variación / Analysing Data > Describing Variation (pp. 1-12). Universidade de Vigo (Servizo de Publicacións).

Other L2 spanish corpora

CEDEL2 is in line with other international projects where large learner corpora are being created. Of particular interest are SPLLOC (Spanish Learner Language Oral Corpus) at the University of Southampton (UK), CAES (Corpus de Aprendices del Español) at Universidad de Santiago de Compostela (Spain), and LANGSNAP (Language and Social Networks Abroad Project). For a repository of L2 Spanish corpora, see the Indexador de Corpus de Aprendices de Español.