CEDEL2: Corpus Escrito del Español L2 (version 3)

CEDEL2 (v2)

v3.0
Oct. 2025

Corpus design

Rationale

In CEDEL2 we investigate how people learn Spanish. That is why we collected a large database (=corpus) of written and spoken texts produced by learners of Spanish. This is called a 'learner corpus' or 'L2 corpus'.

The corpus is intended to be a useful tool for linguists, researchers and teachers/learners of Spanish, as well as those interested in other uses of learner corpora (computational linguists, course material designers, etc).

Several thousand speakers have participated online from universities and schools all over the world (USA, UK, Japan, Spain, Italy, Germany, Greece, Russia, various Arabic countries, etc.). You can also participate online at http://learnercorpora.com

Corpus description

CEDEL2 (version 3) is a large corpus that contains samples of the language produced from learners of Spanish as a second language. For comparative purposes, it also contains a native control subcorpus of the language produced by native speakers of Spanish from different varieties (peninsular Spanish and all varieties of Latin American Spanish), so it can be used as a native corpus in its own right.

It contains an additional set of native control subcorpora by native speakers of different languages (English, Portuguese, Greek, Arabic, Chinese, Japanese, etc). These are the mother tongues (i.e., the first languages, L1s) of some of the learners. In this way, researchers can check whether the learners' L1 is influencing their L2 Spanish (i.e., whether learners are transferring from their L1).

Therefore, at this stage of CEDEL2 (version 3), we have the following set of control subcorpora: subcorpora of type 1 (the learners' mother tongue) and a subcorpus of type 2 (the learner's target language, i.e., the Spanish native subcorpus). In future versions of CEDEL2, we will add additional control subcorpora which are currently under development in such a way that there is a control subcorpus type 1 for every learner subcorpus. For details on statistics, see Statistics section.

Table: Native control subcorpora in CEDEL2 v.3 (number of words between brackets)

Native control subcorpus 1
(learner's mother tongue)
Learner subcorpus Native control subcorpus 2
(learner's target language)
L1 English (51,954) L1 English-L2 Spanish (589,763) L1 Spanish (445,955)
L1 Japanese (11,533) L1 Japanese-L2 Spanish (39,732) L1 Spanish (445,955)
L1 Greek (2,438) L1 Greek-L2 Spanish (96,297) L1 Spanish (445,955)
L1 Italian (1,667) L1 Italian-L2 Spanish (39,235) L1 Spanish (445,955)
L1 Russian (4,725) L1 Russian-L2 Spanish (33,589) L1 Spanish (445,955)
L1 German (25,996) L1 German-L2 Spanish (20,723) L1 Spanish (445,955)
L1 Portuguese (7,801) L1 Portuguese-L2 Spanish (17,578) L1 Spanish (445,955)
L1 French (1,795) L1 French-L2 Spanish (17,085) L1 Spanish (445,955)
L1 Dutch (under development) L1 Dutch-L2 Spanish (14,665) L1 Spanish (445,955)
L1 Arabic (1,883) L1 Arabic-L2 Spanish (12,958) L1 Spanish (445,955)
L1 Estonian (under development) L1 Estonian-L2 Spanish (11,852) L1 Spanish (445,955)
L1 Chinese (1,968) L1 Chinese-L2 Spanish (10,823) L1 Spanish (445,955)
L1 Polish (under development) L1 Polish-L2 Spanish (8,576) L1 Spanish (445,955)
L1 Turkish (2,287) L1 Turkish-L2 Spanish (5,023) L1 Spanish (445,955)
L1 Vietnamese (1,247) L1 Vietnamese-L2 Spanish (3,358) L1 Spanish (445,955)

CEDEL2 history

CEDEL2 was designed, created and implemented by Cristóbal Lozano, who has directed the project since its inception in 2005. It originated at the Universidad Autónoma de Madrid in 2005 and since 2006 it has been developed at the Universidad de Granada.

Online data collection started in 2006. Following the standards of Open Data Science, the first version (CEDEL2 v.1) was released in September 2017 with a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and the entire corpus has been freely and publicly available online ever since. For its second version (CEDEL2 v.2), the corpus has grown in number of words and subcorpora and both local (Universidad de Granada, UGR) and international collaborators joined the project (cf. list of collaborators in the tab ‘CEDEL2 team’). In its current version (CEDEL2 v.3), it contains 4,247 learners of Spanish from 15 typologically diverse languages and 2,313 native speakers from those languages, amounting to a total of 6,560 files. It is probably one of the largest L2 Spanish corpus of its kind (cf. the tab ‘Statistics’ for further details). The table below presents a quick overview of the different phases CEDEL2 has gone through:

Table: The development of CEDEL2

Phase Years Team No. of speakers Web interface
1st 2006-2016 C. Lozano L1 Eng-L2 Spa [1600]
L1 Greek-L2 Spa [173]
Spanish natives [800]
2,573 total files
May 2016:
CEDEL2 beta
Sept 2017:
CEDEL2 v. 1
2nd 2017-2020 C. Lozano & BilinguaLab team Learners [3034]: Arab, Chi, Dutch, Eng, French, German, Greek, Italian, Jap, Port, Russian
Natives [1365]: Arab, Eng, Greek, Jap, Port, Spa
4,399 total files
Sept 2020:
CEDEL2 v. 2
3rd 2020-2025 C. Lozano, Bilingualab Team & NLPgo team Learners [4247]: Arab, Chi, Dutch, Eng, Estonian, French, German, Greek, Italian, Jap, Polish, Port, Russian, Turkish, Vietnamese
Natives [2313]: Arab, Chi, Eng, French, German, Greek, Italian, Jap, Port, Russian, Spa, Turkish, Vietnamese
6,560 total files
Oct 2025:
CEDEL2 v. 3

Corpus structure (CEDEL2 version 3): subcorpora

CEDEL2 is divided into two major components: the learner vs. native subcorpora. The 15 learner subcorpora consist of texts (mostly written, but many spoken) produced by learners of Spanish as a second language (L2). These learners are classified into subcorpora according to their mother tongue (i.e., their first language, L1). We have Indo-European languages, which are further subclassified into Germanic (English, German, Dutch), Romance (French, Italian, Portuguese), Hellenic (Greek) and Slavic (Russian, Polish). We also have East-Asian languages (Japanese, Chinese, Korean, Vietnamese), Uralic (Estonian) and semitic (Arabic). All these typological similarities and differences make CEDEL2 an ideal L2 corpus for crosslinguistic comparisons to test for L1 influence on L2 Spanish. These subcorpora contain varied learner profiles: all proficiency levels (beginners, intermediates, advanced), different ages, different length of residence (LoR) in Spanish speaking countries), varied ages of onset (AoO) to L2 Spanish, and from different environments (university learners, high school learners, naturalistic learners).

Table: CEDEL2 learner subcorpora

L2 Spanish learner subcorpora Words Documents
L1 Arabic - L2 Spanish 12,958 105
L1 Chinese - L2 Spanish 10,823 61
L1 Dutch - L2 Spanish 14,665 92
L1 English - L2 Spanish 589,763 2,139
L1 Estonian - L2 Spanish 11,852 73
L1 French - L2 Spanish 17,085 118
L1 German - L2 Spanish 20,723 100
L1 Greek - L2 Spanish 96,279 418
L1 Italian - L2 Spanish 39,235 242
L1 Japanese - L2 Spanish 39,732 434
L1 Polish - L2 Spanish 8,576 63
L1 Portuguese - L2 Spanish 17,578 121
L1 Russian - L2 Spanish 33,589 221
L1 Turkish - L2 Spanish 5,023 38
L1 Vietnamese - L2 Spanish 3,358 22

The 13 native subcorpora serve as 'control' data and are used for comparative purposes. In particular, the L1 native Spanish subcorpus amounts to nearly half a million words (445,955 words, 1,705 documents). It can be used as a traditional control subcorpus against which we can compare the language produced by the learners of L2 Spanish, especially to check 'ultimate attainment': whether very advanced and near-native learners of L2 Spanish can ultimately attain a native level. The L1 native Spanish subcorpus contains the language produced by native speakers of Spanish from Spain and from other Spanish-speaking countries (Mexico, Argentina, Colombia, etc.), so, given its large size, it can be used as a corpus of native Spanish in its own right.

Additionally, there are other 12 native control subcorpora which are used to investigate the properties of native language (L1) of the learners as well as likely L1 transfer, i.e., whether the learners are transferring properties from their mother tongue (L1) onto their L2 Spanish. The subcorpora are listed below.

Table: CEDEL2 native subcorpora

Native control subcorpora Words Documents
L1 Arabic 1,883 8
L1 Chinese 1,968 6
L1 English 51,954 230
L1 French 1,795 5
L1 German 25,996 103
L1 Greek 24,380 109
L1 Italian 1,667 8
L1 Japanese 11,533 57
L1 Portuguese 7,801 36
L1 Russian 4,725 20
L1 Spanish 445,955 1,705
L1 Turkish 2,287 21
L1 Vietnamese 1,247 5

Corpus design: Tasks

Tasks 1-12 (see Table below) were used in CEDEL2 (version 1). For the enhancement of the CEDEL2 corpus in subsequent versions (CEDEL2 v.2 and v.3), two of these tasks were kept (2 and 3), and two additional ones were added (13 and 14). Importantly, tasks are not associated with any particular proficiency level, i.e., learners can in principle choose any task independently of their proficiency level.

Table: CEDEL2 tasks

Task number Task title Task description
1 Region where you live What is the region where you live like?
¿Cómo es la región donde vives?
2 Famous Person Talk about a famous person.
Habla de una persona famosa.
3 Film Summarise a film you have seen recently.
Resume una película que has visto recientemente.
4 Last year holidays What did you do during your holidays last summer?
¿Qué hiciste el año pasado durante las vacaciones?
5 Future plans What are your plans for the future?
¿Cuáles son tus planes para el futuro?
6 Recent trip Describe a trip you have recently made.
Describe un viaje que has hecho recientemente.
7 Experience Talk about an experience you have recently had.
Cuenta una experiencia que hayas vivido.
8 Terrorism Talk about the problem of terrorism in the world.
Habla del problema del terrorismo en el mundo.
9 Anti-smoking law What do you think about the new anti-smoking law?
¿Qué opinas de la nueva ley anti-tabaco?
10 Gay couples Do you think gay couples should have the right to get married and adopt children?
¿Crees que las parejas gay tienen el derecho de casarse y adoptar niños?
11 Marijuana legalization Do you think marijuana should be legal?
¿Crees que la marihuana se debería legalizar?
12 Immigration Analyse the main aspects concerning immigration.
Analiza los principales aspectos de la inmigración.
13 Frog Look at the following pictures and retell the story.
Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “Un día / One day... https://goo.gl/so3S6W

Mira las siguientes ilustraciones. Narra una historia basada en las ilustraciones. Puedes añadir ideas nuevas o ignorar algunas que aparezcan en las ilustraciones. Por favor, comienza la historia con la frase: "Un día..." https://goo.gl/so3S6W
14 Chaplin Watch the following Chaplin video clip and retell the story.
Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once.
https://www.youtube.com/watch?v=4QkTNJFhu-g

Mira el siguiente video de Charles Chaplin (4 minutos). Haz un resumen de la historia. Puedes ver el video más de una vez. https://www.youtube.com/watch?v=4QkTNJFhu-g

Corpus design: Variables

CEDEL2 was designed with a second language acquisition (SLA) agenda in mind. For every participant, we collected a large number of variables that are essential for SLA researchers. There are two sets of variables: linguistic background variables and task variables.

Table: Learner’s variables (linguistic background and task)

Linguistic background variables Task variables
  1. L1 of the learner
  2. L1 of the learner’s father
  3. L1 of the learner’s mother
  4. Language(s) spoken at home
  5. Placement test score (1-43 points)
  6. Proficiency level (lower beginner up to upper advanced)
  7. Proficiency level self-evaluation on each skill in Spanish (speaking, listening, writing, reading).
  8. Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading).
  9. Spanish language certificates held, if any
  10. Sex
  11. Age
  12. Age of exposure to L2 Spanish (AoE)
  13. Years studying Spanish (Length of Instruction, LoI)
  14. Stays in Spanish-speaking countries? (yes/no):
  15. Stay(s): Where?
  16. Stay(s): When? (period(s) of residence)
  17. Stay(s): How long? (length of residence)
  18. School/University/Educational institution (if any)
  19. Major degree (if any)
  20. Year at university/school (if any)
  1. Task title
  2. Task text (written text/spoken text transcription/audio file)
  3. Approximate time to produce the task (in minutes).
  4. Where was the task done? (in class/outside class/both)
  5. Resources used to produce the task (help from Spanish native/bilingual dictionary/monolingual dictionary/spellchecker/grammar book/background readings/none)

Table: Native’s variables (linguistic background and task)

Linguistic background variables Task variables
  1. L1 of the native speaker
  2. L1 variety
  3. L1 of the native speaker’s father
  4. L1 of the native speaker’s mother
  5. Language(s) spoken at home
  6. Proficiency level self-evaluation on each skill in foreign language (speaking, listening, writing, reading).
  7. Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading).
  8. Sex
  9. Age
  10. School/University/Educational institution (if any)
  11. Major degree (if any)
  12. Year at university/school (if any)
  1. Task title
  2. Task text (written text/spoken text transcription/audio file)
  3. Approximate time to produce the task (in minutes).
  4. Resources used to produce the task (Monolingual dictionary/Spellchecker/Grammar book/Background readings about the task topic (newspapers, internet, TV, etc.))

Corpus design: Proficiency level

CEDEL2 contains data from learners of Spanish at all proficiency levels (beginner, intermediate, advanced). Unlike other learner corpora that do not contain a standardised measure of the learners’ proficiency, CEDEL2 uses two proficiency-level measurements:

Data collection

Written data were collected via online forms, which means that participants could participate from anywhere in the world. To ensure that all participants understood the forms and the instructions correctly, forms were written in their native language. To see the latest version of the forms, please visit http://learnercorpora.com.

Spoken data were collected in two ways (i) online via Google Meet or (ii) in situ at the Universidad de Granada in a quiet room with the help of special audio recording equipment (Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa) to guarantee the best quality of sound possible in the audio files. Audio files (and their transcriptions) can be freely downloaded. For consistency, a protocol was followed by all data collectors during oral recordings. The audio files were orthographically transcribed and converted into text files which are searchable and downloadable, though the audio files can also be downloaded. See the transcription conventions in the tab 'User Guide' > ‘Transcription conventions’.

Other twin corpora

For comparative purposes, in BilinguaLab we created a twin corpus called COREFL (CORpus of English as a Foreign Language), which was designed following the CEDEL2 principles. It contains data from L2 English learners with a wide variety of L1s. Its design allows users to do 'mirror image' comparisons. For example, a given linguistic phenomenon can be explored in both directions (L1⇔L2): L1 English-L2 Spanish (CEDEL2 subcorpus) vs. L1 Spanish-L2 English (COREFL subcorpus). This corpus design feature is called 'bidirectionality'. Other combinations are also possible, e.g., L1 Japanese-L2 Spanish vs L1 Japanese-L2 English, etc. (i.e., same mother tongue but different L2s being acquired).

We are also developing JFLCorp, a corpus of L2 Japanese with natives from different backgrounds. It also follows the same design criteria as CEDEL2.

Other L2 spanish corpora

CEDEL2 is in line with other international projects where large learner corpora are being created. Of particular interest are SPLLOC (Spanish Learner Language Oral Corpus) at the University of Southampton (UK), CAES (Corpus de Aprendices del Español) at Universidad de Santiago de Compostela (Spain), LANGSNAP (Language and Social Networks Abroad Project), and COWS-L2H (Corpus of Written Spanish, L2 and Heritage Speakers). For a repository of L2 Spanish corpora, see the Indexador de Corpus de Aprendices de Español.

This web site uses own and third party cookies to allow it to work fine and to allow us to know how it is being used. If you click on ACCEPT these both types of cookies will be enabled. If you want more information, you can read the COOKIES POLICY document of our web site. Cookie settings

Technical cookies So that our website can work. Activated by default.

Technical cookies are strictly necessary for our website to work and for you to navigate through it. These types of cookies are those that, for example, allow us to identify you, give you access to certain restricted parts of the website if necessary, or remember different options or services already selected by you, such as your privacy preferences. Therefore, they are activated by default and your authorization is not necessary.

Through the configuration of your browser, you can block or alert the presence of this type of cookies, although such blocking will affect the proper functioning of the different functionalities of our website.

Analysis cookies To allow us to know how our web is being used. You can enable or disable them.

Analysis cookies allow us to study the navigation of the users of our website in general (for example, which sections of the site are the most visited, which services are used most and if they work correctly, etc.). From this statistical information about navigation on our website, we can improve both the operation of the site itself and the different services it offers. Therefore, these cookies do not have an advertising purpose, but only serve to make our website work better, adapting to our users in general. By activating them you will contribute to this continuous improvement.

You can activate or deactivate these cookies by changing the corresponding sliders.