Corpus design

Rationale

In CEDEL2 we investigate how people learn Spanish. That is why we collected a large database (=corpus) of written and spoken texts produced by learners of Spanish. This is called a 'learner corpus' or 'L2 corpus'.

The corpus is intended to be a useful tool for linguists, researchers and teachers/learners of Spanish, as well as those interested in other uses of learner corpora (computational linguists, course material designers, etc).

Several thousand speakers have participated online from universities and schools all over the world (USA, UK, Japan, Spain, Italy, Germany, Greece, Russia, various Arabic countries, etc.). You can also participate online at http://learnercorpora.com

Corpus description

CEDEL2 (version 3) is a large corpus that contains samples of the language produced from learners of Spanish as a second language. For comparative purposes, it also contains a native control subcorpus of the language produced by native speakers of Spanish from different varieties (peninsular Spanish and all varieties of Latin American Spanish), so it can be used as a native corpus in its own right.

It contains an additional set of native control subcorpora by native speakers of different languages (English, Portuguese, Greek, Arabic, Chinese, Japanese, etc). These are the mother tongues (i.e., the first languages, L1s) of some of the learners. In this way, researchers can check whether the learners' L1 is influencing their L2 Spanish (i.e., whether learners are transferring from their L1).

Therefore, at this stage of CEDEL2 (version 3), we have the following set of control subcorpora: subcorpora of type 1 (the learners' mother tongue) and a subcorpus of type 2 (the learner's target language, i.e., the Spanish native subcorpus). In future versions of CEDEL2, we will add additional control subcorpora which are currently under development in such a way that there is a control subcorpus type 1 for every learner subcorpus. For details on statistics, see Statistics section.

Table: Native control subcorpora in CEDEL2 v.3 (number of words between brackets)

Native control subcorpus 1 (learner's mother tongue)	Learner subcorpus	Native control subcorpus 2 (learner's target language)
L1 English (51,954)	L1 English-L2 Spanish (589,763)	L1 Spanish (445,955)
L1 Japanese (11,533)	L1 Japanese-L2 Spanish (39,732)	L1 Spanish (445,955)
L1 Greek (2,438)	L1 Greek-L2 Spanish (96,297)	L1 Spanish (445,955)
L1 Italian (1,667)	L1 Italian-L2 Spanish (39,235)	L1 Spanish (445,955)
L1 Russian (4,725)	L1 Russian-L2 Spanish (33,589)	L1 Spanish (445,955)
L1 German (25,996)	L1 German-L2 Spanish (20,723)	L1 Spanish (445,955)
L1 Portuguese (7,801)	L1 Portuguese-L2 Spanish (17,578)	L1 Spanish (445,955)
L1 French (1,795)	L1 French-L2 Spanish (17,085)	L1 Spanish (445,955)
L1 Dutch (under development)	L1 Dutch-L2 Spanish (14,665)	L1 Spanish (445,955)
L1 Arabic (1,883)	L1 Arabic-L2 Spanish (12,958)	L1 Spanish (445,955)
L1 Estonian (under development)	L1 Estonian-L2 Spanish (11,852)	L1 Spanish (445,955)
L1 Chinese (1,968)	L1 Chinese-L2 Spanish (10,823)	L1 Spanish (445,955)
L1 Polish (under development)	L1 Polish-L2 Spanish (8,576)	L1 Spanish (445,955)
L1 Turkish (2,287)	L1 Turkish-L2 Spanish (5,023)	L1 Spanish (445,955)
L1 Vietnamese (1,247)	L1 Vietnamese-L2 Spanish (3,358)	L1 Spanish (445,955)

CEDEL2 history

CEDEL2 was designed, created and implemented by Cristóbal Lozano, who has directed the project since its inception in 2005. It originated at the Universidad Autónoma de Madrid in 2005 and since 2006 it has been developed at the Universidad de Granada.

Online data collection started in 2006. Following the standards of Open Data Science, the first version (CEDEL2 v.1) was released in September 2017 with a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and the entire corpus has been freely and publicly available online ever since. For its second version (CEDEL2 v.2), the corpus has grown in number of words and subcorpora and both local (Universidad de Granada, UGR) and international collaborators joined the project (cf. list of collaborators in the tab ‘CEDEL2 team’). In its current version (CEDEL2 v.3), it contains 4,247 learners of Spanish from 15 typologically diverse languages and 2,313 native speakers from those languages, amounting to a total of 6,560 files. It is probably one of the largest L2 Spanish corpus of its kind (cf. the tab ‘Statistics’ for further details). The table below presents a quick overview of the different phases CEDEL2 has gone through:

Table: The development of CEDEL2

Phase	Years	Team	No. of speakers	Web interface
1st	2006-2016	C. Lozano	L1 Eng-L2 Spa [1600] L1 Greek-L2 Spa [173] Spanish natives [800] 2,573 total files	May 2016: CEDEL2 beta Sept 2017: CEDEL2 v. 1
2nd	2017-2020	C. Lozano & BilinguaLab team	Learners [3034]: Arab, Chi, Dutch, Eng, French, German, Greek, Italian, Jap, Port, Russian Natives [1365]: Arab, Eng, Greek, Jap, Port, Spa 4,399 total files	Sept 2020: CEDEL2 v. 2
3rd	2020-2025	C. Lozano, Bilingualab Team & NLPgo team	Learners [4247]: Arab, Chi, Dutch, Eng, Estonian, French, German, Greek, Italian, Jap, Polish, Port, Russian, Turkish, Vietnamese Natives [2313]: Arab, Chi, Eng, French, German, Greek, Italian, Jap, Port, Russian, Spa, Turkish, Vietnamese 6,560 total files	Oct 2025: CEDEL2 v. 3

Corpus structure (CEDEL2 version 3): subcorpora

CEDEL2 is divided into two major components: the learner vs. native subcorpora. The 15 learner subcorpora consist of texts (mostly written, but many spoken) produced by learners of Spanish as a second language (L2). These learners are classified into subcorpora according to their mother tongue (i.e., their first language, L1). We have Indo-European languages, which are further subclassified into Germanic (English, German, Dutch), Romance (French, Italian, Portuguese), Hellenic (Greek) and Slavic (Russian, Polish). We also have East-Asian languages (Japanese, Chinese, Korean, Vietnamese), Uralic (Estonian) and semitic (Arabic). All these typological similarities and differences make CEDEL2 an ideal L2 corpus for crosslinguistic comparisons to test for L1 influence on L2 Spanish. These subcorpora contain varied learner profiles: all proficiency levels (beginners, intermediates, advanced), different ages, different length of residence (LoR) in Spanish speaking countries), varied ages of onset (AoO) to L2 Spanish, and from different environments (university learners, high school learners, naturalistic learners).

Table: CEDEL2 learner subcorpora

L2 Spanish learner subcorpora	Words	Documents
L1 Arabic - L2 Spanish	12,958	105
L1 Chinese - L2 Spanish	10,823	61
L1 Dutch - L2 Spanish	14,665	92
L1 English - L2 Spanish	589,763	2,139
L1 Estonian - L2 Spanish	11,852	73
L1 French - L2 Spanish	17,085	118
L1 German - L2 Spanish	20,723	100
L1 Greek - L2 Spanish	96,279	418
L1 Italian - L2 Spanish	39,235	242
L1 Japanese - L2 Spanish	39,732	434
L1 Polish - L2 Spanish	8,576	63
L1 Portuguese - L2 Spanish	17,578	121
L1 Russian - L2 Spanish	33,589	221
L1 Turkish - L2 Spanish	5,023	38
L1 Vietnamese - L2 Spanish	3,358	22

The 13 native subcorpora serve as 'control' data and are used for comparative purposes. In particular, the L1 native Spanish subcorpus amounts to nearly half a million words (445,955 words, 1,705 documents). It can be used as a traditional control subcorpus against which we can compare the language produced by the learners of L2 Spanish, especially to check 'ultimate attainment': whether very advanced and near-native learners of L2 Spanish can ultimately attain a native level. The L1 native Spanish subcorpus contains the language produced by native speakers of Spanish from Spain and from other Spanish-speaking countries (Mexico, Argentina, Colombia, etc.), so, given its large size, it can be used as a corpus of native Spanish in its own right.

Additionally, there are other 12 native control subcorpora which are used to investigate the properties of native language (L1) of the learners as well as likely L1 transfer, i.e., whether the learners are transferring properties from their mother tongue (L1) onto their L2 Spanish. The subcorpora are listed below.

Table: CEDEL2 native subcorpora

Native control subcorpora	Words	Documents
L1 Arabic	1,883	8
L1 Chinese	1,968	6
L1 English	51,954	230
L1 French	1,795	5
L1 German	25,996	103
L1 Greek	24,380	109
L1 Italian	1,667	8
L1 Japanese	11,533	57
L1 Portuguese	7,801	36
L1 Russian	4,725	20
L1 Spanish	445,955	1,705
L1 Turkish	2,287	21
L1 Vietnamese	1,247	5

Corpus design: Tasks

Tasks 1-12 (see Table below) were used in CEDEL2 (version 1). For the enhancement of the CEDEL2 corpus in subsequent versions (CEDEL2 v.2 and v.3), two of these tasks were kept (2 and 3), and two additional ones were added (13 and 14). Importantly, tasks are not associated with any particular proficiency level, i.e., learners can in principle choose any task independently of their proficiency level.

Table: CEDEL2 tasks

Task number	Task title	Task description
1	Region where you live	What is the region where you live like? ¿Cómo es la región donde vives?
2	Famous Person	Talk about a famous person. Habla de una persona famosa.
3	Film	Summarise a film you have seen recently. Resume una película que has visto recientemente.
4	Last year holidays	What did you do during your holidays last summer? ¿Qué hiciste el año pasado durante las vacaciones?
5	Future plans	What are your plans for the future? ¿Cuáles son tus planes para el futuro?
6	Recent trip	Describe a trip you have recently made. Describe un viaje que has hecho recientemente.
7	Experience	Talk about an experience you have recently had. Cuenta una experiencia que hayas vivido.
8	Terrorism	Talk about the problem of terrorism in the world. Habla del problema del terrorismo en el mundo.
9	Anti-smoking law	What do you think about the new anti-smoking law? ¿Qué opinas de la nueva ley anti-tabaco?
10	Gay couples	Do you think gay couples should have the right to get married and adopt children? ¿Crees que las parejas gay tienen el derecho de casarse y adoptar niños?
11	Marijuana legalization	Do you think marijuana should be legal? ¿Crees que la marihuana se debería legalizar?
12	Immigration	Analyse the main aspects concerning immigration. Analiza los principales aspectos de la inmigración.
13	Frog	Look at the following pictures and retell the story. Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “Un día / One day... https://goo.gl/so3S6W Mira las siguientes ilustraciones. Narra una historia basada en las ilustraciones. Puedes añadir ideas nuevas o ignorar algunas que aparezcan en las ilustraciones. Por favor, comienza la historia con la frase: "Un día..." https://goo.gl/so3S6W
14	Chaplin	Watch the following Chaplin video clip and retell the story. Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once. https://www.youtube.com/watch?v=4QkTNJFhu-g Mira el siguiente video de Charles Chaplin (4 minutos). Haz un resumen de la historia. Puedes ver el video más de una vez. https://www.youtube.com/watch?v=4QkTNJFhu-g

Corpus design: Variables

CEDEL2 was designed with a second language acquisition (SLA) agenda in mind. For every participant, we collected a large number of variables that are essential for SLA researchers. There are two sets of variables: linguistic background variables and task variables.

Table: Learner’s variables (linguistic background and task)

Linguistic background variables	Task variables
L1 of the learner L1 of the learner’s father L1 of the learner’s mother Language(s) spoken at home Placement test score (1-43 points) Proficiency level (lower beginner up to upper advanced) Proficiency level self-evaluation on each skill in Spanish (speaking, listening, writing, reading). Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading). Spanish language certificates held, if any Sex Age Age of exposure to L2 Spanish (AoE) Years studying Spanish (Length of Instruction, LoI) Stays in Spanish-speaking countries? (yes/no): Stay(s): Where? Stay(s): When? (period(s) of residence) Stay(s): How long? (length of residence) School/University/Educational institution (if any) Major degree (if any) Year at university/school (if any)	Task title Task text (written text/spoken text transcription/audio file) Approximate time to produce the task (in minutes). Where was the task done? (in class/outside class/both) Resources used to produce the task (help from Spanish native/bilingual dictionary/monolingual dictionary/spellchecker/grammar book/background readings/none)

Table: Native’s variables (linguistic background and task)

Linguistic background variables	Task variables
L1 of the native speaker L1 variety L1 of the native speaker’s father L1 of the native speaker’s mother Language(s) spoken at home Proficiency level self-evaluation on each skill in foreign language (speaking, listening, writing, reading). Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading). Sex Age School/University/Educational institution (if any) Major degree (if any) Year at university/school (if any)	Task title Task text (written text/spoken text transcription/audio file) Approximate time to produce the task (in minutes). Resources used to produce the task (Monolingual dictionary/Spellchecker/Grammar book/Background readings about the task topic (newspapers, internet, TV, etc.))

Corpus design: Proficiency level

CEDEL2 contains data from learners of Spanish at all proficiency levels (beginner, intermediate, advanced). Unlike other learner corpora that do not contain a standardised measure of the learners’ proficiency, CEDEL2 uses two proficiency-level measurements:

Objective measurement: Learners were administered a 43-point standardised placement test (University of Wisconsin, 1998)*, which objectively measures their proficiency level. We classify them according to the following six levels:

Proficiency level	Placement test score	Corresponding % score
Lower beginner	0-12	0%-28%
Upper beginner	13-20	30%-47%
Lower intermediate	21-28	49%-65%
Upper intermediate	29-35	67%-81%
Lower advanced	36-40	84%-93%
Upper advanced	41-43	95%-100%

*University of Wisconsin. (1998). The University of Wisconsin College-Level Placement Test: Spanish (Grammar) Form 96M. University of Wisconsin Press. http://testing.wisc.edu/centerpages/spanishtest.html

Subjective measurement: Learners self-rate their proficiency in Spanish for each of the four skills (speaking, listening, reading, writing) according to a six-point ordinal scale. The subjective measurement for each skill is then transformed into a 1-6 numeric scale and a new variable is created called ‘Proficiency self-assessment’, which is an average of the four observations. For example, suppose a learner self-rates their Spanish as follows: speaking A1, listening B1, reading A2, writing A1. These ordinal values are transformed into their corresponding numeric values: 1, 3, 2, 1. The final average for the variable ‘proficiency self-assessment’ is 1.75 (out of a maximum of 6).

Self-rating ordinal scale	Corresponding numeric value
Lower beginner (A1)	1
Upper beginner (A2)	2
Lower intermediate (B1)	3
Upper intermediate (B2)	4
Lower advanced (C1)	5
Upper advanced (C2)	6

Additionally, learners report on any Spanish language certificates they may hold (e.g., DELE B1). Finally, learners also report on any other additional foreign languages they know (other than Spanish) and self-rate themselves on each of the skills according to the 6-point subjective scale above.

Data collection

Written data were collected via online forms, which means that participants could participate from anywhere in the world. To ensure that all participants understood the forms and the instructions correctly, forms were written in their native language. To see the latest version of the forms, please visit http://learnercorpora.com.

Spoken data were collected in two ways (i) online via Google Meet or (ii) in situ at the Universidad de Granada in a quiet room with the help of special audio recording equipment (Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa) to guarantee the best quality of sound possible in the audio files. Audio files (and their transcriptions) can be freely downloaded. For consistency, a protocol was followed by all data collectors during oral recordings. The audio files were orthographically transcribed and converted into text files which are searchable and downloadable, though the audio files can also be downloaded. See the transcription conventions in the tab 'User Guide' > ‘Transcription conventions’.

Other twin corpora

For comparative purposes, in BilinguaLab we created a twin corpus called COREFL (CORpus of English as a Foreign Language), which was designed following the CEDEL2 principles. It contains data from L2 English learners with a wide variety of L1s. Its design allows users to do 'mirror image' comparisons. For example, a given linguistic phenomenon can be explored in both directions (L1⇔L2): L1 English-L2 Spanish (CEDEL2 subcorpus) vs. L1 Spanish-L2 English (COREFL subcorpus). This corpus design feature is called 'bidirectionality'. Other combinations are also possible, e.g., L1 Japanese-L2 Spanish vs L1 Japanese-L2 English, etc. (i.e., same mother tongue but different L2s being acquired).

We are also developing JFLCorp, a corpus of L2 Japanese with natives from different backgrounds. It also follows the same design criteria as CEDEL2.

Other L2 spanish corpora

CEDEL2 is in line with other international projects where large learner corpora are being created. Of particular interest are SPLLOC (Spanish Learner Language Oral Corpus) at the University of Southampton (UK), CAES (Corpus de Aprendices del Español) at Universidad de Santiago de Compostela (Spain), LANGSNAP (Language and Social Networks Abroad Project), and COWS-L2H (Corpus of Written Spanish, L2 and Heritage Speakers). For a repository of L2 Spanish corpora, see the Indexador de Corpus de Aprendices de Español.

Technical cookies		So that our website can work. Activated by default.
Technical cookies are strictly necessary for our website to work and for you to navigate through it. These types of cookies are those that, for example, allow us to identify you, give you access to certain restricted parts of the website if necessary, or remember different options or services already selected by you, such as your privacy preferences. Therefore, they are activated by default and your authorization is not necessary. Through the configuration of your browser, you can block or alert the presence of this type of cookies, although such blocking will affect the proper functioning of the different functionalities of our website.
Analysis cookies		To allow us to know how our web is being used. You can enable or disable them.
Analysis cookies allow us to study the navigation of the users of our website in general (for example, which sections of the site are the most visited, which services are used most and if they work correctly, etc.). From this statistical information about navigation on our website, we can improve both the operation of the site itself and the different services it offers. Therefore, these cookies do not have an advertising purpose, but only serve to make our website work better, adapting to our users in general. By activating them you will contribute to this continuous improvement. You can activate or deactivate these cookies by changing the corresponding sliders.

CEDEL2: Corpus Escrito del Español L2 (version 3)

CEDEL2 (v2)