Corpus design

Rationale

In CEDEL2 we investigate how people learn Spanish. That is why we collected a large database (=corpus) of written (and some spoken) texts produced by learners of Spanish. This is called a ‘learner corpus’ or ‘L2 corpus’.

The corpus is intended to be beneficial for linguists, researchers and teachers/learners of Spanish, as well as those interested in other uses of learner corpora (computational linguists, course material designers, etc).

Several thousand speakers have participated online from universities and schools all over the world (USA, UK, Japan, Spain, Italy, Germany, Greece, Russia, various Arabic countries, etc.). You can also participate online at http://learnercorpora.com

Corpus description

CEDEL2 (version 2) is a large corpus that contains samples of the language produced from learners of Spanish as a second language. For comparative purposes, it also contains a native control subcorpus of the language produced by native speakers of Spanish from different varieties (peninsular Spanish and all varieties of Latin American Spanish), so it can be used as a native corpus in its own right.

It contains an additional set of native control subcorpora by native speakers of different languages (English, Portuguese, Greek, Arabic, and Japanese). These are the mother tongues (i.e., the first languages, L1s) of some of the learners. In this way, researchers can also check whether the learners’ L1 is influencing their L2 Spanish (i.e., whether learners are transferring from their L1).

Therefore, at this stage of CEDEL2 (version 2), we have the following set of control subcorpora: subcorpora of the type 1 (the learners’ mother tongue) and a subcorpus of the type 2 (the learner’s target language, i.e., the Spanish native subcorpus). In future versions of CEDEL2, we will add additional control subcorpora which are currently under development in such a way that there is a control subcorpus type 1 for every learner subcorpus.

Table: Native control subcorpora in CEDEL2 v.2

Native control subcorpus 1 (learner's mother tongue)	Learner subcorpus	Native control subcorpus 2 (learner's target language)
L1 English	L1 English-L2 Spanish	L1 Spanish
L1 Portuguese	L1 Portuguese-L2 Spanish	L1 Spanish
L1 Greek	L1 Greek-L2 Spanish	L1 Spanish
L1 Arabic	L1 Arabic-L2 Spanish	L1 Spanish
L1 Japanese	L1 Japanese-L2 Spanish	L1 Spanish
L1 German [under development]	L1 German-L2 Spanish	L1 Spanish
L1 Dutch [under development]	L1 Dutch-L2 Spanish	L1 Spanish
L1 Italian [under development]	L1 Italian-L2 Spanish	L1 Spanish
L1 French [under development]	L1 French-L2 Spanish	L1 Spanish
L1 Russian [under development]	L1 Russian-L2 Spanish	L1 Spanish
L1 Chinese [under development]	L1 Chinese-L2 Spanish	L1 Spanish

CEDEL2 history

CEDEL2 was designed and implemented by Cristóbal Lozano, who has directed the project since 2004. It originated at the Universidad Autónoma de Madrid in that year and since 2006 it has been continued and implemented at the Universidad de Granada.

Online data collection started in 2006. Following the standards of Open Data Science, the first version (CEDEL2 v.1) was released in September 2017 with a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and the entire corpus has been freely and publicly available online ever since. It contained 2,578 speakers and files in total, collected mainly by Lozano (2,405 speakers, of which 1,609 were L1 English-L2 Spanish learners and 796 Spanish-speaking natives) and by Athanasios Georgopoulos (173 L1 Greek-L2 Spanish learners).

For its second version (CEDEL2 v.2), the corpus has been expanding since 2017 with the inclusion of a large list of subcorpora and the incorporation to the project of both local (Universidad de Granada, UGR) and international collaborators (cf. list of collaborators in the tab ‘CEDEL2 team’). CEDEL2 v.2 currently contains 1,821 new files plus the existing 2,578 files from v.1, which amounts to a total of 4,399 written and spoken files coming from 4,261 participants. CEDEL2 v.2 amounts to over one million words, which currently makes it the largest L2 Spanish corpus of its kind (cf. the tab ‘Statistics’ for further details). CEDEL2 v.2 (this version) has been publicly and freely released in July 2020 at http://cedel2.learnercorpora.com.

Corpus structure (CEDEL2 version 2): subcorpora

CEDEL2 is divided into two major components: the learner vs. native subcorpora. The 11 learner subcorpora consist of texts (mostly written, but some spoken) produced by learners of Spanish as a second language (L2). These learners are classified into subcorpora according to their mother tongue (i.e., their first language, L1). We have Indo-European languages, which are further subclassified into Germanic (English, German, Dutch), Romance (French, Italian, Portuguese), Hellenic (Greek) and Slavic (Russian). We also have East-Asian languages (Japanese, Chinese) and Arabic. All these typological similarities and differences make CEDEL2 an ideal L2 corpus for crosslinguistic comparisons to test for L1 influence on L2 Spanish. These subcorpora contain learner texts of all proficiency levels (beginners, intermediates, advanced).

Table: CEDEL2 learner subcorpora

L2 Spanish learner subcorpora	Words	Documents
L1 Arabic - L2 Spanish	9,118	74
L1 English - L2 Spanish	558,731	1,931
L1 Chinese - L2 Spanish	4,373	22
L1 Dutch - L2 Spanish	9,069	60
L1 French - L2 Spanish	8,136	58
L1 German - L2 Spanish	16,164	82
L1 Greek - L2 Spanish	64,105	216
L1 Italian - L2 Spanish	14,426	83
L1 Japanese - L2 Spanish	23,049	243
L1 Portuguese - L2 Spanish	21,662	164
L1 Russian - L2 Spanish	16,117	101

The 6 native subcorpora serve as ‘control’ data and are used for comparative purposes. In particular, the L1 native Spanish subcorpus can be used as a traditional control subcorpus against which we can compare the language produced by the learners of L2 Spanish, especially to check ‘ultimate attainment’: whether very advanced and near-native learners of L2 Spanish can ultimately attain a native level. The L1 native Spanish subcorpus contains the language produced by native speakers of Spanish from Spain and from other Spanish-speaking countries (Mexico, Argentina, Colombia, etc.), so it can be used as a corpus of native Spanish in its own right.

Additionally, there are a few native control subcorpora which are used to investigate the properties of native language (L1) of the learners as well as likely L1 transfer, i.e., whether the learners are transferring properties from their mother tongue (L1) onto their L2 Spanish. The subcorpora are: L1 native English, Portuguese, Greek, Japanese, and Arabic. In future versions of CEDEL2, there will be additional control corpora so that there is always a control corpus for every L1 of the learners.

Table: CEDEL2 native subcorpora

Native control subcorpora	Words	Documents
L1 Arabic - L2 Spanish	1,465	6
L1 English - L2 Spanish	40,805	172
L1 Greek - L2 Spanish	2,031	12
L1 Japanese - L2 Spanish	9,126	47
L1 Portuguese - L2 Spanish	3,348	16
L1 Spanish - L2 Spanish	304,211	1,112

Corpus design: Tasks

Tasks 1-12 (see Table below) were used in CEDEL2 (version 1). For the enhancement of the CEDEL2 corpus in its 2nd version (CEDEL2 v.2), two of these tasks were kept (2 and 3), and two additional ones were added (13 and 14). Importantly, tasks are not associated with any particular proficiency level, i.e., learners can in principle choose any task independently of their proficiency level.

Table: CEDEL2 tasks

Task number	Task title	Task description
1	Region where you live	What is the region where you live like? ¿Cómo es la región donde vives?
2	Famous Person	Talk about a famous person. Habla de una persona famosa.
3	Film	Summarise a film you have seen recently. Resume una película que has visto recientemente.
4	Last year holidays	What did you do during your holidays last summer? ¿Qué hiciste el año pasado durante las vacaciones?
5	Future plans	What are your plans for the future? ¿Cuáles son tus planes para el futuro?
6	Recent trip	Describe a trip you have recently made. Describe un viaje que has hecho recientemente.
7	Experience	Talk about an experience you have recently had. Cuenta una experiencia que hayas vivido.
8	Terrorism	Talk about the problem of terrorism in the world. Habla del problema del terrorismo en el mundo.
9	Anti-smoking law	What do you think about the new anti-smoking law? ¿Qué opinas de la nueva ley anti-tabaco?
10	Gay couples	Do you think gay couples should have the right to get married and adopt children? ¿Crees que las parejas gay tienen el derecho de casarse y adoptar niños?
11	Marijuana legalization	Do you think marijuana should be legal? ¿Crees que la marihuana se debería legalizar?
12	Immigration	Analyse the main aspects concerning immigration. Analiza los principales aspectos de la inmigración.
13	Frog	Look at the following pictures and retell the story. Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “Un día / One day... https://goo.gl/so3S6W Mira las siguientes ilustraciones. Narra una historia basada en las ilustraciones. Puedes añadir ideas nuevas o ignorar algunas que aparezcan en las ilustraciones. Por favor, comienza la historia con la frase: "Un día..." https://goo.gl/so3S6W
14	Chaplin	Watch the following Chaplin video clip and retell the story. Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once. https://www.youtube.com/watch?v=4QkTNJFhu-g Mira el siguiente video de Charles Chaplin (4 minutos). Haz un resumen de la historia. Puedes ver el video más de una vez. https://www.youtube.com/watch?v=4QkTNJFhu-g

Corpus design: Variables

CEDEL2 was designed with a second language acquisition (SLA) agenda in mind. For every participant, we collected a large number of variables that are essential for SLA researchers. There are two sets of variables: linguistic background variables and task variables.

Table: Learner’s variables (linguistic background and task)

Linguistic background variables	Task variables
L1 of the learner L1 of the learner’s father L1 of the learner’s mother Language(s) spoken at home Placement test score (1-43 points) Proficiency level (lower beginner up to upper advanced) Proficiency level self-evaluation on each skill in Spanish (speaking, listening, writing, reading). Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading). Spanish language certificates held, if any Sex Age Age of exposure to L2 Spanish (AoE) Years studying Spanish (Length of Instruction, LoI) Stays in Spanish-speaking countries? (yes/no): Stay(s): Where? Stay(s): When? (period(s) of residence) Stay(s): How long? (length of residence) School/University/Educational institution (if any) Major degree (if any) Year at university/school (if any)	Task title Task text (written text/spoken text transcription/audio file) Approximate time to produce the task (in minutes). Where was the task done? (in class/outside class/both) Resources used to produce the task (help from Spanish native/bilingual dictionary/monolingual dictionary/spellchecker/grammar book/background readings/none)

Table: Native’s variables (linguistic background and task)

Linguistic background variables	Task variables
L1 of the native speaker L1 variety L1 of the native speaker’s father L1 of the native speaker’s mother Language(s) spoken at home Proficiency level self-evaluation on each skill in foreign language (speaking, listening, writing, reading). Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading). Sex Age School/University/Educational institution (if any) Major degree (if any) Year at university/school (if any)	Task title Task text (written text/spoken text transcription/audio file) Approximate time to produce the task (in minutes). Resources used to produce the task (Monolingual dictionary/Spellchecker/Grammar book/Background readings about the task topic (newspapers, internet, TV, etc.))

Corpus design: Proficiency level

CEDEL2 contains data from learners of Spanish at all proficiency levels (beginner, intermediate, advanced). Unlike other learner corpora that do not contain a standardised measure of the learners’ proficiency, CEDEL2 uses two proficiency-level measurements:

Objective measurement: Learners were administered a 43-point standardised placement test (University of Wisconsin, 1998)*, which objectively measures their proficiency level. We classify them according to the following six levels:

Proficiency level	Placement test score	Corresponding % score
Lower beginner	0-12	0%-28%
Upper beginner	13-20	30%-47%
Lower intermediate	21-28	49%-65%
Upper intermediate	29-35	67%-81%
Lower advanced	36-40	84%-93%
Upper advanced	41-43	95%-100%

*University of Wisconsin. (1998). The University of Wisconsin College-Level Placement Test: Spanish (Grammar) Form 96M. University of Wisconsin Press. http://testing.wisc.edu/centerpages/spanishtest.html

Subjective measurement: Learners self-rate their proficiency in Spanish for each of the four skills (speaking, listening, reading, writing) according to a six-point ordinal scale. The subjective measurement for each skill is then transformed into a 1-6 numeric scale and a new variable is created called ‘Proficiency self-assessment’, which is an average of the four observations. For example, suppose a learner self-rates their Spanish as follows: speaking A1, listening B1, reading A2, writing A1. These ordinal values are transformed into their corresponding numeric values: 1, 3, 2, 1. The final average for the variable ‘proficiency self-assessment’ is 1.75 (out of a maximum of 6).

Self-rating ordinal scale	Corresponding numeric value
Lower beginner (A1)	1
Upper beginner (A2)	2
Lower intermediate (B1)	3
Upper intermediate (B2)	4
Lower advanced (C1)	5
Upper advanced (C2)	6

Additionally, learners report on any Spanish language certificates they may hold (e.g., DELE B1). Finally, learners also report on any other additional foreign languages they know (other than Spanish) and self-rate themselves on each of the skills according to the 6-point subjective scale above.

Data collection

Written data were collected via online forms, which means that participants could participate from anywhere in the world. To ensure that all participants understood the forms and the instructions correctly, forms were written in their native language. To see the different forms, please visit http://learnercorpora.com.

Spoken data were collected in situ at the Universidad de Granada in a quiet room with the help of special audio recording equipment (Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa) to guarantee the best quality of sound possible in the audio files. This ensures that phoneticians can download the audio files to perform fine-grained acoustic analyses. For consistency, a protocol was followed by all data collectors during oral recordings. The audio files were transcribed orthographically and converted into text files which are searchable and downloadable, though the audio files can also be downloaded. See the transcription conventions in the tab ‘User Guide’ > ‘Transcription conventions’.

Other twin corpora

For comparative purposes, we created a twin corpus called COREFL (CORpus of English as a Foreign Language), which was designed following the CEDEL2 principles. It contains data from L1 Spanish-L2 English and L1 German-L2 English in such a way that users can do ‘mirror image’ comparisons. For example, a given linguistic phenomenon can be explored in both directions (L1↔L2): L1 English-L2 Spanish (CEDEL2 subcorpus) vs. L1 Spanish-L2 English (COREFL subcorpus). Also in L1 German-L2 Spanish (CEDEL2) vs. L1 German-L2 English (COREFL). This corpus design feature is called ‘bidirectionality’.

Additionally, COREFL contains a subcorpus of native English, which is shared across CEDEL2 (as a corpus of the learners’ L1) and COREFL (as a corpus of the target language being acquired).

COREFL: Lozano, C., Díaz-Negrillo, A., & Callies, M. (2020). Designing and compiling a learner corpus of written and spoken narratives: COREFL. In C. Bongartz & J. Torregrossa (Eds.), What’s in a Narrative? Variation in Story-Telling at the Interface between Language and Literacy (pp. 9-32). Bern: Peter Lang.

Finally, WriCLE (Written Corpus of Learner English) is a similar L1 Spanish - L2 English corpus that we have developed in earlier projects.

WriCLE: Rollinson, P., & Mendikoetxea, A. (2010). Learner corpora and second language acquisition: Introducing WRICLE: In J. L. Bueno Alonso, D. González Álvarez, U. Kirsten Torrado, A. E. Martínez Insua, J. Pérez-Guerra, E. Rama Martínez, & R. Rodríguez Vázquez (Eds.), Analizar datos > Describir variación / Analysing Data > Describing Variation (pp. 1-12). Universidade de Vigo (Servizo de Publicacións).

Other L2 spanish corpora

CEDEL2 is in line with other international projects where large learner corpora are being created. Of particular interest are SPLLOC (Spanish Learner Language Oral Corpus) at the University of Southampton (UK), CAES (Corpus de Aprendices del Español) at Universidad de Santiago de Compostela (Spain), and LANGSNAP (Language and Social Networks Abroad Project). For a repository of L2 Spanish corpora, see the Indexador de Corpus de Aprendices de Español.

Technical cookies		So that our website can work. Activated by default.
Technical cookies are strictly necessary for our website to work and for you to navigate through it. These types of cookies are those that, for example, allow us to identify you, give you access to certain restricted parts of the website if necessary, or remember different options or services already selected by you, such as your privacy preferences. Therefore, they are activated by default and your authorization is not necessary. Through the configuration of your browser, you can block or alert the presence of this type of cookies, although such blocking will affect the proper functioning of the different functionalities of our website.
Analysis cookies		To allow us to know how our web is being used. You can enable or disable them.
Analysis cookies allow us to study the navigation of the users of our website in general (for example, which sections of the site are the most visited, which services are used most and if they work correctly, etc.). From this statistical information about navigation on our website, we can improve both the operation of the site itself and the different services it offers. Therefore, these cookies do not have an advertising purpose, but only serve to make our website work better, adapting to our users in general. By activating them you will contribute to this continuous improvement. You can activate or deactivate these cookies by changing the corresponding sliders.

CEDEL2: Corpus Escrito del Español L2 (version 2)

CEDEL2 (v2)