Corpus design
Rationale
In CEDEL2 we investigate how people learn Spanish. That is why we collected a large database (=corpus) of written (and some spoken) texts produced by learners of Spanish. This is called a ‘learner corpus’ or ‘L2 corpus’.
The corpus is intended to be beneficial for linguists, researchers and teachers/learners of Spanish, as well as those interested in other uses of learner corpora (computational linguists, course material designers, etc).
Several thousand speakers have participated online from universities and schools all over the world (USA, UK, Japan, Spain, Italy, Germany, Greece, Russia, various Arabic countries, etc.). You can also participate online at http://learnercorpora.com
Corpus description
CEDEL2 (version 2) is a large corpus that contains samples of the language produced from learners of Spanish as a second language. For comparative purposes, it also contains a native control subcorpus of the language produced by native speakers of Spanish from different varieties (peninsular Spanish and all varieties of Latin American Spanish), so it can be used as a native corpus in its own right.
It contains an additional set of native control subcorpora by native speakers of different languages (English, Portuguese, Greek, Arabic, and Japanese). These are the mother tongues (i.e., the first languages, L1s) of some of the learners. In this way, researchers can also check whether the learners’ L1 is influencing their L2 Spanish (i.e., whether learners are transferring from their L1).
Therefore, at this stage of CEDEL2 (version 2), we have the following set of control subcorpora: subcorpora of the type 1 (the learners’ mother tongue) and a subcorpus of the type 2 (the learner’s target language, i.e., the Spanish native subcorpus). In future versions of CEDEL2, we will add additional control subcorpora which are currently under development in such a way that there is a control subcorpus type 1 for every learner subcorpus.
Table: Native control subcorpora in CEDEL2 v.2
Native control subcorpus 1 (learner's mother tongue) |
Learner subcorpus | Native control subcorpus 2 (learner's target language) |
---|---|---|
L1 English | L1 English-L2 Spanish | L1 Spanish |
L1 Portuguese | L1 Portuguese-L2 Spanish | L1 Spanish |
L1 Greek | L1 Greek-L2 Spanish | L1 Spanish |
L1 Arabic | L1 Arabic-L2 Spanish | L1 Spanish |
L1 Japanese | L1 Japanese-L2 Spanish | L1 Spanish |
L1 German [under development] | L1 German-L2 Spanish | L1 Spanish |
L1 Dutch [under development] | L1 Dutch-L2 Spanish | L1 Spanish |
L1 Italian [under development] | L1 Italian-L2 Spanish | L1 Spanish |
L1 French [under development] | L1 French-L2 Spanish | L1 Spanish |
L1 Russian [under development] | L1 Russian-L2 Spanish | L1 Spanish |
L1 Chinese [under development] | L1 Chinese-L2 Spanish | L1 Spanish |
CEDEL2 history
CEDEL2 was designed and implemented by Cristóbal Lozano, who has directed the project since 2004. It originated at the Universidad Autónoma de Madrid in that year and since 2006 it has been continued and implemented at the Universidad de Granada.
Online data collection started in 2006. Following the standards of Open Data Science, the first version (CEDEL2 v.1) was released in September 2017 with a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and the entire corpus has been freely and publicly available online ever since. It contained 2,578 speakers and files in total, collected mainly by Lozano (2,405 speakers, of which 1,609 were L1 English-L2 Spanish learners and 796 Spanish-speaking natives) and by Athanasios Georgopoulos (173 L1 Greek-L2 Spanish learners).
For its second version (CEDEL2 v.2), the corpus has been expanding since 2017 with the inclusion of a large list of subcorpora and the incorporation to the project of both local (Universidad de Granada, UGR) and international collaborators (cf. list of collaborators in the tab ‘CEDEL2 team’). CEDEL2 v.2 currently contains 1,821 new files plus the existing 2,578 files from v.1, which amounts to a total of 4,399 written and spoken files coming from 4,261 participants. CEDEL2 v.2 amounts to over one million words, which currently makes it the largest L2 Spanish corpus of its kind (cf. the tab ‘Statistics’ for further details). CEDEL2 v.2 (this version) has been publicly and freely released in July 2020 at http://cedel2.learnercorpora.com.
Corpus structure (CEDEL2 version 2): subcorpora
CEDEL2 is divided into two major components: the learner vs. native subcorpora. The 11 learner subcorpora consist of texts (mostly written, but some spoken) produced by learners of Spanish as a second language (L2). These learners are classified into subcorpora according to their mother tongue (i.e., their first language, L1). We have Indo-European languages, which are further subclassified into Germanic (English, German, Dutch), Romance (French, Italian, Portuguese), Hellenic (Greek) and Slavic (Russian). We also have East-Asian languages (Japanese, Chinese) and Arabic. All these typological similarities and differences make CEDEL2 an ideal L2 corpus for crosslinguistic comparisons to test for L1 influence on L2 Spanish. These subcorpora contain learner texts of all proficiency levels (beginners, intermediates, advanced).
Table: CEDEL2 learner subcorpora
L2 Spanish learner subcorpora | Words | Documents |
---|---|---|
L1 Arabic - L2 Spanish | 9,118 | 74 |
L1 English - L2 Spanish | 558,731 | 1,931 |
L1 Chinese - L2 Spanish | 4,373 | 22 |
L1 Dutch - L2 Spanish | 9,069 | 60 |
L1 French - L2 Spanish | 8,136 | 58 |
L1 German - L2 Spanish | 16,164 | 82 |
L1 Greek - L2 Spanish | 64,105 | 216 |
L1 Italian - L2 Spanish | 14,426 | 83 |
L1 Japanese - L2 Spanish | 23,049 | 243 |
L1 Portuguese - L2 Spanish | 21,662 | 164 |
L1 Russian - L2 Spanish | 16,117 | 101 |
The 6 native subcorpora serve as ‘control’ data and are used for comparative purposes. In particular, the L1 native Spanish subcorpus can be used as a traditional control subcorpus against which we can compare the language produced by the learners of L2 Spanish, especially to check ‘ultimate attainment’: whether very advanced and near-native learners of L2 Spanish can ultimately attain a native level. The L1 native Spanish subcorpus contains the language produced by native speakers of Spanish from Spain and from other Spanish-speaking countries (Mexico, Argentina, Colombia, etc.), so it can be used as a corpus of native Spanish in its own right.
Additionally, there are a few native control subcorpora which are used to investigate the properties of native language (L1) of the learners as well as likely L1 transfer, i.e., whether the learners are transferring properties from their mother tongue (L1) onto their L2 Spanish. The subcorpora are: L1 native English, Portuguese, Greek, Japanese, and Arabic. In future versions of CEDEL2, there will be additional control corpora so that there is always a control corpus for every L1 of the learners.
Table: CEDEL2 native subcorpora
Native control subcorpora | Words | Documents |
---|---|---|
L1 Arabic - L2 Spanish | 1,465 | 6 |
L1 English - L2 Spanish | 40,805 | 172 |
L1 Greek - L2 Spanish | 2,031 | 12 |
L1 Japanese - L2 Spanish | 9,126 | 47 |
L1 Portuguese - L2 Spanish | 3,348 | 16 |
L1 Spanish - L2 Spanish | 304,211 | 1,112 |
Corpus design: Tasks
Tasks 1-12 (see Table below) were used in CEDEL2 (version 1). For the enhancement of the CEDEL2 corpus in its 2nd version (CEDEL2 v.2), two of these tasks were kept (2 and 3), and two additional ones were added (13 and 14). Importantly, tasks are not associated with any particular proficiency level, i.e., learners can in principle choose any task independently of their proficiency level.
Table: CEDEL2 tasks
Task number | Task title | Task description |
1 | Region where you live | What is the region where you live like? ¿Cómo es la región donde vives? |
2 | Famous Person | Talk about a famous person. Habla de una persona famosa. |
3 | Film | Summarise a film you have seen recently. Resume una película que has visto recientemente. |
4 | Last year holidays | What did you do during your holidays last summer? ¿Qué hiciste el año pasado durante las vacaciones? |
5 | Future plans | What are your plans for the future? ¿Cuáles son tus planes para el futuro? |
6 | Recent trip | Describe a trip you have recently made. Describe un viaje que has hecho recientemente. |
7 | Experience | Talk about an experience you have recently had. Cuenta una experiencia que hayas vivido. |
8 | Terrorism | Talk about the problem of terrorism in the world. Habla del problema del terrorismo en el mundo. |
9 | Anti-smoking law | What do you think about the new anti-smoking law? ¿Qué opinas de la nueva ley anti-tabaco? |
10 | Gay couples | Do you think gay couples should have the right to get married and adopt children? ¿Crees que las parejas gay tienen el derecho de casarse y adoptar niños? |
11 | Marijuana legalization | Do you think marijuana should be legal? ¿Crees que la marihuana se debería legalizar? |
12 | Immigration | Analyse the main aspects concerning immigration. Analiza los principales aspectos de la inmigración. |
13 | Frog |
Look at the following pictures and retell the story. Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “Un día / One day... https://goo.gl/so3S6W Mira las siguientes ilustraciones. Narra una historia basada en las ilustraciones. Puedes añadir ideas nuevas o ignorar algunas que aparezcan en las ilustraciones. Por favor, comienza la historia con la frase: "Un día..." https://goo.gl/so3S6W |
14 | Chaplin |
Watch the following Chaplin video clip and retell the story. Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once. https://www.youtube.com/watch?v=4QkTNJFhu-g Mira el siguiente video de Charles Chaplin (4 minutos). Haz un resumen de la historia. Puedes ver el video más de una vez. https://www.youtube.com/watch?v=4QkTNJFhu-g |
Corpus design: Variables
CEDEL2 was designed with a second language acquisition (SLA) agenda in mind. For every participant, we collected a large number of variables that are essential for SLA researchers. There are two sets of variables: linguistic background variables and task variables.
Table: Learner’s variables (linguistic background and task)
Linguistic background variables | Task variables |
---|---|
|
|
Table: Native’s variables (linguistic background and task)
Linguistic background variables | Task variables |
---|---|
|
|
Corpus design: Proficiency level
CEDEL2 contains data from learners of Spanish at all proficiency levels (beginner, intermediate, advanced). Unlike other learner corpora that do not contain a standardised measure of the learners’ proficiency, CEDEL2 uses two proficiency-level measurements:
-
Objective measurement: Learners were administered a 43-point standardised placement test (University of Wisconsin, 1998)*, which objectively measures their proficiency level. We classify them according to the following six levels:
Proficiency level Placement test score Corresponding % score Lower beginner 0-12 0%-28% Upper beginner 13-20 30%-47% Lower intermediate 21-28 49%-65% Upper intermediate 29-35 67%-81% Lower advanced 36-40 84%-93% Upper advanced 41-43 95%-100% -
Subjective measurement: Learners self-rate their proficiency in Spanish for each of the four skills (speaking, listening, reading, writing) according to a six-point ordinal scale. The subjective measurement for each skill is then transformed into a 1-6 numeric scale and a new variable is created called ‘Proficiency self-assessment’, which is an average of the four observations. For example, suppose a learner self-rates their Spanish as follows: speaking A1, listening B1, reading A2, writing A1. These ordinal values are transformed into their corresponding numeric values: 1, 3, 2, 1. The final average for the variable ‘proficiency self-assessment’ is 1.75 (out of a maximum of 6).
Self-rating ordinal scale Corresponding numeric value Lower beginner (A1) 1 Upper beginner (A2) 2 Lower intermediate (B1) 3 Upper intermediate (B2) 4 Lower advanced (C1) 5 Upper advanced (C2) 6 Additionally, learners report on any Spanish language certificates they may hold (e.g., DELE B1). Finally, learners also report on any other additional foreign languages they know (other than Spanish) and self-rate themselves on each of the skills according to the 6-point subjective scale above.
Data collection
Written data were collected via online forms, which means that participants could participate from anywhere in the world. To ensure that all participants understood the forms and the instructions correctly, forms were written in their native language. To see the different forms, please visit http://learnercorpora.com.
Spoken data were collected in situ at the Universidad de Granada in a quiet room with the help of special audio recording equipment (Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa) to guarantee the best quality of sound possible in the audio files. This ensures that phoneticians can download the audio files to perform fine-grained acoustic analyses. For consistency, a protocol was followed by all data collectors during oral recordings. The audio files were transcribed orthographically and converted into text files which are searchable and downloadable, though the audio files can also be downloaded. See the transcription conventions in the tab ‘User Guide’ > ‘Transcription conventions’.
Other twin corpora
For comparative purposes, we created a twin corpus called COREFL (CORpus of English as a Foreign Language), which was designed following the CEDEL2 principles. It contains data from L1 Spanish-L2 English and L1 German-L2 English in such a way that users can do ‘mirror image’ comparisons. For example, a given linguistic phenomenon can be explored in both directions (L1↔L2): L1 English-L2 Spanish (CEDEL2 subcorpus) vs. L1 Spanish-L2 English (COREFL subcorpus). Also in L1 German-L2 Spanish (CEDEL2) vs. L1 German-L2 English (COREFL). This corpus design feature is called ‘bidirectionality’.
Additionally, COREFL contains a subcorpus of native English, which is shared across CEDEL2 (as a corpus of the learners’ L1) and COREFL (as a corpus of the target language being acquired).
COREFL: Lozano, C., Díaz-Negrillo, A., & Callies, M. (2020). Designing and compiling a learner corpus of written and spoken narratives: COREFL. In C. Bongartz & J. Torregrossa (Eds.), What’s in a Narrative? Variation in Story-Telling at the Interface between Language and Literacy (pp. 9-32). Bern: Peter Lang.
Finally, WriCLE (Written Corpus of Learner English) is a similar L1 Spanish - L2 English corpus that we have developed in earlier projects.
WriCLE: Rollinson, P., & Mendikoetxea, A. (2010). Learner corpora and second language acquisition: Introducing WRICLE: In J. L. Bueno Alonso, D. González Álvarez, U. Kirsten Torrado, A. E. Martínez Insua, J. Pérez-Guerra, E. Rama Martínez, & R. Rodríguez Vázquez (Eds.), Analizar datos > Describir variación / Analysing Data > Describing Variation (pp. 1-12). Universidade de Vigo (Servizo de Publicacións).
Other L2 spanish corpora
CEDEL2 is in line with other international projects where large learner corpora are being created. Of particular interest are SPLLOC (Spanish Learner Language Oral Corpus) at the University of Southampton (UK), CAES (Corpus de Aprendices del Español) at Universidad de Santiago de Compostela (Spain), and LANGSNAP (Language and Social Networks Abroad Project). For a repository of L2 Spanish corpora, see the Indexador de Corpus de Aprendices de Español.