CEDEL2: Corpus Escrito del Español L2 (version 2)

CEDEL2 (v2)

Sept. 2020

Transcription conventions

Transcription of the spoken data

The transcripts are orthographic transcriptions and include only basic details of spoken language properties. For example, pauses are marked but their length is not marked. Spoken language features marked in the transcripts include pauses, false starts, incomprehensible words, etc. The idea is that the transcription should be as legible as possible by a wide range of users.

In this version of CEDEL2 (version 2), transcriptions are provided only when the spoken texts are in Spanish (i.e., Spanish native subcorpus and all L2 Spanish subcorpora) or in English (i.e., English native subcorpus).

Table: Transcription convention

Phenomenon Code Explanation Examples
Empty pauses / Only for very obvious pauses with a clear flat line in the waveform, independently of its length.
A pause may coincide with a clause boundary (i.e., the end of a clause) but often it does not.
(1) Chaplin dice que / el viejo necesita tomar al bebé
(2) da la bebé a un / viejo hombre
Filled pauses uh (English)
eh (Spanish)
The sound produced in the filled pause may be of different kinds, including uh, eh, er, em, erm, etc. (1) el vídeo termina con el hombre / eh feliz
(2) al principio del vídeo eh Charlie Chaplin está andando por la calle
Non-linguistic occurrences hhh Unspecified non-linguistic occurrence, which can include: laughing, coughing, clearing one’s throat, sighing, deep breathing. (1) Charlie Chaplin está andando por la calle hhh y un una cosa eh se cayó por el cielo
(2) y por eso lo pone hhh el bebé en este coche
Incomprehensible or unintelligible word(s) xxx (1) el hombre encuentra un / eh / papel con un / una frase xxx / ayuda por / la pobre bebé
False starts and Cut-off words = The symbol marks a false start or a cut-off word and is inserted immediately after the unfinished word. (1) Chaplin sa= sacó el bebé
(2) el mujer tiene mucho más espacio / pa= en su carrito para el hhh otro bebé
Repetitions They are not tagged or marked in any way. The transcription simply reflects what the speaker says. Repetitions can be repeated words or multiple words. (1) Charles hhh di un bebé / a otra otra hhh hombre
(2) y se pone eh se pone
(3) y se lo lleva / hhh eh se lo lleva
Rewordings or Reformulations They are not tagged or marked in any way. The transcription simply reflects what the speaker says. (1) mete a ese niño en el al niño en el coche capota
(2) están tirando basura por todos los hhh por todas partes
Capitalization Capital letters are used for proper names and for acronyms. Charles Chaplin
Sound lengthening Lengthened phonemes are not transcribed or annotated in any way.
Intonation and punctuation The transcription does not include any of the standard punctuation used in written language, like a full stop (.) to mark the boundary between sentences, or a comma (,) to indicate a pause or a question mark (?) to indicate a rising/falling intonation.
Foreign word(s) and Codeswitches They are transcribed as such. [No examples attested in the L2 Spanish spoken data, though there are cases in the written texts]
Contractions English: contractions are transcribed as such. Spanish: no contractions used. I’m, there’s, they’re, don’t, wanna, gonna, gotta, kinda, ‘cos, ‘n’ (as a contracted form of ‘and’: he saw a man walking around ‘n’ asked the man).

Transliteration (Japanese native written subcorpus)

The data for the Japanese native subcorpus was gathered in the Japanese kana-kanji majiri script. Although the original texts are also available, they were transliterated into the Latin script (rōmaji), introducing spaces between words (wakachigaki), for analytical, statistical and comparative purposes, thus making it possible, for example, to establish word counts and type/token ratios that are comparable with other subcorpora within CEDEL2. Although, technically, the procedure is more of a transcription than a transliteration because there is no one-to-one match between the original Japanese written representations and those in the Latin script, the word ‘transliteration’ is used for practical purposes.

These are the criteria used for transliterating from the Japanese kana-kanji majiri writing system to the Latin script (rōmaji) and wakachigaki (spacing between words):

Table: Transliteration from native Japanese

Phenomenon Description Examples
The mora ん / ン The letter n for the Japanese mora ん / ン is kept before bilabial consonants such as p, b or m. janpu; akan
Vowel elongation Vowel elongations are indicated with a macron above the lengthened vowel, except for the elongation of i, which is represented via a repetition of the letter. rōjin, but kawaii
Aphaeresis/syncopation Aphaeresis or syncopation characteristic of informal language style are captured via an apostrophe (’). modotte ’ta > modotte ita; kurumarete ’ru > kurumarete iru; iru n’ da > iru no da
Hyphens Hyphens are used for honorific forms of address such as -san, -kun, -chan, etc. They are not used to form compound words. kero-san, but akachan (lexicalized to mean “baby”)
Verb inflection Verb inflectional morphemes are not hyphenated but attached. suterarete [sute (“throw away”) + rare (passive voice) + te (gerundive)]
Auxiliary verbs following a gerundive Auxiliary verbs iru, aru, aruku, iku, kuru, shimau, etc. following a lexical verb in its gerundive form (-te / -de) are detached when the combination is interpreted as periphrastic (inchoate, durative, conclusive, etc.). Otherwise, they are regarded as single lexical units. aruite iru [aruku (“walk”) + iru (“be”)]
yattekuru [yatte (“do”) + kuru (“come”)] meaning “pop up”, “come along”, “come around”.
Compound verbs Compound verbs with the first verb in its mizenkei 未然形 form (aruki > aruku) are written as single lexical units. When the verb combination is interpreted as two juxtaposed clauses, the verbs are written separately. arukisaru [aruki + saru]
mitsuke yomu, where two separate and sequential actions are described
Noun + suru A space is added between the auxiliary verb suru used to create denominal verbs and the preceding noun. gekido suru
Particles Particles are all detached from the preceding and/or following words except for those that are lexicalized. tegami ga; hitotachi ni; akachan o; obasan wa; hitome de;
dokoka; itsuka; dareka; ikuraka; nanika.
Subordinating particle to Subordinating particle to is always detached from the preceding word. fu to miru; noseyō to suru
Demonstratives Demonstratives are separated from the following noun. kono ko; sono go; sono mama; ano hito.
先程 saki and hodo are written separately. saki hodo
~では The de wa form is rendered separately in order to distinguish the gerundive from the topic particle. kazoku de wa nai