Transcription conventions

Transcription of the spoken data

The transcripts are orthographic transcriptions and include only basic details of spoken language properties. For example, pauses are marked but their length is not marked. Spoken language features marked in the transcripts include pauses, false starts, incomprehensible words, etc. The idea is that the transcription should be as legible as possible by a wide range of users.

In this version of CEDEL2 (version 2), transcriptions are provided only when the spoken texts are in Spanish (i.e., Spanish native subcorpus and all L2 Spanish subcorpora) or in English (i.e., English native subcorpus).

Table: Transcription convention

Phenomenon	Code	Explanation	Examples
Empty pauses	/	Only for very obvious pauses with a clear flat line in the waveform, independently of its length. A pause may coincide with a clause boundary (i.e., the end of a clause) but often it does not.	(1) Chaplin dice que / el viejo necesita tomar al bebé (2) da la bebé a un / viejo hombre
Filled pauses	uh (English) eh (Spanish)	The sound produced in the filled pause may be of different kinds, including uh, eh, er, em, erm, etc.	(1) el vídeo termina con el hombre / eh feliz (2) al principio del vídeo eh Charlie Chaplin está andando por la calle
Non-linguistic occurrences	hhh	Unspecified non-linguistic occurrence, which can include: laughing, coughing, clearing one’s throat, sighing, deep breathing.	(1) Charlie Chaplin está andando por la calle hhh y un una cosa eh se cayó por el cielo (2) y por eso lo pone hhh el bebé en este coche
Incomprehensible or unintelligible word(s)	xxx		(1) el hombre encuentra un / eh / papel con un / una frase xxx / ayuda por / la pobre bebé
False starts and Cut-off words	=	The symbol marks a false start or a cut-off word and is inserted immediately after the unfinished word.	(1) Chaplin sa= sacó el bebé (2) el mujer tiene mucho más espacio / pa= en su carrito para el hhh otro bebé
Repetitions		They are not tagged or marked in any way. The transcription simply reflects what the speaker says. Repetitions can be repeated words or multiple words.	(1) Charles hhh di un bebé / a otra otra hhh hombre (2) y se pone eh se pone (3) y se lo lleva / hhh eh se lo lleva
Rewordings or Reformulations		They are not tagged or marked in any way. The transcription simply reflects what the speaker says.	(1) mete a ese niño en el al niño en el coche capota (2) están tirando basura por todos los hhh por todas partes
Capitalization		Capital letters are used for proper names and for acronyms.	Charles Chaplin London USA
Sound lengthening		Lengthened phonemes are not transcribed or annotated in any way.
Intonation and punctuation		The transcription does not include any of the standard punctuation used in written language, like a full stop (.) to mark the boundary between sentences, or a comma (,) to indicate a pause or a question mark (?) to indicate a rising/falling intonation.
Foreign word(s) and Codeswitches		They are transcribed as such.	[No examples attested in the L2 Spanish spoken data, though there are cases in the written texts]
Contractions	’	English: contractions are transcribed as such. Spanish: no contractions used.	I’m, there’s, they’re, don’t, wanna, gonna, gotta, kinda, ‘cos, ‘n’ (as a contracted form of ‘and’: he saw a man walking around ‘n’ asked the man).

Transliteration (Japanese native written subcorpus)

The data for the Japanese native subcorpus was gathered in the Japanese kana-kanji majiri script. Although the original texts are also available, they were transliterated into the Latin script (rōmaji), introducing spaces between words (wakachigaki), for analytical, statistical and comparative purposes, thus making it possible, for example, to establish word counts and type/token ratios that are comparable with other subcorpora within CEDEL2. Although, technically, the procedure is more of a transcription than a transliteration because there is no one-to-one match between the original Japanese written representations and those in the Latin script, the word ‘transliteration’ is used for practical purposes.

These are the criteria used for transliterating from the Japanese kana-kanji majiri writing system to the Latin script (rōmaji) and wakachigaki (spacing between words):

The Revised Hepburn Romanization system has been followed since it is the most widely used today for Japanese discourse transcription (cf. Miyata and MacWhinney, 2016).
Additionally, the following decisions have been made regarding Japanese phonotactics and wakachigaki (word separation):

Table: Transliteration from native Japanese

Phenomenon	Description	Examples
The mora ん / ン	The letter n for the Japanese mora ん / ン is kept before bilabial consonants such as p, b or m.	janpu; akanbō
Vowel elongation	Vowel elongations are indicated with a macron above the lengthened vowel, except for the elongation of i, which is represented via a repetition of the letter.	rōjin, but kawaii
Aphaeresis/syncopation	Aphaeresis or syncopation characteristic of informal language style are captured via an apostrophe (’).	modotte ’ta > modotte ita; kurumarete ’ru > kurumarete iru; iru n’ da > iru no da
Hyphens	Hyphens are used for honorific forms of address such as -san, -kun, -chan, etc. They are not used to form compound words.	kero-san, but akachan (lexicalized to mean “baby”)
Verb inflection	Verb inflectional morphemes are not hyphenated but attached.	suterarete [sute (“throw away”) + rare (passive voice) + te (gerundive)]
Auxiliary verbs following a gerundive	Auxiliary verbs iru, aru, aruku, iku, kuru, shimau, etc. following a lexical verb in its gerundive form (-te / -de) are detached when the combination is interpreted as periphrastic (inchoate, durative, conclusive, etc.). Otherwise, they are regarded as single lexical units.	aruite iru [aruku (“walk”) + iru (“be”)] but yattekuru [yatte (“do”) + kuru (“come”)] meaning “pop up”, “come along”, “come around”.
Compound verbs	Compound verbs with the first verb in its mizenkei 未然形 form (aruki > aruku) are written as single lexical units. When the verb combination is interpreted as two juxtaposed clauses, the verbs are written separately.	arukisaru [aruki + saru] but mitsuke yomu, where two separate and sequential actions are described
Noun + suru	A space is added between the auxiliary verb suru used to create denominal verbs and the preceding noun.	gekido suru
Particles	Particles are all detached from the preceding and/or following words except for those that are lexicalized.	tegami ga; hitotachi ni; akachan o; obasan wa; hitome de; but dokoka; itsuka; dareka; ikuraka; nanika.
Subordinating particle to	Subordinating particle to is always detached from the preceding word.	fu to miru; noseyō to suru
Demonstratives	Demonstratives are separated from the following noun.	kono ko; sono go; sono mama; ano hito.
先程	saki and hodo are written separately.	saki hodo
～では	The de wa form is rendered separately in order to distinguish the gerundive from the topic particle.	kazoku de wa* nai*

Technical cookies		So that our website can work. Activated by default.
Technical cookies are strictly necessary for our website to work and for you to navigate through it. These types of cookies are those that, for example, allow us to identify you, give you access to certain restricted parts of the website if necessary, or remember different options or services already selected by you, such as your privacy preferences. Therefore, they are activated by default and your authorization is not necessary. Through the configuration of your browser, you can block or alert the presence of this type of cookies, although such blocking will affect the proper functioning of the different functionalities of our website.
Analysis cookies		To allow us to know how our web is being used. You can enable or disable them.
Analysis cookies allow us to study the navigation of the users of our website in general (for example, which sections of the site are the most visited, which services are used most and if they work correctly, etc.). From this statistical information about navigation on our website, we can improve both the operation of the site itself and the different services it offers. Therefore, these cookies do not have an advertising purpose, but only serve to make our website work better, adapting to our users in general. By activating them you will contribute to this continuous improvement. You can activate or deactivate these cookies by changing the corresponding sliders.

CEDEL2: Corpus Escrito del Español L2 (version 2)

CEDEL2 (v2)

Transcription conventions

Transcription of the spoken data

Transliteration (Japanese native written subcorpus)