Transcription conventions

Transcription of the spoken data

While the transcriptions of CEDEL2 v.2 contained basic transcription conventions, for CEDEL2 v.3 we opted to follow a 'plain text' protocol, whereby no transcription symbols are used. Words are transcribed as they are heard, in lowercase and without any punctuation. Repetitions are transcribed as such (e.g., un hombre que estaba caminando en el en la calle).

These are the main reasons for this 'plain text' policy:

Adding spoken marks (like empty pauses '/' and filled pauses 'eh', for example) makes the transcribed text blind to certain morphosyntactic searches based on tags, e.g., ARTfem NOUNmasc (la problema, la coche), if transcribed as la / eh / problema, la / coche, then would not have generate any hits in the search.
The new transcription method ensures comparability across BilinguaLab's written vs spoken corpora when doing morphosyntactic searches, so results are more comparable.
The BilinguaLab corpora were clearly designed with a lexicon/morphosyntax researcher in mind and not a phonologist/phonetist, so adding oral marks does not add much value to either researchers interested in lexical or morphosyntactic aspects or even researchers interested in spoken aspects. Those researchers interested in oral/phonetic aspects can always download the audio files and do their own fine-grained annotated transcriptions, which will always be more precise than the transcription conventions we used for CEDEL2 (version 2).

The only code is the use of XXX when the word cannot be understood/interpreted by the transcriber. These are just a couple of examples of transcribed audio files:

Filename: EN_SP_14_19_0.5_14_GFH

un hombre está de pie en el camino las cosas están cayendo a su alrededor el hombre encuentra un bebé no sabe de dónde vino trata de encontrar a su madre cuando no puede encontrar a su madre se la da a extraños es xxx por la policía por tratar de abandonar al bebé piensa el dejar al bebé al xxx de la carretera luego encuentra una nota que dice que el bebé es huérfano al finale él decide quedarse con el bebé

Filename: EN_SP_28_21_4_14_SD

hay un hombre que estaba caminando en el en la calle cuando encontró un bebé y entonces él no quería el bebé y ver un una mujer y le da el bebé a ella pero ella no quiere el bebé también y lucha luchó con él y le da el bebé a él entonces él encontró un otro hombre y el hombre no quería el bebé también tampoco y finalmente él decidió finalmente él encontró un una carta con el bebé y leí en el en la carta por favor amar y cuidar a a mi a mi bebé y él decidió cuidar al niño

Transliteration (Japanese native written subcorpus)

The data for the Japanese native subcorpus was gathered in the Japanese kana-kanji majiri script. Although the original texts are also available, they were transliterated into the Latin script (rōmaji), introducing spaces between words (wakachigaki), for analytical, statistical and comparative purposes, thus making it possible, for example, to establish word counts and type/token ratios that are comparable with other subcorpora within CEDEL2. Although, technically, the procedure is more of a transcription than a transliteration because there is no one-to-one match between the original Japanese written representations and those in the Latin script, the word ‘transliteration’ is used for practical purposes.

These are the criteria used for transliterating from the Japanese kana-kanji majiri writing system to the Latin script (rōmaji) and wakachigaki (spacing between words):

The Revised Hepburn Romanization system has been followed since it is the most widely used today for Japanese discourse transcription (cf. Miyata and MacWhinney, 2016).
Additionally, the following decisions have been made regarding Japanese phonotactics and wakachigaki (word separation):

Table: Transliteration from native Japanese

Phenomenon	Description	Examples
The mora ん / ン	The letter n for the Japanese mora ん / ン is kept before bilabial consonants such as p, b or m.	janpu; akanbō
Vowel elongation	Vowel elongations are indicated with a macron above the lengthened vowel, except for the elongation of i, which is represented via a repetition of the letter.	rōjin, but kawaii
Aphaeresis/syncopation	Aphaeresis or syncopation characteristic of informal language style are captured via an apostrophe (’).	modotte ’ta > modotte ita; kurumarete ’ru > kurumarete iru; iru n’ da > iru no da
Hyphens	Hyphens are used for honorific forms of address such as -san, -kun, -chan, etc. They are not used to form compound words.	kero-san, but akachan (lexicalized to mean “baby”)
Verb inflection	Verb inflectional morphemes are not hyphenated but attached.	suterarete [sute (“throw away”) + rare (passive voice) + te (gerundive)]
Auxiliary verbs following a gerundive	Auxiliary verbs iru, aru, aruku, iku, kuru, shimau, etc. following a lexical verb in its gerundive form (-te / -de) are detached when the combination is interpreted as periphrastic (inchoate, durative, conclusive, etc.). Otherwise, they are regarded as single lexical units.	aruite iru [aruku (“walk”) + iru (“be”)] but yattekuru [yatte (“do”) + kuru (“come”)] meaning “pop up”, “come along”, “come around”.
Compound verbs	Compound verbs with the first verb in its mizenkei 未然形 form (aruki > aruku) are written as single lexical units. When the verb combination is interpreted as two juxtaposed clauses, the verbs are written separately.	arukisaru [aruki + saru] but mitsuke yomu, where two separate and sequential actions are described
Noun + suru	A space is added between the auxiliary verb suru used to create denominal verbs and the preceding noun.	gekido suru
Particles	Particles are all detached from the preceding and/or following words except for those that are lexicalized.	tegami ga; hitotachi ni; akachan o; obasan wa; hitome de; but dokoka; itsuka; dareka; ikuraka; nanika.
Subordinating particle to	Subordinating particle to is always detached from the preceding word.	fu to miru; noseyō to suru
Demonstratives	Demonstratives are separated from the following noun.	kono ko; sono go; sono mama; ano hito.
先程	saki and hodo are written separately.	saki hodo
～では	The de wa form is rendered separately in order to distinguish the gerundive from the topic particle.	kazoku de wa* nai*

Technical cookies		So that our website can work. Activated by default.
Technical cookies are strictly necessary for our website to work and for you to navigate through it. These types of cookies are those that, for example, allow us to identify you, give you access to certain restricted parts of the website if necessary, or remember different options or services already selected by you, such as your privacy preferences. Therefore, they are activated by default and your authorization is not necessary. Through the configuration of your browser, you can block or alert the presence of this type of cookies, although such blocking will affect the proper functioning of the different functionalities of our website.
Analysis cookies		To allow us to know how our web is being used. You can enable or disable them.
Analysis cookies allow us to study the navigation of the users of our website in general (for example, which sections of the site are the most visited, which services are used most and if they work correctly, etc.). From this statistical information about navigation on our website, we can improve both the operation of the site itself and the different services it offers. Therefore, these cookies do not have an advertising purpose, but only serve to make our website work better, adapting to our users in general. By activating them you will contribute to this continuous improvement. You can activate or deactivate these cookies by changing the corresponding sliders.

CEDEL2: Corpus Escrito del Español L2 (version 3)

CEDEL2 (v2)

Transcription conventions

Transcription of the spoken data

Transliteration (Japanese native written subcorpus)