Transcription conventions
Transcription of the spoken data
While the transcriptions of CEDEL2 v.2 contained basic transcription conventions, for CEDEL2 v.3 we opted to follow a 'plain text' protocol, whereby no transcription symbols are used. Words are transcribed as they are heard, in lowercase and without any punctuation. Repetitions are transcribed as such (e.g., un hombre que estaba caminando en el en la calle).
These are the main reasons for this 'plain text' policy:
- Adding spoken marks (like empty pauses '/' and filled pauses 'eh', for example) makes the transcribed text blind to certain morphosyntactic searches based on tags, e.g., ARTfem NOUNmasc (la problema, la coche), if transcribed as la / eh / problema, la / coche, then would not have generate any hits in the search.
- The new transcription method ensures comparability across BilinguaLab's written vs spoken corpora when doing morphosyntactic searches, so results are more comparable.
- The BilinguaLab corpora were clearly designed with a lexicon/morphosyntax researcher in mind and not a phonologist/phonetist, so adding oral marks does not add much value to either researchers interested in lexical or morphosyntactic aspects or even researchers interested in spoken aspects. Those researchers interested in oral/phonetic aspects can always download the audio files and do their own fine-grained annotated transcriptions, which will always be more precise than the transcription conventions we used for CEDEL2 (version 2).
The only code is the use of XXX when the word cannot be understood/interpreted by the transcriber. These are just a couple of examples of transcribed audio files:
Filename: EN_SP_14_19_0.5_14_GFH
un hombre está de pie en el camino las cosas están cayendo a su alrededor el hombre encuentra un bebé no sabe de dónde vino trata de encontrar a su madre cuando no puede encontrar a su madre se la da a extraños es xxx por la policía por tratar de abandonar al bebé piensa el dejar al bebé al xxx de la carretera luego encuentra una nota que dice que el bebé es huérfano al finale él decide quedarse con el bebé
Filename: EN_SP_28_21_4_14_SD
hay un hombre que estaba caminando en el en la calle cuando encontró un bebé y entonces él no quería el bebé y ver un una mujer y le da el bebé a ella pero ella no quiere el bebé también y lucha luchó con él y le da el bebé a él entonces él encontró un otro hombre y el hombre no quería el bebé también tampoco y finalmente él decidió finalmente él encontró un una carta con el bebé y leí en el en la carta por favor amar y cuidar a a mi a mi bebé y él decidió cuidar al niño
Transliteration (Japanese native written subcorpus)
The data for the Japanese native subcorpus was gathered in the Japanese kana-kanji majiri script. Although the original texts are also available, they were transliterated into the Latin script (rōmaji), introducing spaces between words (wakachigaki), for analytical, statistical and comparative purposes, thus making it possible, for example, to establish word counts and type/token ratios that are comparable with other subcorpora within CEDEL2. Although, technically, the procedure is more of a transcription than a transliteration because there is no one-to-one match between the original Japanese written representations and those in the Latin script, the word ‘transliteration’ is used for practical purposes.
These are the criteria used for transliterating from the Japanese kana-kanji majiri writing system to the Latin script (rōmaji) and wakachigaki (spacing between words):
- The Revised Hepburn Romanization system has been followed since it is the most widely used today for Japanese discourse transcription (cf. Miyata and MacWhinney, 2016).
- Additionally, the following decisions have been made regarding Japanese phonotactics and wakachigaki (word separation):
Table: Transliteration from native Japanese
| Phenomenon | Description | Examples |
|---|---|---|
| The mora ん / ン | The letter n for the Japanese mora ん / ン is kept before bilabial consonants such as p, b or m. | janpu; akanbō |
| Vowel elongation | Vowel elongations are indicated with a macron above the lengthened vowel, except for the elongation of i, which is represented via a repetition of the letter. | rōjin, but kawaii |
| Aphaeresis/syncopation | Aphaeresis or syncopation characteristic of informal language style are captured via an apostrophe (’). | modotte ’ta > modotte ita; kurumarete ’ru > kurumarete iru; iru n’ da > iru no da |
| Hyphens | Hyphens are used for honorific forms of address such as -san, -kun, -chan, etc. They are not used to form compound words. | kero-san, but akachan (lexicalized to mean “baby”) |
| Verb inflection | Verb inflectional morphemes are not hyphenated but attached. | suterarete [sute (“throw away”) + rare (passive voice) + te (gerundive)] |
| Auxiliary verbs following a gerundive | Auxiliary verbs iru, aru, aruku, iku, kuru, shimau, etc. following a lexical verb in its gerundive form (-te / -de) are detached when the combination is interpreted as periphrastic (inchoate, durative, conclusive, etc.). Otherwise, they are regarded as single lexical units. | aruite iru [aruku (“walk”) + iru (“be”)] but yattekuru [yatte (“do”) + kuru (“come”)] meaning “pop up”, “come along”, “come around”. |
| Compound verbs | Compound verbs with the first verb in its mizenkei 未然形 form (aruki > aruku) are written as single lexical units. When the verb combination is interpreted as two juxtaposed clauses, the verbs are written separately. | arukisaru [aruki + saru] but mitsuke yomu, where two separate and sequential actions are described |
| Noun + suru | A space is added between the auxiliary verb suru used to create denominal verbs and the preceding noun. | gekido suru |
| Particles | Particles are all detached from the preceding and/or following words except for those that are lexicalized. |
tegami ga;
hitotachi ni;
akachan o;
obasan wa;
hitome de;
but dokoka; itsuka; dareka; ikuraka; nanika. |
| Subordinating particle to | Subordinating particle to is always detached from the preceding word. | fu to miru; noseyō to suru |
| Demonstratives | Demonstratives are separated from the following noun. | kono ko; sono go; sono mama; ano hito. |
| 先程 | saki and hodo are written separately. | saki hodo |
| ~では | The de wa form is rendered separately in order to distinguish the gerundive from the topic particle. | kazoku de wa nai |