What is it about?

This paper looks at key issues in the compilation of spoken language corpora in a computer-mediated communication (CMC) environment, using data from CASE (forthcoming), the Corpus of Academic Spoken English in an international context, which is currently being compiled at Saarland University, Germany, in cooperation with partners from different countries. Based on preliminary findings, new recommendations concerning data collection, treatment, compilation, and transcription are put forward to supplement existing best practice as presented in Wynne (2005). Our main general recommendations are the use of Skype as a suitable tool for collecting spoken data for linguistic analysis which moves the recording of spoken data out of a restricted laboratory setting. During anonymisation, special care has to be taken with the video component, while preserving multimodal features for analysis. We recommend the addition of a number of annotation elements already at the transcription stage, particularly the CMC-related discourse features of overlap, echo, interference and pauses, the English as a Lingua Franca (ELF) features of non-standard language and code-switching, as well as the inclusion of prosodic, paralinguistic, and non-verbal annotation. Additionally, we propose a layered corpus design in order to allow researchers to focus on specific annotation features.

The following have contributed to this page:

Selina Schmidt, Stefan Diemer, and Marie-Louise Brunner