The Problems of Transcription

The purpose of legal transcription is to endow instances of speech –ranging from court proceedings, through corporate meetings, to private conversations monitored by law enforcement– with the stringency of the written record. In this post, we will take a closer look at issues that frequently arise in this endeavor, with particular focus on the instance of a court setting. We intend to show that, while some derive from underlying technology, others are inherent to language and communication themselves, and thus require human judgment to be addressed.

1. The Transcriber’s Questions

Who is speaking?

Speech attribution can be challenging in the context of an audio recording or series of recordings that are not associated with any images. Audio files feature no tags or other indices of speakers’ identity, and voices are often hard to distinguish –sometimes even the speaker’s gender will be unclear. The fact that multiple speakers often overlap is an additional complication.

Barring direct voice recognition, when a speaker’s voice is previously known to the listener, the problem can be solved through semantic inference –if a speaker at some point introduces or otherwise refers to themselves, or is addressed by name by someone else– or pragmatic inference –when a given speaker is expected to open proceedings or respond at a given time as per the rules of procedure in a formal setting, such as a trial, wedding, etc..

Speaker identification is subject to additional challenges in a project involving long duration- or multiple audio sources. Owing to the scope of the project, few participants will hear the entirety of the material, being confined to manageable sections instead. To make sense of what they are hearing, listeners must then resort to background information, such as the schedule of hearings, media coverage, or case files. They will also cross-reference multiple audio sources, including files from different dates, comparing passages in them to identify recurring speakers.

In such cases, thoughtful project management will be key to ensure a task is not exceedingly atomized and all participants have access to- and benefit from each other’s input.

What does the speaker mean?

Sense-making starts at the most basic level of spoken language: that of identifying which sounds are meaningful and which are not. The distinction between phones and phonemes, that is, between non-meaningful and meaningful sounds, is essential in natural languages. Consider how Japanese speakers have difficulties distinguishing between the /l/ and /r/ sounds; or how the difference between the pronunciation of “houses” as a noun and “houses” as a verb in English (the difference between an unvoiced /s/ phoneme and a voiced /z/ phoneme) is lost on many non-native English speakers.

Sense-making is based on pattern recognition. Utterances can only be understood properly if the circumstances in which they are uttered are known. Furthermore, any given term is only understood in relation to the other terms along which it appears. In humans, pattern recognition takes place by matching perceived information against our memories – which are most often semantic memories, that is, memories that refer to the general knowledge of the world that we have acquired during our lives (word meanings, concepts, ideas), and ultimately through our bodily experience. This stands in contrast to machine learning models, such as the LLMs (large-language models) on which programs like ChatGPT are based, pattern recognition is statistical, based on the weights given to the connections between the processed data. Speech recognition software works similarly, breaking down recorded speech into individual sounds and using statistical algorithms to find the most likely fit for each sound within a word and sentence. Linguist Noam Chomsky addressed this distinction concisely in a recent piece for the NYT. (We discuss this matter further in our AI Glosses series.)

Human beings try to understand our environment, for better or worse. When gaps in data appear, the mind tends to fill them in based on past experience. However, apophenia – finding meaning in the connections between unrelated things, making sense where there is none – is a very real risk. As William Gibson put it, “Homo sapiens is about pattern recognition […]. Both a gift and a trap”.^[1]

Writer Janet Malcolm once discussed a trial involving the work of avant-garde sculptor Richard Serra. One of the witnesses in this trial, an art history professor, made a statement, and proceeded to give “[an] elegant lecture on, as the transcript has it, ‘minibalist’ sculpture”. “The phonetic spellings that leap off the pages of the transcript”, continues Malcolm “– ‘Grancoozi’, ‘Saint Gordons’, ‘DeSuveral’, ‘DeEppilo’, ‘Modelwell’, ‘Manwhole’ – testify to the gap that exists between the ordinary literate American and the tiny group of people who are the advanced art public”.^[2]

When the court stenographer was unable to understand the term “minimalist” or identify artists such as Brancusi, di Suvero, and Motherwell, they simply gave the phonetic transcription of what they heard, rather than leaving a blank – a common response.

This risk of overinterpretation – or just plain “making it up” – can be avoided through linguistic training, as well as, again, contextual knowledge (e.g., by being aware of any relevant names and proper nouns mentioned in the audio).

Is the speech direct or it is being reported?

This problem, as shall see below, is particularly relevant in a court setting, where it is crucial to determine whether what is being said is original speech, a quotation from a written source, or a reference to previous statements by other speakers.

How should speech be transcribed?

Last but not least, there is the question of transcription itself – how to render recorded sound legible, in a manner that captures all the above nuances. To our surprise, our research for the transcription of a high-profile criminal trial that amounted to over 780 hours of audio yielded that no universal convention exists for this purpose in any European language –let alone two– or all of them. Consequently, we developed our own.

2. Transcription in a court setting

For all its formality, a court setting is no exception to the above: verbal exchanges therein often lend themselves to misunderstandings with material implications to all involved. In many English-speaking countries, there are fairly well-established requirements for court reporters in terms of their education and professional standards^[3], and it is relatively easy to find trained transcribers. However, this is not the case in every jurisdiction or every language.

In our experience, the main types of systematic mistakes found when reviewing court transcripts are the following:

Punctuation. A major issue that goes well beyond omitting punctuation signs. Because no obvious convention to transcribe oral language in court exists, when several transcribers are engaged in the same project, every transcriber tends to transcribe inherently oral language features in different ways (including, very often, through overuse and misuse of ellipsis).
Confusing speakers’ names, so that sometimes the same speaker is recorded as both asking questions and answering them.
Identifying speakers. Names can be spelled differently by different transcribers. The use of identifying letters could lead to confusion (e.g., “x1”, “x2”, and so on, leading to the issue of “x17” on the last page referring to the same individual as “x1” on page one). This is often particularly acute during the early stages of a project, when transcribers, several of whom may be assigned to the same document, are unfamiliar with speakers’.
Failing to identify speakers altogether.
“Editorial” omissions, such as the exclusion of speakers’ hesitation: “eh…”, “ummmm”, “.”, etc., as transcribers are often instructed not to do this in the benefit of clarity. However, in a court setting, every hesitation, indeed every occurrence at court, must be faithfully recorded. To take a real-case example, the incursion of a feral cat into the courtroom (!) once had to be addressed in order to make sense of the speakers’ reactions. In another case, one of the defendants repeatedly made comments of a veiled insulting nature against the prosecutor, eliciting laughter among a largely partial audience. The cause of the eventual intervention by the judge would have been lost without reference to said laughter.

Transcription reinforces the known trope that good craftsmanship is invisible. Conversely, the consequences of falling short can be glaring.

^[1] William Gibson, Pattern Recognition, Penguin, 2005, p. 17.

^[2] Janet Malcolm, “A Girl of the Zeitgeist”, in Forty-One False Starts, Granta, 2014, p. 216.

^[3] For example, in the United Kingdom, accreditation by the British Institute of Verbatim Reporters (https://bivr.org.uk/) is required for court reporters. In the United States, some states require that court reporters hold a national or state certification, whereas certification is voluntary in other states. However, there are widely recognised professional associations –such as the National Court Reporters Association (https://www.ncra.org), the American Association of Electronic Reporters and Transcribers (https://aaert.org/), and the National Verbatim Reporters Association (https://nvra.org/)– whose goal is to provide education and certification and help set professional standards.