Boethius Translations

Transcription – the rendering of speech into writing – might seem fairly straightforward. We can all conjure the image of a stenographer, diligently taking dictation at a meeting. These days, AI-powered transcription tools are the latest milestone on the long road toward greater efficiency. Yet anyone who has transcribed voluminous or somewhat complex material knows that the process, even with the help of such tools, is more arduous than they may have presumed. Indeed, transcription today poses many challenges, as it has throughout its long history. To understand the current state of transcription, said history merits a closer look, and is of further interest as a concomitant to the evolution of communication technologies generally. In some contexts, transcription can now be completed much faster and more accurately than ever, as some of its challenges have been greatly ameliorated, if not solved, by technological advances. However other challenges, which we analyze in greater detail below (also here) , are more intractable, as they are inherent to language and communication themselves, and thus require sensible human intervention to overcome.

*

Speech – the expression of thoughts or feelings through articulate sounds – is a universal and defining capacity of homo sapiens. In fact, the anatomical capacity to speak (rooted in the ability to form contrasting vowel sounds) is present in baboons and macaques, suggesting that the last common ancestor of primates and humans already had the hardware, and our evolutionary forebears therefore had some 27 million years to develop the cognitive power necessary to give voice to language. Whereas writing, at a mere 6000 years old, is a relatively recent invention. The first attempts at rendering otherwise ephemeral language into concrete visible symbols were independently developed in ancient Egypt, China, Mesoamerica and Mesopotamia – the latter being the region where the first steps were taken in a long journey towards the most important transcription technology yet: the alphabet.

Before incorporating characters representing sounds, purely pictographic and logographic writing systems represented ideas and words through images and symbols. Thus, because new words necessitated new characters, in these early systems the number of characters quickly grew unwieldy, learnable by few, while remaining conceptually limited — far from capable of representing the full range of meanings in spoken language. If, however, we observe the evolution of the ancient Sumerian writing system, we can follow a fascinating progression towards greater abstraction and efficiency: starting in the late Neolithic with small clay tokens directly symbolizing an object (an ox, for example); to the lines created by impressions of such figurines upon clay tablets; to the disposal of the figurines in favor of drawing their outlines; through multiple stages of simplification of those lines until arriving at the stylus-formed wedged, i.e., “cuneiform” script. Ultimately consisting of some 1200 characters representing numerals and select nouns, Sumerian cuneiform served for basic administrative purposes, but it was only with the eventual incorporation of phonographic-syllabic characters that the invention of writing really took flight.

Cuneiform outlived the Sumerians, and it was thanks to the Akkadian appropriation of the script that the evolution of Western writing continued. Despite (or perhaps because of) Sumerian and Akkadian being unrelated languages, the Akkadians managed to appropriate and adapt cuneiform, but with some very potent changes. By halving the number of characters, dropping pictographs and adding syllabic phonographs, Akkadian cuneiform, although unsystematic and incomplete, became a vastly more functional orthography — capable, indeed, of producing epic works of literature and preserving the histories and myths of ancient Mesopotamia which are still with us today. Yet, while capable of such great feats, transcribing natural speech using Akkadian cuneiform would have been akin to performing surgery with an axe. But why the difficulty? To answer, let us pause to consider the fundamental differences between speech and writing.

Like speech, writing is an expression of language, but from the first scratches on stone until today, writing differs immensely from speech, both formally and functionally. Writing is a craft — a communication technology, inevitably bound by formal conventions and under deliberate authorial control. Whereas speech, encompassing all verbal and gestural expression, is primordial, by nature unbounded, dialogic and extemporaneous. Consider the scope of speech: from the lallation of infants to the complex code-switching routinely carried out in multilingual conversations; a nervous hum and the roar of a crowd; glossolalia, neologisms, meaningful ellipses; stutters and grunts, songs and slang, et cetera. Consider also its velocity and ephemerality: with typical speech poured forth into the air at rates averaging between 100 and 300 words per minute, and leaving behind, if anything, only an immensely diminished and distorted trace in the unreliable memory of the listener.

To our ancestral scribes, the idea of faithfully capturing and representing all of that would have been as unthinkable as the images caught by today’s high-definition wildlife filmographers would have been to the palaeolithic cave painters of Lascaux. Like the evolution of photography, the evolution of transcription technologies relies upon centuries of continuous developments and discrete discoveries made by individuals and communities around the world. For our purposes, following writing, the other key advances we will remark upon are those we consider most illustrative of those requisite for addressing the challenges mentioned above, i.e., the scope, the speed and the ephemerality of speech.

*

Contemporary linguists rightly stress that the mid-20th century conception of the progressive evolution of writing systems is overly-simplistic and Eurocentric. If we consider the Cuneiform to Alphabet lineage (i.e., pictograms → logograms → syllabic/consonantal phonograms → alphabet) as necessarily representing stages of advance, it could falsely imply that non-alphabetic scripts such as Japanese and Chinese are relatively primitive, when in fact using an alphabet for such languages would be inefficient. Indeed, the perfection of a writing system depends entirely on its fitness to express a particular language, as well as the purpose of the writing. That said, when considering the objective of representing the full range of speech, the alphabet is undoubtedly one of the great achievements of the Western world, and the culmination of a long and richly-varied refinement.

As the grandchild of hordes, the precise lineage of the alphabet is a matter of debate. Many cultures throughout the eastern Mediterranean developed quasi-alphabetic (phonological, syllabic and consonantal) writing systems, but when we speak of “the alphabet” we are referring to a very specific invention. As with so many things, the Western world owes its alphabet to the genius of the ancient Greeks. By creating, around 1000 BCE, a series of characters representing only phonemes (the underlying constituents of syllables), and in particular by creating independent characters for vowels, they simultaneously minimized the number of characters to a very learnable 24, while maximizing their capability to represent the full range of phonological sounds, making it possible to spell out any utterance. The alphabet is therefore the first system that truly allows for writing all that might be said. It was also easily (totally or partially) appropriated by many different cultures and with only minor adjustments adapted to very different languages.

The alphabet having largely resolved the problem of range, there remain the problems of the velocity and transience of speech, the solutions to which would also require a couple millennia’s worth of experimentation. In ancient times written records were of course made of the words of eminent orators, oracles and teachers, but the boundary between transcription and composition typically blurred beyond recognition. Because a polished final text serving the rhetorical or literary purposes of its author was the objective, accurate transcription was not of utmost importance. Generations of freshmen have puzzled in vain over the possibility of distinguishing “the real Socrates” from Plato’s prose, but Socrates’ own words were lost forever upon being spoken; thus the Socrates on trial as recounted by Xenophon will always seem a different man than in Plato’s portrayal, though both versions recount the same fundamentals of the proceedings.

*

Lingering briefly in Ancient Greece, nearly 1500 years after the invention of the alphabet, our history continues with a contribution of the very same Xenophon, whose Notae Socrate is considered the first Western shorthand writing system. Ancient shorthand, somewhat ironically, returned to the use of vast collections of symbols requiring lengthy specialized training to master, but with the advantage of enabling writing that was both much faster, allowing for greater accuracy in taking dictation, and more economical of space, an important consideration given the relative scarcity of parchment. Shorthand would become a bedrock of Greco-Roman culture: legal proceedings, marginalia, and everything from love letters to foundational philosophical texts could be written much faster by the author, or could be dictated to a trained slave or student, whose shorthand (often written on erasable wax tablets) could be used to later write final drafts.

Latin shorthand (notae) may have derived from Greek influence or may have been independently introduced, as the most popular theory goes, by Cicero. Plutarch, referring to Cato’s 63 BC senate speech against the conspirators of a coup d’etat, describes what may be the first example of legislative transcription: “They say that this is the only speech of Cato which is preserved, and that it was owing to Cicero the consul who had previously instructed those clerks, who surpassed the rest in quick writing, in the use of certain signs which comprehended in their small and brief marks the force of many characters, and had placed them in different parts of the senate-house. For the Romans at this time were not used to employ nor did they possess what are called note-writers (σημειογράφοι), but it was on this occasion, as they say, that they were first established in a certain form.”[1]

The prevalence of such “note-writers” grew exponentially in the time of the Roman Empire, to the point where wealthy Romans often kept one on their permanent staff. Early systems (such as Notae Tironianae, named after Cicero’s secretary) involved thousands of symbols and were thus notoriously difficult to learn, but the Romans also developed more streamlined versions that could be taught in schools. From such schools arose a profession that is still with us today, as generations of specially trained shorthand writers (notarii) were tasked with recording the proceedings in courts of justice. In parallel, as is the case with the entire history of shorthand, more complicated systems also remained (the Notae Tironianae, for example, was in use for over 1000 years) while others came and went out of fashion, or were so particular that they only served for the use of few.

The Latin word Notae referred to shorthand used for transcribing speech, but from the start Notae (derived from Nota, meaning a mark or sign of any kind) carried a double meaning, as it also meant any form of writing in cipher. In fact, the development and use of shorthand were driven as much by the need for concealment as by the need for transcription. Ample and varied advantage was taken of bespoke shorthand systems’ potential for secrecy: be it Julius Caesar or Augustus using it in their letters, Jews using it for messages during the Jewish—Roman Wars, or pagan alchemists using it for incantations and spells. Indeed, as Christianity came to dominate Europe, the widespread association of shorthand with magic and witchcraft would lead to its eventual prohibition, and for a time this once ubiquitous practice all but disappeared.

Ancient shorthand texts having been rediscovered in the 15th century led to a renewed interest, and by the 17th century several English language versions were developed, including systems that could be adapted to other languages. In early modern England, competing systems vied for the loyalty not only of court and legislative reporters, but of journalists, diarists, playwrights and theatergoers, thanks to which many Shakespeare plays were preserved in shorthand. Learning shorthand was not only a professional asset, but could be used for all manner of less reputable purposes, such as piracy. Clergymen, for example, were known to attend others’ services and use shorthand to plagiarize their competitors’ sermons. However, while popular among the highly educated, the geometric symbol systems then in use were abstruse and difficult to standardize, and it was only with the increased business and legal demands that came with the Industrial Revolution that more efficient and popularizable systems would appear.

At this point, it should be noted that while shorthand allowed for much faster writing, even the most efficient systems were still not fast enough to keep up with speech. Arriving at a final text was a lengthy, multi-step process and completely faithful transcription was still out of reach.  To record Cato’s speech in the Senate, Cicero needed multiple notarii in the gallery because it was only by reference to the combined (incomplete and imperfect) versions that a full draft could be constructed (with Cicero, we assume, likely tweaking and polishing the final draft in a manner most advantageous to his purposes.) To put things in perspective, consider 200 words per minute as a rough average rate of speech. Adult longhand writing today roughly averages 8 wpm, and a proficient trained shorthand writer (using the most efficient system still widely used in the 20th century) could average roughly 100 wpm. For live transcription, speed is of the essence, but the solution to this problem would not enter the scene until the dawn of the 20th century, as it required the confluence of the logic of shorthand with the (also long in coming) mechanical innovations culminating in the stenotype machine.

*

Writing machines were attempted for nearly 300 years, the first of which is (dubiously) attributed to an Italian printmaker whose 1575 scrittura tattile design was meant to facilitate the writing and reading of the blind. Many impractical machines were built in the 19th century, some as big as pianos, all slower and more cumbersome than handwriting. A breakthrough came with the 1865 Danish Malling-Hansen Writing Ball, the first viable mechanical writing device. Gorgeous and portable, though delicate and difficult to master, the writing ball sold well in Europe. Famously Friedrich Nietzche, in the midst of losing his sight and mind, bought one and observed how using it changed both his writing style and, he felt, the manner in which he formed his thoughts.

Being artisanal and slow to produce, the writing ball was overtaken by the more commercially viable American-made Remingtons, which reached the market in 1874. The success of such typewriters led to the emergence of training programs and careers dedicated to their use and professional typists could reach 60 to 90 wpm which, though impressive, was still slow for transcription purposes. Demand for faster typing was increasingly widespread and thus, starting around 1830, several “shorthand machines,” which used a reduced number of keys, specialized stroke techniques, and phonetic shorthand systems, were independently invented in Europe. By 1880 the Italian Senate had one in use for legislative reporting, and around 1913 the direct ancestor of the modern stenotype machines premiered. With these machines, the goal of writing at the speed of speech was finally achieved, as a professional stenotypist could reach speeds up to 300 wpm.

*

There remains the problem of the ephemerality of speech. Due to human fallibility, complete accuracy, even for the best stenotypists, remained hard to achieve. Long meandering speeches, arguments with several people speaking at once, difficult hearing conditions, et cetera; all could lead to seriously flawed transcriptions, and with no other record to refer back to, transcriptions could be not only error-ridden but purposely altered — with possibly tremendous consequences in some contexts. In addition, given that (then as now) stenotypists typically needed to study shorthand for a couple of years, along with thousands of hours of physical practice to reach top speeds, professional transcription services were limited and costly. Fortunately, in parallel with the development of typewriters, the end of the 19th century would bring another game changing  innovation: audio recording.

In 1857 Édouard-Léon Scott de Martinville, a French printer and bookseller, patented the phonautograph, the first device to successfully record sound. This ingenious mechanism was never meant to play sound back, but rather “transcribed” sound waves into visual, undulating lines drawn on smoke-blackened paper or glass. Imagining the potential of his mechanism, he wondered, “Will one be able, between two men brought together in a silent room, to cause to intervene a silent stenographer that preserves the discussion in its minutest details while adapting to the speed of the conversation?[2]” Interested in shorthand systems, the inventor’s vision for his device was that the etchings made by the recordings might someday be deciphered and read, thus serving as a form of natural stenography. 

Unfortunately, Scott de Martinville never managed to decipher his sound-lines, the technology did not catch on, and his recordings went unheard until 2008, when UC Berkeley researchers succeeded in converting a phonautograph recording into a playable audio file. What they discovered was a rendition of a French folk song, now recognized as the earliest known intelligible recording of the human voice. Initially misinterpreted by researchers, who thought the voice was that of a woman or child singing at an ordinary tempo, once the proper speed was established the voice turned out to be that of a man, probably Scott de Martinville himself, singing the song very slowly, and his “Au clair de la lune, On n’y voit qu’un peu. On chercha la plume, On chercha du feu. En cherchant d’la sorte, Je n’sais c’qu’on trouva; Mais je sais qu’la porte Sur eux se ferma”[3] finally reached our ears — over 150 years later. 

In 1877, Thomas Edison presented the phonograph, the first device to both record sound and play it back and with the improvements of Alexander Graham Bell the technology became commercially viable. We now take the recording and playback of sound for granted, but when our great-grandparents first heard voice recordings, similar to the first reactions to photographs, many found it unsettling that the voices of the dead could be heard in posterity, likening the technology to the occult. But of course it spread like wildfire and under its various iterations – from dictaphones on through digital recorders – the new technology went on to change the world in previously unimaginable ways. The audio recording of human speech obviously proved another momentous step for the purposes of transcription. Now all manner of speech could be preserved in its entirety and transcribed at a later date, and with the unlimited ability to pause and playback any literate person with a good ear (and enough time on their hands) could now conceivably transcribe from a well executed recording.

*

The major technological breakthroughs described above are what made transcription reliably possible, but by no means easy or fast. Thus later advances all involve increasing the efficiency of the transcription process, a process which fundamentally remains much the same: first, live or recorded speech heard by a transcriptionist (or analyzed by AI) is simultaneously noted in a rough, preliminary (possibly shorthand) text; AI generated texts are then carefully checked to correct blatant errors; the many doubts and gaps in the preliminary text are addressed by the transcriptionist  repeatedly consulting the audio recording (if available); finally, appropriate formatting is determined, inaudible portions are indicated, the transcriptionist writes out a full draft and the draft is proofread at least once. Fortunately, as with so many processes, computing technology has greatly expedited all of those steps. Compared with the first writing machines, the ease of typing, error correction, formatting, storage and duplication allowed for by word-processing programs saves countless hours of labor. The internet made instant file transfers and the distribution of tasks across global remote work teams possible, while AI continues to improve and, when used responsibly by an attentive transcriptionist, it can prevent a lot of wrist pain by generating that initial draft.

To conclude, we must expand: what we have addressed here pertains only to the mechanics of transferring speech into writing. While the challenges presented by what we identified as the range, speed and ephemerality of speech were largely met thanks to technological advances, there remains a whole other aspect of transcription. This, let’s call it the interpretive aspect, involves a different set of challenges, challenges which we at Boethius are particularly familiar with. It is by understanding this set of challenges that one understands the limitations of transcription technology, as these, at least for the foreseeable future, form a formidable obstacle to the dream of seamless automated transcription.

Like automated translation, accurate machine transcription only occurs in highly controlled contexts. A radio broadcaster, for example, clearly enunciating a 5 minute news article written in the standardized style of the Associated Press would likely be flawlessly transcribed by an AI program. But as soon as natural human speech of any length or complexity is involved, with all its individual and cultural vagaries, the material quickly thwarts the competence of our AI assistants, who nonetheless go on spouting the wrong answers with unmatchable speed and confidence. In fact, as has been the case from the beginning of our history, transcription remains the work of highly literate professionals formed by years of training and practice with the latest tools.

[1]  Plutarch. Life of Cato the Younger. Long’s translation. Retrieved here (p. 291)

[2] The Phonautograph Manuscripts of Édouard-Léon Scott de Martinville. Retrieved here.

[3] By the light of the moon, you barely see. They looked for a pen, looked for a light. All that looking, I don’t know what they found; I only know the door closed behind them.