Culture – The BBC Industrial Radio Ballad

The four Industrial Radio Ballads from 1958 to 1961 gave voice to a diversity of working-class cultures and professions that often remained off the air – they relied on a history of ethnographic and documentary practices to share each community’s story. The introduction of the portable tape recorder, an invention recently made available in Britain after World War II, made the process of documenting cultural differences in remote areas more accessible. Upon receiving news of the portable tape recorder’s arrival at BBC, the documentary radio producer Charles Parker reached out to folk singer-songwriter Ewan MacColl (called “‘Britain’s leading folk singer'” (Cole 355) by critic Robert Shelton) to devise the Radio Ballad.

In using this new state-of-the-art technology, the three creators of the Radio Ballad recorded in each community what MacColl described as “‘a vividness of speech, an element of fantasy, that he [believed] to have been lost in literary English'” (Cole 356). This disconnect between the written word of “literary English” and the spoken word of recorded “actuality” echoed the concept of ethnographic hearing loss – “a common late-nineteenth-century refrain about the inherently unfaithful relationship between the written word and its cultural sound sources” (Hochman 75).

Singing the Fishing, the third Radio Ballad, demonstrated the capabilities of the portable tape recorder. The remoteness of the herring fish industry in East Anglia and Northeast Scotland – plus the ever-shifting location of the industry’s seafarers – heightened the exigency to record these communities’ stories. However, these recordings have not always been interpreted correctly. The online transcript of Singing the Fishing still does not account for every syllable of sung or spoken text in the Doric dialects of Northeast Scotland, even if it is mostly accurate. Thus, ethnographic hearing loss continues to affect how we gather, perceive, and represent cultural differences today.

Over the last decade, the emergence of Artificial Intelligence (AI) technologies that are capable of notating cultural difference may have been even more groundbreaking than the birth of the portable tape recorder. But how capable may AI technologies be at rectifying the failures of ethnographic hearing loss, in comparison to human beings?

To test these ideas, I utilized OpenAI’s Whisper, an automatic transcription tool in Python. Using Word Error Rate, I measured the fidelity of Whisper‘s automatic transcript of Singing the Fishing to the human-made transcript of the same episode. Ultimately, I predicted that the Whisper transcript would do a subpar job, falling around a Word Error Rate of 0.5.

Whisper took approximately five hours to process the entirety of Singing the Fishing. The first two figures above show the beginning and ending sections of Whisper‘s output – the middle section has been left out for the sake of brevity. As soon as Whisper‘s output began to generate, I noticed that the tool struggled to transcribe most, if not all, of the sung text.

In order to create two easily comparable plain text copies of the transcripts, I copied and pasted each one into TextEdit and exported both as .txt files. As shown in the third figure above, I used Microsoft Excel to separate the time stamps and recorded text for the Whisper transcript (an extra step before pasting the text in TextEdit). I input the “TEXTSPLIT” function and built the formula “=TEXTSPLIT(B2, “] “)” to divide the data into the two columns. Next, I clicked on “Paste Special” and “Values” to batch paste the transcript text in Column O (to avoid any issues with text overlap).

I finished the process by running both transcript copies of Singing the Fishing through Whisper and calculating the Word Error Rate: 0.3248. On the scale between 0 to 1, this value indicated that the AI transcription was mostly similar to the human-made transcription. Whisper did a better job at capturing the smaller details of the recorded audio than I had anticipated. However, despite how Whisper had trained using various accent data, it corrected many characteristics of the Doric dialect into “standard English.” One example was how the /uː/ sound in “noo” (Doric) became corrected to the /aʊ/ sound in “now” (the “standard English” equivalent term). The bias toward “standard English” in Whisper not only reinvokes the Lomax family approach to making dialect more “acceptable”; it also furthers the development of ethnographic hearing loss. To work against this issue, AI engineers using Whisper must consider how to fine tune the data sampling process. They could provide a more specific list of dialects that the end user can target when inputting text into Whisper.