Mat Janson Blanchet's academic works

Module 5 Assignment: Analyze media authoring tools and compare speech processing technology developments

Posted on August 25, 2020

Human-Computer Interaction for User Experience Design

Question 1

In Unit 1 of this module, you learned about some of the eye-tracking and media authoring tools that Professor Frédo Durand is involved with.

For this question, produce a paragraph of 400 words in which you analyze the usefulness and need for the media authoring tools covered in Unit 1 (Voice Script and Pentimento).


Let’s take a look at the different media authoring tools covered in this unit.

First, concerning Voice Script, it would be important to first define who is the intended audience, who are the expected users of these tools? It would be necessary to consult with different people who could be using this tool, or otherwise explicitly target one audience and cater to their specific needs.

I have a hard time believing that the film, advertising or voice recording industry would use this flow. The way voice actors work is by having someone at the studio console control when to punch in the recording, and the actor is elsewhere in a sound booth, doing their acting job. This is demonstrated in How Pokémon Is Dubbed From Japanese To English, in which voice actor Sarah Natochenny described this process and why she must be free to follow or go off-script. Voice Script seems to be meant for a one-person team, which is not scalable to a professional setting.

Most writers tend to write their scripts in a word editor, does Voice Script offer the possibility to import such files? What about translations? Oftentimes, translated scripts are exported in .srt files, which are often used by YouTube, Vimeo, and other digital viewers. Is this another feature that has been considered?

The tool could probably use some improvement regarding how it imports and paces the sound: there were moments where the words were put too close together. Also, by using automated voice selection, the punching of the word intonations were sometimes off, as was the sound tone. This is the kind if details that a studio engineer and a voice actor can easily figure out manually, it might be a bit harder for an automated tool.

However, Pentimento seems to be more promising. There are already many platforms for online classes—e.g. Khan Academy, PluralSight, Udemy—and there are many people who animate conference presentations—e.g. the RSA Animates series by the RSA event organizers, or Eva-Lotta Lamm, the illustrator.

Given the rise in popularity in remote work and remote teaching, it is likely that such could be useful. What is lacking in the presentation of this tool is not how it works, but how do people learn to use this? Elementary and high school teachers may not already have any editing background knowledge. While this tool leverages concepts that are familiar to designers and advanced users—vector graphics editing, time-based editing—how is this conveyed to the intended audience?

The main takeaway here is really when Pr. Durand stated that “it’s important to have tools that offer a very flexible workflow” when discussing Voice Script.

Question 2

In Unit 2 of this module, you learned about the fundamentals of speech processing, current developments in the field, and what the future of speech recognition may look like.

For this question, produce a paragraph of 400 words in which you compare past, present, and future developments in speech processing, focusing on the following:

– Consider the speech processing technology of the past and the present, and point out how specific developments in the past influenced the speech processing technology available today.

– Consider the needs for the speech processing technology that is available today. How will these needs change in the future? And what speech processing developments must occur to address these needs?


In the class notes, speech processing technology is “traced back to 1877, when Thomas Edison’s phonograph recorded and reproduced sound (Newville 2009).” What is not mentioned is the work of Alexander Graham Bell, famously known for inventing the telephone, who was actually doing research to convert sound to a visual medium, so that the deaf could read sounds. While Thomas Edison’s invention, the phonograph, popularized recorded music, Graham Bell’s work is more closely related to what is understood as speech recognition today.

Among the research that was done to synthesize voice, one of the research paths to think was that additive synthesis—boardly: the accumulation of different frequencies—could allow to create any sound. While this did no pan out exactly as expected, it did lead to, among other things, frequency modulation synthesis, which in turn allowed musicians to create all sorts of music with easily available synthesized sounds. Here also, this technological evolution was more about music and sounds than actual speech.

As Dr. Glass explains in his videos and notes, sound was eventually translated to a spectrum graph, and cut up into little parts, which allowed computers to recognize and catalog multiple sounds. Then, allowing variations on a sound—e.g. accent, language, sound quality, etc.—became a matter of training an algorithm with more data to be able to provide a more accurate response. As Dr. Glass goes on explaining, algorithms are now able to do part of the work as the sound recording is happening. Previously, recordings were static, and processing them was a long-term affair.

Now that speech recognition is functional enough to become “virtual assistants”, the issues that arise are behavioral. As Marshall McLuhan theorized, a medium need only exist to cause change, it does not matter what the intent of the medium is. And so it is as well with virtual assistants: their mere existence seems to reiterate the stereotype that women must serve men, or at least be in a subservient position. There are quite a few articles and research papers about this subject—1, 2, 3, 4, 5, etc.—and so the next steps of work on speech recognition should not mainly be about technical capacity, but instead should be more focused on human behavior and psychology, and on how users experience interactions with a device that is signified as a female entity at their service.

Learning Outcomes

– LO2: Illustrate the advantages of improved media authoring tools.

– LO4: Compare past, present, and future developments in speech processing technology.

Leave a Reply

Your email address will not be published. Required fields are marked *