Open access

Integration of Speech Recognition-based Caption Editing System with Presentation Software

Written By

Kohtaroh Miyamoto, Masakazu Takizawa and Takashi Saito

Published: December 1st, 2009

DOI: 10.5772/7736

Chapter metrics overview

1,603 Chapter Downloads

View Full Metrics

1. Introduction

1.1. Background

Recently an increasing amount of e-Learning material including audio and presentation slides is being provided through the Internet or private networks referred to as intranets. Many hearing impaired people and senior citizens require captioning to understand such content. Captioning is a vital part of accessibility and there are national standards such as “WCAG 2.0”


. “JIS X8341-3 5-4-d” and also laws such as “Section 508 of the Disabilities Act”

See assure accessibility to publicly available contents


There are much on-going efforts for automated speech recognition enhancement, but here we will focus on the post editing to assure accurate captioning for digital archives. We introduce the method of “IBM Caption Editing System with Presentation Integration (hereafter CESPI)” which is an extension to IBM Caption Editing System (hereafter CES)


. CESPI completely includes all the functions within CES, but is further extended to include the presentation integration functions.

CES encapsulates the speech recognition engine for transcribing audio into text (CES Recorder) and also allows various editing features for error correction (CES Master and CES Client). As shown in Figure 1, CESPI integrates presentation software in various ways for both the CES Recorder and the CES Master System. Figure 2 shows a sample output of CESPI which composes of video, captioning and presentation image slide show.

Figure 1.

CESPI receives audio input and CES Recorder by encapsulated speech recognition engine, transcribes the audio into text. CES Master System and CES Client System allows collaborative editing. CESPI adds a presentation integration feature to both CES Recorder and CES Master System.

Figure 2.

The sample output of CESPI is shown. Presentation slide image is on the left hand side, video image is on the upper right hand and the caption is on the lower right hand side.

1.2. Previous Methods

Automatic Speech Recognition (ASR) engines have improved its accuracy

McCowan et. al. (2004) discusses in detail how error rates should be handled for speech recognition results.

over time. And there are many ASR related programs being introduced. Most noticeable is to use the ASR for University lecture transcription. Bain et. al., Bain et. al. 2005 introduces the ASR technology and how they are being adopted to Universities. Wald & Bain, Wald & Bain 2008 introduces the Liberated Learning Project, Kheir & WayKheir & Way 2007 introduces the VUST project, Hewitt et. al. Hewitt et. al. 2008 introduces the SpeakView project. Itoh et. al.Itoh et. al. 2008 introduces the Join Project which is based on CES.

Also there are some limited efforts to use ASR for transcribing television broadcasting. Lambourne et. al. Lambourne et. al. 2004 and Imai et. al. Imai et. al 2005 discusses the difficulties in adapting ASR to broadcasting since captioning must be completely accurate. Making corrections to the ASR transcriptions in near real-time introduces many challenges.

While most of these programs focus on the real-time transcription, there is also a strong request to transcribe the digital archives for lecturers, videos, etc. In order to create such content, it is necessary to correct the transcribed errors. Making corrections to edit these errors is a very labor intensive task (hereafter we call this task caption editing, and the system to perform the caption editing as caption editing system). Therefore our primary focus is to provide a caption editing system which is highly efficient to the user.

Goto et. alGoto et. al. (2007) introduces PodCastle


which is a service available on the Internet to transcribe podcast content by ASR. And users can make corrections to erroneous words basically by selecting from ASR candidates. Also Munteanu et. al.Munteanu et. al. 2008 introduces a wiki-like caption editing feature to enhance the Webcast system.

1.3. Our Prior Work

We have introduced Arakawa et. al. 2006 our CES technology and how it has been adopted in Universities. We previously introduced Miyamoto 2005, Miyamoto & Takizawa 2009 how the system can help collaborate between different roles of editors. Specifically, the master editor who is responsible for the final output uses CES Master Editing System while the client who may be any novice user uses the CES Client Editing System and both are connected by network. We also showed how the caption editing steps can be improved using three major concepts. The three concepts were “complete audio synchronization”, “completely automatic audio control”, and “status marking”. As a result, we showed 30.7% improvement in caption editing cost.

In CES, the output phrases (as candidate caption lines) from the voice recognition engine are laid out vertically as individual lines along with timestamps. “Complete audio synchronization” means that the keyboard focus always matches the audio replay position. For example, if the audio is playing the time position of one position (e.g. 5 seconds) while the keyboard input focus is on a different position (e.g. 10 seconds), it is quite obvious that it would make it extremely difficult to make corrections to the erroneous words while listening to the audio. CES plays the audio in synchronization with the associated caption lines. This means the audio focus always matches the caption line focus.

The second concept of “completely automatic audio control”, means that the audio is fully controlled automatically by the system. Users are not required to “replay” and “stop” the audio manually (usually a huge number of times). As the editing begins, the focus is set on the initial series of words, and the audio which is associated to that portion is replayed automatically. By comparing the audio with the transcribed words, user needs to determine if the words are correct. If it is, the user can press the enter key to move focus to the next series of words, but if not the user needs to make the correction. The audio will be repeatedly replayed over again to urge the user for action. The replay stops automatically when the user types any key since it is usually annoying to hear the audio during typing. A long pause in typing will automatically restart the audio again. As a result the user does not need to operate the audio replay at all and he/she can solely concentrate on making corrections. In writing it may seem quite obvious, currently we identify that CES is the only system which has this feature.

The last concept is “status marking”. The unverified lines are automatically distinguished from the corrected lines as shown in Figure 3, in CES, each caption line includes a button which is used to mark the status of each caption line. The mark also corresponds to the color of the font. The marks have several useful meanings, but basically these marks make it easier to keep track of how far the caption editing has progressed. This is very important in many cases because it is required to keep track of the caption editing work progress. And then estimate the projected finishing time and also it is needed to take appropriate action in cases such that the target deadline may be missed.

Figure 3.

The sample image of CES is shown.

Here in this example, all of the caption lines are initially marked as “unverified” (question mark “?”). As the corrections proceed the flags are automatically converted to “determined” (circles). Here, caption lines 1 to 17 are correct since they were either correctly transcribed by the voice recognition engine or they were corrected using the editing feature in CES. Caption lines 18 and later are still unverified.

Figure 4 illustrates how the system works from the (caption) editor’s perspective. The audio is played automatically, and so the editor focuses on the audio. As soon as the editor begins to type the audio stops automatically. But when the editor is not sure and pauses the typing, the audio automatically starts to play again. The editor makes the necessary changes (and hits the enter key) then the keyboard and audio focus automatically moves to the next target line.

Figure 4.

The figure shows how the caption editing task using the CES. All the audio processing is automatic and user merely needs to focus on making the necessary correction.

We also showed how external scripts can help reduce the caption editing work. The experimentation results show that for example, when the recognition rate is 60.9%, editing total hours decreased by 35% Miyamoto 2006 and the method of matching the script with the erroneous transcription results are introduced Miyamoto 2008 as well. Furthermore, we introduced a hybrid caption editing system Miyamoto et. al. 2007 which integrates “line unit type” (which is efficient in editing in relatively high speech recognition rate situations) with “word processor type” (which is efficient in editing in relatively low recognition rate situations). The input strings to the “word processor type” subsystem is matched to the “line unit type” subsystem. As a result of experimentation in a various speech recognition rate conditions for caption editing efficiency against previous caption editing methods, the hybrid system shows significant advantage in number of interaction and editing time. Another technique we introduced (Arakawa, et. al. 2008 the technique to make real-time automated error corrections by using confidence scores of speech recognition and automated matching algorithms of sources such as text in presentation software or scripts.

Our work goes beyond the efficiency of caption editing and also discloses the method Miyamoto & Ikawa 2008 to safely eliminate and dynamically distribute confidential

information from multimedia content with audio, caption, video and presentation materials.

In this chapter, we focus on the integration of Presentation software with the CES to increate the efficiency of caption editing. Presentation software provides many useful features to easily create effective e-Learning contents by the following 2 steps.

1) Prepare presentation file by combination of text, pictures, visual layout, and any other provided feature.

2) Make oral presentation using the slide show feature of the presentation software. At the same time record the movie by any video camera and/or oral presentation audio.


2. Preliminary Survey and Investigation

We conducted a survey to see whether the combination of video with audio, captions, and presentation slides (hereafter “multimedia composite”) is helpful in understanding the content. We created 4 multimedia composites, and then allowed a total of 80 senior citizens and people with disabilities to view any content of interest freely. After viewing, we administered a survey, and asked whether the multimedia composite is useful. The results as shown in Table 1, showed that 66.3% found the multimedia composite either "Strongly Agree” or "Agree", irrelevant of age group. So we concluded that a multimedia composite is very useful for better understanding in e-Learning.

Age Group Strongly Agree Agree Disagree Strongly Disagree
20s 0 4 0 0
30s 0 1 1 0
40s 0 3 2 0
50s 0 6 6 0
60s 2 9 6 0
70s 3 21 10 0
80 and higher 2 2 2 0
Total 7 46 27 0

Table 1.

Usefulness of Multimedia Composite.

Next, we conducted an investigation to see whether multimedia composites are captioned. We searched through the internet for multimedia composites, and found that out of 100 composites, only 21 were adequately captioned, 1 merely provided transcript text. (Conditions were web sites free of charge, max of 5 composites per domain.) It seems that the main reason for this low rate of captioning is due to the high labor costs. There are several approaches for captioning, but here we focus on using speech recognition technology. Unfortunately the voice recognition accuracy rate is still not 100%, and therefore there is still a need for an effective caption editing system to correct the errors.

The conclusion of our preliminary survey and investigation is that in order to reduce the costs of captioning content with audio and presentation slides, there is a strong need for an effective caption editing tool. The presentation slides are mostly created by commercial presentation software. In this paper, we focus on a speech recognition error correction system which integrates a caption editing system with presentation software.


3. Problems and Apparatus

Based on the preliminary survey and investigation, we investigated the available caption editing tools that generate captions from audio, and identified 3 major problems. The three major problems between CES and presentation software were identified as “Content Layout Definitions”, “Editing Focus Linkage”, and “Exporting to Speaker Notes”. To address these problems, we extended our Caption Editing System (CES) to integrate it with Microsoft PowerPoint, creating our new Caption Editing System with Presentation Integration (CESPI). The architecture in terms of code interface is shown in Figure 5.

Figure 5.

The base platform is Microsoft Windows 2000/XP. User Interface of CESPI is built on Visual Basic V6.0. IBM ViaVoice engine control is implemented by Microsoft Visual C++ 6.0. The interface between ViaVoice and CESPI is Speech Manager API (SMAPI) V7.0. Also, the interface between CESPI and Microsoft PowerPoint is Visual Basic for Application (VBA) V6.0.

Finally we conducted a field test to see if the real-time transcription accuracy of state-of-the-art speech recognition can show satisfactory results. We transcribed 11 University lectures in real-time and as a result obtained 81.8% accuracy. Unfortunately we have found that the speech recognition accuracy does not necessary reach the satisfaction level. There are many observations but we received many comments which required the speech recognition rate to be 85% at the least and preferably 90% for satisfactory level. So we conclude that it is required to seek a human computer transcription method to raise the accuracy (obviously without raising the cost by relying on many human resources).

3.1. Content Layout Definition

A multimedia composite consists of several visual components such as video, presentation images, and captions. These components needs to be laid out effectively in position and size according to such parameters as font face, font size, number of maximum characters per line, presentation image size, vice image size, resolution, overall size, and overlapping options. (Figure 6 shows a bad example of by excessive space, overlap, cut off.) CES (and CESPI) supports the RealOne Player by SMIL


format and also Windows Media Player by SAMI format.

The task of effectively laying out these components manually can be quite time consuming. CESPI solves this problem by automatically laying out these components based on each parameter. As shown in Figure 7, CESPI also provides a layout customization feature which allows the user to easily change the details of the layout.

Figure 6.

The figure shows a sample of ill defined layout, where presentation image is surrounded by excessive empty space. Video and caption overlap with the presentation image. The caption is being cut off by the window boundary.

Figure 7.

The figure shows the Change Content Layout dialog on the left hand side and the Select Layout Video + PPT + Caption dialog with the focus on the right hand side.

3.2. Editing Focus Linkage

While editing the captions of certain multimedia composites, it is useful to reference special terminology used in the presentation slides. Because caption editing tools and presentation software were separate applications, the operating system only allows one application to have the focus at one time. Therefore it was necessary to frequently switch the focus between these two applications. Also, the user had to change to the corresponding slide pages manually. CESPI solves this problem by automatically laying out the captions, page images, and page text in a single application window, which makes it easier to view and edit the captions. CESPI also automatically interlinks between the caption timestamps and the presentation page. In other words, the presentation page always corresponds to the focused caption. (Figure 8 shows the actual user interface of the CESPI Master Editing Subsystem.)

Figure 8.

On the left hand side basically shows the “Caption Line Text” (start time, state, choice, and caption). “Start Time” pointed to by the “1” represents the 0 based timestamp which to display the caption. “State” pointed to by the “2”. “Choice” is pointed to by the “3”. “Caption” is pointed to by the “4”.On the right hand side basically shows the “Presentation Page”. “Presentation Image” is pointed to by the “5” and retrieved text is pointed to by the “6”.

3.3. Speaker Notes Export

Using presentation software, a speaker may define narrative notes for each presentation page (the “speaker notes”). In many cases, a single presentation package used by one presenter will be later reused by another presenter. In such cases, since the captions and speaker notes are similar, it is efficient to use the initial caption. Previously, in order to export captions to speaker notes, manual operations such as moving to the proper page and then performing copy and paste operations were required. Therefore as illustrated in Figure 9, CESPI has a capability for automatically exporting the corresponding page of the caption into the speaker notes of the presentation package.

Figure 9.

Master caption is exported into the speaker notes portion of the presentation. The speaker notes can be referenced to the client caption.

3.4. Real-time Checker and Matching

Our technique to make automated error corrections Arakawa, et. al. 2008 is to use confidence scores of speech recognition and also matching algorithms (to match the transcribe result with text retrieved from presentation software). Theoretically if the confidence score is high then the system can show the results of the speech recognition directly. If the confidence score is low then the system can either allow the user to make corrections or the system can try to match the result with the text retrieved from the presentation software. Unfortunately we found that the confidence score is not reliable enough to be able to detect each word’s correctness by high accuracy. So here we would like to propose making use of a human resource checker. The human resource checker merely flags the transcribed result as correct or incorrect. It needs to be noted that human resource (unlike computer systems) can sometimes be very slow in processing their assigned job (which can result in a long delay). Therefore the system assures real-time captioning presence by routing the words to the matching subsystem by a higher degree as the captioning presence starts to delay. Finally, when the captioning is presented the words which were automatically corrected by the matching will be shown in blue colour and italic.


4. Results

An experiment was performed to measure the editing time under the following conditions.

1. Editors are to use CES and CESPI for an approximately 30 minutes of content each.

2. It is known that as you get used to 5 editors who already have enough experience with CES and CESPI were chosen to eliminate any inconsistencies due to the learning curve effectBarloff 1971.

3. Each editor was also assigned different portions of the content for CES and CESPI so that memory from the previous content will not take effect.

4. Task consists of correcting all the speech recognition errors, laying out the multimedia composite without each overlapping or excessive blank space, and exporting the speaker notes to the appropriate page. (Conditions are shown in Table 2.)

Category Conditions
Window Size 800x460
Component Layout Position (Video) Right Upper Position
Component Layout Position (Presentation Image) Left Upper Position
Component Layout Position (Caption) Bottom
Caption Font Charset x-sjis
Caption Font F ace
Caption Font Color black
Caption Font Size +3
Other Conditions -1 No Excessive Empty Space
Other Conditions -2 No Overlap
Other Conditions -3 No Cutoff

Table 2.

Various Conditions for the Experiment.

As shown in Table 3, the results showed that CESPI provided a 37.6% improvement in total editing time.

Speech Recognition Rate 81.4% 80.8%
Average Content Time 28min 24sec 27min 58 sec
Number of Characters 9240 9221
Total Average Editing Time 9 3min 46sec 12 7 min 2 sec
Editing Time Average per Content Time 3.30 4.54
Total Efficiency in Percentage 37.6% (N/A)

Table 3.

Result of the Experiment.

Figure 10 shows the ratio of time which accounted for the saved time by “Content Layout Definition”, “Editing Focus Linkage”, and “Speaker Notes Export”. It can be seen that Content Layout Definition accounted for approximately half of the time, while Editing Focus Linkage follows and then Speaker Notes Export made the slightest difference.

Figure 10.

Figure shows that out of the improvement of editing time shown in Table 2, 50.3% accounted for Content Layout Definition, 31.1% accounted for Editing Focus Linkage, 18.6% for Speaker Notes Export.

Content Layout Definition saved much time for CESPI since content layout required much trial and error type of editing for CES. CESPI practically required almost no time since layout can be done automatically.


5. Summary

The three major problems between CES and presentation software were identified as “Content Layout Definitions”, “Editing Focus Linkage”, and “Exporting to Speaker Notes”. This paper has shown how CESPI solves each of these problems. And experiment showed a 37.6% efficiency improvement compared with the previous method. Among the 3 items “Content Layout Definition” accounted for the most improvement in time, followed by “Editing Focus Linkage” and “Speaker Notes Export” came last.

Currently CESPI only supports Microsoft PowerPoint as the choice of presentation software. Future work item will be to support other presentation software.



Many people have participated in preliminary survey, so we would like to thank all those people for sparing their time.


  • See,
  • See assure accessibility to publicly available contents
  • See
  • McCowan et. al. (2004) discusses in detail how error rates should be handled for speech recognition results.
  • See
  • See

Written By

Kohtaroh Miyamoto, Masakazu Takizawa and Takashi Saito

Published: December 1st, 2009