Usefulness of Multimedia Composite.
1. Introduction
1.1. Background
Recently an increasing amount of e-Learning material including audio and presentation slides is being provided through the Internet or private networks referred to as intranets. Many hearing impaired people and senior citizens require captioning to understand such content. Captioning is a vital part of accessibility and there are national standards such as “WCAG 2.0” See http://www.w3.org/TR/WCAG20/., See http://www.section508.gov/.to assure accessibility to publicly available contents
There are much on-going efforts for automated speech recognition enhancement, but here we will focus on the post editing to assure accurate captioning for digital archives. We introduce the method of “IBM Caption Editing System with Presentation Integration (hereafter CESPI)” which is an extension to IBM Caption Editing System (hereafter CES) See http://www.alphaworks.ibm.com/tech/ces/
CES encapsulates the speech recognition engine for transcribing audio into text (CES Recorder) and also allows various editing features for error correction (CES Master and CES Client). As shown in Figure 1, CESPI integrates presentation software in various ways for both the CES Recorder and the CES Master System. Figure 2 shows a sample output of CESPI which composes of video, captioning and presentation image slide show.
1.2. Previous Methods
Automatic Speech Recognition (ASR) engines have improved its accuracy McCowan et. al. (2004) discusses in detail how error rates should be handled for speech recognition results.
Also there are some limited efforts to use ASR for transcribing television broadcasting. Lambourne et. al. Lambourne et. al. 2004 and Imai et. al. Imai et. al 2005 discusses the difficulties in adapting ASR to broadcasting since captioning must be completely accurate. Making corrections to the ASR transcriptions in near real-time introduces many challenges.
While most of these programs focus on the real-time transcription, there is also a strong request to transcribe the digital archives for lecturers, videos, etc. In order to create such content, it is necessary to correct the transcribed errors. Making corrections to edit these errors is a very labor intensive task (hereafter we call this task caption editing, and the system to perform the caption editing as caption editing system). Therefore our primary focus is to provide a caption editing system which is highly efficient to the user.
Goto et. alGoto et. al. (2007) introduces PodCastle See http://podcastle.jp/
1.3. Our Prior Work
We have introduced Arakawa et. al. 2006 our CES technology and how it has been adopted in Universities. We previously introduced Miyamoto 2005, Miyamoto & Takizawa 2009 how the system can help collaborate between different roles of editors. Specifically, the master editor who is responsible for the final output uses CES Master Editing System while the client who may be any novice user uses the CES Client Editing System and both are connected by network. We also showed how the caption editing steps can be improved using three major concepts. The three concepts were “complete audio synchronization”, “completely automatic audio control”, and “status marking”. As a result, we showed 30.7% improvement in caption editing cost.
In CES, the output phrases (as candidate caption lines) from the voice recognition engine are laid out vertically as individual lines along with timestamps. “Complete audio synchronization” means that the keyboard focus always matches the audio replay position. For example, if the audio is playing the time position of one position (e.g. 5 seconds) while the keyboard input focus is on a different position (e.g. 10 seconds), it is quite obvious that it would make it extremely difficult to make corrections to the erroneous words while listening to the audio. CES plays the audio in synchronization with the associated caption lines. This means the audio focus always matches the caption line focus.
The second concept of “completely automatic audio control”, means that the audio is fully controlled automatically by the system. Users are not required to “replay” and “stop” the audio manually (usually a huge number of times). As the editing begins, the focus is set on the initial series of words, and the audio which is associated to that portion is replayed automatically. By comparing the audio with the transcribed words, user needs to determine if the words are correct. If it is, the user can press the enter key to move focus to the next series of words, but if not the user needs to make the correction. The audio will be repeatedly replayed over again to urge the user for action. The replay stops automatically when the user types any key since it is usually annoying to hear the audio during typing. A long pause in typing will automatically restart the audio again. As a result the user does not need to operate the audio replay at all and he/she can solely concentrate on making corrections. In writing it may seem quite obvious, currently we identify that CES is the only system which has this feature.
The last concept is “status marking”. The unverified lines are automatically distinguished from the corrected lines as shown in Figure 3, in CES, each caption line includes a button which is used to mark the status of each caption line. The mark also corresponds to the color of the font. The marks have several useful meanings, but basically these marks make it easier to keep track of how far the caption editing has progressed. This is very important in many cases because it is required to keep track of the caption editing work progress. And then estimate the projected finishing time and also it is needed to take appropriate action in cases such that the target deadline may be missed.
Here in this example, all of the caption lines are initially marked as “unverified” (question mark “?”). As the corrections proceed the flags are automatically converted to “determined” (circles). Here, caption lines 1 to 17 are correct since they were either correctly transcribed by the voice recognition engine or they were corrected using the editing feature in CES. Caption lines 18 and later are still unverified.
Figure 4 illustrates how the system works from the (caption) editor’s perspective. The audio is played automatically, and so the editor focuses on the audio. As soon as the editor begins to type the audio stops automatically. But when the editor is not sure and pauses the typing, the audio automatically starts to play again. The editor makes the necessary changes (and hits the enter key) then the keyboard and audio focus automatically moves to the next target line.
We also showed how external scripts can help reduce the caption editing work. The experimentation results show that for example, when the recognition rate is 60.9%, editing total hours decreased by 35% Miyamoto 2006 and the method of matching the script with the erroneous transcription results are introduced Miyamoto 2008 as well. Furthermore, we introduced a hybrid caption editing system Miyamoto et. al. 2007 which integrates “line unit type” (which is efficient in editing in relatively high speech recognition rate situations) with “word processor type” (which is efficient in editing in relatively low recognition rate situations). The input strings to the “word processor type” subsystem is matched to the “line unit type” subsystem. As a result of experimentation in a various speech recognition rate conditions for caption editing efficiency against previous caption editing methods, the hybrid system shows significant advantage in number of interaction and editing time. Another technique we introduced (Arakawa, et. al. 2008 the technique to make real-time automated error corrections by using confidence scores of speech recognition and automated matching algorithms of sources such as text in presentation software or scripts.
Our work goes beyond the efficiency of caption editing and also discloses the method Miyamoto & Ikawa 2008 to safely eliminate and dynamically distribute confidential
information from multimedia content with audio, caption, video and presentation materials.
In this chapter, we focus on the integration of Presentation software with the CES to increate the efficiency of caption editing. Presentation software provides many useful features to easily create effective e-Learning contents by the following 2 steps.
1) Prepare presentation file by combination of text, pictures, visual layout, and any other provided feature.
2) Make oral presentation using the slide show feature of the presentation software. At the same time record the movie by any video camera and/or oral presentation audio.
2. Preliminary Survey and Investigation
We conducted a survey to see whether the combination of video with audio, captions, and presentation slides (hereafter “multimedia composite”) is helpful in understanding the content. We created 4 multimedia composites, and then allowed a total of 80 senior citizens and people with disabilities to view any content of interest freely. After viewing, we administered a survey, and asked whether the multimedia composite is useful. The results as shown in Table 1, showed that 66.3% found the multimedia composite either "Strongly Agree” or "Agree", irrelevant of age group. So we concluded that a multimedia composite is very useful for better understanding in e-Learning.
Age Group | Strongly Agree | Agree | Disagree | Strongly Disagree |
20s | 0 | 4 | 0 | 0 |
30s | 0 | 1 | 1 | 0 |
40s | 0 | 3 | 2 | 0 |
50s | 0 | 6 | 6 | 0 |
60s | 2 | 9 | 6 | 0 |
70s | 3 | 21 | 10 | 0 |
80 and higher | 2 | 2 | 2 | 0 |
Total | 7 | 46 | 27 | 0 |
Next, we conducted an investigation to see whether multimedia composites are captioned. We searched through the internet for multimedia composites, and found that out of 100 composites, only 21 were adequately captioned, 1 merely provided transcript text. (Conditions were web sites free of charge, max of 5 composites per domain.) It seems that the main reason for this low rate of captioning is due to the high labor costs. There are several approaches for captioning, but here we focus on using speech recognition technology. Unfortunately the voice recognition accuracy rate is still not 100%, and therefore there is still a need for an effective caption editing system to correct the errors.
The conclusion of our preliminary survey and investigation is that in order to reduce the costs of captioning content with audio and presentation slides, there is a strong need for an effective caption editing tool. The presentation slides are mostly created by commercial presentation software. In this paper, we focus on a speech recognition error correction system which integrates a caption editing system with presentation software.
3. Problems and Apparatus
Based on the preliminary survey and investigation, we investigated the available caption editing tools that generate captions from audio, and identified 3 major problems. The three major problems between CES and presentation software were identified as “Content Layout Definitions”, “Editing Focus Linkage”, and “Exporting to Speaker Notes”. To address these problems, we extended our Caption Editing System (CES) to integrate it with Microsoft PowerPoint, creating our new Caption Editing System with Presentation Integration (CESPI). The architecture in terms of code interface is shown in Figure 5.
Finally we conducted a field test to see if the real-time transcription accuracy of state-of-the-art speech recognition can show satisfactory results. We transcribed 11 University lectures in real-time and as a result obtained 81.8% accuracy. Unfortunately we have found that the speech recognition accuracy does not necessary reach the satisfaction level. There are many observations but we received many comments which required the speech recognition rate to be 85% at the least and preferably 90% for satisfactory level. So we conclude that it is required to seek a human computer transcription method to raise the accuracy (obviously without raising the cost by relying on many human resources).
3.1. Content Layout Definition
A multimedia composite consists of several visual components such as video, presentation images, and captions. These components needs to be laid out effectively in position and size according to such parameters as font face, font size, number of maximum characters per line, presentation image size, vice image size, resolution, overall size, and overlapping options. (Figure 6 shows a bad example of by excessive space, overlap, cut off.) CES (and CESPI) supports the RealOne Player by SMIL See http://www.w3.org/AudioVideo/.
The task of effectively laying out these components manually can be quite time consuming. CESPI solves this problem by automatically laying out these components based on each parameter. As shown in Figure 7, CESPI also provides a layout customization feature which allows the user to easily change the details of the layout.
3.2. Editing Focus Linkage
While editing the captions of certain multimedia composites, it is useful to reference special terminology used in the presentation slides. Because caption editing tools and presentation software were separate applications, the operating system only allows one application to have the focus at one time. Therefore it was necessary to frequently switch the focus between these two applications. Also, the user had to change to the corresponding slide pages manually. CESPI solves this problem by automatically laying out the captions, page images, and page text in a single application window, which makes it easier to view and edit the captions. CESPI also automatically interlinks between the caption timestamps and the presentation page. In other words, the presentation page always corresponds to the focused caption. (Figure 8 shows the actual user interface of the CESPI Master Editing Subsystem.)
3.3. Speaker Notes Export
Using presentation software, a speaker may define narrative notes for each presentation page (the “speaker notes”). In many cases, a single presentation package used by one presenter will be later reused by another presenter. In such cases, since the captions and speaker notes are similar, it is efficient to use the initial caption. Previously, in order to export captions to speaker notes, manual operations such as moving to the proper page and then performing copy and paste operations were required. Therefore as illustrated in Figure 9, CESPI has a capability for automatically exporting the corresponding page of the caption into the speaker notes of the presentation package.
3.4. Real-time Checker and Matching
Our technique to make automated error corrections Arakawa, et. al. 2008 is to use confidence scores of speech recognition and also matching algorithms (to match the transcribe result with text retrieved from presentation software). Theoretically if the confidence score is high then the system can show the results of the speech recognition directly. If the confidence score is low then the system can either allow the user to make corrections or the system can try to match the result with the text retrieved from the presentation software. Unfortunately we found that the confidence score is not reliable enough to be able to detect each word’s correctness by high accuracy. So here we would like to propose making use of a human resource checker. The human resource checker merely flags the transcribed result as correct or incorrect. It needs to be noted that human resource (unlike computer systems) can sometimes be very slow in processing their assigned job (which can result in a long delay). Therefore the system assures real-time captioning presence by routing the words to the matching subsystem by a higher degree as the captioning presence starts to delay. Finally, when the captioning is presented the words which were automatically corrected by the matching will be shown in blue colour and italic.
4. Results
An experiment was performed to measure the editing time under the following conditions.
1. Editors are to use CES and CESPI for an approximately 30 minutes of content each.
2. It is known that as you get used to 5 editors who already have enough experience with CES and CESPI were chosen to eliminate any inconsistencies due to the learning curve effectBarloff 1971.
3. Each editor was also assigned different portions of the content for CES and CESPI so that memory from the previous content will not take effect.
4. Task consists of correcting all the speech recognition errors, laying out the multimedia composite without each overlapping or excessive blank space, and exporting the speaker notes to the appropriate page. (Conditions are shown in Table 2.)
Category | Conditions |
Window Size | 800x460 |
Component Layout Position (Video) | Right Upper Position |
Component Layout Position (Presentation Image) | Left Upper Position |
Component Layout Position (Caption) | Bottom |
Caption Font Charset | x-sjis |
Caption Font F ace | |
Caption Font Color | black |
Caption Font Size | +3 |
Other Conditions -1 | No Excessive Empty Space |
Other Conditions -2 | No Overlap |
Other Conditions -3 | No Cutoff |
As shown in Table 3, the results showed that CESPI provided a 37.6% improvement in total editing time.
CESPI | CES | |
Speech Recognition Rate | 81.4% | 80.8% |
Average Content Time | 28min 24sec | 27min 58 sec |
Number of Characters | 9240 | 9221 |
Total Average Editing Time | 9 3min 46sec | 12 7 min 2 sec |
Editing Time Average per Content Time | 3.30 | 4.54 |
Total Efficiency in Percentage | 37.6% | (N/A) |
Figure 10 shows the ratio of time which accounted for the saved time by “Content Layout Definition”, “Editing Focus Linkage”, and “Speaker Notes Export”. It can be seen that Content Layout Definition accounted for approximately half of the time, while Editing Focus Linkage follows and then Speaker Notes Export made the slightest difference.
Content Layout Definition saved much time for CESPI since content layout required much trial and error type of editing for CES. CESPI practically required almost no time since layout can be done automatically.
5. Summary
The three major problems between CES and presentation software were identified as “Content Layout Definitions”, “Editing Focus Linkage”, and “Exporting to Speaker Notes”. This paper has shown how CESPI solves each of these problems. And experiment showed a 37.6% efficiency improvement compared with the previous method. Among the 3 items “Content Layout Definition” accounted for the most improvement in time, followed by “Editing Focus Linkage” and “Speaker Notes Export” came last.
Currently CESPI only supports Microsoft PowerPoint as the choice of presentation software. Future work item will be to support other presentation software.
Acknowledgments
Many people have participated in preliminary survey, so we would like to thank all those people for sparing their time.
Notes
- See http://www.w3.org/TR/WCAG20/.,
- See http://www.section508.gov/.to assure accessibility to publicly available contents
- See http://www.alphaworks.ibm.com/tech/ces/
- McCowan et. al. (2004) discusses in detail how error rates should be handled for speech recognition results.
- See http://podcastle.jp/
- See http://www.w3.org/AudioVideo/.