InTech uses cookies to offer you the best online experience. By continuing to use our site, you agree to our Privacy Policy.

Computer and Information Science » Human-Computer Interaction » "Human-Computer Interaction", book edited by Inaki Maurtua, ISBN 978-953-307-022-3, Published: December 1, 2009 under CC BY-NC-SA 3.0 license. © The Author(s).

Chapter 11

Integration of Speech Recognition-based Caption Editing System with Presentation Software

By Kohtaroh Miyamoto, Masakazu Takizawa and Takashi Saito
DOI: 10.5772/7736

Article top

Overview

CESPI receives audio input and CES Recorder by encapsulated speech recognition engine, transcribes the audio into text. CES Master System and CES Client System allows collaborative editing. CESPI adds a presentation integration feature to both CES Recorder and CES Master System.
Figure 1. CESPI receives audio input and CES Recorder by encapsulated speech recognition engine, transcribes the audio into text. CES Master System and CES Client System allows collaborative editing. CESPI adds a presentation integration feature to both CES Recorder and CES Master System.
The sample output of CESPI is shown. Presentation slide image is on the left hand side, video image is on the upper right hand and the caption is on the lower right hand side.
Figure 2. The sample output of CESPI is shown. Presentation slide image is on the left hand side, video image is on the upper right hand and the caption is on the lower right hand side.
The sample image of CES is shown.
Figure 3. The sample image of CES is shown.
The figure shows how the caption editing task using the CES. All the audio processing is automatic and user merely needs to focus on making the necessary correction.
Figure 4. The figure shows how the caption editing task using the CES. All the audio processing is automatic and user merely needs to focus on making the necessary correction.
The base platform is Microsoft Windows 2000/XP. User Interface of CESPI is built on Visual Basic V6.0. IBM ViaVoice engine control is implemented by Microsoft Visual C++ 6.0. The interface between ViaVoice and CESPI is Speech Manager API (SMAPI) V7.0. Also, the interface between CESPI and Microsoft PowerPoint is Visual Basic for Application (VBA) V6.0.
Figure 5. The base platform is Microsoft Windows 2000/XP. User Interface of CESPI is built on Visual Basic V6.0. IBM ViaVoice engine control is implemented by Microsoft Visual C++ 6.0. The interface between ViaVoice and CESPI is Speech Manager API (SMAPI) V7.0. Also, the interface between CESPI and Microsoft PowerPoint is Visual Basic for Application (VBA) V6.0.
The figure shows a sample of ill defined layout, where presentation image is surrounded by excessive empty space. Video and caption overlap with the presentation image. The caption is being cut off by the window boundary.
Figure 6. The figure shows a sample of ill defined layout, where presentation image is surrounded by excessive empty space. Video and caption overlap with the presentation image. The caption is being cut off by the window boundary.
The figure shows the Change Content Layout dialog on the left hand side and the Select Layout Video + PPT + Caption dialog with the focus on the right hand side.
Figure 7. The figure shows the Change Content Layout dialog on the left hand side and the Select Layout Video + PPT + Caption dialog with the focus on the right hand side.
On the left hand side basically shows the “Caption Line Text” (start time, state, choice, and caption). “Start Time” pointed to by the “1” represents the 0 based timestamp which to display the caption. “State” pointed to by the “2”. “Choice” is pointed to by the “3”. “Caption” is pointed to by the “4”.On the right hand side basically shows the “Presentation Page”. “Presentation Image” is pointed to by the “5” and retrieved text is pointed to by the “6”.
Figure 8. On the left hand side basically shows the “Caption Line Text” (start time, state, choice, and caption). “Start Time” pointed to by the “1” represents the 0 based timestamp which to display the caption. “State” pointed to by the “2”. “Choice” is pointed to by the “3”. “Caption” is pointed to by the “4”.On the right hand side basically shows the “Presentation Page”. “Presentation Image” is pointed to by the “5” and retrieved text is pointed to by the “6”.
Master caption is exported into the speaker notes portion of the presentation. The speaker notes can be referenced to the client caption.
Figure 9. Master caption is exported into the speaker notes portion of the presentation. The speaker notes can be referenced to the client caption.
Figure shows that out of the improvement of editing time shown in Table 2, 50.3% accounted for Content Layout Definition, 31.1% accounted for Editing Focus Linkage, 18.6% for Speaker Notes Export.
Figure 10. Figure shows that out of the improvement of editing time shown in Table 2, 50.3% accounted for Content Layout Definition, 31.1% accounted for Editing Focus Linkage, 18.6% for Speaker Notes Export.

Integration of Speech Recognition-basedCaption Editing System with Presentation Software

Kohtaroh Miyamoto1, Masakazu Takizawa1 and Takashi Saito1

1. Introduction

1.1. Background

Recently an increasing amount of e-Learning material including audio and presentation slides is being provided through the Internet or private networks referred to as intranets. Many hearing impaired people and senior citizens require captioning to understand such content. Captioning is a vital part of accessibility and there are national standards such as “WCAG 2.0” [1] - . “JIS X8341-3 5-4-d” and also laws such as “Section 508 of the Disabilities Act” [2] - .

There are much on-going efforts for automated speech recognition enhancement, but here we will focus on the post editing to assure accurate captioning for digital archives. We introduce the method of “IBM Caption Editing System with Presentation Integration (hereafter CESPI)” which is an extension to IBM Caption Editing System (hereafter CES) [3] - . CESPI completely includes all the functions within CES, but is further extended to include the presentation integration functions.

CES encapsulates the speech recognition engine for transcribing audio into text (CES Recorder) and also allows various editing features for error correction (CES Master and CES Client). As shown in Figure 1, CESPI integrates presentation software in various ways for both the CES Recorder and the CES Master System. Figure 2 shows a sample output of CESPI which composes of video, captioning and presentation image slide show.

media/image1.png

Figure 1.

CESPI receives audio input and CES Recorder by encapsulated speech recognition engine, transcribes the audio into text. CES Master System and CES Client System allows collaborative editing. CESPI adds a presentation integration feature to both CES Recorder and CES Master System.

media/image2.png

Figure 2.

The sample output of CESPI is shown. Presentation slide image is on the left hand side, video image is on the upper right hand and the caption is on the lower right hand side.

1.2. Previous Methods

Automatic Speech Recognition (ASR) engines have improved its accuracy [4] - over time. And there are many ASR related programs being introduced. Most noticeable is to use the ASR for University lecture transcription. Bain et. al., Bain et. al. 2005 introduces the ASR technology and how they are being adopted to Universities. Wald & Bain, Wald & Bain 2008 introduces the Liberated Learning Project, Kheir & WayKheir & Way 2007 introduces the VUST project, Hewitt et. al. Hewitt et. al. 2008 introduces the SpeakView project. Itoh et. al.Itoh et. al. 2008 introduces the Join Project which is based on CES.

Also there are some limited efforts to use ASR for transcribing television broadcasting. Lambourne et. al. Lambourne et. al. 2004 and Imai et. al. Imai et. al 2005 discusses the difficulties in adapting ASR to broadcasting since captioning must be completely accurate. Making corrections to the ASR transcriptions in near real-time introduces many challenges.

While most of these programs focus on the real-time transcription, there is also a strong request to transcribe the digital archives for lecturers, videos, etc. In order to create such content, it is necessary to correct the transcribed errors. Making corrections to edit these errors is a very labor intensive task (hereafter we call this task caption editing, and the system to perform the caption editing as caption editing system). Therefore our primary focus is to provide a caption editing system which is highly efficient to the user.

Goto et. alGoto et. al. (2007) introduces PodCastle [5] - which is a service available on the Internet to transcribe podcast content by ASR. And users can make corrections to erroneous words basically by selecting from ASR candidates. Also Munteanu et. al.Munteanu et. al. 2008 introduces a wiki-like caption editing feature to enhance the Webcast system.

1.3. Our Prior Work

We have introduced Arakawa et. al. 2006 our CES technology and how it has been adopted in Universities. We previously introduced Miyamoto 2005, Miyamoto & Takizawa 2009 how the system can help collaborate between different roles of editors. Specifically, the master editor who is responsible for the final output uses CES Master Editing System while the client who may be any novice user uses the CES Client Editing System and both are connected by network. We also showed how the caption editing steps can be improved using three major concepts. The three concepts were “complete audio synchronization”, “completely automatic audio control”, and “status marking”. As a result, we showed 30.7% improvement in caption editing cost.

In CES, the output phrases (as candidate caption lines) from the voice recognition engine are laid out vertically as individual lines along with timestamps. “Complete audio synchronization” means that the keyboard focus always matches the audio replay position. For example, if the audio is playing the time position of one position (e.g. 5 seconds) while the keyboard input focus is on a different position (e.g. 10 seconds), it is quite obvious that it would make it extremely difficult to make corrections to the erroneous words while listening to the audio. CES plays the audio in synchronization with the associated caption lines. This means the audio focus always matches the caption line focus.

The second concept of “completely automatic audio control”, means that the audio is fully controlled automatically by the system. Users are not required to “replay” and “stop” the audio manually (usually a huge number of times). As the editing begins, the focus is set on the initial series of words, and the audio which is associated to that portion is replayed automatically. By comparing the audio with the transcribed words, user needs to determine if the words are correct. If it is, the user can press the enter key to move focus to the next series of words, but if not the user needs to make the correction. The audio will be repeatedly replayed over again to urge the user for action. The replay stops automatically when the user types any key since it is usually annoying to hear the audio during typing. A long pause in typing will automatically restart the audio again. As a result the user does not need to operate the audio replay at all and he/she can solely concentrate on making corrections. In writing it may seem quite obvious, currently we identify that CES is the only system which has this feature.

The last concept is “status marking”. The unverified lines are automatically distinguished from the corrected lines as shown in Figure 3, in CES, each caption line includes a button which is used to mark the status of each caption line. The mark also corresponds to the color of the font. The marks have several useful meanings, but basically these marks make it easier to keep track of how far the caption editing has progressed. This is very important in many cases because it is required to keep track of the caption editing work progress. And then estimate the projected finishing time and also it is needed to take appropriate action in cases such that the target deadline may be missed.

media/image3.png

Figure 3.

The sample image of CES is shown.

Here in this example, all of the caption lines are initially marked as “unverified” (question mark “?”). As the corrections proceed the flags are automatically converted to “determined” (circles). Here, caption lines 1 to 17 are correct since they were either correctly transcribed by the voice recognition engine or they were corrected using the editing feature in CES. Caption lines 18 and later are still unverified.

Figure 4 illustrates how the system works from the (caption) editor’s perspective. The audio is played automatically, and so the editor focuses on the audio. As soon as the editor begins to type the audio stops automatically. But when the editor is not sure and pauses the typing, the audio automatically starts to play again. The editor makes the necessary changes (and hits the enter key) then the keyboard and audio focus automatically moves to the next target line.

media/image4.png

Figure 4.

The figure shows how the caption editing task using the CES. All the audio processing is automatic and user merely needs to focus on making the necessary correction.

We also showed how external scripts can help reduce the caption editing work. The experimentation results show that for example, when the recognition rate is 60.9%, editing total hours decreased by 35% Miyamoto 2006 and the method of matching the script with the erroneous transcription results are introduced Miyamoto 2008 as well. Furthermore, we introduced a hybrid caption editing system Miyamoto et. al. 2007 which integrates “line unit type” (which is efficient in editing in relatively high speech recognition rate situations) with “word processor type” (which is efficient in editing in relatively low recognition rate situations). The input strings to the “word processor type” subsystem is matched to the “line unit type” subsystem. As a result of experimentation in a various speech recognition rate conditions for caption editing efficiency against previous caption editing methods, the hybrid system shows significant advantage in number of interaction and editing time. Another technique we introduced (Arakawa, et. al. 2008 the technique to make real-time automated error corrections by using confidence scores of speech recognition and automated matching algorithms of sources such as text in presentation software or scripts.

Our work goes beyond the efficiency of caption editing and also discloses the method Miyamoto & Ikawa 2008 to safely eliminate and dynamically distribute confidential

information from multimedia content with audio, caption, video and presentation materials.

In this chapter, we focus on the integration of Presentation software with the CES to increate the efficiency of caption editing. Presentation software provides many useful features to easily create effective e-Learning contents by the following 2 steps.

1) Prepare presentation file by combination of text, pictures, visual layout, and any other provided feature.

2) Make oral presentation using the slide show feature of the presentation software. At the same time record the movie by any video camera and/or oral presentation audio.

2. Preliminary Survey and Investigation

We conducted a survey to see whether the combination of video with audio, captions, and presentation slides (hereafter “multimedia composite”) is helpful in understanding the content. We created 4 multimedia composites, and then allowed a total of 80 senior citizens and people with disabilities to view any content of interest freely. After viewing, we administered a survey, and asked whether the multimedia composite is useful. The results as shown in Table 1, showed that 66.3% found the multimedia composite either "Strongly Agree” or "Agree", irrelevant of age group. So we concluded that a multimedia composite is very useful for better understanding in e-Learning.

Age GroupStrongly AgreeAgreeDisagreeStrongly Disagree
20s0400
30s0110
40s0320
50s0660
60s2960
70s321100
80 and higher2220
Total746270

Table 1.

Usefulness of Multimedia Composite.

Next, we conducted an investigation to see whether multimedia composites are captioned. We searched through the internet for multimedia composites, and found that out of 100 composites, only 21 were adequately captioned, 1 merely provided transcript text. (Conditions were web sites free of charge, max of 5 composites per domain.) It seems that the main reason for this low rate of captioning is due to the high labor costs. There are several approaches for captioning, but here we focus on using speech recognition technology. Unfortunately the voice recognition accuracy rate is still not 100%, and therefore there is still a need for an effective caption editing system to correct the errors.

The conclusion of our preliminary survey and investigation is that in order to reduce the costs of captioning content with audio and presentation slides, there is a strong need for an effective caption editing tool. The presentation slides are mostly created by commercial presentation software. In this paper, we focus on a speech recognition error correction system which integrates a caption editing system with presentation software.

3. Problems and Apparatus

Based on the preliminary survey and investigation, we investigated the available caption editing tools that generate captions from audio, and identified 3 major problems. The three major problems between CES and presentation software were identified as “Content Layout Definitions”, “Editing Focus Linkage”, and “Exporting to Speaker Notes”. To address these problems, we extended our Caption Editing System (CES) to integrate it with Microsoft PowerPoint, creating our new Caption Editing System with Presentation Integration (CESPI). The architecture in terms of code interface is shown in Figure 5.

media/image5.png

Figure 5.

The base platform is Microsoft Windows 2000/XP. User Interface of CESPI is built on Visual Basic V6.0. IBM ViaVoice engine control is implemented by Microsoft Visual C++ 6.0. The interface between ViaVoice and CESPI is Speech Manager API (SMAPI) V7.0. Also, the interface between CESPI and Microsoft PowerPoint is Visual Basic for Application (VBA) V6.0.

Finally we conducted a field test to see if the real-time transcription accuracy of state-of-the-art speech recognition can show satisfactory results. We transcribed 11 University lectures in real-time and as a result obtained 81.8% accuracy. Unfortunately we have found that the speech recognition accuracy does not necessary reach the satisfaction level. There are many observations but we received many comments which required the speech recognition rate to be 85% at the least and preferably 90% for satisfactory level. So we conclude that it is required to seek a human computer transcription method to raise the accuracy (obviously without raising the cost by relying on many human resources).

3.1. Content Layout Definition

A multimedia composite consists of several visual components such as video, presentation images, and captions. These components needs to be laid out effectively in position and size according to such parameters as font face, font size, number of maximum characters per line, presentation image size, vice image size, resolution, overall size, and overlapping options. (Figure 6 shows a bad example of by excessive space, overlap, cut off.) CES (and CESPI) supports the RealOne Player by SMIL [6] - format and also Windows Media Player by SAMI format.

The task of effectively laying out these components manually can be quite time consuming. CESPI solves this problem by automatically laying out these components based on each parameter. As shown in Figure 7, CESPI also provides a layout customization feature which allows the user to easily change the details of the layout.

media/image6.png

Figure 6.

The figure shows a sample of ill defined layout, where presentation image is surrounded by excessive empty space. Video and caption overlap with the presentation image. The caption is being cut off by the window boundary.

media/image7.png

Figure 7.

The figure shows the Change Content Layout dialog on the left hand side and the Select Layout Video + PPT + Caption dialog with the focus on the right hand side.

3.2. Editing Focus Linkage

While editing the captions of certain multimedia composites, it is useful to reference special terminology used in the presentation slides. Because caption editing tools and presentation software were separate applications, the operating system only allows one application to have the focus at one time. Therefore it was necessary to frequently switch the focus between these two applications. Also, the user had to change to the corresponding slide pages manually. CESPI solves this problem by automatically laying out the captions, page images, and page text in a single application window, which makes it easier to view and edit the captions. CESPI also automatically interlinks between the caption timestamps and the presentation page. In other words, the presentation page always corresponds to the focused caption. (Figure 8 shows the actual user interface of the CESPI Master Editing Subsystem.)

media/image8.png

Figure 8.

On the left hand side basically shows the “Caption Line Text” (start time, state, choice, and caption). “Start Time” pointed to by the “1” represents the 0 based timestamp which to display the caption. “State” pointed to by the “2”. “Choice” is pointed to by the “3”. “Caption” is pointed to by the “4”.On the right hand side basically shows the “Presentation Page”. “Presentation Image” is pointed to by the “5” and retrieved text is pointed to by the “6”.

3.3. Speaker Notes Export

Using presentation software, a speaker may define narrative notes for each presentation page (the “speaker notes”). In many cases, a single presentation package used by one presenter will be later reused by another presenter. In such cases, since the captions and speaker notes are similar, it is efficient to use the initial caption. Previously, in order to export captions to speaker notes, manual operations such as moving to the proper page and then performing copy and paste operations were required. Therefore as illustrated in Figure 9, CESPI has a capability for automatically exporting the corresponding page of the caption into the speaker notes of the presentation package.

media/image9.png

Figure 9.

Master caption is exported into the speaker notes portion of the presentation. The speaker notes can be referenced to the client caption.

3.4. Real-time Checker and Matching

Our technique to make automated error corrections Arakawa, et. al. 2008 is to use confidence scores of speech recognition and also matching algorithms (to match the transcribe result with text retrieved from presentation software). Theoretically if the confidence score is high then the system can show the results of the speech recognition directly. If the confidence score is low then the system can either allow the user to make corrections or the system can try to match the result with the text retrieved from the presentation software. Unfortunately we found that the confidence score is not reliable enough to be able to detect each word’s correctness by high accuracy. So here we would like to propose making use of a human resource checker. The human resource checker merely flags the transcribed result as correct or incorrect. It needs to be noted that human resource (unlike computer systems) can sometimes be very slow in processing their assigned job (which can result in a long delay). Therefore the system assures real-time captioning presence by routing the words to the matching subsystem by a higher degree as the captioning presence starts to delay. Finally, when the captioning is presented the words which were automatically corrected by the matching will be shown in blue colour and italic.

4. Results

An experiment was performed to measure the editing time under the following conditions.

1. Editors are to use CES and CESPI for an approximately 30 minutes of content each.

2. It is known that as you get used to 5 editors who already have enough experience with CES and CESPI were chosen to eliminate any inconsistencies due to the learning curve effectBarloff 1971.

3. Each editor was also assigned different portions of the content for CES and CESPI so that memory from the previous content will not take effect.

4. Task consists of correcting all the speech recognition errors, laying out the multimedia composite without each overlapping or excessive blank space, and exporting the speaker notes to the appropriate page. (Conditions are shown in Table 2.)

CategoryConditions
Window Size800x460
Component Layout Position (Video)Right Upper Position
Component Layout Position (Presentation Image)Left Upper Position
Component Layout Position (Caption)Bottom
Caption Font Charsetx-sjis
Caption Font F ace
Caption Font Colorblack
Caption Font Size+3
Other Conditions -1No Excessive Empty Space
Other Conditions -2No Overlap
Other Conditions -3No Cutoff

Table 2.

Various Conditions for the Experiment.

As shown in Table 3, the results showed that CESPI provided a 37.6% improvement in total editing time.

CESPICES
Speech Recognition Rate 81.4%80.8%
Average Content Time 28min 24sec27min 58 sec
Number of Characters 92409221
Total Average Editing Time 9 3min 46sec12 7 min 2 sec
Editing Time Average per Content Time 3.304.54
Total Efficiency in Percentage37.6%(N/A)

Table 3.

Result of the Experiment.

Figure 10 shows the ratio of time which accounted for the saved time by “Content Layout Definition”, “Editing Focus Linkage”, and “Speaker Notes Export”. It can be seen that Content Layout Definition accounted for approximately half of the time, while Editing Focus Linkage follows and then Speaker Notes Export made the slightest difference.

media/image10.png

Figure 10.

Figure shows that out of the improvement of editing time shown in Table 2, 50.3% accounted for Content Layout Definition, 31.1% accounted for Editing Focus Linkage, 18.6% for Speaker Notes Export.

Content Layout Definition saved much time for CESPI since content layout required much trial and error type of editing for CES. CESPI practically required almost no time since layout can be done automatically.

5. Summary

The three major problems between CES and presentation software were identified as “Content Layout Definitions”, “Editing Focus Linkage”, and “Exporting to Speaker Notes”. This paper has shown how CESPI solves each of these problems. And experiment showed a 37.6% efficiency improvement compared with the previous method. Among the 3 items “Content Layout Definition” accounted for the most improvement in time, followed by “Editing Focus Linkage” and “Speaker Notes Export” came last.

Currently CESPI only supports Microsoft PowerPoint as the choice of presentation software. Future work item will be to support other presentation software.

6. Acknowledgements

Many people have participated in preliminary survey, so we would like to thank all those people for sparing their time.

References

1 - K. Arakawa, K. Miyamoto, K. Negishi, 2006 Caption Editing System for University-Level Students with Hearing Disabilities, Closing the Gap 2006 Conference
2 - K. Arakawa, K. Miyamoto, T. Ohgane, 2008 Caption Correction Device, United States Patent Application 20080040111, International Business Machines Corporation
3 - K. Bain, S. Basson, A. Faisman, D. Kanevsky, 2005 Accessibility, Transcription, and Access Everywhere, IBM Systems Journal: Accessibility 44 3
4 - N. Baloff, 1971 Extensions of the Learning Curve-Some Empirical Results. Operations Research Quarterly, 22 44
5 - M. Goto, J. Ogata, K. Eto, 2007 PodCastle: A Web 2.0 Approach to Speech Recognition Research. in: Proceedings of the 8th Annual Conference of the International Speech Communication Association, Interspeech 2007, 2397 2400
6 - A. Harvey, 2004 Nevada Legislature Develops Web Captioning System for the Hearing Impaired While technology can provide some elegant solutions, cost is still a major factor, National Association of Legislative Information Technology Newsletter
7 - J. Hewitt, J. Lyon, C. Britton, B. Mellor, 2005 SpeakView: Live Captioning of Lectures, in: Proceedings of Universal Access in Human Computer Interaction 8 HCI International 2005
8 - T. Imai, S. Sato, S. Homma, K. Onoe, A. Kobayashi, 2007 Online Speech Detection and Dual-Gender Speech Recognition for Captioning Broadcast News, IEICE Trans. Inf. & Syst., E90-D , 8 1286 1291
9 - E. Itoh, Y. Asahi, T. Aoki, 2008 Lecture Assisting Used by Speech Recognition for the University Students with Disabilities- JOIN Project, Trace of 3 years, in :Proceedings of Rehabilitation Engineering Society of Japan, RESJA 2008, 241 242
10 - R. Kheir, T. Way, 2007 Inclusion of deaf students in computer science classes using real-time speech transcription, in: Proceedings of the 12th annual SIGCSE conference on Innovation and technology in computer science education ITiCSE’07, 39 3 261 265
11 - A. Lambourne, J. Hewitt, C. Lyon, S. Warren, 2004 Speech-Based Real-Time Subtitling Services, International Journal of Speech Technology, 7 4 269 279
12 - I. Mc Cowan, D. Moore, J. Dines, D. Gatica-Perez, M. Flynn, M. Wellner, H. Boulard, 2004 On the Use of Information Retrieval Measures for Speech Recognition Evaluation, Institut Dalle Molle d’Intelligence Artificielle Perceptive IDIAP Research Report 04-73, 1 13
13 - K. Miyamoto, 2005 Effective Master Client Closed Caption Editing System for Wide Range Workforce, in: Proceedings of the Universal Access in Human Computer Interaction 7 HCI International 2005
14 - K. Miyamoto, 2006 Reduction of Caption Editing Using Supplemental Text Information, Technology & Persons with Disabilities Conference CSUN 2006
15 - K. Miyamoto, K. Arakawa, T. Saito, 2007 Hybrid Caption Editing System by Integrating Line Unit Type and Word Processor Type. IEICE Transactions on Information and Systems, J90-D 3 673 682
16 - K. Miyamoto, Y. Ikawa, 2008 Apparatus and method for rendering contents, containing sound data, moving image data and static image data, harmless, USPTO Application #: 20080262841, International Business Machines Corporation
17 - K. Miyamoto, M. Shoji, 2008 Displaying text of speech in synchronization with the speech, United States Patent 20050203750, International Business Machines Corporation
18 - K. Miyamoto, M. Takizawa, 2009 User Efficient Speech Recognition-based Caption Editing System Technology & Persons with Disabilities Conference CSUN 2009
19 - C. Munteanu, R. Baecker, G. Penn, 2008 Collaborative Editing for Improved Usefulness and Usability of Transcript-Enhanced Webcasts. in: Proceedings of CHI 2008, ACM Press, 373 382
20 - M. Wald, K. Bain, 2008 Universal access to communication and learning: the role of automatic speech recognition, in: Universal Access in the Information Society, 6 4 435 447

Notes

[1] - See http://www.w3.org/TR/WCAG20/.,

[2] - See http://www.section508.gov/.to assure accessibility to publicly available contents

[3] - See http://www.alphaworks.ibm.com/tech/ces/

[4] - McCowan et. al. (2004) discusses in detail how error rates should be handled for speech recognition results.

[5] - See http://podcastle.jp/

[6] - See http://www.w3.org/AudioVideo/.

Notes

[1] - See http://www.w3.org/TR/WCAG20/.,

[2] - See http://www.section508.gov/.to assure accessibility to publicly available contents

[3] - See http://www.alphaworks.ibm.com/tech/ces/

[4] - McCowan et. al. (2004) discusses in detail how error rates should be handled for speech recognition results.

[5] - See http://podcastle.jp/

[6] - See http://www.w3.org/AudioVideo/.