OCR to enrich ASR
Automatic Speech Recognition systems, especially those leveraging probabilistic modeling such as Hidden Markov Model based ASR systems rely a lot on the associated data/lexicon for optimum performance. In this project done as part of my undergrad summer research Internship at Institut de Recherche en Informatique de Toulouse (IRIT) , Universite Paul Sabatier, we intended to analyse the possible boost in ASR performance by incorporating output of Optical Character Recognition applied on associated visual components of the speech.
We set out to study the impact of populating the lexicon of speech processing system with OCR outputs obtained from their videos. To this end, we used the open source, readily available MOOC data for the experimentation. Performing Automatic Speech recognition on these lectures for transcription and indexing is a bit difficult because different videos have a specific set of words depending on the domain of the video called jargon,which are not present in general lexicons we use to train speech recognition models. But most of these videos also have text as part of slides or handwritten scribbles on screen which if used to populate the lexicon in realtime will benefit the speech recognition system.
We set out by creating a corpus of such videos along with their transcripts with timestamps and the slides used in pdf or other file formats. We used apache Tika to extract text from these slides as part of ground truth. We also implemented a semi automatic GUI to annotate the slide transitions with respective timestamps in the video for accurate temporal alignement with ground truth for benchmarking OCR performance.
For Video OCR we used the LOOV(Poignant et al.) tool that uses classical Computational techniques such as Sobel filtering, Sauvola Algorithm followed by text tracking over consecutive frame to ensure text persistence for text detection and then tesseract OCR engine for text detection. The text detections are averaged over shifted regions and Viterbi Algorithm applied for modelling the best OCR output using SRILM library . We reimplemented parts of LOOV in python by taking developers version of PyLOOV which had functional issues and optimised it for our own use case.
We benchmarked the performance of our video OCR using ground truth annotations obtained from the slides using Recall and precision as metrics. Now we identified some domain specific words that were present in the OCR output but not in the transcript to get a general ballpark of possible improvement. We found out that there were words in range of 2 to 20%(avg ~10%) of the total words in the OCR which were absent in the transcripts on a per slide basis. The HMM based speech Recognition model was trained with the old and updated lexicons using Kaldi toolkit and expectedly we observed a significant improvement of an average of 5% in performance of the ASR for our dataset which were heavily domain oriented course lectures from Online course websites such as Coursera, edX.
Such a tool when integrated in ASR systems to update lexicon real time would help tremendously improve the ASR output.