Sequence Modeling with CTC

diginova1 pts0 comments

Sequence Modeling with CTC

Sequence Modeling With CTC

A visual guide to Connectionist Temporal Classification, an algorithm used<br>to train deep neural networks in speech recognition, handwriting recognition<br>and other sequence problems.

How CTC collapsing works

For an input,<br>like speech

Predict a<br>sequence of<br>tokens

Use return to<br>input a blank (ϵ)(\epsilon)(ϵ)

Merge repeats,<br>drop ϵ\epsilonϵ<br>Final output

Authors

Affiliations

Awni Hannun

Stanford University

Published

Nov. 27, 2017

DOI

10.23915/distill.00008

Introduction

Consider speech recognition. We have a dataset of audio clips and<br>corresponding transcripts. Unfortunately, we don’t know how the characters in<br>the transcript align to the audio. This makes training a speech recognizer<br>harder than it might at first seem.

Without this alignment, the simple approaches aren’t available to us. We<br>could devise a rule like “one character corresponds to ten inputs”. But<br>people’s rates of speech vary, so this type of rule can always be broken.<br>Another alternative is to hand-align each character to its location in the<br>audio. From a modeling standpoint this works well — we’d know the ground truth<br>for each input time-step. However, for any reasonably sized dataset this is<br>prohibitively time consuming.

This problem doesn’t just turn up in speech recognition. We see it in many<br>other places. Handwriting recognition from images or sequences of pen strokes<br>is one example. Action labelling in videos is another.

Handwriting recognition: The input can be<br>(x,y)(x,y)(x,y) coordinates of a pen stroke or<br>pixels in an image.

Speech recognition: The input can be a spectrogram or some<br>other frequency based feature extractor.

Connectionist Temporal Classification (CTC) is a way to get around not<br>knowing the alignment between the input and the output. As we’ll see, it’s<br>especially well suited to applications like speech and handwriting<br>recognition.

To be a bit more formal, let’s consider mapping input sequences<br>X=[x1,x2,…,xT]X = [x_1, x_2, \ldots, x_T]X=[x1​,x2​,…,xT​], such as audio, to corresponding output<br>sequences Y=[y1,y2,…,yU]Y = [y_1, y_2, \ldots, y_U]Y=[y1​,y2​,…,yU​], such as transcripts.<br>We want to find an accurate mapping from XXX’s to YYY’s.

There are challenges which get in the way of us<br>using simpler supervised learning algorithms. In particular:

Both XXX and YYY<br>can vary in length.

The ratio of the lengths of XXX and YYY<br>can vary.

We don’t have an accurate alignment (correspondence of the elements) of<br>XXX and Y.Y.Y.

The CTC algorithm overcomes these challenges. For a given XXX<br>it gives us an output distribution over all possible YYY’s. We<br>can use this distribution either to infer a likely output or to assess<br>the probability of a given output.

Not all ways of computing the loss function and performing inference are<br>tractable. We’ll require that CTC do both of these efficiently.

Loss Function: For a given input, we’d like to train our<br>model to maximize the probability it assigns to the right answer. To do this,<br>we’ll need to efficiently compute the conditional probability<br>p(Y∣X).p(Y \mid X).p(Y∣X). The function p(Y∣X)p(Y \mid X)p(Y∣X) should<br>also be differentiable, so we can use gradient descent.

Inference: Naturally, after we’ve trained the model, we<br>want to use it to infer a likely YYY given an X.X.X.<br>This means solving

Y∗=argmaxYp(Y∣X).<br>Y^* \enspace =\enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X).<br>Y∗=Yargmax​p(Y∣X).

Ideally Y∗Y^*Y∗ can be found efficiently. With CTC we’ll settle<br>for an approximate solution that’s not too expensive to find.

The Algorithm

The CTC algorithm can assign a probability for any YYY<br>given an X.X.X. The key to computing this probability is how CTC<br>thinks about alignments between inputs and outputs. We’ll start by looking at<br>these alignments and then show how to use them to compute the loss function and<br>perform inference.

Alignment

The CTC algorithm is alignment-free — it doesn’t require an<br>alignment between the input and the output. However, to get the probability of<br>an output given an input, CTC works by summing over the probability of all<br>possible alignments between the two. We need to understand what these<br>alignments are in order to understand how the loss function is ultimately<br>calculated.

To motivate the specific form of the CTC alignments, first consider a naive<br>approach. Let’s use an example. Assume the input has length six and Y=Y<br>=Y= [c, a, t]. One way to align XXX and YYY<br>is to assign an output character to each input step and collapse repeats.

This approach has two problems.

Often, it doesn’t make sense to force every input step to align to<br>some output. In speech recognition, for example, the input can have stretches<br>of silence with no corresponding output.

We have no way to produce outputs with multiple characters in a row.<br>Consider the alignment [h, h, e, l, l, l, o]. Collapsing repeats will<br>produce “helo” instead of “hello”.

To get around these problems, CTC introduces a new...

input output speech recognition alignment probability

Related Articles