Sequence Modeling with CTC
Sequence Modeling With CTC
A visual guide to Connectionist Temporal Classification, an algorithm used<br>to train deep neural networks in speech recognition, handwriting recognition<br>and other sequence problems.
How CTC collapsing works
For an input,<br>like speech
Predict a<br>sequence of<br>tokens
Use return to<br>input a blank (ϵ)(\epsilon)(ϵ)
Merge repeats,<br>drop ϵ\epsilonϵ<br>Final output
Authors
Affiliations
Awni Hannun
Stanford University
Published
Nov. 27, 2017
DOI
10.23915/distill.00008
Introduction
Consider speech recognition. We have a dataset of audio clips and<br>corresponding transcripts. Unfortunately, we don’t know how the characters in<br>the transcript align to the audio. This makes training a speech recognizer<br>harder than it might at first seem.
Without this alignment, the simple approaches aren’t available to us. We<br>could devise a rule like “one character corresponds to ten inputs”. But<br>people’s rates of speech vary, so this type of rule can always be broken.<br>Another alternative is to hand-align each character to its location in the<br>audio. From a modeling standpoint this works well — we’d know the ground truth<br>for each input time-step. However, for any reasonably sized dataset this is<br>prohibitively time consuming.
This problem doesn’t just turn up in speech recognition. We see it in many<br>other places. Handwriting recognition from images or sequences of pen strokes<br>is one example. Action labelling in videos is another.
Handwriting recognition: The input can be<br>(x,y)(x,y)(x,y) coordinates of a pen stroke or<br>pixels in an image.
Speech recognition: The input can be a spectrogram or some<br>other frequency based feature extractor.
Connectionist Temporal Classification (CTC) is a way to get around not<br>knowing the alignment between the input and the output. As we’ll see, it’s<br>especially well suited to applications like speech and handwriting<br>recognition.
To be a bit more formal, let’s consider mapping input sequences<br>X=[x1,x2,…,xT]X = [x_1, x_2, \ldots, x_T]X=[x1,x2,…,xT], such as audio, to corresponding output<br>sequences Y=[y1,y2,…,yU]Y = [y_1, y_2, \ldots, y_U]Y=[y1,y2,…,yU], such as transcripts.<br>We want to find an accurate mapping from XXX’s to YYY’s.
There are challenges which get in the way of us<br>using simpler supervised learning algorithms. In particular:
Both XXX and YYY<br>can vary in length.
The ratio of the lengths of XXX and YYY<br>can vary.
We don’t have an accurate alignment (correspondence of the elements) of<br>XXX and Y.Y.Y.
The CTC algorithm overcomes these challenges. For a given XXX<br>it gives us an output distribution over all possible YYY’s. We<br>can use this distribution either to infer a likely output or to assess<br>the probability of a given output.
Not all ways of computing the loss function and performing inference are<br>tractable. We’ll require that CTC do both of these efficiently.
Loss Function: For a given input, we’d like to train our<br>model to maximize the probability it assigns to the right answer. To do this,<br>we’ll need to efficiently compute the conditional probability<br>p(Y∣X).p(Y \mid X).p(Y∣X). The function p(Y∣X)p(Y \mid X)p(Y∣X) should<br>also be differentiable, so we can use gradient descent.
Inference: Naturally, after we’ve trained the model, we<br>want to use it to infer a likely YYY given an X.X.X.<br>This means solving
Y∗=argmaxYp(Y∣X).<br>Y^* \enspace =\enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X).<br>Y∗=Yargmaxp(Y∣X).
Ideally Y∗Y^*Y∗ can be found efficiently. With CTC we’ll settle<br>for an approximate solution that’s not too expensive to find.
The Algorithm
The CTC algorithm can assign a probability for any YYY<br>given an X.X.X. The key to computing this probability is how CTC<br>thinks about alignments between inputs and outputs. We’ll start by looking at<br>these alignments and then show how to use them to compute the loss function and<br>perform inference.
Alignment
The CTC algorithm is alignment-free — it doesn’t require an<br>alignment between the input and the output. However, to get the probability of<br>an output given an input, CTC works by summing over the probability of all<br>possible alignments between the two. We need to understand what these<br>alignments are in order to understand how the loss function is ultimately<br>calculated.
To motivate the specific form of the CTC alignments, first consider a naive<br>approach. Let’s use an example. Assume the input has length six and Y=Y<br>=Y= [c, a, t]. One way to align XXX and YYY<br>is to assign an output character to each input step and collapse repeats.
This approach has two problems.
Often, it doesn’t make sense to force every input step to align to<br>some output. In speech recognition, for example, the input can have stretches<br>of silence with no corresponding output.
We have no way to produce outputs with multiple characters in a row.<br>Consider the alignment [h, h, e, l, l, l, o]. Collapsing repeats will<br>produce “helo” instead of “hello”.
To get around these problems, CTC introduces a new...