Sequence Modeling with CTC

Sequence Modeling With CTC

A visual guide to Connectionist Temporal Classification, an algorithm used to train deep neural networks in speech recognition, handwriting recognition and other sequence problems.

How CTC collapsing works

For an input, like speech

Predict a sequence of tokens

Use return to input a blank (ϵ)(\epsilon)(ϵ)

Merge repeats, drop ϵ\epsilonϵ Final output

Authors

Affiliations

Awni Hannun

Stanford University

Published

Nov. 27, 2017

DOI

10.23915/distill.00008

Introduction

Consider speech recognition. We have a dataset of audio clips and corresponding transcripts. Unfortunately, we don’t know how the characters in the transcript align to the audio. This makes training a speech recognizer harder than it might at first seem.

Without this alignment, the simple approaches aren’t available to us. We could devise a rule like “one character corresponds to ten inputs”. But people’s rates of speech vary, so this type of rule can always be broken. Another alternative is to hand-align each character to its location in the audio. From a modeling standpoint this works well — we’d know the ground truth for each input time-step. However, for any reasonably sized dataset this is prohibitively time consuming.

This problem doesn’t just turn up in speech recognition. We see it in many other places. Handwriting recognition from images or sequences of pen strokes is one example. Action labelling in videos is another.

Handwriting recognition: The input can be (x,y)(x,y)(x,y) coordinates of a pen stroke or pixels in an image.

Speech recognition: The input can be a spectrogram or some other frequency based feature extractor.

Connectionist Temporal Classification (CTC) is a way to get around not knowing the alignment between the input and the output. As we’ll see, it’s especially well suited to applications like speech and handwriting recognition.

To be a bit more formal, let’s consider mapping input sequences X=[x1,x2,…,xT]X = [x_1, x_2, \ldots, x_T]X=[x1,x2,…,xT], such as audio, to corresponding output sequences Y=[y1,y2,…,yU]Y = [y_1, y_2, \ldots, y_U]Y=[y1,y2,…,yU], such as transcripts. We want to find an accurate mapping from XXX’s to YYY’s.

There are challenges which get in the way of us using simpler supervised learning algorithms. In particular:

Both XXX and YYY can vary in length.

The ratio of the lengths of XXX and YYY can vary.

We don’t have an accurate alignment (correspondence of the elements) of XXX and Y.Y.Y.

The CTC algorithm overcomes these challenges. For a given XXX it gives us an output distribution over all possible YYY’s. We can use this distribution either to infer a likely output or to assess the probability of a given output.

Not all ways of computing the loss function and performing inference are tractable. We’ll require that CTC do both of these efficiently.

Loss Function: For a given input, we’d like to train our model to maximize the probability it assigns to the right answer. To do this, we’ll need to efficiently compute the conditional probability p(Y∣X).p(Y \mid X).p(Y∣X). The function p(Y∣X)p(Y \mid X)p(Y∣X) should also be differentiable, so we can use gradient descent.

Inference: Naturally, after we’ve trained the model, we want to use it to infer a likely YYY given an X.X.X. This means solving

Y∗=argmaxYp(Y∣X). Y^* \enspace =\enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X). Y∗=Yargmaxp(Y∣X).

Ideally Y∗Y^*Y∗ can be found efficiently. With CTC we’ll settle for an approximate solution that’s not too expensive to find.

The Algorithm

The CTC algorithm can assign a probability for any YYY given an X.X.X. The key to computing this probability is how CTC thinks about alignments between inputs and outputs. We’ll start by looking at these alignments and then show how to use them to compute the loss function and perform inference.

Alignment

The CTC algorithm is alignment-free — it doesn’t require an alignment between the input and the output. However, to get the probability of an output given an input, CTC works by summing over the probability of all possible alignments between the two. We need to understand what these alignments are in order to understand how the loss function is ultimately calculated.

To motivate the specific form of the CTC alignments, first consider a naive approach. Let’s use an example. Assume the input has length six and Y=Y =Y= [c, a, t]. One way to align XXX and YYY is to assign an output character to each input step and collapse repeats.

This approach has two problems.

Often, it doesn’t make sense to force every input step to align to some output. In speech recognition, for example, the input can have stretches of silence with no corresponding output.

We have no way to produce outputs with multiple characters in a row. Consider the alignment [h, h, e, l, l, l, o]. Collapsing repeats will produce “helo” instead of “hello”.

To get around these problems, CTC introduces a new...

Sequence Modeling with CTC

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars