SaySynth: A Brief History of Speaking Machines

evakhoury1 pts0 comments

SaySynth: A Brief History of Speaking Machines | Brian Abelson

These are expanded notes from a talk I gave at composition.codes on December 21, 2025. Slides here. Video here.

SaySynth is a synthesizer I built on top of macOS’s text-to-speech framework — more popularly known as the say command. But to explain why I built it and why I think it matters, I want to take a detour through the history of speaking machines more broadly.

A Typology of Speaking Machines

There are roughly four kinds of speaking machines that have existed over time:

Mechanical — Literally physical: bellows forcing air through a reed, with different knobs, valves, and whistles shaping different formants and phonemes. The human operator is part of the instrument.

Formant/Rule-Based — More like a synthesizer: an oscillator and a comb filter simulating the resonant shape of the vocal tract. The system models the acoustics of speech without recording any actual speech.

Sample-Based (Concatenative) — From something as crude as a toy with a phonograph inside, all the way to sophisticated “diphone” synthesizers that splice together recordings of every possible phoneme transition. GPS voices and automated customer service phone lines of the ’90s and 2000s were built this way.

Generative (Neural/AI) — What most people think of today. These are basically sample-based systems taken to an extreme: instead of recordings of phoneme pairs, you’re dealing with individual digital samples predicted by a neural network, sample by sample.

A Brief History

Von Kempelen’s Speaking Machine (1773)

The first speaking machine most people point to. An operator pushes air through a reed and moves their hand around a piece of leather to simulate the shape of the vocal tract, while separate whistles handle noisier consonants like S and T. Crude, but the basic architecture — oscillator source, shaped by something simulating a vocal tract — is essentially what we still see in formant synthesizers today.

Joseph Faber’s Euphonia (1845)

Faber iterated on von Kempelen’s design into something far more sophisticated: sixteen keys, each generating a different phoneme. You can start to see the importance of the operator in these systems. To make it seem less threatening, Faber put a woman’s face on the front of it and, reportedly, sometimes hung a dress in front of the machinery. I suspect this had the opposite of its intended effect.

Edison Talking Dolls (1890s)

Not quite a speaking machine in the traditional sense, but the first concatenative one: a doll with a miniature phonograph inside playing back recordings of children’s rhymes. Edison thought embedding recorded voices in a toy would help people get comfortable with the technology. The preserved recordings suggest he was mistaken.

VODER (1939)

Demonstrated at the 1939 World’s Fair, the VODER was genuinely remarkable for its time — a monophonic synthesizer with an oscillator, a noise generator, and a set of controls for shaping phonemes in real time, with pitch controlled by a foot pedal. What I find most interesting about it is that its “impressiveness” was entirely dependent on its operators, women known as “Voderettes,” who trained for years to produce intelligible speech. The inventor got all the credit. The operators are largely nameless to history.

MUSA — Multichannel Speaking Automaton (1978)

Developed in Italy, MUSA was one of the first practical diphone synthesizers. They even pressed a vinyl record of the results. It uses recordings of every possible phoneme transition (around 2,000 combinations) and then applies DSP to smooth them together. This approach became dominant in commercial TTS through the ’90s and 2000s.

S.A.M. — Software Automatic Mouth (1982)

The first commercially available speech synthesizer, available for the Commodore 64, Atari, and Apple II. What makes SAM notable is that it exposed controls for pitch, speed, and inflection to the user. The company that made it later provided the technology underlying Macintosh’s Macintalk — which is where this story gets personal.

Two Recurring Patterns

Before moving on, it’s worth noting two things that recur throughout this history.

Speaking machines are often demonstrated through singing. From HAL 9000 singing “Daisy Bell” in 2001: A Space Odyssey to Siri, singing has always been the ultimate proof-of-concept for TTS, because it forces the system to handle pitch variation, rhythm, and expressiveness. But there’s an implicit claim embedded in this: that singing is the pinnacle of human linguistic expression, and that a speaking machine isn’t truly “human” unless it can sing.

Speaking machines encode the biases of the culture that produces them. Faber put a female face on his Euphonia to make it seem less threatening. The Voderettes trained for years and are now forgotten. Most AI assistants today are female-coded by default. This isn’t incidental — it reflects a consistent, uncomfortable pattern in how we try to make machines seem...

speaking machines history speech through recordings

Related Articles