AES67 Audio-over-IP on the ESP32-P4

sohkamyung1 pts0 comments

AES67 audio-over-IP on the ESP32-P4 · Developer Portal&darr;<br>Skip to main content

Developer Portal

Blog

Workshops

Events

Products

Hardware

Software

Quick Links

Espressif Site

Documents

Component Registry

DevCon

ESP32 Forum

Product Selector

Reddit

Table of Contents<br>Table of Contents

AES67/RAVENNA is the audio transport behind a lot of broadcast and live-sound infrastructure, and it normally runs on dedicated silicon or Linux boxes. This article is about getting a working, PTP-synchronized AES67 endpoint onto an ESP32-P4 — how the clock sync, the low-latency receive path, and the I2S playout fit together, what the measured latency is, and where the edges are.<br>What AES67 is, and why this is an odd place to run it<br>AES67 is an interoperability standard for moving uncompressed audio over a<br>normal IP network: PCM samples in RTP, sessions described in SDP and<br>announced over SAP, and — the part that makes it hard — every device locked<br>to a shared clock by PTP (IEEE-1588) so that streams from different sources<br>stay sample-aligned. It is what a lot of broadcast trucks, studios, and<br>live-sound rigs use under the hood, and it interoperates with the RAVENNA<br>and Dante (AES67 mode) ecosystems.<br>The hardware that speaks it is usually a dedicated Dante chip, an FPGA, or a<br>Linux machine with a good NIC. None of those are a microcontroller. The<br>ESP32-P4 is interesting here because it has two things that make AES67<br>plausible on a device of this size: a RISC-V core fast enough to convert and move<br>audio samples in software, and an Ethernet MAC with IEEE-1588 hardware<br>timestamping — which is the one feature you cannot fake if you want real<br>PTP sync.<br>So I built an AES67/RAVENNA endpoint as an ESP-IDF component for the ESP32-P4. It<br>synchronizes to an external PTP grandmaster (or becomes one), sends and<br>receives multichannel RTP audio, discovers and is discovered over SAP/SDP,<br>and plays out through I2S to a DAC. It interoperates on real hardware with<br>Merging Technologies SIENNA, with the<br>aes67-linux-daemon running<br>on a Raspberry Pi, and with a standalone AES67 hardware speaker/amplifier.<br>End-to-end latency, best case, is about 0.7 ms .<br>This article walks through the three parts that actually matter — the clock,<br>getting packets off the wire fast, and playing them out without jitter — and<br>is honest about what&rsquo;s solid and what isn&rsquo;t.<br>The clock is the whole problem<br>When discussing network audio, latency is the key metric people quote. But the<br>thing that&rsquo;s genuinely hard is agreement on time. Every AES67 device has<br>to run its media clock from the same PTP grandmaster, to within<br>sub-microsecond accuracy, or audio from two sources drifts apart and you get<br>clicks at the seams.<br>PTP gets that accuracy by timestamping sync packets in hardware, at the MAC,<br>the instant they cross the wire — software timestamps carry too much jitter<br>from interrupt latency and scheduling. The ESP32-P4 EMAC has this unit, and<br>ESP-IDF exposes it through a clock abstraction (esp_eth_clock_gettime,<br>esp_eth_clock_settime). On top of that I run a small IEEE-1588 daemon<br>(a port of the NuttX ptpd) that does the protocol — best-master-clock<br>selection, sync/follow-up/delay-request exchange, and the servo that<br>disciplines the local clock to the master.<br>A couple of details worth pulling out:<br>The PTP identity comes from the MAC address. A PTP clock needs a<br>unique 64-bit identity; AES67 devices derive it from the 48-bit MAC by the<br>standard EUI-64 expansion (insert FF:FE in the middle). The node also<br>uses that to detect when it has been elected grandmaster — it compares<br>the announced grandmaster ID against its own MAC-derived identity.<br>It can be master or slave. If there&rsquo;s a better clock on the network,<br>the node locks to it. If there isn&rsquo;t, best-master-clock selection promotes<br>this node to grandmaster and it serves time to everyone else. That&rsquo;s not a<br>mode switch you configure; it falls out of the protocol.<br>Once the clock is disciplined, RTP timestamps are just a projection of PTP<br>time onto the media rate: rtp_ts = ptp_time_ns * sample_rate / 1e9. Get<br>the clock right and the rest of the timing is arithmetic.<br>Getting RTP off the wire without paying for the network stack<br>The receive path is where the latency budget is won or lost. A normal<br>sockets path — EMAC interrupt, into lwIP, IP/UDP demux, copy into a socket<br>buffer, wake the reader task — adds buffering and scheduling delay at every<br>hop, and on a device of this size that&rsquo;s a meaningful fraction of a millisecond<br>plus jitter you can&rsquo;t predict.<br>AES67 RTP is multicast UDP on a known port, which means you can recognize it<br>extremely early. The component reads frames at L2 (via esp_vfs_l2tap),<br>ahead of the socket layer. The example goes one step further and installs a<br>hook in the Ethernet driver&rsquo;s receive callback itself — the function that<br>runs in the driver before lwIP is ever called:<br>/* Runs for EVERY Ethernet frame, in the driver, before lwIP.<br>* RTP multicast on...

aes67 clock rsquo audio esp32 hardware

Related Articles