Scaling Akvorado BMP RIB with sharding
To associate routing information—like AS paths or BGP communities—to flows,<br>Akvorado can import routes through the BGP Monitoring Protocol (BMP). As<br>the Internet routing table contains more than 1 million routes, Akvorado<br>needs to scale to tens of millions of routes .1 This has been a<br>long-standing challenge,2 but I expect this issue is now fixed by using<br>RIB sharding , a method that splits the routing database into several parts<br>to enable concurrent updates.
Each router exporting flows doesn’t need to send its routes. When<br>Akvorado does not find a route from a specific device, it falls back to a<br>route sent by another device. It is up to the operator to decide if this<br>is a good enough approximation. ❦
I made many attempts to scale the BMP component. See for example<br>PR #254, PR #255, PR #278, PR #2244, and PR #2245.<br>Despite these efforts, this component remained problematic for some users.<br>See discussion #2287 as the latest example. ❦
Previous implementation<br>Storing routes in a map
Interning routes
Why does it not scale?
RIB sharding<br>First step: basic sharding
Second step: lock-free reads
Previous implementation#
Akvorado connects 2 elements to build its RIB:
a prefix tree , and
a list of routes attached to each prefix.
Akvorado BMP RIB implementation without sharding. One single read/write lock.<br>In the diagram above, the RIB stores five IPv4 prefixes and two IPv6 prefixes.<br>One of them, 2001:db8:1::/48, contains three routes:
from peer 3, next hop 2001:db8::3:1, AS 65402, AS path 65402, community<br>65402:31,
from peer 4, next hop 2001:db8::4:1, same ASN, AS path, and community,
from peer 5, next hop 2001:db8::5:1, AS 65402, AS path 65401 65402,<br>community 65402:31.
The rib structure is defined in Go as follows:
type rib struct {<br>tree *bart.Table[prefixIndex]<br>routes map[routeKey]route<br>nlris *intern.Pool[nlri]<br>nextHops *intern.Pool[nextHop]<br>rtas *intern.Pool[routeAttributes]<br>nextPrefixID prefixIndex<br>freePrefixIDs []prefixIndex
The prefix tree uses the bart package, an adaptation of Donald Knuth’s ART<br>algorithm. The benchmarks demonstrate it outperforms other packages for<br>lookups, insertions, and memory usage.3 Plus, the author is quite<br>helpful.
It keeps improving: bart 0.28.0 features a new<br>implementation that trades a bit of memory for greater lookup performance. I<br>did not test it yet, as I have been preparing this blog post for a couple<br>of months already. ❦
Storing routes in a map#
The list of routes for each prefix is not stored directly in the prefix tree:<br>it would put too much pressure on the garbage collector by allocating per-prefix<br>arrays.
Instead, the RIB assigns a unique 32-bit prefix identifier for each prefix,<br>either by picking the last available prefix identifier from the freePrefixIDs<br>array if any, or using the nextPrefixID value before incrementing it. Then,<br>the routes are stored in the routes map, leveraging the optimized Swiss<br>table in Go. To retrieve routes attached to a prefix, we look them up<br>one by one in the routes map with a 64-bit key combining the 32-bit prefix<br>index with a 32-bit route index matching the position of the route in the list.<br>Akvorado scans routes from the first to the last to find the best one.4 It<br>knows there is no more route if the route key returns no result.
Akvorado prefers the route matching the exact next hop. Otherwise, it<br>falls back to any other route. This is an approximation. An alternative<br>would be to have one prefix tree for each BGP peer but it would require<br>configuring all routers to export their routes. pmacct’s BMP daemon<br>implements this approach. ❦
type prefixIndex uint32<br>type routeIndex uint32<br>type routeKey uint64
Interning routes#
A route contains a BGP peer identifier, a partial NLRI5, the next hop, and the<br>attributes.
If we consider the BGP RIB as a database, the Network Layer<br>Reachability Information (NLRI) is the primary key. Its content depends on<br>the BGP family. With IPv4 or IPv6 unicast, this is the prefix. For VPNv4 and<br>VPNv6 families, it includes the route distinguisher. If you enable the<br>ADD-PATH extension, the NLRI also contains a path identifier.
In our implementation, we don’t store the prefix as we get it from the<br>looked-up IP address using the separately-stored prefix length. ❦
type route struct {<br>peer uint32<br>nlri intern.Reference[nlri]<br>nextHop intern.Reference[nextHop]<br>attributes intern.Reference[routeAttributes]<br>prefixLen uint8
type nlri struct {<br>family bgp.Family<br>path uint32<br>rd RD<br>type nextHop netip.Addr<br>type routeAttributes struct {<br>asn uint32<br>asPath []uint32<br>communities []uint32<br>largeCommunities []bgp.LargeCommunity
To save memory and allocations, NLRI, next hops, and route attributes are<br>“interned:” a 32-bit integer replaces the real value. The mechanism predates the<br>unique package introduced in Go 1.23. We keep it because it has<br>different trade-offs:
It uses explicit reference counting instead of relying on weak pointers.
It works with non-comparable values...