Add VXLAN/EVPN support with flood list management#504
Add VXLAN/EVPN support with flood list management#504
Conversation
grep -q '10.' can match any 3 characters that start with '10' (e.g. 100, 10a, 10:, etc.). So it can match parts of IPv6 link local addresses. Use full address and -F/--fixed-strings to avoid any special regexp characters. We want verbatim match. Fixes: 74228b7 ("cli: add address flush command") Signed-off-by: Robin Jarry <rjarry@redhat.com>
There is no "name" argument available when creating an interface. The name is the first argument. Fixes: 9d5152f ("smoke: add VRF configuration tests") Signed-off-by: Robin Jarry <rjarry@redhat.com>
When removing a port which is the xconnect peer of another one,
iface_from_id(iface->domain_id) will return NULL since the interface was
deleted.
Program terminated with signal SIGSEGV, Segmentation fault.
xconnect_process at modules/infra/datapath/xconnect.c:36
if (peer->type == GR_IFACE_TYPE_PORT) {
__rte_node_process at subprojects/dpdk/lib/graph/rte_graph_worker_common.h:216
rte_graph_walk_rtc at subprojects/dpdk/lib/graph/rte_graph_model_rtc.h:42
rte_graph_walk at subprojects/dpdk/lib/graph/rte_graph_worker.h:38
gr_datapath_loop at modules/infra/datapath/main_loop.c:252
Check the return value and drop the packet in that case.
Signed-off-by: Robin Jarry <rjarry@redhat.com>
When an interface leaves VRF mode (e.g. reconfigured as cross-connect), any IPv4 and IPv6 addresses previously configured on it become invalid. Likewise, when an interface moves to a different VRF, its addresses belong to the old VRF and need to be removed. Subscribe to GR_EVENT_IFACE_POST_RECONFIG in both IPv4 and IPv6 address modules. On reconfiguration, flush all addresses when the interface is no longer in VRF mode or has moved to a different VRF. For IPv6, also reinitialize link-local and well-known multicast addresses when entering VRF mode or changing VRFs. Extend the IPv6 add/del smoke test to exercise VRF reassignment and cross-connect mode transitions. Signed-off-by: Robin Jarry <rjarry@redhat.com>
When an interface is removed, GR_EVENT_IFACE_PRE_REMOVE is handled by both nexthop_iface_cleanup() in nexthop.c and address-family-specific handlers in ip/control/address.c and ip6/control/address.c (added in 6a1362c "ip,ip6: flush addresses on interface mode change"). The event handler execution order is not guaranteed. If nexthop_iface_cleanup() runs first, it destroys local address nexthops by decrementing their ref_count to zero. When the address-family handler runs next, it accesses already-freed nexthops via nexthop_info_l3(), leading to use-after-free. Skip local address nexthops (NH_LOCAL_ADDR_FLAGS) in nh_cleanup_interface_cb(), leaving their cleanup to addr4_delete() and addr6_delete() which properly remove them from the per-interface address vector and handle associated routes. Signed-off-by: Robin Jarry <rjarry@redhat.com>
When a bond is destroyed, its member ports are detached but remain without a VRF assignment. When a port is destroyed, its peer interfaces (other ports whose domain_id points to this port) lose their domain reference. In both cases, reassign the orphaned ports to the default VRF and fire GR_EVENT_IFACE_POST_RECONFIG so that address-family handlers can flush stale addresses and reinitialize as needed. Export vrf_default_get_or_create() so it can be used from bond and port teardown paths. Signed-off-by: Robin Jarry <rjarry@redhat.com>
When rte_rcu_qsbr_dq_enqueue() fails in DQ mode, the deleted key slot is never freed and becomes permanently leaked. Also, when rte_hash_add_key_data() overwrites an existing key, the old data pointer is silently lost. With RCU-protected readers still potentially accessing the old data, there is no safe way to free it. Add two patches from an upstream series [1]: - Fall back to synchronous reclamation instead of only logging an error when the RCU defer queue enqueue fails on key deletion. - When RCU is configured with a free_key_data_func callback, automatically defer-free the old data pointer on overwrite. The third patch from that series (adding a new rte_hash_replace API) is not needed since the free_key_data_func callback is sufficient. [1] https://patches.dpdk.org/project/dpdk/list/?series=37352 Signed-off-by: Robin Jarry <rjarry@redhat.com>
When outputting on a VLAN interface, the local iface variable is reassigned to the parent interface after VLAN tag insertion. The subsequent UP status check and TX stats increment then use this reassigned pointer, accounting them on the parent instead of the original VLAN interface. Use d->iface which still references the original VLAN interface for the status check and stats increment. Fixes: 7701685 ("port: add dedicated port_tx functions") Signed-off-by: Robin Jarry <rjarry@redhat.com>
Bridge members that are not VLAN interfaces (trunk ports) need to carry the VLAN ID through the output path so that the Ethernet header can be built with the correct 802.1Q tag. iface_output unconditionally clears d->vlan_id to zero for non-VLAN interfaces, discarding the VLAN ID set during input processing. Only set d->vlan_id when the output interface is actually a VLAN type. Clear it instead at the points where it is no longer needed: in eth_output after the Ethernet header has been built, and in the control plane injection path where no VLAN context exists. Signed-off-by: Robin Jarry <rjarry@redhat.com>
A future change will require calling control_queue_push() from gr_event_push() which lives in main/. If control_queue stays in the infra module, this would create a circular dependency between main and infra. Move control_queue.c and gr_control_queue.h to main/ and replace the event-based drain mechanism with explicit control_queue_drain() calls from iface_destroy() and nexthop_destroy() after the RCU sync. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Pass rte_lcore_id() to rte_rcu_qsbr_synchronize() instead of RTE_QSBR_THRID_INVALID to exclude the calling thread from the quiescent state wait. This is needed to allow creating objects from datapath workers. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Event notifications must be processed on the control plane thread. Modify gr_event_push() to detect when it is called from a datapath worker and use the control queue to defer the notification to the control plane event loop. This enables datapath nodes (such as bridge MAC learning) to create MAC entries on the fly without blocking the control plane. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Introduce a new l2 module with a bridge interface type that allows grouping multiple member interfaces (ports, VLANs, bonds) into a single L2 broadcast domain. The bridge maintains a list of members and supports configurable MAC learning, BUM traffic flooding, per-bridge ageing timer and a custom MAC address. Members are switched to GR_IFACE_MODE_BRIDGE when attached and restored to the default VRF when the bridge is destroyed. FDB management and datapath nodes for actual packet forwarding will follow in subsequent commits. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Implement a forwarding database backed by an RCU-protected rte_hash with a dedicated rte_mempool for entries. The hash is configured with a free_key_data_func callback so that deleted entries are automatically returned to the pool after RCU synchronization. Entries can be added/deleted/flushed via the API and can also be dynamically learned from the datapath via fdb_learn(). A periodic ageing timer evicts learned entries that have not been refreshed within the bridge ageing_time. Static entries configured by the user are never aged out. FDB entries associated with a member or bridge are automatically purged on detach or bridge destruction. The FDB table size defaults to 4096 entries and can be changed at runtime via the config set/get API, provided the table is empty. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add bridge_input and bridge_flood datapath nodes. bridge_input receives packets from member interfaces via GR_IFACE_MODE_BRIDGE. It learns source MAC addresses into the FDB (unless GR_BRIDGE_F_NO_LEARN is set), then looks up the destination. Known unicast destinations are forwarded to the learned output interface. Unknown unicast, broadcast and multicast are sent to bridge_flood. Hairpin packets (destination is the source interface) are dropped. When the destination is the bridge interface itself, packets are sent to eth_input for local processing. bridge_flood replicates each packet to all bridge members except the ingress interface, and to the bridge interface itself. The first output reuses the original mbuf, subsequent ones are cloned. When GR_BRIDGE_F_NO_FLOOD is set, the packet is dropped instead. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Create a bridge with three member ports and verify L2 forwarding between namespaces, L3 reachability to the bridge interface address, and overwriting a dynamic FDB entry with a static one. Also check that detaching a member and deleting the bridge properly clean up FDB entries. Signed-off-by: Robin Jarry <rjarry@redhat.com>
iface_create copies the requested configuration into the iface struct early via iface->base = conf->base. When the interface is created with GR_IFACE_F_UP, the flag is already set by the time iface_set_up_down runs. The down-to-up transition condition (!(flags & UP) && up) evaluates to false, so GR_IFACE_S_RUNNING is never set and GR_EVENT_IFACE_STATUS_UP is never pushed. This only affects logical interfaces (bridges, VXLAN, VLANs). Physical ports are not affected because their set_up_down callback manages the running state independently via the DPDK link status event. This prevents FRR from seeing logical interfaces as operationally up (IFF_RUNNING), which in turn prevents EVPN from advertising IMET routes for VXLAN interfaces. Clear the UP flag before calling iface_set_up_down so the transition fires normally. Fixes: 9a61e92 ("iface: send status events on admin state changes") Signed-off-by: Robin Jarry <rjarry@redhat.com>
📝 WalkthroughWalkthroughThis pull request introduces comprehensive Layer 2 (L2) support with new bridge and VXLAN interface types, Forwarding Database (FDB) management, and VXLAN flood (VTEP) capabilities. The control plane adds bridge member management, FDB learning and aging, and flood entry tracking. The datapath implements bridge and VXLAN packet processing nodes with learning, flooding, and tunnel encapsulation/decapsulation. FRR integration extends MAC and VTEP support through dplane operations. The event system is refactored to defer notifications via a control queue. New CLI modules enable bridge, VXLAN, FDB, and flood management. Infrastructure is extended with new interface types, VRF handling improvements, and QSBR synchronization updates. Integration tests validate bridge connectivity, VXLAN tunneling, and EVPN/VXLAN interoperability. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@modules/infra/control/group_nexthop.c`:
- Line 152: The call to rte_rcu_qsbr_synchronize(gr_datapath_rcu(),
rte_lcore_id()) is using rte_lcore_id() from a control thread that is not
registered as a QSBR reader; replace the second argument with
RTE_QSBR_THRID_INVALID so the call becomes
rte_rcu_qsbr_synchronize(gr_datapath_rcu(), RTE_QSBR_THRID_INVALID) whenever
invoked from control-plane threads (same change for any other control-plane
calls that pass rte_lcore_id()); ensure only datapath reader threads keep using
their registered thread IDs (registration happens via
rte_rcu_qsbr_thread_register in the datapath main loop).
In `@modules/l2/cli/vxlan.c`:
- Around line 73-77: arg_vrf currently returns 0 when the user omits the
ENCAV_VRF argument, but the code treats 0 as success and unconditionally sets
GR_VXLAN_SET_ENCAP_VRF, causing encap_vrf to be overwritten; fix by storing the
arg_vrf return value (e.g. int ret = arg_vrf(c, p, "ENCAP_VRF",
&vxlan->encap_vrf_id)), return on ret < 0, and only set set_attrs |=
GR_VXLAN_SET_ENCAP_VRF when ret > 0 (meaning the user actually supplied
ENCAV_VRF), leaving vxlan->encap_vrf_id untouched when the argument is absent.
In `@modules/l2/control/bridge.c`:
- Around line 60-77: bridge_detach_member currently resets member->mode to
GR_IFACE_MODE_VRF but leaves member->vrf_id as GR_VRF_ID_UNDEF; update
bridge_detach_member to restore the member's VRF by calling
vrf_default_get_or_create() and assigning the returned vrf id to member->vrf_id
and incrementing its refcount via vrf_incref (mirroring bridge_fini behavior),
then set member->mode = GR_IFACE_MODE_VRF so the detached iface has a valid VRF.
In `@modules/l2/control/vxlan.c`:
- Around line 281-287: The vtep_flood_del function mutates the shared
flood_vteps array in-place (swap-and-decrement) without RCU protection, causing
a data-race with datapath readers; change vtep_flood_del to follow the
copy-on-write + RCU pattern used by vtep_flood_add: allocate a new flood_vteps
buffer, copy entries from the old array excluding entry->vtep.addr (preserving
order if add does), set the new pointer and updated n_flood_vteps atomically
(using the same RCU/atomic swap helper used by vtep_flood_add), schedule the old
buffer to be freed after the RCU grace period, and keep the
gr_event_push(GR_EVENT_FLOOD_DEL, entry) call; reference vtep_flood_del,
vtep_flood_add, flood_vteps, n_flood_vteps, and gr_event_push when making the
change.
- Around line 50-83: The delete uses cur->encap_vrf_id after it was overwritten,
so rte_hash_del_key is built with the new encap_vrf_id instead of the old one;
fix by capturing the old encap_vrf_id (and old vni if needed) before mutating
cur (e.g., read old_vrf = cur->encap_vrf_id and build cur_key from old_vrf and
cur->vni) or postpone assigning cur->encap_vrf_id until after the hash
delete/add sequence; update the code around cur->encap_vrf_id, cur_key,
rte_hash_del_key, next_key and rte_hash_add_key_data accordingly so the deletion
targets the original {old_vni, old_vrf}.
In `@modules/l2/datapath/vxlan_output.c`:
- Around line 75-79: vxlan_output currently assigns ip_output_mbuf_data(m)->nh =
fib4_lookup(...) without checking for NULL and always sends packets to
IP_OUTPUT; change vxlan_output to check the result of fib4_lookup (the value
stored in ip_output_mbuf_data(m)->nh) and if it is NULL enqueue the packet to
the BAD_NEXTHOP edge (the declared but unused BAD_NEXTHOP path) instead of
forwarding to IP_OUTPUT, otherwise continue to set edge = IP_OUTPUT and enqueue
as before; update the enqueue logic around rte_node_enqueue_x1(graph, node,
edge, m) so the chosen edge reflects this NULL-check.
🧹 Nitpick comments (3)
frr/if_grout.c (1)
369-378: Variableaddshadows outerbool addon line 356.
struct gr_fdb_add_req *add(line 370) shadows thebool adddeclared at line 356. This works correctly due to block scoping, but it's a latent maintenance trap — a future refactor could easily reference the wrongadd.Proposed fix — rename inner variable
if (add) { - struct gr_fdb_add_req *add = req; - add->exist_ok = true; - add->fdb.iface_id = ifindex_frr_to_grout(dplane_ctx_get_ifindex(ctx)); - add->fdb.bridge_id = ifindex_frr_to_grout(dplane_ctx_mac_get_br_ifindex(ctx)); - add->fdb.vlan_id = dplane_ctx_mac_get_vlan(ctx); - add->fdb.flags = dplane_ctx_mac_get_dp_static(ctx) ? GR_FDB_F_STATIC : 0; - memcpy(&add->fdb.mac, dplane_ctx_mac_get_addr(ctx), sizeof(add->fdb.mac)); - add->fdb.vtep = dplane_ctx_mac_get_vtep_ip(ctx)->s_addr; + struct gr_fdb_add_req *add_req = req; + add_req->exist_ok = true; + add_req->fdb.iface_id = ifindex_frr_to_grout(dplane_ctx_get_ifindex(ctx)); + add_req->fdb.bridge_id = ifindex_frr_to_grout(dplane_ctx_mac_get_br_ifindex(ctx)); + add_req->fdb.vlan_id = dplane_ctx_mac_get_vlan(ctx); + add_req->fdb.flags = dplane_ctx_mac_get_dp_static(ctx) ? GR_FDB_F_STATIC : 0; + memcpy(&add_req->fdb.mac, dplane_ctx_mac_get_addr(ctx), sizeof(add_req->fdb.mac)); + add_req->fdb.vtep = dplane_ctx_mac_get_vtep_ip(ctx)->s_addr; req_type = GR_FDB_ADD;modules/l2/api/gr_l2.h (1)
44-49: Bit 36 skipped in VXLAN reconfiguration flags.
GR_VXLAN_SET_LOCALis bit 35,GR_VXLAN_SET_MACjumps to bit 37. Bit 36 is unused. If intentional (reserved for a future attribute), no problem. If a typo, it won't cause a bug now but could cause confusion later.modules/l2/control/fdb.c (1)
329-346: Redundantfdb_max_entriesassignment.Line 342 sets
fdb_max_entries = req->max_entries, butfdb_reconfig(line 79) already does the same assignment. Harmless, but the duplicate write could be removed.
| if (set_attrs & GR_VXLAN_SET_ENCAP_VRF) { | ||
| uint16_t vrf = next->encap_vrf_id; | ||
| uint16_t old = cur->encap_vrf_id; | ||
|
|
||
| if (vrf == GR_VRF_ID_UNDEF) | ||
| vrf = vrf_default_get_or_create(); | ||
|
|
||
| if (vrf != old && vrf_incref(vrf) < 0) | ||
| return -errno; | ||
|
|
||
| if (old != GR_VRF_ID_UNDEF) | ||
| vrf_decref(old); | ||
|
|
||
| cur->encap_vrf_id = vrf; | ||
| } | ||
|
|
||
| if (set_attrs & (GR_VXLAN_SET_VNI | GR_VXLAN_SET_ENCAP_VRF)) { | ||
| const struct vxlan_key next_key = {rte_cpu_to_be_32(next->vni), next->encap_vrf_id}; | ||
| const struct vxlan_key cur_key = {rte_cpu_to_be_32(cur->vni), cur->encap_vrf_id}; | ||
|
|
||
| if (rte_hash_lookup(vxlan_hash, &next_key) >= 0) | ||
| return errno_set(EADDRINUSE); | ||
|
|
||
| if (next->vni == 0 || next->vni > 0xffffff) | ||
| return errno_set(ERANGE); | ||
|
|
||
| rte_hash_del_key(vxlan_hash, &cur_key); | ||
|
|
||
| ret = rte_hash_add_key_data(vxlan_hash, &next_key, iface); | ||
| if (ret < 0) | ||
| return errno_log(-ret, "rte_hash_add_key_data"); | ||
|
|
||
| cur->vni = next->vni; | ||
| } |
There was a problem hiding this comment.
Bug: cur_key for hash deletion uses the already-updated encap_vrf_id.
When GR_VXLAN_SET_ENCAP_VRF is set, line 63 updates cur->encap_vrf_id to the new value before the hash key update block at line 66. Consequently, cur_key on line 68 is built with the new encap_vrf_id, not the old one. The rte_hash_del_key on line 76 attempts to delete a key that doesn't exist yet, leaving the old hash entry ({old_vni, old_vrf}) as a stale orphan.
Proposed fix: capture old key before mutating cur
+ // Save old key components before ENCAP_VRF update may overwrite them.
+ const struct vxlan_key old_key = {rte_cpu_to_be_32(cur->vni), cur->encap_vrf_id};
+
if (set_attrs & GR_VXLAN_SET_ENCAP_VRF) {
uint16_t vrf = next->encap_vrf_id;
uint16_t old = cur->encap_vrf_id;
if (vrf == GR_VRF_ID_UNDEF)
vrf = vrf_default_get_or_create();
if (vrf != old && vrf_incref(vrf) < 0)
return -errno;
if (old != GR_VRF_ID_UNDEF)
vrf_decref(old);
cur->encap_vrf_id = vrf;
}
if (set_attrs & (GR_VXLAN_SET_VNI | GR_VXLAN_SET_ENCAP_VRF)) {
const struct vxlan_key next_key = {rte_cpu_to_be_32(next->vni), next->encap_vrf_id};
- const struct vxlan_key cur_key = {rte_cpu_to_be_32(cur->vni), cur->encap_vrf_id};
if (rte_hash_lookup(vxlan_hash, &next_key) >= 0)
return errno_set(EADDRINUSE);
if (next->vni == 0 || next->vni > 0xffffff)
return errno_set(ERANGE);
- rte_hash_del_key(vxlan_hash, &cur_key);
+ rte_hash_del_key(vxlan_hash, &old_key);
ret = rte_hash_add_key_data(vxlan_hash, &next_key, iface);🤖 Prompt for AI Agents
In `@modules/l2/control/vxlan.c` around lines 50 - 83, The delete uses
cur->encap_vrf_id after it was overwritten, so rte_hash_del_key is built with
the new encap_vrf_id instead of the old one; fix by capturing the old
encap_vrf_id (and old vni if needed) before mutating cur (e.g., read old_vrf =
cur->encap_vrf_id and build cur_key from old_vrf and cur->vni) or postpone
assigning cur->encap_vrf_id until after the hash delete/add sequence; update the
code around cur->encap_vrf_id, cur_key, rte_hash_del_key, next_key and
rte_hash_add_key_data accordingly so the deletion targets the original {old_vni,
old_vrf}.
| ip_output_mbuf_data(m)->nh = fib4_lookup(vxlan->encap_vrf_id, d->vtep); | ||
|
|
||
| edge = IP_OUTPUT; | ||
| next: | ||
| rte_node_enqueue_x1(graph, node, edge, m); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the vxlan_output.c file around the lines in question
cat -n modules/l2/datapath/vxlan_output.c | sed -n '50,90p'Repository: DPDK/grout
Length of output: 1387
🏁 Script executed:
# Search for fib4_lookup definition
rg -n 'fib4_lookup' --type=c -A 5 | head -80Repository: DPDK/grout
Length of output: 4748
🏁 Script executed:
# Check for BAD_NEXTHOP edge declaration and usage in vxlan_output.c
rg -n 'BAD_NEXTHOP' modules/l2/datapath/vxlan_output.cRepository: DPDK/grout
Length of output: 119
🏁 Script executed:
# Look for edge declarations in vxlan_output.c
rg -n 'enum.*edge|edge.*=' modules/l2/datapath/vxlan_output.c -i | head -20Repository: DPDK/grout
Length of output: 130
🏁 Script executed:
# Check fib4_lookup implementation
cat -n modules/ip/control/route.c | sed -n '119,145p'Repository: DPDK/grout
Length of output: 892
🏁 Script executed:
# Check if ip_output can handle NULL nexthop
rg -n 'ip_output' modules/ip/datapath/ -A 20 | grep -A 20 'process.*ip_output' | head -40Repository: DPDK/grout
Length of output: 1282
🏁 Script executed:
# Search for ip_output process function
fd 'ip_output' --type=f modules/ip/datapath/ && grep -n 'process.*ip_output\|ip_output_process' modules/ip/datapath/*.cRepository: DPDK/grout
Length of output: 286
🏁 Script executed:
# Look at what happens with NULL nexthop in ip_output
rg -n '->nh' modules/ip/datapath/ip_output.c -C 2 | head -50Repository: DPDK/grout
Length of output: 458
🏁 Script executed:
# Check ip_output_process function to see how it handles nexthop
cat -n modules/ip/datapath/ip_output.c | sed -n '58,120p'Repository: DPDK/grout
Length of output: 1904
🏁 Script executed:
# Look for how nexthop is dereferenced in ip_output
rg -n 'nh\-' modules/ip/datapath/ip_output.c -C 2 | head -60Repository: DPDK/grout
Length of output: 277
🏁 Script executed:
# Check if there's NULL checking for nexthop in ip_output
rg -n 'nh == NULL|nh != NULL' modules/ip/datapath/ip_output.cRepository: DPDK/grout
Length of output: 76
Missing NULL check for fib4_lookup result — error handling deferred to downstream node.
fib4_lookup can return NULL when no route is found, but vxlan_output doesn't check this locally. Instead, the packet proceeds to ip_output, which detects the NULL and routes it via NO_ROUTE to an error handler. While this works, it's inefficient and inconsistent with other modules (dnat44, icmp_output, ip_input). The BAD_NEXTHOP edge is declared but never used.
Proposed fix
ip_output_mbuf_data(m)->nh = fib4_lookup(vxlan->encap_vrf_id, d->vtep);
- edge = IP_OUTPUT;
+ if (unlikely(ip_output_mbuf_data(m)->nh == NULL)) {
+ edge = BAD_NEXTHOP;
+ } else {
+ edge = IP_OUTPUT;
+ }🤖 Prompt for AI Agents
In `@modules/l2/datapath/vxlan_output.c` around lines 75 - 79, vxlan_output
currently assigns ip_output_mbuf_data(m)->nh = fib4_lookup(...) without checking
for NULL and always sends packets to IP_OUTPUT; change vxlan_output to check the
result of fib4_lookup (the value stored in ip_output_mbuf_data(m)->nh) and if it
is NULL enqueue the packet to the BAD_NEXTHOP edge (the declared but unused
BAD_NEXTHOP path) instead of forwarding to IP_OUTPUT, otherwise continue to set
edge = IP_OUTPUT and enqueue as before; update the enqueue logic around
rte_node_enqueue_x1(graph, node, edge, m) so the chosen edge reflects this
NULL-check.
Introduce the VXLAN interface type for the L2 module. A VXLAN interface carries a VNI (VXLAN Network Identifier), a local VTEP address used as the outer IP source, an encapsulation VRF for underlay routing, and a configurable UDP destination port (default 4789). VXLAN interfaces are keyed by (VNI, encap_vrf_id) in a lockfree RCU-protected hash table so that the datapath can resolve incoming tunneled packets to the correct interface without locks. VXLAN interfaces are intended to be attached to a bridge domain. All L2 traffic entering the bridge is forwarded transparently over the VXLAN tunnel. The local VTEP address must already be configured in the encapsulation VRF. Signed-off-by: Robin Jarry <rjarry@redhat.com>
VXLAN uses UDP port 4789 by default but allows configuring a custom destination port per interface. Allow the control plane to register additional UDP ports at runtime as aliases for an already registered port, reusing the same datapath edge. Use reference counting so that multiple interfaces sharing the same non-default port do not interfere with each other during teardown. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Wire up the VXLAN interface's configurable destination port to the L4 input node. When a non-default port is configured, register it as an alias for the standard VXLAN port (4789) so that the datapath delivers matching UDP packets to the vxlan_input node. Unregister the alias when the port changes or the interface is destroyed. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Introduce a transport-agnostic flood list framework for BUM traffic (Broadcast, Unknown unicast, Multicast). In EVPN, each PE maintains a flooding list built from IMET routes (RFC 8365, RFC 9572). The entries in this list differ depending on the overlay encapsulation: VXLAN uses a remote VTEP IPv4 address and a VNI, while SRv6 would use a 128-bit SID. The API defines a gr_flood_entry structure with a type discriminant and a union, allowing future encapsulation types (e.g. SRv6 SIDs) to be added without changing the API request types. A dispatch layer in control/flood.c routes add/del/list operations to type-specific callbacks registered at init time. Implement the VXLAN VTEP flood type (GR_FLOOD_T_VTEP). Each VXLAN interface maintains a per-VNI array of remote VTEP addresses used by the vxlan_flood datapath node for ingress replication. The array is replaced atomically with an RCU synchronization barrier so that the datapath never sees a partially updated list. CLI commands are exposed under "flood vtep add/del/show". Signed-off-by: Robin Jarry <rjarry@redhat.com>
In a VXLAN overlay, the bridge needs to know which remote VTEP to use when sending unicast frames to a learned MAC address. Add a VTEP IPv4 address field to FDB entries so that known unicast traffic can be sent directly to the correct tunnel endpoint instead of being flooded to all VTEPs. When bridge_input learns a MAC address from a VXLAN member interface, it records the source VTEP from the decapsulated packet's outer IP header. When forwarding to a known destination, the stored VTEP address is passed to the output path via the mbuf private data so that vxlan_output can build the correct outer header. Only set the VTEP field when the source interface is actually a VXLAN type to avoid storing uninitialized data from other packet paths (control plane, local bridge traffic). Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add three datapath nodes for VXLAN packet processing. vxlan_input decapsulates incoming UDP/4789 packets. It strips the outer UDP and VXLAN headers, resolves the inner VNI to a VXLAN interface via the RCU-protected hash table, records the source VTEP from the outer IP header into the mbuf private data, and forwards the inner Ethernet frame to iface_input for bridge processing. vxlan_output encapsulates outgoing frames for a known destination VTEP. It prepends a pre-built IP/UDP/VXLAN header template initialized by the control plane, fills in the per-packet fields (destination VTEP, UDP length, IP length, checksum), and hashes the inner flow to select an ephemeral source port for underlay ECMP (RFC 7348 Section 5). The FIB lookup for the outer IP uses the encapsulation VRF, not the bridge domain. vxlan_flood handles BUM traffic by replicating the frame to every VTEP in the flood list via ingress replication. The original mbuf is sent to the first VTEP and clones are created for the rest. The bridge_flood node is updated to steer VXLAN member traffic through vxlan_flood instead of direct iface_output. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Set up a VXLAN overlay between grout and a Linux netns peer. Grout runs a bridge with a VXLAN member (VNI 100) and the Linux side mirrors the topology with a kernel VXLAN device enslaved to a Linux bridge. Both sides have flood lists configured with each other's VTEP address for BUM traffic replication. The test verifies L3 connectivity over the tunnel by having the Linux side ping the bridge address. This exercises the full path: ARP resolution over VXLAN, FDB learning from decapsulated traffic, and ICMP echo reply via the VXLAN output encapsulation. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Report bridge interfaces to FRR as ZEBRA_IF_BRIDGE with their MAC address. Tag members with ZEBRA_IF_SLAVE_BRIDGE and propagate the bridge ifindex so that FRR can associate them with the correct master. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Report VXLAN interfaces to FRR's zebra as ZEBRA_IF_VXLAN with the associated L2 VNI information. This allows FRR's EVPN control plane to discover which VNIs are locally configured and advertise them via BGP IMET routes to remote PEs. The VXLAN L2 info includes the VNI, the local VTEP address, and the underlay interface index so that zebra can correlate the tunnel with the correct underlay routing context. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Synchronize bridge FDB entries bidirectionally between grout and FRR. This is required for EVPN to advertise locally learned MAC addresses via BGP type-2 routes and to install remotely learned MACs into the bridge forwarding table. Subscribe to FDB add/del/update events from grout and translate them to dplane MAC install/delete operations for zebra. In the reverse direction, handle DPLANE_OP_MAC_INSTALL/DELETE from FRR and convert them to GR_FDB_ADD/DEL API calls. The VTEP address is propagated in both directions so that remote MACs are associated with the correct tunnel endpoint. Self-event suppression is enabled on the FDB subscriptions to prevent feedback loops when FRR installs a MAC that was originally learned by grout. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Handle DPLANE_OP_VTEP_ADD and DPLANE_OP_VTEP_DELETE operations from FRR's EVPN control plane. When BGP learns a remote VTEP via an IMET route (EVPN Route Type 3), zebra pushes the VTEP to the dataplane provider which translates it to a GR_FLOOD_ADD/DEL request with GR_FLOOD_T_VTEP type. This allows BGP EVPN to dynamically manage the per-VNI flood lists used for BUM traffic ingress replication, replacing the need for static flood list configuration via the CLI. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Set up a full EVPN/VXLAN topology between FRR+grout and a standalone FRR+Linux peer. Each side runs a bridge with a VXLAN member (VNI 100) and a host namespace. Both peers run iBGP with the l2vpn evpn address-family and advertise-all-vni. The test verifies that EVPN type-3 (IMET) routes are exchanged so that both sides install each other's VTEP in their flood lists. It then verifies end-to-end L2 connectivity by pinging between the two host namespaces through the VXLAN overlay, which exercises type-2 (MAC/IP) route advertisement and FDB synchronization. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add VXLAN interface type with encapsulation/decapsulation datapath nodes. Each VXLAN interface maintains a per-VNI flood list of remote VTEPs used for BUM traffic ingress replication.
The flood list API is transport-agnostic, designed to accommodate future SRv6 EVPN support. VXLAN VTEP is the first registered flood type. A dispatch layer routes add/del/list operations to type-specific callbacks.
FRR integration is wired up for bridge interfaces, VXLAN interfaces, FDB entries and flood lists. This enables BGP EVPN type-2 (MAC/IP) and type-3 (IMET) route exchange with remote PEs.
Also fix interface running state not being set on creation. This prevented FRR from seeing logical interfaces as operationally up.
Summary by CodeRabbit
New Features
Bug Fixes
Chores