Skip to content

Add VXLAN/EVPN support with flood list management#504

Draft
rjarry wants to merge 29 commits intoDPDK:mainfrom
rjarry:vxlan
Draft

Add VXLAN/EVPN support with flood list management#504
rjarry wants to merge 29 commits intoDPDK:mainfrom
rjarry:vxlan

Conversation

@rjarry
Copy link
Collaborator

@rjarry rjarry commented Feb 14, 2026

Add VXLAN interface type with encapsulation/decapsulation datapath nodes. Each VXLAN interface maintains a per-VNI flood list of remote VTEPs used for BUM traffic ingress replication.

The flood list API is transport-agnostic, designed to accommodate future SRv6 EVPN support. VXLAN VTEP is the first registered flood type. A dispatch layer routes add/del/list operations to type-specific callbacks.

FRR integration is wired up for bridge interfaces, VXLAN interfaces, FDB entries and flood lists. This enables BGP EVPN type-2 (MAC/IP) and type-3 (IMET) route exchange with remote PEs.

Also fix interface running state not being set on creation. This prevented FRR from seeing logical interfaces as operationally up.

Summary by CodeRabbit

  • New Features

    • Added bridge interface type with member management, MAC learning, and flooding capabilities.
    • Added VXLAN tunnel interface support with VNI configuration and VTEP management.
    • Added Forwarding Database (FDB) management for MAC learning and aging.
    • Added CLI commands for bridge, VXLAN, FDB, and flood management.
  • Bug Fixes

    • Fixed synchronization timing in resource cleanup operations.
    • Improved control queue draining for proper resource deallocation.
  • Chores

    • Expanded build system for new L2 module infrastructure.

grep -q '10.' can match any 3 characters that start with '10' (e.g. 100,
10a, 10:, etc.). So it can match parts of IPv6 link local addresses.

Use full address and -F/--fixed-strings to avoid any special regexp
characters. We want verbatim match.

Fixes: 74228b7 ("cli: add address flush command")
Signed-off-by: Robin Jarry <rjarry@redhat.com>
There is no "name" argument available when creating an interface. The
name is the first argument.

Fixes: 9d5152f ("smoke: add VRF configuration tests")
Signed-off-by: Robin Jarry <rjarry@redhat.com>
When removing a port which is the xconnect peer of another one,
iface_from_id(iface->domain_id) will return NULL since the interface was
deleted.

Program terminated with signal SIGSEGV, Segmentation fault.
  xconnect_process at modules/infra/datapath/xconnect.c:36
                      if (peer->type == GR_IFACE_TYPE_PORT) {
  __rte_node_process at subprojects/dpdk/lib/graph/rte_graph_worker_common.h:216
  rte_graph_walk_rtc at subprojects/dpdk/lib/graph/rte_graph_model_rtc.h:42
  rte_graph_walk at subprojects/dpdk/lib/graph/rte_graph_worker.h:38
  gr_datapath_loop at modules/infra/datapath/main_loop.c:252

Check the return value and drop the packet in that case.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
When an interface leaves VRF mode (e.g. reconfigured as cross-connect),
any IPv4 and IPv6 addresses previously configured on it become invalid.
Likewise, when an interface moves to a different VRF, its addresses
belong to the old VRF and need to be removed.

Subscribe to GR_EVENT_IFACE_POST_RECONFIG in both IPv4 and IPv6 address
modules. On reconfiguration, flush all addresses when the interface is
no longer in VRF mode or has moved to a different VRF. For IPv6, also
reinitialize link-local and well-known multicast addresses when entering
VRF mode or changing VRFs.

Extend the IPv6 add/del smoke test to exercise VRF reassignment and
cross-connect mode transitions.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
When an interface is removed, GR_EVENT_IFACE_PRE_REMOVE is handled by
both nexthop_iface_cleanup() in nexthop.c and address-family-specific
handlers in ip/control/address.c and ip6/control/address.c (added in
6a1362c "ip,ip6: flush addresses on interface mode change").

The event handler execution order is not guaranteed. If
nexthop_iface_cleanup() runs first, it destroys local address nexthops
by decrementing their ref_count to zero. When the address-family handler
runs next, it accesses already-freed nexthops via nexthop_info_l3(),
leading to use-after-free.

Skip local address nexthops (NH_LOCAL_ADDR_FLAGS) in
nh_cleanup_interface_cb(), leaving their cleanup to addr4_delete() and
addr6_delete() which properly remove them from the per-interface address
vector and handle associated routes.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
When a bond is destroyed, its member ports are detached but remain
without a VRF assignment. When a port is destroyed, its peer interfaces
(other ports whose domain_id points to this port) lose their domain
reference.

In both cases, reassign the orphaned ports to the default VRF and fire
GR_EVENT_IFACE_POST_RECONFIG so that address-family handlers can flush
stale addresses and reinitialize as needed.

Export vrf_default_get_or_create() so it can be used from bond and port
teardown paths.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
When rte_rcu_qsbr_dq_enqueue() fails in DQ mode, the deleted key slot
is never freed and becomes permanently leaked. Also, when
rte_hash_add_key_data() overwrites an existing key, the old data pointer
is silently lost. With RCU-protected readers still potentially accessing
the old data, there is no safe way to free it.

Add two patches from an upstream series [1]:

- Fall back to synchronous reclamation instead of only logging an error
  when the RCU defer queue enqueue fails on key deletion.
- When RCU is configured with a free_key_data_func callback,
  automatically defer-free the old data pointer on overwrite.

The third patch from that series (adding a new rte_hash_replace API) is
not needed since the free_key_data_func callback is sufficient.

[1] https://patches.dpdk.org/project/dpdk/list/?series=37352

Signed-off-by: Robin Jarry <rjarry@redhat.com>
When outputting on a VLAN interface, the local iface variable is
reassigned to the parent interface after VLAN tag insertion. The
subsequent UP status check and TX stats increment then use this
reassigned pointer, accounting them on the parent instead of the
original VLAN interface.

Use d->iface which still references the original VLAN interface
for the status check and stats increment.

Fixes: 7701685 ("port: add dedicated port_tx functions")
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Bridge members that are not VLAN interfaces (trunk ports) need to
carry the VLAN ID through the output path so that the Ethernet
header can be built with the correct 802.1Q tag. iface_output
unconditionally clears d->vlan_id to zero for non-VLAN interfaces,
discarding the VLAN ID set during input processing.

Only set d->vlan_id when the output interface is actually a VLAN
type. Clear it instead at the points where it is no longer needed:
in eth_output after the Ethernet header has been built, and in the
control plane injection path where no VLAN context exists.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
A future change will require calling control_queue_push() from
gr_event_push() which lives in main/. If control_queue stays in the
infra module, this would create a circular dependency between main and
infra.

Move control_queue.c and gr_control_queue.h to main/ and replace the
event-based drain mechanism with explicit control_queue_drain() calls
from iface_destroy() and nexthop_destroy() after the RCU sync.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Pass rte_lcore_id() to rte_rcu_qsbr_synchronize() instead of
RTE_QSBR_THRID_INVALID to exclude the calling thread from the quiescent
state wait. This is needed to allow creating objects from datapath
workers.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Event notifications must be processed on the control plane thread.
Modify gr_event_push() to detect when it is called from a datapath
worker and use the control queue to defer the notification to the
control plane event loop.

This enables datapath nodes (such as bridge MAC learning) to create
MAC entries on the fly without blocking the control plane.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Introduce a new l2 module with a bridge interface type that allows
grouping multiple member interfaces (ports, VLANs, bonds) into a single
L2 broadcast domain.

The bridge maintains a list of members and supports configurable MAC
learning, BUM traffic flooding, per-bridge ageing timer and a custom
MAC address. Members are switched to GR_IFACE_MODE_BRIDGE when attached
and restored to the default VRF when the bridge is destroyed.

FDB management and datapath nodes for actual packet forwarding will
follow in subsequent commits.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Implement a forwarding database backed by an RCU-protected rte_hash
with a dedicated rte_mempool for entries. The hash is configured with
a free_key_data_func callback so that deleted entries are automatically
returned to the pool after RCU synchronization.

Entries can be added/deleted/flushed via the API and can also be
dynamically learned from the datapath via fdb_learn(). A periodic
ageing timer evicts learned entries that have not been refreshed
within the bridge ageing_time. Static entries configured by the user
are never aged out. FDB entries associated with a member or bridge are
automatically purged on detach or bridge destruction.

The FDB table size defaults to 4096 entries and can be changed at
runtime via the config set/get API, provided the table is empty.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add bridge_input and bridge_flood datapath nodes.

bridge_input receives packets from member interfaces via
GR_IFACE_MODE_BRIDGE. It learns source MAC addresses into the FDB
(unless GR_BRIDGE_F_NO_LEARN is set), then looks up the destination.

Known unicast destinations are forwarded to the learned output
interface. Unknown unicast, broadcast and multicast are sent to
bridge_flood. Hairpin packets (destination is the source interface)
are dropped. When the destination is the bridge interface itself,
packets are sent to eth_input for local processing.

bridge_flood replicates each packet to all bridge members except the
ingress interface, and to the bridge interface itself. The first
output reuses the original mbuf, subsequent ones are cloned.

When GR_BRIDGE_F_NO_FLOOD is set, the packet is dropped instead.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Create a bridge with three member ports and verify L2 forwarding
between namespaces, L3 reachability to the bridge interface address,
and overwriting a dynamic FDB entry with a static one.

Also check that detaching a member and deleting the bridge properly
clean up FDB entries.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
iface_create copies the requested configuration into the iface
struct early via iface->base = conf->base. When the interface is
created with GR_IFACE_F_UP, the flag is already set by the time
iface_set_up_down runs. The down-to-up transition condition
(!(flags & UP) && up) evaluates to false, so GR_IFACE_S_RUNNING
is never set and GR_EVENT_IFACE_STATUS_UP is never pushed.

This only affects logical interfaces (bridges, VXLAN, VLANs).
Physical ports are not affected because their set_up_down callback
manages the running state independently via the DPDK link status
event.

This prevents FRR from seeing logical interfaces as operationally
up (IFF_RUNNING), which in turn prevents EVPN from advertising
IMET routes for VXLAN interfaces.

Clear the UP flag before calling iface_set_up_down so the
transition fires normally.

Fixes: 9a61e92 ("iface: send status events on admin state changes")
Signed-off-by: Robin Jarry <rjarry@redhat.com>
@rjarry rjarry marked this pull request as draft February 14, 2026 00:06
@coderabbitai
Copy link

coderabbitai bot commented Feb 14, 2026

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive Layer 2 (L2) support with new bridge and VXLAN interface types, Forwarding Database (FDB) management, and VXLAN flood (VTEP) capabilities. The control plane adds bridge member management, FDB learning and aging, and flood entry tracking. The datapath implements bridge and VXLAN packet processing nodes with learning, flooding, and tunnel encapsulation/decapsulation. FRR integration extends MAC and VTEP support through dplane operations. The event system is refactored to defer notifications via a control queue. New CLI modules enable bridge, VXLAN, FDB, and flood management. Infrastructure is extended with new interface types, VRF handling improvements, and QSBR synchronization updates. Integration tests validate bridge connectivity, VXLAN tunneling, and EVPN/VXLAN interoperability.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@modules/infra/control/group_nexthop.c`:
- Line 152: The call to rte_rcu_qsbr_synchronize(gr_datapath_rcu(),
rte_lcore_id()) is using rte_lcore_id() from a control thread that is not
registered as a QSBR reader; replace the second argument with
RTE_QSBR_THRID_INVALID so the call becomes
rte_rcu_qsbr_synchronize(gr_datapath_rcu(), RTE_QSBR_THRID_INVALID) whenever
invoked from control-plane threads (same change for any other control-plane
calls that pass rte_lcore_id()); ensure only datapath reader threads keep using
their registered thread IDs (registration happens via
rte_rcu_qsbr_thread_register in the datapath main loop).

In `@modules/l2/cli/vxlan.c`:
- Around line 73-77: arg_vrf currently returns 0 when the user omits the
ENCAV_VRF argument, but the code treats 0 as success and unconditionally sets
GR_VXLAN_SET_ENCAP_VRF, causing encap_vrf to be overwritten; fix by storing the
arg_vrf return value (e.g. int ret = arg_vrf(c, p, "ENCAP_VRF",
&vxlan->encap_vrf_id)), return on ret < 0, and only set set_attrs |=
GR_VXLAN_SET_ENCAP_VRF when ret > 0 (meaning the user actually supplied
ENCAV_VRF), leaving vxlan->encap_vrf_id untouched when the argument is absent.

In `@modules/l2/control/bridge.c`:
- Around line 60-77: bridge_detach_member currently resets member->mode to
GR_IFACE_MODE_VRF but leaves member->vrf_id as GR_VRF_ID_UNDEF; update
bridge_detach_member to restore the member's VRF by calling
vrf_default_get_or_create() and assigning the returned vrf id to member->vrf_id
and incrementing its refcount via vrf_incref (mirroring bridge_fini behavior),
then set member->mode = GR_IFACE_MODE_VRF so the detached iface has a valid VRF.

In `@modules/l2/control/vxlan.c`:
- Around line 281-287: The vtep_flood_del function mutates the shared
flood_vteps array in-place (swap-and-decrement) without RCU protection, causing
a data-race with datapath readers; change vtep_flood_del to follow the
copy-on-write + RCU pattern used by vtep_flood_add: allocate a new flood_vteps
buffer, copy entries from the old array excluding entry->vtep.addr (preserving
order if add does), set the new pointer and updated n_flood_vteps atomically
(using the same RCU/atomic swap helper used by vtep_flood_add), schedule the old
buffer to be freed after the RCU grace period, and keep the
gr_event_push(GR_EVENT_FLOOD_DEL, entry) call; reference vtep_flood_del,
vtep_flood_add, flood_vteps, n_flood_vteps, and gr_event_push when making the
change.
- Around line 50-83: The delete uses cur->encap_vrf_id after it was overwritten,
so rte_hash_del_key is built with the new encap_vrf_id instead of the old one;
fix by capturing the old encap_vrf_id (and old vni if needed) before mutating
cur (e.g., read old_vrf = cur->encap_vrf_id and build cur_key from old_vrf and
cur->vni) or postpone assigning cur->encap_vrf_id until after the hash
delete/add sequence; update the code around cur->encap_vrf_id, cur_key,
rte_hash_del_key, next_key and rte_hash_add_key_data accordingly so the deletion
targets the original {old_vni, old_vrf}.

In `@modules/l2/datapath/vxlan_output.c`:
- Around line 75-79: vxlan_output currently assigns ip_output_mbuf_data(m)->nh =
fib4_lookup(...) without checking for NULL and always sends packets to
IP_OUTPUT; change vxlan_output to check the result of fib4_lookup (the value
stored in ip_output_mbuf_data(m)->nh) and if it is NULL enqueue the packet to
the BAD_NEXTHOP edge (the declared but unused BAD_NEXTHOP path) instead of
forwarding to IP_OUTPUT, otherwise continue to set edge = IP_OUTPUT and enqueue
as before; update the enqueue logic around rte_node_enqueue_x1(graph, node,
edge, m) so the chosen edge reflects this NULL-check.
🧹 Nitpick comments (3)
frr/if_grout.c (1)

369-378: Variable add shadows outer bool add on line 356.

struct gr_fdb_add_req *add (line 370) shadows the bool add declared at line 356. This works correctly due to block scoping, but it's a latent maintenance trap — a future refactor could easily reference the wrong add.

Proposed fix — rename inner variable
 	if (add) {
-		struct gr_fdb_add_req *add = req;
-		add->exist_ok = true;
-		add->fdb.iface_id = ifindex_frr_to_grout(dplane_ctx_get_ifindex(ctx));
-		add->fdb.bridge_id = ifindex_frr_to_grout(dplane_ctx_mac_get_br_ifindex(ctx));
-		add->fdb.vlan_id = dplane_ctx_mac_get_vlan(ctx);
-		add->fdb.flags = dplane_ctx_mac_get_dp_static(ctx) ? GR_FDB_F_STATIC : 0;
-		memcpy(&add->fdb.mac, dplane_ctx_mac_get_addr(ctx), sizeof(add->fdb.mac));
-		add->fdb.vtep = dplane_ctx_mac_get_vtep_ip(ctx)->s_addr;
+		struct gr_fdb_add_req *add_req = req;
+		add_req->exist_ok = true;
+		add_req->fdb.iface_id = ifindex_frr_to_grout(dplane_ctx_get_ifindex(ctx));
+		add_req->fdb.bridge_id = ifindex_frr_to_grout(dplane_ctx_mac_get_br_ifindex(ctx));
+		add_req->fdb.vlan_id = dplane_ctx_mac_get_vlan(ctx);
+		add_req->fdb.flags = dplane_ctx_mac_get_dp_static(ctx) ? GR_FDB_F_STATIC : 0;
+		memcpy(&add_req->fdb.mac, dplane_ctx_mac_get_addr(ctx), sizeof(add_req->fdb.mac));
+		add_req->fdb.vtep = dplane_ctx_mac_get_vtep_ip(ctx)->s_addr;
 		req_type = GR_FDB_ADD;
modules/l2/api/gr_l2.h (1)

44-49: Bit 36 skipped in VXLAN reconfiguration flags.

GR_VXLAN_SET_LOCAL is bit 35, GR_VXLAN_SET_MAC jumps to bit 37. Bit 36 is unused. If intentional (reserved for a future attribute), no problem. If a typo, it won't cause a bug now but could cause confusion later.

modules/l2/control/fdb.c (1)

329-346: Redundant fdb_max_entries assignment.

Line 342 sets fdb_max_entries = req->max_entries, but fdb_reconfig (line 79) already does the same assignment. Harmless, but the duplicate write could be removed.

Comment on lines +50 to +83
if (set_attrs & GR_VXLAN_SET_ENCAP_VRF) {
uint16_t vrf = next->encap_vrf_id;
uint16_t old = cur->encap_vrf_id;

if (vrf == GR_VRF_ID_UNDEF)
vrf = vrf_default_get_or_create();

if (vrf != old && vrf_incref(vrf) < 0)
return -errno;

if (old != GR_VRF_ID_UNDEF)
vrf_decref(old);

cur->encap_vrf_id = vrf;
}

if (set_attrs & (GR_VXLAN_SET_VNI | GR_VXLAN_SET_ENCAP_VRF)) {
const struct vxlan_key next_key = {rte_cpu_to_be_32(next->vni), next->encap_vrf_id};
const struct vxlan_key cur_key = {rte_cpu_to_be_32(cur->vni), cur->encap_vrf_id};

if (rte_hash_lookup(vxlan_hash, &next_key) >= 0)
return errno_set(EADDRINUSE);

if (next->vni == 0 || next->vni > 0xffffff)
return errno_set(ERANGE);

rte_hash_del_key(vxlan_hash, &cur_key);

ret = rte_hash_add_key_data(vxlan_hash, &next_key, iface);
if (ret < 0)
return errno_log(-ret, "rte_hash_add_key_data");

cur->vni = next->vni;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: cur_key for hash deletion uses the already-updated encap_vrf_id.

When GR_VXLAN_SET_ENCAP_VRF is set, line 63 updates cur->encap_vrf_id to the new value before the hash key update block at line 66. Consequently, cur_key on line 68 is built with the new encap_vrf_id, not the old one. The rte_hash_del_key on line 76 attempts to delete a key that doesn't exist yet, leaving the old hash entry ({old_vni, old_vrf}) as a stale orphan.

Proposed fix: capture old key before mutating cur
+	// Save old key components before ENCAP_VRF update may overwrite them.
+	const struct vxlan_key old_key = {rte_cpu_to_be_32(cur->vni), cur->encap_vrf_id};
+
 	if (set_attrs & GR_VXLAN_SET_ENCAP_VRF) {
 		uint16_t vrf = next->encap_vrf_id;
 		uint16_t old = cur->encap_vrf_id;
 
 		if (vrf == GR_VRF_ID_UNDEF)
 			vrf = vrf_default_get_or_create();
 
 		if (vrf != old && vrf_incref(vrf) < 0)
 			return -errno;
 
 		if (old != GR_VRF_ID_UNDEF)
 			vrf_decref(old);
 
 		cur->encap_vrf_id = vrf;
 	}
 
 	if (set_attrs & (GR_VXLAN_SET_VNI | GR_VXLAN_SET_ENCAP_VRF)) {
 		const struct vxlan_key next_key = {rte_cpu_to_be_32(next->vni), next->encap_vrf_id};
-		const struct vxlan_key cur_key = {rte_cpu_to_be_32(cur->vni), cur->encap_vrf_id};
 
 		if (rte_hash_lookup(vxlan_hash, &next_key) >= 0)
 			return errno_set(EADDRINUSE);
 
 		if (next->vni == 0 || next->vni > 0xffffff)
 			return errno_set(ERANGE);
 
-		rte_hash_del_key(vxlan_hash, &cur_key);
+		rte_hash_del_key(vxlan_hash, &old_key);
 
 		ret = rte_hash_add_key_data(vxlan_hash, &next_key, iface);
🤖 Prompt for AI Agents
In `@modules/l2/control/vxlan.c` around lines 50 - 83, The delete uses
cur->encap_vrf_id after it was overwritten, so rte_hash_del_key is built with
the new encap_vrf_id instead of the old one; fix by capturing the old
encap_vrf_id (and old vni if needed) before mutating cur (e.g., read old_vrf =
cur->encap_vrf_id and build cur_key from old_vrf and cur->vni) or postpone
assigning cur->encap_vrf_id until after the hash delete/add sequence; update the
code around cur->encap_vrf_id, cur_key, rte_hash_del_key, next_key and
rte_hash_add_key_data accordingly so the deletion targets the original {old_vni,
old_vrf}.

Comment on lines +75 to +79
ip_output_mbuf_data(m)->nh = fib4_lookup(vxlan->encap_vrf_id, d->vtep);

edge = IP_OUTPUT;
next:
rte_node_enqueue_x1(graph, node, edge, m);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the vxlan_output.c file around the lines in question
cat -n modules/l2/datapath/vxlan_output.c | sed -n '50,90p'

Repository: DPDK/grout

Length of output: 1387


🏁 Script executed:

# Search for fib4_lookup definition
rg -n 'fib4_lookup' --type=c -A 5 | head -80

Repository: DPDK/grout

Length of output: 4748


🏁 Script executed:

# Check for BAD_NEXTHOP edge declaration and usage in vxlan_output.c
rg -n 'BAD_NEXTHOP' modules/l2/datapath/vxlan_output.c

Repository: DPDK/grout

Length of output: 119


🏁 Script executed:

# Look for edge declarations in vxlan_output.c
rg -n 'enum.*edge|edge.*=' modules/l2/datapath/vxlan_output.c -i | head -20

Repository: DPDK/grout

Length of output: 130


🏁 Script executed:

# Check fib4_lookup implementation
cat -n modules/ip/control/route.c | sed -n '119,145p'

Repository: DPDK/grout

Length of output: 892


🏁 Script executed:

# Check if ip_output can handle NULL nexthop
rg -n 'ip_output' modules/ip/datapath/ -A 20 | grep -A 20 'process.*ip_output' | head -40

Repository: DPDK/grout

Length of output: 1282


🏁 Script executed:

# Search for ip_output process function
fd 'ip_output' --type=f modules/ip/datapath/ && grep -n 'process.*ip_output\|ip_output_process' modules/ip/datapath/*.c

Repository: DPDK/grout

Length of output: 286


🏁 Script executed:

# Look at what happens with NULL nexthop in ip_output
rg -n '->nh' modules/ip/datapath/ip_output.c -C 2 | head -50

Repository: DPDK/grout

Length of output: 458


🏁 Script executed:

# Check ip_output_process function to see how it handles nexthop
cat -n modules/ip/datapath/ip_output.c | sed -n '58,120p'

Repository: DPDK/grout

Length of output: 1904


🏁 Script executed:

# Look for how nexthop is dereferenced in ip_output
rg -n 'nh\-' modules/ip/datapath/ip_output.c -C 2 | head -60

Repository: DPDK/grout

Length of output: 277


🏁 Script executed:

# Check if there's NULL checking for nexthop in ip_output
rg -n 'nh == NULL|nh != NULL' modules/ip/datapath/ip_output.c

Repository: DPDK/grout

Length of output: 76


Missing NULL check for fib4_lookup result — error handling deferred to downstream node.

fib4_lookup can return NULL when no route is found, but vxlan_output doesn't check this locally. Instead, the packet proceeds to ip_output, which detects the NULL and routes it via NO_ROUTE to an error handler. While this works, it's inefficient and inconsistent with other modules (dnat44, icmp_output, ip_input). The BAD_NEXTHOP edge is declared but never used.

Proposed fix
 		ip_output_mbuf_data(m)->nh = fib4_lookup(vxlan->encap_vrf_id, d->vtep);
 
-		edge = IP_OUTPUT;
+		if (unlikely(ip_output_mbuf_data(m)->nh == NULL)) {
+			edge = BAD_NEXTHOP;
+		} else {
+			edge = IP_OUTPUT;
+		}
🤖 Prompt for AI Agents
In `@modules/l2/datapath/vxlan_output.c` around lines 75 - 79, vxlan_output
currently assigns ip_output_mbuf_data(m)->nh = fib4_lookup(...) without checking
for NULL and always sends packets to IP_OUTPUT; change vxlan_output to check the
result of fib4_lookup (the value stored in ip_output_mbuf_data(m)->nh) and if it
is NULL enqueue the packet to the BAD_NEXTHOP edge (the declared but unused
BAD_NEXTHOP path) instead of forwarding to IP_OUTPUT, otherwise continue to set
edge = IP_OUTPUT and enqueue as before; update the enqueue logic around
rte_node_enqueue_x1(graph, node, edge, m) so the chosen edge reflects this
NULL-check.

Introduce the VXLAN interface type for the L2 module. A VXLAN
interface carries a VNI (VXLAN Network Identifier), a local VTEP
address used as the outer IP source, an encapsulation VRF for
underlay routing, and a configurable UDP destination port (default
4789).

VXLAN interfaces are keyed by (VNI, encap_vrf_id) in a lockfree
RCU-protected hash table so that the datapath can resolve incoming
tunneled packets to the correct interface without locks.

VXLAN interfaces are intended to be attached to a bridge domain.
All L2 traffic entering the bridge is forwarded transparently over
the VXLAN tunnel. The local VTEP address must already be configured
in the encapsulation VRF.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
VXLAN uses UDP port 4789 by default but allows configuring a custom
destination port per interface. Allow the control plane to register
additional UDP ports at runtime as aliases for an already registered
port, reusing the same datapath edge.

Use reference counting so that multiple interfaces sharing the same
non-default port do not interfere with each other during teardown.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Wire up the VXLAN interface's configurable destination port to the
L4 input node. When a non-default port is configured, register it
as an alias for the standard VXLAN port (4789) so that the datapath
delivers matching UDP packets to the vxlan_input node.

Unregister the alias when the port changes or the interface is
destroyed.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Introduce a transport-agnostic flood list framework for BUM traffic
(Broadcast, Unknown unicast, Multicast). In EVPN, each PE maintains
a flooding list built from IMET routes (RFC 8365, RFC 9572). The
entries in this list differ depending on the overlay encapsulation:
VXLAN uses a remote VTEP IPv4 address and a VNI, while SRv6 would
use a 128-bit SID.

The API defines a gr_flood_entry structure with a type discriminant
and a union, allowing future encapsulation types (e.g. SRv6 SIDs)
to be added without changing the API request types. A dispatch
layer in control/flood.c routes add/del/list operations to
type-specific callbacks registered at init time.

Implement the VXLAN VTEP flood type (GR_FLOOD_T_VTEP). Each VXLAN
interface maintains a per-VNI array of remote VTEP addresses used
by the vxlan_flood datapath node for ingress replication. The array
is replaced atomically with an RCU synchronization barrier so that
the datapath never sees a partially updated list.

CLI commands are exposed under "flood vtep add/del/show".

Signed-off-by: Robin Jarry <rjarry@redhat.com>
In a VXLAN overlay, the bridge needs to know which remote VTEP to
use when sending unicast frames to a learned MAC address. Add a
VTEP IPv4 address field to FDB entries so that known unicast
traffic can be sent directly to the correct tunnel endpoint instead
of being flooded to all VTEPs.

When bridge_input learns a MAC address from a VXLAN member
interface, it records the source VTEP from the decapsulated
packet's outer IP header. When forwarding to a known destination,
the stored VTEP address is passed to the output path via the mbuf
private data so that vxlan_output can build the correct outer
header.

Only set the VTEP field when the source interface is actually a
VXLAN type to avoid storing uninitialized data from other packet
paths (control plane, local bridge traffic).

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add three datapath nodes for VXLAN packet processing.

vxlan_input decapsulates incoming UDP/4789 packets. It strips the
outer UDP and VXLAN headers, resolves the inner VNI to a VXLAN
interface via the RCU-protected hash table, records the source VTEP
from the outer IP header into the mbuf private data, and forwards
the inner Ethernet frame to iface_input for bridge processing.

vxlan_output encapsulates outgoing frames for a known destination
VTEP. It prepends a pre-built IP/UDP/VXLAN header template
initialized by the control plane, fills in the per-packet fields
(destination VTEP, UDP length, IP length, checksum), and hashes the
inner flow to select an ephemeral source port for underlay ECMP
(RFC 7348 Section 5). The FIB lookup for the outer IP uses the
encapsulation VRF, not the bridge domain.

vxlan_flood handles BUM traffic by replicating the frame to every
VTEP in the flood list via ingress replication. The original mbuf
is sent to the first VTEP and clones are created for the rest.

The bridge_flood node is updated to steer VXLAN member traffic
through vxlan_flood instead of direct iface_output.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Set up a VXLAN overlay between grout and a Linux netns peer. Grout
runs a bridge with a VXLAN member (VNI 100) and the Linux side
mirrors the topology with a kernel VXLAN device enslaved to a Linux
bridge. Both sides have flood lists configured with each other's
VTEP address for BUM traffic replication.

The test verifies L3 connectivity over the tunnel by having the
Linux side ping the bridge address. This exercises the full path:
ARP resolution over VXLAN, FDB learning from decapsulated traffic,
and ICMP echo reply via the VXLAN output encapsulation.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Report bridge interfaces to FRR as ZEBRA_IF_BRIDGE with their MAC
address. Tag members with ZEBRA_IF_SLAVE_BRIDGE and propagate the
bridge ifindex so that FRR can associate them with the correct master.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Report VXLAN interfaces to FRR's zebra as ZEBRA_IF_VXLAN with the
associated L2 VNI information. This allows FRR's EVPN control
plane to discover which VNIs are locally configured and advertise
them via BGP IMET routes to remote PEs.

The VXLAN L2 info includes the VNI, the local VTEP address, and
the underlay interface index so that zebra can correlate the tunnel
with the correct underlay routing context.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Synchronize bridge FDB entries bidirectionally between grout and
FRR. This is required for EVPN to advertise locally learned MAC
addresses via BGP type-2 routes and to install remotely learned
MACs into the bridge forwarding table.

Subscribe to FDB add/del/update events from grout and translate
them to dplane MAC install/delete operations for zebra. In the
reverse direction, handle DPLANE_OP_MAC_INSTALL/DELETE from FRR
and convert them to GR_FDB_ADD/DEL API calls. The VTEP address
is propagated in both directions so that remote MACs are associated
with the correct tunnel endpoint.

Self-event suppression is enabled on the FDB subscriptions to
prevent feedback loops when FRR installs a MAC that was originally
learned by grout.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Handle DPLANE_OP_VTEP_ADD and DPLANE_OP_VTEP_DELETE operations
from FRR's EVPN control plane. When BGP learns a remote VTEP via
an IMET route (EVPN Route Type 3), zebra pushes the VTEP to the
dataplane provider which translates it to a GR_FLOOD_ADD/DEL
request with GR_FLOOD_T_VTEP type.

This allows BGP EVPN to dynamically manage the per-VNI flood lists
used for BUM traffic ingress replication, replacing the need for
static flood list configuration via the CLI.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Set up a full EVPN/VXLAN topology between FRR+grout and a
standalone FRR+Linux peer. Each side runs a bridge with a VXLAN
member (VNI 100) and a host namespace. Both peers run iBGP with
the l2vpn evpn address-family and advertise-all-vni.

The test verifies that EVPN type-3 (IMET) routes are exchanged so
that both sides install each other's VTEP in their flood lists.
It then verifies end-to-end L2 connectivity by pinging between the
two host namespaces through the VXLAN overlay, which exercises
type-2 (MAC/IP) route advertisement and FDB synchronization.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant