STUN on Ribbon Edge vs SWeLite and Teams ICE-Lite

Aaah I missed writing articles like this one, deep technical dives into Teams Direct Routing STUN failures that even baffle hardware vendors? Yes please!

Before we get started, here’s a bit of context. A customer of mine had a cluster of Sonus SBC2000’s Greig setup years ago. These were then updated to perform Teams Direct Routing by a colleague of mine that I trust, so we knew it was all working perfectly before I turned up.

As we do, the customer wanted to move away from physical hardware. So, the customer engaged us to replace the physical SBC’s with virtual ones.

Just come here from a search engine having replaced an SBC1k/2k with a SWeLite and Media negotiation failing?
Calls connect but drop after ~10 seconds?
Bye packets with a Q850 cause code 47 (Resource unavailable)?
No RTP established?
“NpDspTimer[ICE_SIPSG_UPDATE-59] expired” Syslog messages?
Jump to TLDR for your potential answer.

Unfortunately, because you can’t simply upload a backup from an SBC 2000 into a SWe Lite, one of our team members went through the painful process of manually configuring the new SWe Lite’s to use the same config as the SBC 2000’s. Sure, you can move Transformation Table configs, but everything else needs to be done by hand. This was especially hard as the SBC’s were routing calls between 5 different phone systems, each needing their own unique config.

Some MVP should make a migration tool or something…(no. not in a million years!)

To make the migration simpler, the customer had opted to reuse the IP addresses of the SBC2000’s on the SWe Lite’s, ideally meaning no changes to the firewall will be required. Right…. Right?

An image of the Starwars "For the better... Right" meme referencing the reuse of IP addresses and implying we didnt need to update the firewall.

Oh how wrong I was…

A primer on NAT, STUN, ICE and TURN

Before we get started, I think it’s important to have a firm understanding of NAT, to the point of understanding the difference between SNAT, DNAT, 1:1NAT and Masquerading as well as understanding Connection Tracking.

Network Address Translation (NAT)

If you’re new here, here’s a quick run down. (I’m simplifying, this is not exhaustive, NAT can be pretty complicated, for more info read this for the basics and this if you wanna really deep dive, Wikipedia also has a great page on it)

SNAT = Source Address Translation. Replacing the Source IP Address (Typically a private IP) in a packet, with another (typically Public) IP like 203.0.113.0. Enabling the destination device to reply to the packet using the public address of the NAT device.
DNAT = Destination Address Translation. Replacing the Destination IP Address on a packet with another address, This is typically seen in things like Port Forwards where we replace the 203.0.133.0 address with the private address of your server.

Masquerading and Connection Tracking

Also known as “Dynamic NAT”, “Port Address Translation” and “Overload NAT”.

Historically these were described separately, but in modern implementations they are often tightly coupled and behave as a single system.

When a packet gets transmitted, it is Source Natted by your router, but, instead of just changing the Source IP Address, it will also change the Source Port, adding an entry to the Connections/Masquerade Table and transmitting the new packet.

The NAT device does this to keep track of every packet it SNAT’s, enabling it to DNAT and Route the return packet when it arrives.

When the return packet arrives, The NAT device looks at the Connections Table for the Destination IP/Port (it’s Public IP) the Source IP Address/Port (the Remote Host) and the Protocol (Collectively, these are known as a “Tuple”). Using this information, the NAT device can determine the Private IP to replace the Destination IP/Port (DNAT) with and route to the internal host.

Should a packet arrive that’s not in the Connections Table, or explicitly configured (Port Forwarding/Static DNAT) the device will drop the packet.

Obviously, there is a limit to how many ports and memory is available on the NAT/router, so these entries will also have a connection timer, typically 30 seconds for UDP and minutes for TCP. After which, if no outbound traffic is seen, the entry is deleted and any packets that arrive after that are simply dropped.

The issue here for Peer to Peer communication is there’s a lot of ambiguity for the end points behind NAT. How are they supposed to know what port they are publicly communicating on so they can setup a connection?

an image of the "Thats the neat part, you dont" meme from Invincible with the first panel asking how an application knows its external IP, and the second panel stating "thats the near part, you dont"

That the crux of the issue, you don’t. Sure, there are tricks like UPNP to allow your app to request rules dynamically be added to your home router, but that doesn’t work in a commercial environment. So what do we do instead?

Use IPv6!
I’m joking, but a fun fact, that’s how Xbox Live does a lot of it’s connections to avoid NAT issues if it can.

1:1 NAT

Okay, so we don’t want to use IPv6, and yeah, I get it. I dislike administering IPv6 too. So, instead we use 1:1 NAT.

We give a dedicated IP Address (or range) to a host, along with a dedicated port range and then perform a simple 1:1 SNAT/DNAT translation. We might configure a rule like this on our router/firewall.

SBC Private IP Address	SBC Public IP Address	Port Range
10.0.10.10	203.0.113.10	16384 – 17583

So that means whenever the router receives a packet from the SBC it will literally just replace the Source IP Address. No changes to the source port. No Masquerading. Just pure SNAT.

The same is also true for packets sent from the Internet to the Public IP address. If the packet has a destination port within the allocated port range, the router will simply DNAT and change the Destination IP, leaving the Destination Port alone.

The only real downside here is, it doesn’t scale well. Depending on the number of ports you need, you can run out of port ranges really quickly.

This sounds awfully complicated, why do we do all this?

Well, back in the 90’s the internet was running out of IP addresses. It was very common for organizations to issue public IPv4 addresses to every. single. computer. (fun fact, Apple, Ford and a few others each still hold onto over 16 million IP addresses each) So, NAT was put in place as a stop gap to IPv6 to allow for organizations to have Private addresses (192.168.0.0, 172.16.0.0, 10.0.0.0) that they could assign internally. NAT would then allow those devices to communicate on the Internet, as without NAT whenever you messaged a remote server, it would see the source address of “192.168.1.20” and upon crafting a reply, the packet would get dropped as Private IP addresses are not routable on the Internet

*sigh* 30 years later and we still aren’t using IPv6 properly…

Get on with it!

Okay, okay! you understand NAT… But, its important we understand it properly as the issue came down to someone not actually understanding how NAT works as well as a bit of a design oversight. We also need understand some key technologies Microsoft Teams uses that you might not have seen in traditional SIP Trunking. (These are all part of normal SIP btw, they just aren’t that commonly used)

Note, this article is already long enough, so I am going to assume that you have a solid understanding of SIP.
If not, I have 2 great sessions explaining SIP and Direct Routing, here at Commsverse and here at M365May. Check them out!

I’m not going to quote the Wikipedia/RFC articles here, but long and short. ICE, STUN and TURN are some of the methods many VoIP implementations use to dynamically work around NAT. Especially when the endpoints that control the signalling and media are on potentially different IP addresses, meaning simple SNAT tunnels cannot open a media path between the two endpoints because the connection tracking is expecting a different IP. This is especially important when it comes to Teams Media Bypass

Understanding Teams media connectivity and the struggles of NAT with Voice.

As I alluded to before, when we are talking about SIP telephony, there are 2 different types of connectivity.

Signalling – Contains things like what number you would like to call, the number you’re calling from and importantly control messages that help setup a call. (Hold, diversion, codec negotiation etc)

Media – This is the contents of the call, in most cases the Audio both parties hear (Video is supported in SIP too, but less common)

In Teams, Microsoft relays all SIP Signalling though their own call controller, known as PSTNHub. Even when using Media Bypass, all call control information still goes through there to facilitate call routing policies SBC selection etc.

As for Media, Teams will attempt 3 different ways to establish a connection between 2 endpoints

Direct (Routable IP to Routable IP)
Using 1:1 NAT and STUN to mitigate NAT
Lastly, Teams will fall back to proxying the media through Microsoft’s PSTNHub, using a technique known as TURN. (port 3478 is used for initiating this)

Direct Connection for Media Bypass

When 2 sip endpoints (Teams client, phones, sbcs etc) can find a Layer 3 path between each other, via local routing, intraoffice VPN, MPLS etc. They will simply list each other’s IP address as the media candidates allowing for media to take the shortest and, hopefully, most efficient path.

This works because both endpoints can effectively communicate with each other using their internal IP stacks thanks to the routers/vpn/”magic” joining the network spaces.

However, as soon as we take away the routable link between the two endpoints, the devices don’t know how to communicate directly each other anymore, so this connection would fail.

Remember that by default, the IP stack on the devices have no idea what their public IP address is, or how to open an inbound port on the firewall. So, in this case, if we just transmitted our private IP addresses in the SIP Invite, the default routers NAT engine wouldn’t manipulate these in the SDP. So, each network would have no idea where those private IP addresses are, so and would simply reject/drop any packets with “No route to host” (or ignorantly forward them out the default route, which would get eaten by the ISP)

Even if the SDP contained the Public IP Addresses of the remote NAT routers, those are going to reject/drop the inbound connections for RTP media as they have no idea which private host to route the traffic to.

Before someone does a “Well Acktually” in the comments. I know that SBC’s aren’t “normal” endpoints and can be configured with their public IP and have firewall ports opened specifically. But, that’s not going to help traffic headed to Teams clients behind NAT. I could write a whole chapter on this, but I have to draw the line somewhere.

That’s where ICE, STUN and TURN come in. Allowing the two endpoints to either learn each other’s Public IP address, and overcome the NAT cone issue. Failing all of that trying to relay through a third party using TURN. (Microsoft in this case)

Okay, so what about using ICE for Media Bypass?

ICE, aka Interactive Connectivity Establishment (not Internal Combustion Engine, unfortunately), is a method for allowing two SIP peers to exchange connectivity though NAT using messages in their SDP’s.

As part of the Teams Direct Routing certification, SBC’s are required to adhere to the ICE-Lite standard. Part of the requirements for that being that the SBC has a Public IP address with a dedicated port range, or a 1:1 NAT. This means the SBC can control the ports for its own connection tracking, instead of relying on the NAT engine in the router. Thus, the SBC can use that to populate the SDP in the initial Invite (Sip signalling), with a list a list of IP/Port candidates for Media.

Note that in the ICE candidates, the local address is also included to allow for media bypass.

STUN

Microsoft’s media relays being ICE-Full endpoints should then initiate STUN tests on the provided candidates.

So, once the endpoints receive the SDP with the ICE candidates in them, the ICE-Full partner (Typically Microsoft) will initiate tests to each of the candidates. It does this by sending a STUN Binding request and looking for a success response in return, this response also contains details about where the remote host saw the connection coming from so the remote party can validate their external public IP and Port. Once those tests complete, the parties will select a valid pair and start using them for Media.

If we take a look at a packet capture between the SBC and Microsoft. you can see these STUN requests using the same port pairs as what’s listed in the SDP candidates. (You might notice something here too, and if you do, extra points)

This allows both parties to agree that yes, they can connect to the ports listed and that when media arrives, the connection will succeed.

In this packet capture you can see the SBC initiated (hint) STUN binding requests with Microsoft replying with the external IP address that Microsoft sees in the XOR mapped address.

ICE-Full vs ICE-Lite

Okay, I’ve mentioned ICE-Lite and ICE-Full a few times, perhaps I should explain that, The article has been pretty heavy on explaining NAT and Media Candidates, but to simplify, ICE-Full hosts actively test connectivity using STUN binding and connectivity checks, while ICE-Lite hosts just sit there and answer those tests.

ICE-Full = Initiate STUN Checks

ICE-Lite = Respond to STUN Checks

Which means in an ICE-Lite scenario, if nothing can reach you… nothing happens!

Finding the issue

First off, I’ll say, I’ve been giving little hints as to what the problem was this entire time. NAT, Firewall, STUN, ICE-Lite, they’re all involved in one way or another.

So what would happen? We would make a call, the sip signalling would setup the call and when it was time for media to connect. The call would get torn down.

We checked everything in the config, the IP addressing, the media setup, the signalling groups, heck at one point we thought it might have been the bandwidth limit on SILK was too high or that Forward Error Correction was causing issues (it wasnt)

We didn’t see anything terribly exciting in the log files either, no SIP error messages that gave any clues, just a BYE packet with Q850 code of “47” “Resource Unavailable, Unspecified” 10 seconds after media candidates were chosen.

STUN Timeouts

The issue was bad enough we got Ribbon involved, same as us, they agreed the issue was Media connectivity related. After enabling debugging and a few packet captures, we noticed that the SBC was complaining that STUN connectivity checks were timing out.
(Note I’ve trimmed the timestamps and cleaned up the logs a bit, highlighting important messages)

DEBUG (evSweLiteApi.c:191) – .ICEAPP1: IceProcStartReqEvent: FSM[154-471] EV_ICE_START_REQ EGRESS WEBRTC localip 192.168.39.5:16960 GCID:0x101f3 rtcp:ON altLocal
DEBUG (evSweLiteApi.c:191) – .ICEAPP1: IceGetNextState: FSM[154-471] GCID[0x101f3] curr:[ST_ICE_START] ev:[EV_ICE_START_REQ] next:[ST_ICE_CHECK]
DEBUG (evSweLiteApi.c:191) – .ICEAPP1: FSM[154-471] IceCheck: EV_ICE_START_REQ wait for STUN Requests
DEBUG (evSweLiteApi.c:191) – .ICEAPP1: InvokeInstance: FSM[154-471] GCID[0x101f3] curr:[ST_ICE_CHECK] EV_ICE_SUSPEND
DEBUG (evSweLiteApi.c:191) – .ICEAPP1: IceStorePotentialCandidate: FSM[154-471] Potential remAddr 52.113.77.6:51951

Ten Seconds later the SBC gives up on the STUN check as it doesn’t see anything from Microsoft.

DEBUG (evSweLiteApi.c:191) – .ICEAPP1: IceCheck: FSM[154-471] GCID 0x101f3 ICE ending
DEBUG (evSweLiteApi.c:191) – .ICEAPP1: IceGetNextState: FSM[154-471] GCID[0x101f3] curr:[ST_ICE_CHECK] ev:[EV_ICE_END_NFY] next:[ST_ICE_NULL]
DEBUG (evSweLiteApi.c:191) – .ICEAPP1: InvokeInstance: FSM[154-471] GCID[0x101f3] curr:[ST_ICE_NULL] EV_ICE_SUSPEND
ERROR (mscConf.cpp:7879) – [5460:499:1] NpDspTimer[ICE_SIPSG_UPDATE-59] expired.
DEBUG (mscConf.cpp:7870) – [5460:499:1] NpDspTimer[ICE_SIPSG_UPDATE-59] has been stopped already.

Looking at the packet capture, the SBC was right, it wasn’t receiving STUN requests from Microsoft.

a screenshot of Wireshark with a packet filter for STUN packets, showing no matching packets

For context, here’s the SBC2000 capture again (with a glaring error I and Ribbon initially missed)

I will say, I feel sorry for Ribbon support. Same as me they first looked at the firewall, saw that the SBC2K’s worked and assumed the firewall was not at fault. I mean nothing changed, so why would the firewall be wrong? They were so stumped by the lack of STUN packets I had to send them screenshots proving I know how to run a packet capture!

I get it, c’mon, how many times has someone unknowingly given you bad info that screwed your troubleshooting?

But the only thing that changed, was the SBC, so certainly it’s the config, Right? RIGHT!?

Copilot to the rescue

Wait, what?

Yes. I’m serious.

As part of troubleshooting, I regularly feed sip ladders of calls to Copilot to see if there is anything I have missed. Whilst I was waiting for this customer’s firewall team to do some packet captures, I asked Copilot a question about STUN packets, I didn’t even feed it the sip ladders, and without even asking, it found my tickets with Ribbon support in my email and reasoned over the issue.

Ever seen a rbbon SBC that doesnt send STUN packets despite being configured for Media Bypass, NAT and ICE-Lite?

One little titbit it came back with, was what got me there in the end

The only time an SBC actively generates STUN is:
✔ When configured as ICE‑Full (SBC2K/SBC1K can do this; SWe Lite/Edge generally cannot)

🎯 So if your SWe/SBC2000 is “properly configured” and still never sends STUN…
It’s almost certainly correct behaviour.
In fact:
Ribbon SBC in ICE-Lite mode should never initiate STUN.
It must only respond.
This matches your earlier support cases from Ribbon (I saw in your emails that they specifically called out lack of STUN in your SweLite captures, and asked you to enable MSC TRACE — because STUN responses only show up there).

There was the last big clue I needed, SBC1K/2K’s can be configured as ICE-Full peers, and thus transmit STUN. I’ll just check the SBC2K config and….

It’s set to ICE-Lite… What the…

But if we look at those packet captures, we can clearly see the SBC is the one initiating STUN first… every single time. Another chat with Copilot and it was confident this is expected behaviour.

Assuming this was a hallucination, I asked to see its sources, and sure enough, it couldnt give me a direct “This does this and that does that” page. But what it did do was compare the documentation between the two units and find the discrepancy. Something I never would have seen looking myself.

SBC 1k/2k use a Hardware DSP Media Engine
From Ribbon SBC 1k/2k architecture documentation (public):

Media flows are handled by the DSP subsystem
The DSP engine performs NAT Keepalives, STUN, RTCP, early RTP behaviour independent of SIP
Media anchoring is performed in a dedicated hardware path

This is as close as they ever get to stating “it sends STUN when it wants to.”
The DSP engine does generate its own keepalives and NAT probes — Ribbon calls them:

“NAT traversal keepalives”
“Media path keepalive packets”
“Early media priming”
“Preemptive keepalives”

These packets can include STUN or RTP/RTCP depending on profile.
This is documented across:

SBC 1k/2k Technical Reference
SBC 1k/2k Troubleshooting Guide
SBC 1k/2k NAT Traversal section

You can just quote:

“The SBC 1k/2k media processor generates NAT traversal keepalives and media path priming packets as part of its media anchoring function.”

That’s your outbound‑STUN puncher.

And then the SWeLite documentation

SWe Lite uses the Ribbon “Media Chain” software architecture
In SWe Lite docs, the Media Plane section explicitly states:

Media is handled by a software media chain, not DSP
ICE‑Lite means passive role only, responding to checks but not initiating
No hardware NAT traversal mechanisms
SWe Lite relies on external NAT/firewall configuration — it cannot “discover” its public IP

Ribbon words it like this:

“SWe Lite implements ICE-Lite and responds to connectivity checks.
It does not participate as a full ICE endpoint.”
“SWe Lite expects the network to provide address and port consistency for media.”

That single sentence alone is enough to prove the discrepancy.

Okay then! It sorta matched up with what I was seeing, so I asked Ribbon directly

Hi team,
I’ve been digging further into the differences between SBC 2000 and SWe Lite behaviour when handling ICE with Microsoft Teams.

From packet captures, I’m consistently seeing the SBC 2000 initiating STUN BINDING REQUESTS (not keepalives). As I understand it and as supported by both Ribbon’s and Microsoft’s documentation an ICE‑Lite agent must not initiate connectivity checks. It should exclusively respond to those sent by the full ICE agent (in this case, Microsoft).
From Ribbon’s documentation:
https://publicdoc.rbbn.com/spaces/UXDOC81/pages/203330838/Working+with+the+SBC+Edge+and+SWe+Lite+as+an+ICE-Lite+Agent

“Acting as an ICE‑Lite agent, the SBC Edge RESPONDS to connectivity checks… The Teams client initiates the media flow.”

From Microsoft:
https://learn.microsoft.com/en-us/microsoftteams/direct-routing-protocols-media

“The ICE Lite implementation must respond to any connectivity checks received. The ICE Lite endpoint must not send any connectivity check requests… sending connectivity checks can lead to unexpected peer-derived candidates and call failures.”

Based on this, the SBC 2000’s behaviour appears inconsistent with the documented ICE‑Lite model, it’s initiating STUN traffic that it should not be sending. Ironically, in this customer scenario, this unexpected behaviour seems to be what allows the 2000 to succeed behind NAT, while the SWe Lite (which adheres to the ICE‑Lite behaviour described above) does not.

To which Ribbon pointed to their media config page (updated link to point to public documentation)

Hi James,

After I further research, I found out the STUN BINDING REQUEST for SBC2K should be trigger by “Send Stun Packets” settings in the media. This settings only available on SBC1K/2K.

I cannot see the actual value configured for this setting. Default value is Always. You may want to check from SBC GUI. I believe it is currently set to Always thats why SBC2K send the STUN BINDING REQUEST.

Go to settings >> media >> media system configuration

Managing Media Profiles – SBC Edge 12.1.x Documentation – Ribbon Public Documentation Center

Sure enough, looking at an SBC2K, by default, it sends STUN

And after checking, the SWe Lite, does not expose this option

AHA! Found it!

So what? It’s just STUN connectivity checks, we shouldn’t be sending them anyway!

At this point, you might be thinking, were supposed to be ICE-Lite, outbound STUN isn’t needed, we can ignore that. But I ask, did you forget about NAT and connection tracking?

This is where it all comes together, the STUN primer and all the discussion about SNAT, DNAT and 1:1 NAT

During the initial Teams Direct Routing setup a few years back, the customer unfortunately, didn’t quite get their Firewall rules correct. They were allowing the 16384-17584 port range Outbound sure, but not Inbound.

But that means it wouldn’t have worked at all then? Right?

Not quite, because the customer also had a default SNAT/Masquerade rule in their firewall, when the SBC2000 transmitted it’s STUN connectivity check, it was inadvertently “punching a hole” in NAT.

A call would get setup, and in the SDP’s, Media candidates would be advertised.
The SBC 2000 would transmit a STUN packet using the same port pairs as expected for Media
The Firewall would 1:1 SNAT this traffic, maintaining the port pairs, but also adding it to the established connections table
Microsoft would send their own, and reply to the existing STUN test okay, and the Firewall would correctly NAT this back to the SBC, refreshing the firewall’s connection timer
Microsoft would send Media using the same port pairs.
The Firewall, still having a valid connection in the connections table, would NAT these new packets to the SBC, as to it, the traffic looked identical!

Now lets look at the SweLite

A call would get setup, and in the SDP’s, Media candidates would be advertised.
Microsoft would transmit a STUN Connectivity test using the same port pairs as expected for Media
The Firewall would simply drop this traffic, as there is no inbound rule for that range
The connectivity tests time out, and the call gets torn down with a generic “Resource Unavailable” message

So the entire reason that SweLites were failing, was due to a minor change in STUN behaviour between SBC2000’s and SweLite, designed to help prime connections, and a firewall misconfiguration.

What looked like a new SBC problem was actually a legacy firewall misconfiguration
The SBC2000 didn’t “fix it”, It just hid it
And SWe Lite, exposed it instantly
Welcome to real-world NAT

UUUUUUGGGGHHHH!!!

TLDR

Okay, that was alot, so heres the short version
– SBC2000 masked a firewall misconfiguration by sending outbound STUN packets
– Those STUN packets created temporary Firewall/NAT entries (“pinholes”) via connection tracking
– The new SWe Lites did NOT send STUN requests (correct ICE-Lite behaviour) thus never punched a hole in NAT/Connection tracking
– Without inbound firewall rules or established sessions, Microsoft’s STUN checks never reach the SBC’s

Result: Media fails, calls drop after ~10 seconds with a Q850 code of “47” “Resource Unavailable, Unspecified” and a syslog message of “NpDspTimer[ICE_SIPSG_UPDATE-59] expired.“

Fix: Allow inbound media port range on Firewall (or implement proper 1:1 NAT)

So why doesn’t Ribbon just fix the SBC1K/2K to behave like the SweLite and implement ICE-Lite properly?

Imagine this:

You’ve got a slightly misconfigured firewall. Nothing too bad, just one of those “Someone set it up and now it works” ages ago. You apply a firmware update. Suddenly… every single call fails.

What are you going to blame?

Not the firewall, not the engineer that misread your firewall table, not the junior that didn’t understand the difference between SNAT and DNAT when actioning your firewall ticket.

You’re going to blame the SBC! Because before the upgrade it worked, and after the upgrade it didn’t.

That’s the problem.

Bringing the SBC1000/2000 into strict standards compliance now would cause real-world outages for customers who have unknowingly built their environments around its current behaviour.

Hello Technical Debt, my old friend.

Not the kind where something is “wrong and should be fixed”, but the kind where fixing it would break production environments everywhere.

So in this case, I think not fixing it is probably the right call.

It just means you end up with weird edge cases like this one that take forever to diagnoise.

Thanks everyone, hope you enjoyed the ride.

STUN on Ribbon Edge vs SWeLite and Teams ICE-Lite