“Hi — we built a Realtime API voice bot using the WebSocket transport. Calls work great for the first ~90 seconds, then the WebSocket closes silently. No error event, no close reason. We’re on Node 20 behind an AWS NLB. Reproduces consistently. Help?”
1. Clarifying questions before debugging
- Is the 90s timing exact, or fuzzy? Exact-ish (87s, 89s, 91s) suggests an upstream idle timeout. Variable (40s–180s) suggests something else.
- Are audio frames actively flowing in both directions during that 90s window? If the user goes quiet and the model isn’t speaking, the socket is technically idle from a TCP perspective.
- Any logs at the NLB? AWS NLB has a
tcp-idle-timeoutthat defaults to 350s but I’ve seen it lowered to 60s on hardened configs. - Is the WebSocket library sending pings? The
wsNode package needs explicitpingintervals or the underlying TCP stays silent.
2. Hypotheses, ranked
- NLB or upstream proxy idle-timeout killing the TCP connection while the WS thinks it’s alive. (most likely — 90s is a smell)
- No WS keepalive pings — OpenAI’s Realtime API doesn’t require pings, but intermediate hops may.
- Client-side audio buffer underrun causing the bot to send no frames; combined with #1 closes silently.
- Server runtime restart (Lambda cold-warm cycles, container OOM); 90s aligns with some platform timeouts but customer is on Node 20 long-running — less likely.
- OpenAI session
max_response_output_tokensor built-in max session length — documented at much higher than 90s, so unlikely root cause but worth confirming.
3. Most likely cause
90 seconds is a classic NLB idle-timeout signature. The Realtime WebSocket carries audio frames continuously when the agent is speaking, but during caller silence + model thinking the socket can go quiet at the TCP layer for >the timeout window. The NLB drops the connection, the client doesn’t get a clean close because the NLB doesn’t emit one.
Two-part fix: (a) raise the NLB idle timeout to 350s+, and (b) keep the WebSocket genuinely active with periodic application-level pings.
4. Fix + code
Application-level keepalive on the Node ws client:
import WebSocket from "ws";
const ws = new WebSocket(REALTIME_URL, { headers });
let pingTimer: NodeJS.Timeout;
ws.on("open", () => {
// 30s ping cadence stays well below any
// sane NLB / ALB / CloudFront idle window.
pingTimer = setInterval(() => {
if (ws.readyState === WebSocket.OPEN) ws.ping();
}, 30_000);
});
ws.on("pong", () => {
// Optional: track RTT for observability.
});
ws.on("close", (code, reason) => {
clearInterval(pingTimer);
console.warn("realtime ws closed", { code, reason: reason.toString() });
// Reconnect with exponential backoff if mid-call.
});
ws.on("error", (err) => {
console.error("realtime ws error", err);
});Plus AWS CLI for the NLB:
aws elbv2 modify-target-group-attributes \
--target-group-arn $TG_ARN \
--attributes \
Key=deregistration_delay.timeout_seconds,Value=30 \
Key=stickiness.enabled,Value=falseAlso confirm connection_idle_timeout at the NLB level (default 350s, can be raised to 6000s).
5. Customer-facing reply
Hi —
90 seconds is a strong fingerprint for an idle TCP-timeout on intermediate infrastructure, not the Realtime API itself. The Realtime session has no 90-second cap, and a clean close from our side would send a close frame with a code — the silent drop you’re seeing means the connection was severed below the WebSocket layer.
Two changes that almost always resolve this:
- Raise your AWS NLB
connection_idle_timeoutto at least 350s. - Send WebSocket pings every 30s from the client. Code sample below.
If you can grab the close code/reason from a long-form log capture and the NLB CloudWatch metrics for that target group, I can confirm we’re looking at the same root cause. Happy to jump on a 15-min call if it moves faster.
— Chad