Why TCP Retransmissions Stall Industrial Camera Communication, and How to Isolate Them

· · TCP, Networking, Bug Investigation, Windows Development, Industrial Camera

In communication with industrial cameras and equipment control, the most troublesome symptom is a link that is fast on average but occasionally stalls for several seconds. It rarely reproduces and nothing happens most of the time, so the UI, threads, the GC, the camera SDK, the NIC, and the switch all start to look slightly suspicious.

This article deals with a case where TCP communication between a host and an application controlling an industrial camera occasionally stopped for a few seconds. When we investigated, the culprit was not the application freezing, but TCP waiting on a retransmission caused by packet loss. Furthermore, enabling the RFC1323-family timestamps feature (in the current standards landscape, RFC 7323) allowed us to keep the wait time to a minimum in this system.

Device names, configurations, and numbers have been generalized, but the way of thinking applies directly in practice.

Table of Contents

  1. The Conclusion First (In One Line)
  2. How the Symptom Appears
    • 2.1. The App Is Alive, but Responses Alone Stall for Seconds
    • 2.2. Low Frequency Makes It Hard to See in Logs Alone
  3. What Was Actually Happening (Diagrams)
    • 3.1. Packet Loss Leads into a Retransmission Wait
    • 3.2. The Multi-Second Stalls Matched the Shape of RTO
  4. What We Looked at During the Investigation
    • 4.1. First Rule Out In-App Stall Causes
    • 4.2. Confirm Retransmissions with a Packet Capture
    • 4.3. Look at the Negotiated TCP Options
  5. Why RFC1323 Timestamps Help
    • 5.1. Timestamps Exist for RTTM and PAWS
    • 5.2. They Remove the Ambiguity of RTT Measurement on Retransmission
    • 5.3. Why We Could Tighten the Wait Time in This Case
  6. What We Actually Did
    • 6.1. Enable Timestamps
    • 6.2. Verify TSopt in the SYN / SYN-ACK
    • 6.3. Where to Look When It Still Doesn’t Help
  7. What to Check in Wireshark
  8. A Rough Decision Guide
  9. Summary
  10. References

1. The Conclusion First (In One Line)

  • TCP communication that occasionally stalls for several seconds can be caused not by the app freezing, but by a retransmission wait following packet loss
  • If a packet capture shows Retransmission with a large time gap, and the stall duration matches how RTO waits behave, it is quite suspicious
  • The TCP timestamps option is a mechanism for RTT measurement and PAWS, and it also removes the ambiguity of RTT measurement on retransmission
  • In this case, enabling the RFC1323-family timestamps feature reduced the time the RTO estimate stayed stale and conservative, keeping the multi-second stalls to a minimum
  • However, this is not magic that makes the loss itself disappear. Reviewing the physical layer, NIC, switches, intermediate devices, drivers, and buffer design is still needed separately

In short, if the true identity of “it occasionally stalls for a few seconds” is wait time inside TCP, then working hard only on application-level retries misses the mark. It is faster to look at the wire first and confirm whether you are in a retransmission wait.

2. How the Symptom Appears

2.1. The App Is Alive, but Responses Alone Stall for Seconds

The first confusing part is that the application as a whole does not look frozen.

  • The UI is not completely dead
  • The process has not crashed
  • The CPU is not pegged
  • Yet the responses to camera control commands occasionally drop out for several seconds

Symptoms like this are hard to distinguish from an in-app deadlock or infinite loop. Moreover, in equipment control, a single multi-second stall directly creates the impression of a line stoppage. Even if the averages look clean, the perception on the shop floor is considerably worse.

2.2. Low Frequency Makes It Hard to See in Logs Alone

What makes this type of defect tedious is the low occurrence rate. It behaves like once an hour, once every half day, or only when conditions happen to line up.

If you chase it only through logs, it usually goes like this.

  • The application log stops at “sent it” and “no response came back”
  • The receiver-side log looks like “nothing arrived”
  • Some other event happens to occur in the same time window, scattering the suspects

In situations like this, trying to reconstruct causality from application logs alone is a pretty reliable way to get stuck in a swamp. It is faster to step down one layer to the communication level.

3. What Was Actually Happening (Diagrams)

3.1. Packet Loss Leads into a Retransmission Wait

The storyline this time is simple. A packet was lost somewhere along the way, the sender waited for an ACK, none came, so it waited out the RTO and then retransmitted.

Camera sideNetworkHost appCamera sideNetworkHost appLost hereNo ACK arrives, so it waitsRecovering this request requires waiting out the RTOCommunication resumes hereControl command (Seq=N)Retransmit the control commandRetransmitted packet arrivesACKACK

From the application’s point of view it looks like “it stalled for a few seconds,” but from TCP’s point of view it was simply “no ACK has arrived yet, so I am waiting for the retransmission timer to expire.” It is unglamorous, but this kind of stall happens all the time.

The control traffic in this case consisted mostly of small request/response exchanges, and a single exchange did not have a large amount of unacknowledged data in flight. As a result, this was a configuration where the RTO wait tended to surface before enough duplicate ACKs could accumulate to trigger fast retransmit.

3.2. The Multi-Second Stalls Matched the Shape of RTO

TCP’s retransmission wait, while it varies by implementation, behaves conservatively. Under RFC 6298, the initial RTO has a baseline of 1 second; if the computed value is smaller, it is rounded up to 1 second, and when a timeout occurs, the RTO doubles.

YesNoPacket lossNo ACK arrivesWait out RTORetransmitACK returned?Communication resumesDouble the RTO

So even in situations where you want things resolved within a few hundred milliseconds, under bad conditions the waits can look like 1 second, 2 seconds, 4 seconds. The “occasionally stalls for a few seconds” in this case lined up with that shape quite naturally.

4. What We Looked at During the Investigation

4.1. First Rule Out In-App Stall Causes

Rather than jumping straight to blaming TCP, we first ruled out the typical application-side causes.

What we checked Why we looked Conclusion in this case
UI thread / worker threads Check for hangs or mutual waits Not the primary cause
CPU usage Check for processing delays under high load Not pegged even during the stalls
GC / memory pressure Check for pauses The shape of the stall durations did not match
Camera SDK calls Check for waits inside the SDK Did not match the delays on the wire
Packet capture Check for retransmissions at the communication layer This is where the cause came into view

The important thing here is to not pick the culprit based on application-log timestamps alone. In equipment control applications, a wait at the upper layer is sometimes just a reflection of a wait at the lower layer.

4.2. Confirm Retransmissions with a Packet Capture

When we took a packet capture, we could see TCP Retransmission in the time window of the stall, and furthermore that no ACK had come back just before it.

These are the points to look at.

  • Is the same Seq being retransmitted?
  • Does the time gap until the retransmission match the stall duration?
  • Does it look like an RTO-expiry wait rather than Dup ACK or Fast Retransmission?
  • Does the problematic connection always show up as the same tcp.stream?

When these line up, “TCP is waiting on a retransmission” becomes much more likely than “the app is frozen.”

4.3. Look at the Negotiated TCP Options

The next thing we looked at was the SYN / SYN-ACK at connection setup. Timestamps are negotiated in the TCP 3-way handshake, so if TSopt does not appear there, it is not used on that connection.

Camera sideHostCamera sideHostOnly after negotiating here can TSopt be used on subsequent segmentsSYN + TSopt ?SYN/ACK + TSopt ?ACK

If you fiddle with OS settings without looking at this, you end up with another unglamorous accident: “I’m sure I enabled it, but it isn’t taking effect.” The facts on the wire are stronger than the configured values.

5. Why RFC1323 Timestamps Help

In practice people still call this “the RFC1323 timestamps,” but the current standard is RFC 7323. This article follows the customary usage and writes RFC1323, while meaning the TCP timestamps option.

5.1. Timestamps Exist for RTTM and PAWS

TCP’s timestamps option is used mainly for two purposes.

  • RTTM (Round-Trip Time Measurement)
  • PAWS (Protect Against Wrapped Sequences)

What helped in this case was the RTTM side. By having the peer echo the TSval of a transmitted segment back in the ACK’s TSecr, the sender can measure RTT more finely and more accurately.

5.2. They Remove the Ambiguity of RTT Measurement on Retransmission

Once a retransmission happens, without timestamps it becomes ambiguous whether “this ACK is for the original transmission or for the retransmission.” This is the point that Karn’s algorithm is concerned with.

RFC 6298 says you must not take an RTT sample from a retransmitted segment. The reason is that you cannot tell which transmission the ACK is for. With the timestamps option, however, this ambiguity goes away: by looking at the TSecr in the arriving ACK, you can identify which segment, with which TSval, actually got through.

ReceiverSenderReceiverSenderThis segment is lostNo ACK arrives, so it waitsCan tell which transmission this responds toSeq=N, TSval=1000Retransmit Seq=N, TSval=2000ACK, TSecr=2000

This is the core of the improvement in this case.

5.3. Why We Could Tighten the Wait Time in This Case

In this case, packet loss occurred from time to time, and each occurrence tended to push the RTT / RTO estimates toward the conservative side. Enabling timestamps makes it easier to update the RTT estimate even in scenarios involving retransmissions, which limits the time the RTO estimate keeps inflating while stale.

Put differently, what we did is not magic that makes TCP faster, but reducing the time TCP keeps watching and waiting longer than necessary.

Of course, RFC 7323 does not claim that “more RTT samples cleanly solve everything.” The degree to which it helps RTO optimization is limited in some respects. Still, the fact that it removes the ambiguity on retransmission can help quite naturally in a system like this one.

There are caveats.

  • Parts of this depend on the TCP stack implementation
  • Timestamps alone do not make the packet loss itself disappear
  • If the physical layer or intermediate devices are at fault, the root cause lies elsewhere
  • SACK, NIC drivers, offload settings, and switch-side problems are better examined separately

That said, in a system like this one — where “loss is not zero” but “what really hurts is the multi-second wait” — it can be quite effective.

6. What We Actually Did

6.1. Enable Timestamps

As the countermeasure, we made sure the timestamps option could be negotiated on both ends of the connection. On Windows systems this is sometimes treated as the RFC 1323 option, and it is affected by OS and network settings.

In practice, however, what matters is not “it is enabled in the settings screen” but “TSopt is actually present on the SYN / SYN-ACK packets on the wire.” This really is true.

6.2. Verify TSopt in the SYN / SYN-ACK

After enabling it, we verified three things.

  • Does the SYN of the connection in question carry TSopt?
  • Does the SYN/ACK side return TSopt as well?
  • Do subsequent data segments and ACKs continue to carry TSopt?

Only once these are confirmed can you say “timestamps are actually being used on that connection.”

6.3. Where to Look When It Still Doesn’t Help

Even with timestamps enabled, improvement can be sluggish in cases like these.

  • The loss rate itself is high
  • An intermediate device breaks, drops, or mangles TCP options
  • There is a separate problem around the NIC / driver / offloading
  • The application hangs everything off a single synchronous call, so one wait looks like a total stall
  • The primary cause is actually not TCP, but a processing stall on the camera side or a clogged queue inside the device

So it is clearest to proceed with countermeasures in this order.

  1. First confirm the retransmission wait on the wire
  2. Check whether TSopt is negotiated
  3. Enable timestamps and measure the improvement delta
  4. If problems remain, tackle the loss source and the application design separately

7. What to Check in Wireshark

Here are display filters that are handy for isolation.

tcp.stream eq <target stream>
tcp.analysis.retransmission
tcp.analysis.fast_retransmission
tcp.analysis.lost_segment
tcp.options.timestamp.tsval
tcp.options.timestamp.tsecr

There are a few tricks to reading the results.

  • Narrow down to the target connection with tcp.stream
  • Display Time delta from previous displayed packet to see the stalled seconds directly
  • Confirm whether Retransmission appears at the problematic moment
  • Confirm whether TSopt is negotiated in the SYN / SYN-ACK at connection setup
  • Check whether TSecr is being returned in the ACKs

When correlating logs with packets, also watch out for the offset between application clock and capture clock. If they are skewed, you tend to pin the blame on an unrelated event.

8. A Rough Decision Guide

Symptom First suspect First action
Occasionally stalls for several seconds TCP’s RTO wait Confirm retransmissions and time gaps in packets
Stalls at almost the same timing every time In-app waits, device-side processing, fixed timeouts Look at threads, SDK calls, device logs
Only degrades under high load CPU, GC, clogged queues Look at CPU, interrupts, memory, queue lengths
Bad across a wide range of connections at once Physical layer, switches, intermediate devices Look at NIC, cables, port statistics, intermediate device logs
Changed settings but nothing changed TCP option is not being negotiated Re-check the SYN / SYN-ACK

That last row is genuinely common. The satisfaction of having tweaked a setting and the fact of it being used on the wire are two different things.

9. Summary

Key points this time:

  • “Occasionally stalls for several seconds” can be TCP’s retransmission wait, not the app freezing
  • If the stall duration matches how RTO waits behave and Retransmission is visible, you are on a good track
  • The TCP timestamps option is a mechanism for RTTM and PAWS, and it removes the ambiguity of RTT measurement on retransmission
  • In this case, enabling the RFC1323-family timestamps limited the time the RTO stayed excessively conservative

Approaches to avoid:

  • Picking the culprit for a communication stall from application logs alone
  • Looking only at OS settings without looking at actual packets
  • Assuming that enabling timestamps will also eliminate the cause of the loss

Approaches that work in practice:

  • Look at the wire first
  • Confirm the shape of the retransmissions and wait times
  • Confirm the TSopt negotiation
  • Even after improvement, tackle the loss source and the application design separately

In other words, with this class of defect, “pinpointing where it is waiting” comes before “making it faster.” Just by not missing that, the investigation gets considerably shorter.

10. References

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

These topic pages place the article in a broader service and decision context.

This case-study page shows a similar structure for diagnosis, prioritization, or redesign.

This article connects naturally to the following service pages.

Windows App Development

It also connects to consultations on reviewing communication design and monitoring from the implementation side, for Windows applications that integrate with equipment.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

Back to the Blog