Why TCP Retransmissions Stall Industrial Camera Communication, and How to Isolate Them

In communication with industrial cameras and equipment control, the most troublesome symptom is a link that is fast on average but occasionally stalls for several seconds. It rarely reproduces and nothing happens most of the time, so the UI, threads, the GC, the camera SDK, the NIC, and the switch all start to look slightly suspicious.

This article deals with a case where TCP communication between a host and an application controlling an industrial camera occasionally stopped for a few seconds. When we investigated, the culprit was not the application freezing, but TCP waiting on a retransmission caused by packet loss. Furthermore, enabling the RFC1323-family timestamps feature (in the current standards landscape, RFC 7323) allowed us to keep the wait time to a minimum in this system.

Device names, configurations, and numbers have been generalized, but the way of thinking applies directly in practice.

The Conclusion First (In One Line)
How the Symptom Appears
- 2.1. The App Is Alive, but Responses Alone Stall for Seconds
- 2.2. Low Frequency Makes It Hard to See in Logs Alone
What Was Actually Happening (Diagrams)
- 3.1. Packet Loss Leads into a Retransmission Wait
- 3.2. The Multi-Second Stalls Matched the Shape of RTO
What We Looked at During the Investigation
- 4.1. First Rule Out In-App Stall Causes
- 4.2. Confirm Retransmissions with a Packet Capture
- 4.3. Look at the Negotiated TCP Options
Why RFC1323 Timestamps Help
- 5.1. Timestamps Exist for RTTM and PAWS
- 5.2. They Remove the Ambiguity of RTT Measurement on Retransmission
- 5.3. Why We Could Tighten the Wait Time in This Case
What We Actually Did
- 6.1. Enable Timestamps
- 6.2. Verify TSopt in the SYN / SYN-ACK
- 6.3. Where to Look When It Still Doesn’t Help
What to Check in Wireshark
A Rough Decision Guide
Summary
References

1. The Conclusion First (In One Line)

TCP communication that occasionally stalls for several seconds can be caused not by the app freezing, but by a retransmission wait following packet loss
If a packet capture shows Retransmission with a large time gap, and the stall duration matches how RTO waits behave, it is quite suspicious
The TCP timestamps option is a mechanism for RTT measurement and PAWS, and it also removes the ambiguity of RTT measurement on retransmission
In this case, enabling the RFC1323-family timestamps feature reduced the time the RTO estimate stayed stale and conservative, keeping the multi-second stalls to a minimum
However, this is not magic that makes the loss itself disappear. Reviewing the physical layer, NIC, switches, intermediate devices, drivers, and buffer design is still needed separately

In short, if the true identity of “it occasionally stalls for a few seconds” is wait time inside TCP, then working hard only on application-level retries misses the mark. It is faster to look at the wire first and confirm whether you are in a retransmission wait.

2. How the Symptom Appears

2.1. The App Is Alive, but Responses Alone Stall for Seconds

The first confusing part is that the application as a whole does not look frozen.

The UI is not completely dead
The process has not crashed
The CPU is not pegged
Yet the responses to camera control commands occasionally drop out for several seconds

Symptoms like this are hard to distinguish from an in-app deadlock or infinite loop. Moreover, in equipment control, a single multi-second stall directly creates the impression of a line stoppage. Even if the averages look clean, the perception on the shop floor is considerably worse.

2.2. Low Frequency Makes It Hard to See in Logs Alone

What makes this type of defect tedious is the low occurrence rate. It behaves like once an hour, once every half day, or only when conditions happen to line up.

If you chase it only through logs, it usually goes like this.

The application log stops at “sent it” and “no response came back”
The receiver-side log looks like “nothing arrived”
Some other event happens to occur in the same time window, scattering the suspects

In situations like this, trying to reconstruct causality from application logs alone is a pretty reliable way to get stuck in a swamp. It is faster to step down one layer to the communication level.

3. What Was Actually Happening (Diagrams)

3.1. Packet Loss Leads into a Retransmission Wait

The storyline this time is simple. A packet was lost somewhere along the way, the sender waited for an ACK, none came, so it waited out the RTO and then retransmitted.

sequenceDiagram
    participant Host as Host app
    participant Net as Network
    participant Cam as Camera side

    Host->>Net: Control command (Seq=N)
    Note over Net: Lost here
    Note over Host: No ACK arrives, so it waits
    Note over Host: Recovering this request requires waiting out the RTO
    Host->>Net: Retransmit the control command
    Net->>Cam: Retransmitted packet arrives
    Cam-->>Net: ACK
    Net-->>Host: ACK
    Note over Host: Communication resumes here

From the application’s point of view it looks like “it stalled for a few seconds,” but from TCP’s point of view it was simply “no ACK has arrived yet, so I am waiting for the retransmission timer to expire.” It is unglamorous, but this kind of stall happens all the time.

The control traffic in this case consisted mostly of small request/response exchanges, and a single exchange did not have a large amount of unacknowledged data in flight. As a result, this was a configuration where the RTO wait tended to surface before enough duplicate ACKs could accumulate to trigger fast retransmit.

3.2. The Multi-Second Stalls Matched the Shape of RTO

TCP’s retransmission wait, while it varies by implementation, behaves conservatively. Under RFC 6298, the initial RTO has a baseline of 1 second; if the computed value is smaller, it is rounded up to 1 second, and when a timeout occurs, the RTO doubles.

flowchart LR
    A[Packet loss] --> B[No ACK arrives]
    B --> C[Wait out RTO]
    C --> D[Retransmit]
    D --> E{ACK returned?}
    E -- Yes --> F[Communication resumes]
    E -- No --> G[Double the RTO]
    G --> C

So even in situations where you want things resolved within a few hundred milliseconds, under bad conditions the waits can look like 1 second, 2 seconds, 4 seconds. The “occasionally stalls for a few seconds” in this case lined up with that shape quite naturally.

4. What We Looked at During the Investigation

4.1. First Rule Out In-App Stall Causes

Rather than jumping straight to blaming TCP, we first ruled out the typical application-side causes.

What we checked	Why we looked	Conclusion in this case
UI thread / worker threads	Check for hangs or mutual waits	Not the primary cause
CPU usage	Check for processing delays under high load	Not pegged even during the stalls
GC / memory pressure	Check for pauses	The shape of the stall durations did not match
Camera SDK calls	Check for waits inside the SDK	Did not match the delays on the wire
Packet capture	Check for retransmissions at the communication layer	This is where the cause came into view

The important thing here is to not pick the culprit based on application-log timestamps alone. In equipment control applications, a wait at the upper layer is sometimes just a reflection of a wait at the lower layer.

4.2. Confirm Retransmissions with a Packet Capture

When we took a packet capture, we could see TCP Retransmission in the time window of the stall, and furthermore that no ACK had come back just before it.

These are the points to look at.

Is the same Seq being retransmitted?
Does the time gap until the retransmission match the stall duration?
Does it look like an RTO-expiry wait rather than Dup ACK or Fast Retransmission?
Does the problematic connection always show up as the same tcp.stream?

When these line up, “TCP is waiting on a retransmission” becomes much more likely than “the app is frozen.”

4.3. Look at the Negotiated TCP Options

The next thing we looked at was the SYN / SYN-ACK at connection setup. Timestamps are negotiated in the TCP 3-way handshake, so if TSopt does not appear there, it is not used on that connection.

sequenceDiagram
    participant Host as Host
    participant Cam as Camera side

    Host->>Cam: SYN + TSopt ?
    Cam-->>Host: SYN/ACK + TSopt ?
    Host->>Cam: ACK
    Note over Host,Cam: Only after negotiating here can TSopt be used on subsequent segments

If you fiddle with OS settings without looking at this, you end up with another unglamorous accident: “I’m sure I enabled it, but it isn’t taking effect.” The facts on the wire are stronger than the configured values.

5. Why RFC1323 Timestamps Help

In practice people still call this “the RFC1323 timestamps,” but the current standard is RFC 7323. This article follows the customary usage and writes RFC1323, while meaning the TCP timestamps option.

5.1. Timestamps Exist for RTTM and PAWS

TCP’s timestamps option is used mainly for two purposes.

RTTM (Round-Trip Time Measurement)
PAWS (Protect Against Wrapped Sequences)

What helped in this case was the RTTM side. By having the peer echo the TSval of a transmitted segment back in the ACK’s TSecr, the sender can measure RTT more finely and more accurately.

5.2. They Remove the Ambiguity of RTT Measurement on Retransmission

Once a retransmission happens, without timestamps it becomes ambiguous whether “this ACK is for the original transmission or for the retransmission.” This is the point that Karn’s algorithm is concerned with.

RFC 6298 says you must not take an RTT sample from a retransmitted segment. The reason is that you cannot tell which transmission the ACK is for. With the timestamps option, however, this ambiguity goes away: by looking at the TSecr in the arriving ACK, you can identify which segment, with which TSval, actually got through.

sequenceDiagram
    participant Host as Sender
    participant Cam as Receiver

    Host->>Cam: Seq=N, TSval=1000
    Note over Host,Cam: This segment is lost
    Note over Host: No ACK arrives, so it waits
    Host->>Cam: Retransmit Seq=N, TSval=2000
    Cam-->>Host: ACK, TSecr=2000
    Note over Host: Can tell which transmission this responds to

This is the core of the improvement in this case.

5.3. Why We Could Tighten the Wait Time in This Case

In this case, packet loss occurred from time to time, and each occurrence tended to push the RTT / RTO estimates toward the conservative side. Enabling timestamps makes it easier to update the RTT estimate even in scenarios involving retransmissions, which limits the time the RTO estimate keeps inflating while stale.

Put differently, what we did is not magic that makes TCP faster, but reducing the time TCP keeps watching and waiting longer than necessary.

Of course, RFC 7323 does not claim that “more RTT samples cleanly solve everything.” The degree to which it helps RTO optimization is limited in some respects. Still, the fact that it removes the ambiguity on retransmission can help quite naturally in a system like this one.

There are caveats.

Parts of this depend on the TCP stack implementation
Timestamps alone do not make the packet loss itself disappear
If the physical layer or intermediate devices are at fault, the root cause lies elsewhere
SACK, NIC drivers, offload settings, and switch-side problems are better examined separately

That said, in a system like this one — where “loss is not zero” but “what really hurts is the multi-second wait” — it can be quite effective.

6. What We Actually Did

6.1. Enable Timestamps

As the countermeasure, we made sure the timestamps option could be negotiated on both ends of the connection. On Windows systems this is sometimes treated as the RFC 1323 option, and it is affected by OS and network settings.

In practice, however, what matters is not “it is enabled in the settings screen” but “TSopt is actually present on the SYN / SYN-ACK packets on the wire.” This really is true.

6.2. Verify TSopt in the SYN / SYN-ACK

After enabling it, we verified three things.

Does the SYN of the connection in question carry TSopt?
Does the SYN/ACK side return TSopt as well?
Do subsequent data segments and ACKs continue to carry TSopt?

Only once these are confirmed can you say “timestamps are actually being used on that connection.”

6.3. Where to Look When It Still Doesn’t Help

Even with timestamps enabled, improvement can be sluggish in cases like these.

The loss rate itself is high
An intermediate device breaks, drops, or mangles TCP options
There is a separate problem around the NIC / driver / offloading
The application hangs everything off a single synchronous call, so one wait looks like a total stall
The primary cause is actually not TCP, but a processing stall on the camera side or a clogged queue inside the device

So it is clearest to proceed with countermeasures in this order.

First confirm the retransmission wait on the wire
Check whether TSopt is negotiated
Enable timestamps and measure the improvement delta
If problems remain, tackle the loss source and the application design separately

7. What to Check in Wireshark

Here are display filters that are handy for isolation.

tcp.stream eq <target stream>
tcp.analysis.retransmission
tcp.analysis.fast_retransmission
tcp.analysis.lost_segment
tcp.options.timestamp.tsval
tcp.options.timestamp.tsecr

There are a few tricks to reading the results.

Narrow down to the target connection with tcp.stream
Display Time delta from previous displayed packet to see the stalled seconds directly
Confirm whether Retransmission appears at the problematic moment
Confirm whether TSopt is negotiated in the SYN / SYN-ACK at connection setup
Check whether TSecr is being returned in the ACKs

When correlating logs with packets, also watch out for the offset between application clock and capture clock. If they are skewed, you tend to pin the blame on an unrelated event.

8. A Rough Decision Guide

Symptom	First suspect	First action
Occasionally stalls for several seconds	TCP’s RTO wait	Confirm retransmissions and time gaps in packets
Stalls at almost the same timing every time	In-app waits, device-side processing, fixed timeouts	Look at threads, SDK calls, device logs
Only degrades under high load	CPU, GC, clogged queues	Look at CPU, interrupts, memory, queue lengths
Bad across a wide range of connections at once	Physical layer, switches, intermediate devices	Look at NIC, cables, port statistics, intermediate device logs
Changed settings but nothing changed	TCP option is not being negotiated	Re-check the SYN / SYN-ACK

That last row is genuinely common. The satisfaction of having tweaked a setting and the fact of it being used on the wire are two different things.

9. Summary

Key points this time:

“Occasionally stalls for several seconds” can be TCP’s retransmission wait, not the app freezing
If the stall duration matches how RTO waits behave and Retransmission is visible, you are on a good track
The TCP timestamps option is a mechanism for RTTM and PAWS, and it removes the ambiguity of RTT measurement on retransmission
In this case, enabling the RFC1323-family timestamps limited the time the RTO stayed excessively conservative

Approaches to avoid:

Picking the culprit for a communication stall from application logs alone
Looking only at OS settings without looking at actual packets
Assuming that enabling timestamps will also eliminate the cause of the loss

Approaches that work in practice:

Look at the wire first
Confirm the shape of the retransmissions and wait times
Confirm the TSopt negotiation
Even after improvement, tackle the loss source and the application design separately

In other words, with this class of defect, “pinpointing where it is waiting” comes before “making it faster.” Just by not missing that, the investigation gets considerably shorter.

10. References

RFC 1323 - TCP Extensions for High Performance
RFC 7323 - TCP Extensions for High Performance
RFC 5681 - TCP Congestion Control
RFC 6298 - Computing TCP’s Retransmission Timer

[Description of Windows TCP features - Windows Server

Microsoft Learn](https://learn.microsoft.com/en-us/troubleshoot/windows-server/networking/description-tcp-features)

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Building a Windows Failure-Path Test Foundation with Application Verifier

What Application Verifier is, organized together with how to build a Windows failure-path test foundation using Handles, Heaps, Low Resou...

Read Article

Investigating Long-Run Crashes of an Industrial Camera App - The Handle Leak (Part 1)

How to look at a Windows app that suddenly crashes after long-running operation, using a case study of an industrial camera control app, ...

Read Article

Windows App Outsourcing and Contract Development: What to Sort Out Before You Ask

Before commissioning Windows app outsourcing or contract development, here is how to sort out existing software modification, device inte...

Read Article

Designing Windows Apps to Leave Logs and Dumps When They Crash

How to combine regular logging, a final crash marker, WER LocalDumps, and a watchdog process so that even when a Windows app dies from an...

Read Article

An Introduction to Collecting Windows Crash Dumps - WER/ProcDump/WinDbg

To chase hard-to-reproduce Windows application crashes, we walk through when to use WER LocalDumps, ProcDump, MiniDumpWriteDump, and WinD...

Read Article

Related Case Study

This case-study page shows a similar structure for diagnosis, prioritization, or redesign.

How We Isolated Multi-Second Communication Stalls

Case-study page for separating a rare communication stall into retransmission wait behavior and OS-side conditions.

View Case Study

Where This Topic Connects

This article connects naturally to the following service pages.

Bug Investigation & Root Cause Analysis

This article is about isolating a hard-to-reproduce communication stall using packets and evidence, which is exactly what our bug investigation and root-cause analysis service covers.

View Service Contact

Windows App Development

It also connects to consultations on reviewing communication design and monitoring from the implementation side, for Windows applications that integrate with equipment.

View Service Contact

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

View Profile Contact

Public links

GitHub LinkedIn X COM_BLAS COM_BigDecimal

Why TCP Retransmissions Stall Industrial Camera Communication, and How to Isolate Them

Table of Contents

1. The Conclusion First (In One Line)

2. How the Symptom Appears

2.1. The App Is Alive, but Responses Alone Stall for Seconds

2.2. Low Frequency Makes It Hard to See in Logs Alone

3. What Was Actually Happening (Diagrams)

3.1. Packet Loss Leads into a Retransmission Wait

3.2. The Multi-Second Stalls Matched the Shape of RTO

4. What We Looked at During the Investigation

4.1. First Rule Out In-App Stall Causes

4.2. Confirm Retransmissions with a Packet Capture

4.3. Look at the Negotiated TCP Options

5. Why RFC1323 Timestamps Help

5.1. Timestamps Exist for RTTM and PAWS

5.2. They Remove the Ambiguity of RTT Measurement on Retransmission

5.3. Why We Could Tighten the Wait Time in This Case

6. What We Actually Did

6.1. Enable Timestamps

6.2. Verify TSopt in the SYN / SYN-ACK

6.3. Where to Look When It Still Doesn’t Help

7. What to Check in Wireshark

8. A Rough Decision Guide

9. Summary

10. References

Building a Windows Failure-Path Test Foundation with Application Verifier

Investigating Long-Run Crashes of an Industrial Camera App - The Handle Leak (Part 1)

Windows App Outsourcing and Contract Development: What to Sort Out Before You Ask

Designing Windows Apps to Leave Logs and Dumps When They Crash

An Introduction to Collecting Windows Crash Dumps - WER/ProcDump/WinDbg

Related Topics

Windows Technical Topics

Bug Investigation & Long-Run Failures

Related Case Study

How We Isolated Multi-Second Communication Stalls

Where This Topic Connects

Bug Investigation & Root Cause Analysis

Windows App Development

Author Profile

Go Komura

Table of Contents

1. The Conclusion First (In One Line)

2. How the Symptom Appears

2.1. The App Is Alive, but Responses Alone Stall for Seconds

2.2. Low Frequency Makes It Hard to See in Logs Alone

3. What Was Actually Happening (Diagrams)

3.1. Packet Loss Leads into a Retransmission Wait

3.2. The Multi-Second Stalls Matched the Shape of RTO

4. What We Looked at During the Investigation

4.1. First Rule Out In-App Stall Causes

4.2. Confirm Retransmissions with a Packet Capture

4.3. Look at the Negotiated TCP Options

5. Why RFC1323 Timestamps Help

5.1. Timestamps Exist for RTTM and PAWS

5.2. They Remove the Ambiguity of RTT Measurement on Retransmission

5.3. Why We Could Tighten the Wait Time in This Case

6. What We Actually Did

6.1. Enable Timestamps

6.2. Verify TSopt in the SYN / SYN-ACK

6.3. Where to Look When It Still Doesn’t Help

7. What to Check in Wireshark

8. A Rough Decision Guide

9. Summary

10. References

Related Articles

Related Topics

Related Case Study

Where This Topic Connects

Author Profile

Go Komura