Why TCP Retransmissions Stall Industrial Camera Communication, and How to Isolate Them
· Go Komura · TCP, Networking, Bug Investigation, Windows Development, Industrial Camera
In communication with industrial cameras and equipment control, the most troublesome symptom is a link that is fast on average but occasionally stalls for several seconds. It rarely reproduces and nothing happens most of the time, so the UI, threads, the GC, the camera SDK, the NIC, and the switch all start to look slightly suspicious.
This article deals with a case where TCP communication between a host and an application controlling an industrial camera occasionally stopped for a few seconds. When we investigated, the culprit was not the application freezing, but TCP waiting on a retransmission caused by packet loss. Furthermore, enabling the RFC1323-family timestamps feature (in the current standards landscape, RFC 7323) allowed us to keep the wait time to a minimum in this system.
Device names, configurations, and numbers have been generalized, but the way of thinking applies directly in practice.
Table of Contents
- The Conclusion First (In One Line)
- How the Symptom Appears
- 2.1. The App Is Alive, but Responses Alone Stall for Seconds
- 2.2. Low Frequency Makes It Hard to See in Logs Alone
- What Was Actually Happening (Diagrams)
- 3.1. Packet Loss Leads into a Retransmission Wait
- 3.2. The Multi-Second Stalls Matched the Shape of RTO
- What We Looked at During the Investigation
- 4.1. First Rule Out In-App Stall Causes
- 4.2. Confirm Retransmissions with a Packet Capture
- 4.3. Look at the Negotiated TCP Options
- Why RFC1323 Timestamps Help
- 5.1. Timestamps Exist for RTTM and PAWS
- 5.2. They Remove the Ambiguity of RTT Measurement on Retransmission
- 5.3. Why We Could Tighten the Wait Time in This Case
- What We Actually Did
- 6.1. Enable Timestamps
- 6.2. Verify TSopt in the SYN / SYN-ACK
- 6.3. Where to Look When It Still Doesn’t Help
- What to Check in Wireshark
- A Rough Decision Guide
- Summary
- References
1. The Conclusion First (In One Line)
- TCP communication that occasionally stalls for several seconds can be caused not by the app freezing, but by a retransmission wait following packet loss
- If a packet capture shows
Retransmissionwith a large time gap, and the stall duration matches how RTO waits behave, it is quite suspicious - The TCP timestamps option is a mechanism for RTT measurement and PAWS, and it also removes the ambiguity of RTT measurement on retransmission
- In this case, enabling the RFC1323-family timestamps feature reduced the time the RTO estimate stayed stale and conservative, keeping the multi-second stalls to a minimum
- However, this is not magic that makes the loss itself disappear. Reviewing the physical layer, NIC, switches, intermediate devices, drivers, and buffer design is still needed separately
In short, if the true identity of “it occasionally stalls for a few seconds” is wait time inside TCP, then working hard only on application-level retries misses the mark. It is faster to look at the wire first and confirm whether you are in a retransmission wait.
2. How the Symptom Appears
2.1. The App Is Alive, but Responses Alone Stall for Seconds
The first confusing part is that the application as a whole does not look frozen.
- The UI is not completely dead
- The process has not crashed
- The CPU is not pegged
- Yet the responses to camera control commands occasionally drop out for several seconds
Symptoms like this are hard to distinguish from an in-app deadlock or infinite loop. Moreover, in equipment control, a single multi-second stall directly creates the impression of a line stoppage. Even if the averages look clean, the perception on the shop floor is considerably worse.
2.2. Low Frequency Makes It Hard to See in Logs Alone
What makes this type of defect tedious is the low occurrence rate. It behaves like once an hour, once every half day, or only when conditions happen to line up.
If you chase it only through logs, it usually goes like this.
- The application log stops at “sent it” and “no response came back”
- The receiver-side log looks like “nothing arrived”
- Some other event happens to occur in the same time window, scattering the suspects
In situations like this, trying to reconstruct causality from application logs alone is a pretty reliable way to get stuck in a swamp. It is faster to step down one layer to the communication level.
3. What Was Actually Happening (Diagrams)
3.1. Packet Loss Leads into a Retransmission Wait
The storyline this time is simple. A packet was lost somewhere along the way, the sender waited for an ACK, none came, so it waited out the RTO and then retransmitted.
sequenceDiagram
participant Host as Host app
participant Net as Network
participant Cam as Camera side
Host->>Net: Control command (Seq=N)
Note over Net: Lost here
Note over Host: No ACK arrives, so it waits
Note over Host: Recovering this request requires waiting out the RTO
Host->>Net: Retransmit the control command
Net->>Cam: Retransmitted packet arrives
Cam-->>Net: ACK
Net-->>Host: ACK
Note over Host: Communication resumes here
From the application’s point of view it looks like “it stalled for a few seconds,” but from TCP’s point of view it was simply “no ACK has arrived yet, so I am waiting for the retransmission timer to expire.” It is unglamorous, but this kind of stall happens all the time.
The control traffic in this case consisted mostly of small request/response exchanges, and a single exchange did not have a large amount of unacknowledged data in flight. As a result, this was a configuration where the RTO wait tended to surface before enough duplicate ACKs could accumulate to trigger fast retransmit.
3.2. The Multi-Second Stalls Matched the Shape of RTO
TCP’s retransmission wait, while it varies by implementation, behaves conservatively. Under RFC 6298, the initial RTO has a baseline of 1 second; if the computed value is smaller, it is rounded up to 1 second, and when a timeout occurs, the RTO doubles.
flowchart LR
A[Packet loss] --> B[No ACK arrives]
B --> C[Wait out RTO]
C --> D[Retransmit]
D --> E{ACK returned?}
E -- Yes --> F[Communication resumes]
E -- No --> G[Double the RTO]
G --> C
So even in situations where you want things resolved within a few hundred milliseconds, under bad conditions the waits can look like 1 second, 2 seconds, 4 seconds. The “occasionally stalls for a few seconds” in this case lined up with that shape quite naturally.
4. What We Looked at During the Investigation
4.1. First Rule Out In-App Stall Causes
Rather than jumping straight to blaming TCP, we first ruled out the typical application-side causes.
| What we checked | Why we looked | Conclusion in this case |
|---|---|---|
| UI thread / worker threads | Check for hangs or mutual waits | Not the primary cause |
| CPU usage | Check for processing delays under high load | Not pegged even during the stalls |
| GC / memory pressure | Check for pauses | The shape of the stall durations did not match |
| Camera SDK calls | Check for waits inside the SDK | Did not match the delays on the wire |
| Packet capture | Check for retransmissions at the communication layer | This is where the cause came into view |
The important thing here is to not pick the culprit based on application-log timestamps alone. In equipment control applications, a wait at the upper layer is sometimes just a reflection of a wait at the lower layer.
4.2. Confirm Retransmissions with a Packet Capture
When we took a packet capture, we could see TCP Retransmission in the time window of the stall, and furthermore that no ACK had come back just before it.
These are the points to look at.
- Is the same
Seqbeing retransmitted? - Does the time gap until the retransmission match the stall duration?
- Does it look like an RTO-expiry wait rather than
Dup ACKorFast Retransmission? - Does the problematic connection always show up as the same
tcp.stream?
When these line up, “TCP is waiting on a retransmission” becomes much more likely than “the app is frozen.”
4.3. Look at the Negotiated TCP Options
The next thing we looked at was the SYN / SYN-ACK at connection setup. Timestamps are negotiated in the TCP 3-way handshake, so if TSopt does not appear there, it is not used on that connection.
sequenceDiagram
participant Host as Host
participant Cam as Camera side
Host->>Cam: SYN + TSopt ?
Cam-->>Host: SYN/ACK + TSopt ?
Host->>Cam: ACK
Note over Host,Cam: Only after negotiating here can TSopt be used on subsequent segments
If you fiddle with OS settings without looking at this, you end up with another unglamorous accident: “I’m sure I enabled it, but it isn’t taking effect.” The facts on the wire are stronger than the configured values.
5. Why RFC1323 Timestamps Help
In practice people still call this “the RFC1323 timestamps,” but the current standard is RFC 7323. This article follows the customary usage and writes RFC1323, while meaning the TCP timestamps option.
5.1. Timestamps Exist for RTTM and PAWS
TCP’s timestamps option is used mainly for two purposes.
- RTTM (Round-Trip Time Measurement)
- PAWS (Protect Against Wrapped Sequences)
What helped in this case was the RTTM side. By having the peer echo the TSval of a transmitted segment back in the ACK’s TSecr, the sender can measure RTT more finely and more accurately.
5.2. They Remove the Ambiguity of RTT Measurement on Retransmission
Once a retransmission happens, without timestamps it becomes ambiguous whether “this ACK is for the original transmission or for the retransmission.” This is the point that Karn’s algorithm is concerned with.
RFC 6298 says you must not take an RTT sample from a retransmitted segment. The reason is that you cannot tell which transmission the ACK is for. With the timestamps option, however, this ambiguity goes away: by looking at the TSecr in the arriving ACK, you can identify which segment, with which TSval, actually got through.
sequenceDiagram
participant Host as Sender
participant Cam as Receiver
Host->>Cam: Seq=N, TSval=1000
Note over Host,Cam: This segment is lost
Note over Host: No ACK arrives, so it waits
Host->>Cam: Retransmit Seq=N, TSval=2000
Cam-->>Host: ACK, TSecr=2000
Note over Host: Can tell which transmission this responds to
This is the core of the improvement in this case.
5.3. Why We Could Tighten the Wait Time in This Case
In this case, packet loss occurred from time to time, and each occurrence tended to push the RTT / RTO estimates toward the conservative side. Enabling timestamps makes it easier to update the RTT estimate even in scenarios involving retransmissions, which limits the time the RTO estimate keeps inflating while stale.
Put differently, what we did is not magic that makes TCP faster, but reducing the time TCP keeps watching and waiting longer than necessary.
Of course, RFC 7323 does not claim that “more RTT samples cleanly solve everything.” The degree to which it helps RTO optimization is limited in some respects. Still, the fact that it removes the ambiguity on retransmission can help quite naturally in a system like this one.
There are caveats.
- Parts of this depend on the TCP stack implementation
- Timestamps alone do not make the packet loss itself disappear
- If the physical layer or intermediate devices are at fault, the root cause lies elsewhere
- SACK, NIC drivers, offload settings, and switch-side problems are better examined separately
That said, in a system like this one — where “loss is not zero” but “what really hurts is the multi-second wait” — it can be quite effective.
6. What We Actually Did
6.1. Enable Timestamps
As the countermeasure, we made sure the timestamps option could be negotiated on both ends of the connection. On Windows systems this is sometimes treated as the RFC 1323 option, and it is affected by OS and network settings.
In practice, however, what matters is not “it is enabled in the settings screen” but “TSopt is actually present on the SYN / SYN-ACK packets on the wire.” This really is true.
6.2. Verify TSopt in the SYN / SYN-ACK
After enabling it, we verified three things.
- Does the SYN of the connection in question carry TSopt?
- Does the SYN/ACK side return TSopt as well?
- Do subsequent data segments and ACKs continue to carry TSopt?
Only once these are confirmed can you say “timestamps are actually being used on that connection.”
6.3. Where to Look When It Still Doesn’t Help
Even with timestamps enabled, improvement can be sluggish in cases like these.
- The loss rate itself is high
- An intermediate device breaks, drops, or mangles TCP options
- There is a separate problem around the NIC / driver / offloading
- The application hangs everything off a single synchronous call, so one wait looks like a total stall
- The primary cause is actually not TCP, but a processing stall on the camera side or a clogged queue inside the device
So it is clearest to proceed with countermeasures in this order.
- First confirm the retransmission wait on the wire
- Check whether TSopt is negotiated
- Enable timestamps and measure the improvement delta
- If problems remain, tackle the loss source and the application design separately
7. What to Check in Wireshark
Here are display filters that are handy for isolation.
tcp.stream eq <target stream>
tcp.analysis.retransmission
tcp.analysis.fast_retransmission
tcp.analysis.lost_segment
tcp.options.timestamp.tsval
tcp.options.timestamp.tsecr
There are a few tricks to reading the results.
- Narrow down to the target connection with
tcp.stream - Display
Time delta from previous displayed packetto see the stalled seconds directly - Confirm whether
Retransmissionappears at the problematic moment - Confirm whether TSopt is negotiated in the SYN / SYN-ACK at connection setup
- Check whether
TSecris being returned in the ACKs
When correlating logs with packets, also watch out for the offset between application clock and capture clock. If they are skewed, you tend to pin the blame on an unrelated event.
8. A Rough Decision Guide
| Symptom | First suspect | First action |
|---|---|---|
| Occasionally stalls for several seconds | TCP’s RTO wait | Confirm retransmissions and time gaps in packets |
| Stalls at almost the same timing every time | In-app waits, device-side processing, fixed timeouts | Look at threads, SDK calls, device logs |
| Only degrades under high load | CPU, GC, clogged queues | Look at CPU, interrupts, memory, queue lengths |
| Bad across a wide range of connections at once | Physical layer, switches, intermediate devices | Look at NIC, cables, port statistics, intermediate device logs |
| Changed settings but nothing changed | TCP option is not being negotiated | Re-check the SYN / SYN-ACK |
That last row is genuinely common. The satisfaction of having tweaked a setting and the fact of it being used on the wire are two different things.
9. Summary
Key points this time:
- “Occasionally stalls for several seconds” can be TCP’s retransmission wait, not the app freezing
- If the stall duration matches how RTO waits behave and
Retransmissionis visible, you are on a good track - The TCP timestamps option is a mechanism for RTTM and PAWS, and it removes the ambiguity of RTT measurement on retransmission
- In this case, enabling the RFC1323-family timestamps limited the time the RTO stayed excessively conservative
Approaches to avoid:
- Picking the culprit for a communication stall from application logs alone
- Looking only at OS settings without looking at actual packets
- Assuming that enabling timestamps will also eliminate the cause of the loss
Approaches that work in practice:
- Look at the wire first
- Confirm the shape of the retransmissions and wait times
- Confirm the TSopt negotiation
- Even after improvement, tackle the loss source and the application design separately
In other words, with this class of defect, “pinpointing where it is waiting” comes before “making it faster.” Just by not missing that, the investigation gets considerably shorter.
10. References
- RFC 1323 - TCP Extensions for High Performance
- RFC 7323 - TCP Extensions for High Performance
- RFC 5681 - TCP Congestion Control
- RFC 6298 - Computing TCP’s Retransmission Timer
-
[Description of Windows TCP features - Windows Server Microsoft Learn](https://learn.microsoft.com/en-us/troubleshoot/windows-server/networking/description-tcp-features)
Related Articles
Recent articles sharing the same tags. Deepen your understanding with closely related topics.
Building a Windows Failure-Path Test Foundation with Application Verifier
What Application Verifier is, organized together with how to build a Windows failure-path test foundation using Handles, Heaps, Low Resou...
Investigating Long-Run Crashes of an Industrial Camera App - The Handle Leak (Part 1)
How to look at a Windows app that suddenly crashes after long-running operation, using a case study of an industrial camera control app, ...
Windows App Outsourcing and Contract Development: What to Sort Out Before You Ask
Before commissioning Windows app outsourcing or contract development, here is how to sort out existing software modification, device inte...
Designing Windows Apps to Leave Logs and Dumps When They Crash
How to combine regular logging, a final crash marker, WER LocalDumps, and a watchdog process so that even when a Windows app dies from an...
An Introduction to Collecting Windows Crash Dumps - WER/ProcDump/WinDbg
To chase hard-to-reproduce Windows application crashes, we walk through when to use WER LocalDumps, ProcDump, MiniDumpWriteDump, and WinD...
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Bug Investigation & Long-Run Failures
Topic page for intermittent failures, communication diagnosis, long-run crashes, and failure-path test foundations.
Related Case Study
This case-study page shows a similar structure for diagnosis, prioritization, or redesign.
How We Isolated Multi-Second Communication Stalls
Case-study page for separating a rare communication stall into retransmission wait behavior and OS-side conditions.
Where This Topic Connects
This article connects naturally to the following service pages.
Bug Investigation & Root Cause Analysis
This article is about isolating a hard-to-reproduce communication stall using packets and evidence, which is exactly what our bug investigation and root-cause analysis service covers.
Windows App Development
It also connects to consultations on reviewing communication design and monitoring from the implementation side, for Windows applications that integrate with equipment.
Author Profile
Profile page for the article author.
Go Komura
Representative of KomuraSoft LLC
Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.
Public links