Investigating Long-Run Crashes of an Industrial Camera App - The Handle Leak (Part 1)
· Go Komura · Windows Development, Bug Investigation, Industrial Camera, Handle Leak, Logging Design
When a Windows app suddenly crashes after running for a long time, the first instinct is very often to suspect a memory leak. In reality, however, it is not uncommon for a handle leak to be the main culprit, finally surfacing weeks later as a secondary failure.
This article presents a case where we investigated a Windows app controlling an industrial camera that suddenly crashed after roughly one month of continuous operation. As we narrowed things down, the cause turned out to be a handle leak occurring on the failure path around camera reconnection.
In this first part, we cover what a handle leak is, how we isolated this incident, and what logs you should keep to prevent recurrence. In the second part, When an Industrial Camera Control App Suddenly Crashes After One Month (Part 2) - What Application Verifier Is and How to Build a Failure-Path Test Foundation, we discuss building a failure-path test foundation.
Proper names and some log fields have been redacted, but the way of thinking is broadly shared across Windows equipment control apps in general.
Table of Contents
- The Conclusion First (In One Line)
- What Is a Handle Leak?
- 2.1. What “Handle” Means Here
- 2.2. Why It Tends to Surface Only After Long-Running Operation
- 2.3. How It Differs from a Memory Leak
- Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month
- 3.1. The Symptoms
- 3.2. The Metrics We Looked at First
- 3.3. The Leak That Was the Root Cause
- How We Isolated It
- 4.1. Compress Time Instead of Waiting for a Month-Scale Repro
- 4.2. Read the Slope of
Handle Count - 4.3. Check the Pairing of
create/openandclose/dispose - 4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed
- The Logs You Need to Prevent Recurrence
- 5.1. The Minimum Set to Keep First
- 5.2. The Logs We Actually Strengthened
- 5.3. At What Granularity to Collect
- A Rough Decision Guide
- Summary
- References
1. The Conclusion First (In One Line)
- In a control app that only crashes after long-running operation, always look at
Handle Count, not justPrivate Bytes - Handle leaks tend to hide not in the normal path but in the
timeout/reconnect/ partial-failure / early-return paths - The line that actually crashes is often the place that could no longer create a new handle later, not the place that leaked it
- The logs you need first are: the
operation/sessioncontext, the process’shandle count, theopen/closepairing of resources, and Win32 / HRESULT / SDK errors - Rather than waiting for a month-scale repro, it is faster to run the connect-disconnect-reconnect-failure paths thousands of times in a short loop
- Application Verifier, covered in Part 2, is quite effective, but the foundation is being able to trace lifetime breakdowns with your own logs first
In short, the first thing to do on a case like this is not to stare at the fact that “it crashed after a long period,” but to get the growth of resources and the failure paths into an observable form.
By the time a handle leak is found, it usually already wears the face of a secondary failure. So if you only look at the exception at the moment of the crash, you tend to walk off in quite the wrong direction.
2. What Is a Handle Leak?
2.1. What “Handle” Means Here
A handle here is the identifier through which a Windows process references OS resources. Examples of what falls under this include:
| Category | Examples |
|---|---|
| Kernel objects | event, mutex, semaphore, thread, process, waitable timer |
| I/O | opens of files, pipes, sockets, devices |
| Common in equipment control | the camera SDK’s internal events, wait objects tied to callback registrations, acquisition-thread-related handles |
What tends to become a problem in control apps in particular is the pattern of “forgetting to close a resource that was opened temporarily for some operation, on a partial-failure path”.
The typical flow looks like this.
- Create one event on every reconnect
- Callback registration or acquisition start fails partway through
- The success path closes it, but the failure path does not
- Routine short tests only exercise the success path, so it goes unnoticed
This type slips through quite routinely, both in code review and in production.
2.2. Why It Tends to Surface Only After Long-Running Operation
A handle leak does not necessarily break things spectacularly in one shot. What is actually nastier is a small-slope leak, where one failure leaks just one handle.
flowchart LR
A[Normal operation] --> B[Occasional timeout / reconnect]
B --> C[Failure path creates an event handle]
C --> D[CloseHandle is never called]
D --> E[Handle Count creeps up slightly]
E --> F[Repeats hundreds of times]
F --> G[CreateEvent / SDK open fails]
G --> H[Crash / stall somewhere else]
If one reconnect leaks just one handle, nothing happens within minutes. But in an equipment control app running 24/7, boundary conditions like timeouts, re-initializations, and disconnect recovery occur over and over. The result is the odd presentation of a problem that only surfaces weeks later.
What matters here is that the handle leak itself is not necessarily the crashing line. The common modes of breakage are these.
- An API that creates a new event / file / thread fails
- The SDK cannot create a resource it needs internally and returns only a generic failure code
- Error handling after the failure is thin, and the app dereferences a
null/ invalid handle and crashes - Timeouts increase, and as a result a watchdog or upstream controller kills the process
In other words, the crash site is the “last victim,” not necessarily the “original culprit.”
2.3. How It Differs from a Memory Leak
For defects after long-running operation, the first suspicion is a memory leak. That instinct is natural, of course, but handle leaks are sometimes faster to find when viewed along a different axis.
| Aspect | Memory leak | Handle leak |
|---|---|---|
| Metrics to check first | Private Bytes, Commit, Working Set |
Handle Count |
| Typical symptoms | Memory pressure, paging, slowdowns, OOM | Create* / Open* / SDK internal init failures, secondary failures |
| Where it tends to hide | Caches, retained references, forgotten frees | Asymmetry between create/open and close/dispose |
| How it presents | Memory creeps up | Handle count creeps up and never comes back down |
So when isolating long-run issues, looking only at memory is like driving with one eye closed.
At minimum, watching Handle Count and Thread Count together makes things considerably easier to sort out.
3. Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month
3.1. The Symptoms
The incident was simple.
- A Windows app controlling an industrial camera runs 24/7
- It runs fine normally
- After roughly one month, one day the app suddenly crashes
- After a restart, it runs fine again for a while
The first difficulty is that it takes a long time to crash. Waiting one month per reproduction attempt is brutal as an investigation.
What made it even nastier was that the crash site was not exactly the same each time. Sometimes it was right after a reconnect started, sometimes at acquisition start, sometimes after a failed SDK call.
With that presentation, at first you can suspect any of the following.
- Instability on the camera SDK side
- Transient failures caused by communication or device disconnects
- A memory leak
- A race around threading
- An initialization failure not showing up in the logs
In other words, we were in a state of too many “vaguely suspicious” candidates.
3.2. The Metrics We Looked at First
So the first thing we did was look at how the process’s resources as a whole were growing. In this case, the observed trends were roughly as follows.
| Metric | Observed trend | Reading |
|---|---|---|
Handle Count |
Creeps up after reconnects and timeouts, never comes back down | Suspect a handle leak |
Private Bytes |
Fluctuates, but the monotonic-increase slope is weak | The main culprit is not necessarily the heap |
Thread Count |
Essentially flat | A thread leak is unlikely |
| Crash site | Slightly different every time | A secondary failure is likely |
At this point, our focus had narrowed considerably. It was more natural to read the situation not as “it crashes after one month,” but “something is leaking a little at a time along the way, and as a result it crashes after one month.”
3.3. The Leak That Was the Root Cause
The ultimate cause was a missed close of an event handle created on the initialization-failure path during camera reconnection.
Simplified, the flow looks like this.
sequenceDiagram
participant App as Control app
participant OS as Windows
participant SDK as Camera SDK
App->>OS: CreateEvent
App->>SDK: Register callback
SDK-->>App: Partial failure / timeout
Note over App: Returns on the failure path
Note over App: CloseHandle is never called
loop Repeated reconnects
App->>OS: Handle Count creeps up
end
App->>OS: Next CreateEvent / Open
OS-->>App: Failure
App-->>App: Crashes as a secondary failure
As a code sketch, the leak looks like this.
handle = CreateEvent(...)
if (!RegisterCallback(handle))
{
return Error; // CloseHandle(handle) is missing
}
if (!StartAcquisition())
{
return Error; // close is missing here too
}
...
CloseHandle(handle)
The reason this slips past short tests is also quite easy to see.
- A normal startup -> normal shutdown does close it
- Failures only happen partway through a reconnect
- There is no test that hammers that failure path
- In production, it accumulates a little at a time over weeks
In other words, the structure was: “invisible if you only watch the normal path, but it leaks routinely on the failure paths.”
The fix is not flashy.
- Bring the responsibilities of
create/openandclose/disposecloser together - Move release into
finally/ destructors / a session object so it always happens even on partial failure - Make ownership explicit around callback registration and acquisition start
- Express “who closes it” through the code’s responsibilities, not comments
This is not so much a special technique as housekeeping that embeds resource lifetimes into the code.
4. How We Isolated It
4.1. Compress Time Instead of Waiting for a Month-Scale Repro
In this kind of investigation, waiting a month per attempt is a bad approach. What you should do is drive the suspicious paths over and over in a short time.
In this case, we compressed the repro by running a loop like this.
flowchart LR
A[Start] --> B[Open camera]
B --> C[Start acquisition]
C --> D[Simulated timeout / disconnect]
D --> E[Reconnect]
E --> F[Resume acquisition]
F --> G{Repeat N times}
G -- Yes --> D
G -- No --> H[Check the deltas at the end]
The point is to spend your time on the lifetime operations at the boundaries, not on the routine “frames are coming in” periods.
Concretely effective scenarios look like these.
- Run
open -> start -> stop -> closein large volumes - Deliberately trigger timeouts and cycle through reconnects
- Force a failure right after callback registration
- Inject disconnect aborts, reconnect aborts, and shutdown races
You do not need to perfectly reproduce a month of real operation. On the contrary, stepping on the suspected lifetime edge thousands of times gets you much closer to the cause.
4.2. Read the Slope of Handle Count
In a handle leak investigation, looking only at absolute values can be confusing. What matters is whether the count comes back down after operations that should return it, and how many handles you gain per how many operations.
Roughly the following order works well.
- Establish a baseline after warm-up
- Record
Handle Countafter each reconnect / start-stop / close - Look at the delta per cycle
- Also look at the slope aggregated over several cycles
For example, a view like this.
leakSlope =
(currentHandleCount - baselineHandleCount)
/ reconnectCount
Whether an absolute value of 2000 is high or low varies by app. But if it is +1 per reconnect and never comes back, that is quite suspicious.
The trick here is to not watch Handle Count alone, but to record at least the following alongside it.
Handle CountPrivate BytesThread CountReconnectCount- Which phase you are currently in
With this, you can tell quite quickly whether “memory is growing,” “threads are growing,” or “resources are not coming back on every reconnect.”
4.3. Check the Pairing of create/open and close/dispose
Even once you know the process-wide Handle Count is suspicious, that alone does not get you to the leak site.
What you need next is logs that show resource lifecycles as pairs.
As an image, structured logs like these.
CameraSession session=421 cameraId=CAM01 phase=ReconnectStart reason=FrameTimeout handleCount=1824 privateBytesMB=418
CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Create osHandle=0x00000ABC handleCount=1825
CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Close osHandle=0x00000ABC handleCount=1824
What matters here is to not rely on osHandle alone.
Windows handle values can be reused later, so in the logs it is easier to trace if you carry at least the following.
sessionIdresourceIdkindaction(Create/Open/Register/Close/Dispose/Unregister)osHandlephase
With this in place, it becomes much easier to spot the one-lunged flow where a Create exists but no Close.
4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed
This point is quite important.
A handle leak often presents like this.
- The crashing line:
CreateEventfails - The real leak:
CloseHandlehad been missing on a failure path since days earlier
In other words, the API that finally fell over is the exit of the damage, not necessarily the entrance of the cause.
So the investigation order should be:
- Look at which resource keeps growing
- Look at which operation boundary it fails to come back at
- Find where the pairing of
create/openandclose/disposeis broken - Read the crash site last
In this order, you are far less likely to get lost.
5. The Logs You Need to Prevent Recurrence
5.1. The Minimum Set to Keep First
What worked in this investigation was not simply increasing log volume. It was methodically adding “information that lets you reach the cause later.”
At minimum, you want to keep the following.
| Category | Minimum fields wanted | Reason |
|---|---|---|
| Operation context | cameraId, sessionId, operationId, reconnectCount, phase |
To tie the event to which operation, on which iteration |
| Process resources | handleCount, privateBytes, workingSet, threadCount |
To first isolate what is growing |
| Resource lifecycle | action, resourceId, kind, osHandle, owner |
To trace the pairs of create/open and close/dispose |
| External call results | win32Error, HRESULT, sdkError, timeoutMs |
To compare failure types later |
| State transitions | OpenStart, OpenDone, ReconnectStart, ReconnectDone, ShutdownStart, etc. |
To know mid-which-phase things broke down |
| Execution environment | pid, tid, buildVersion, machineName |
To correlate with dumps / symbols / deployed artifacts |
We are not claiming this is sufficient. But without at least this, you easily end up with logs that record nothing more than the fact that “it crashed.”
5.2. The Logs We Actually Strengthened
In this case, we strengthened the logs in the following directions.
- Periodic heartbeat
- Emit
Handle Count/Private Bytes/Thread Count/ReconnectCountevery 1-5 minutes
- Emit
- Boundary logs per camera session
OpenStartCallbackRegisteredAcquisitionStartTimeoutDetectedReconnectStartReconnectDoneCloseStartCloseDone
- Resource lifecycle logs
Create/Open/RegisterandClose/Dispose/Unregisterfor events / threads / files / timers / SDK registration tokens
- Error normalization
- Do not stop at the exception message; emit
win32Error,HRESULT,sdkError, andphasetogether
- Do not stop at the exception message; emit
What is important is to not change the shape of the logs between success and failure. If failures get a different format, aggregation later becomes painful.
5.3. At What Granularity to Collect
A common trap here is “just dump everything at INFO.” But if you do that, you end up facing a wall of logs when you read them later. That is quite painful.
In terms of granularity, roughly the following split is realistic.
- Periodic monitoring
Handle Count,Private Bytes,Thread Count,ReconnectCount
- Operation boundaries
- Session start / done / fail
- Resource boundaries
create/open/registerandclose/dispose/unregister
- Failure details
- Error codes, stacks, dump capture triggers
Detailed per-frame logging is usually unnecessary. For long-run defects, logs that let you read “which responsibility opened it, and which responsibility closed it” are far more effective.
6. A Rough Decision Guide
- Crashes only after days to weeks
- First add a heartbeat for
Handle Count/Private Bytes/Thread Count
- First add a heartbeat for
- There are retries / reconnects / shutdowns
- Build a harness first that hammers just those boundaries in volume
- Heavy use of native SDKs / P/Invoke / Win32
- Applying Application Verifier (Part 2) is well worth it
- A GUI lives in the same process
- In addition to
Handle Count, also watchGDI Objects/USER Objects
- In addition to
- The exception at the moment of the crash tells you nothing
- It is faster to first put operation / session / resource lifecycle structured logs in order
That last item is quite important. In bug investigation, what decides the outcome is often not the analysis technique itself, but whether things are in an observable form.
7. Summary
For an app that only crashes after long-running operation, look at Handle Count, not just memory. Handle leaks tend to hide in the failure paths of abnormal flows rather than the normal path, and the crash site is usually the exit of a secondary failure, not the place that leaked. When it comes to reading the symptoms, it ultimately comes down to these three points.
For prevention, bring the responsibilities of create/open and close/dispose closer together, keep logs that carry context per session / operation, and record both process resources and resource lifecycles. In testing, instead of waiting for a month-scale repro, run timeout / reconnect / shutdown in short loops, and make “traceable when it breaks” — not just “doesn’t break” — the acceptance criterion. What worked in this case was this combination. In Part 2, we use Application Verifier to surface hard-to-trigger failure modes such as memory exhaustion and handle anomalies ahead of time.
In control apps, the normal path working matters, but being able to tell “what happened” when things break counts for a lot in long-term operation.
Handle leaks are exactly the type of defect where that difference pays off. If you look at them through growth rates, boundaries, and responsibility pairs — rather than only at the moment they occur — they become considerably easier to chase.
8. References
Related Articles
Recent articles sharing the same tags. Deepen your understanding with closely related topics.
Building a Windows Failure-Path Test Foundation with Application Verifier
What Application Verifier is, organized together with how to build a Windows failure-path test foundation using Handles, Heaps, Low Resou...
Why TCP Retransmissions Stall Industrial Camera Communication, and How to Isolate Them
How to isolate the cause when industrial camera communication stalls for several seconds due to TCP retransmissions, covering packet loss...
Windows App Outsourcing and Contract Development: What to Sort Out Before You Ask
Before commissioning Windows app outsourcing or contract development, here is how to sort out existing software modification, device inte...
Designing Windows Apps to Leave Logs and Dumps When They Crash
How to combine regular logging, a final crash marker, WER LocalDumps, and a watchdog process so that even when a Windows app dies from an...
An Introduction to Collecting Windows Crash Dumps - WER/ProcDump/WinDbg
To chase hard-to-reproduce Windows application crashes, we walk through when to use WER LocalDumps, ProcDump, MiniDumpWriteDump, and WinD...
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Bug Investigation & Long-Run Failures
Topic page for intermittent failures, communication diagnosis, long-run crashes, and failure-path test foundations.
Related Case Study
This case-study page shows a similar structure for diagnosis, prioritization, or redesign.
How We Traced a Long-Run Crash to a Handle Leak
Case-study page for turning a month-scale crash into a handle-leak investigation through better observation points and logging.
Where This Topic Connects
This article connects naturally to the following service pages.
Bug Investigation & Root Cause Analysis
Isolating failures that only occur after long-running operation is a theme that fits our bug investigation and root-cause analysis service extremely well.
Windows App Development
If you want to review how your Windows app is built, including logging design and operational observability, this also connects to our Windows application development consulting.
Author Profile
Profile page for the article author.
Go Komura
Representative of KomuraSoft LLC
Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.
Public links