Investigating Long-Run Crashes of an Industrial Camera App - The Handle Leak (Part 1)

When a Windows app suddenly crashes after running for a long time, the first instinct is very often to suspect a memory leak. In reality, however, it is not uncommon for a handle leak to be the main culprit, finally surfacing weeks later as a secondary failure.

This article presents a case where we investigated a Windows app controlling an industrial camera that suddenly crashed after roughly one month of continuous operation. As we narrowed things down, the cause turned out to be a handle leak occurring on the failure path around camera reconnection.

In this first part, we cover what a handle leak is, how we isolated this incident, and what logs you should keep to prevent recurrence. In the second part, When an Industrial Camera Control App Suddenly Crashes After One Month (Part 2) - What Application Verifier Is and How to Build a Failure-Path Test Foundation, we discuss building a failure-path test foundation.

Proper names and some log fields have been redacted, but the way of thinking is broadly shared across Windows equipment control apps in general.

The Conclusion First (In One Line)
What Is a Handle Leak?
- 2.1. What “Handle” Means Here
- 2.2. Why It Tends to Surface Only After Long-Running Operation
- 2.3. How It Differs from a Memory Leak
Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month
- 3.1. The Symptoms
- 3.2. The Metrics We Looked at First
- 3.3. The Leak That Was the Root Cause
How We Isolated It
- 4.1. Compress Time Instead of Waiting for a Month-Scale Repro
- 4.2. Read the Slope of Handle Count
- 4.3. Check the Pairing of create/open and close/dispose
- 4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed
The Logs You Need to Prevent Recurrence
- 5.1. The Minimum Set to Keep First
- 5.2. The Logs We Actually Strengthened
- 5.3. At What Granularity to Collect
A Rough Decision Guide
Summary
References

1. The Conclusion First (In One Line)

In a control app that only crashes after long-running operation, always look at Handle Count, not just Private Bytes
Handle leaks tend to hide not in the normal path but in the timeout / reconnect / partial-failure / early-return paths
The line that actually crashes is often the place that could no longer create a new handle later, not the place that leaked it
The logs you need first are: the operation/session context, the process’s handle count, the open/close pairing of resources, and Win32 / HRESULT / SDK errors
Rather than waiting for a month-scale repro, it is faster to run the connect-disconnect-reconnect-failure paths thousands of times in a short loop
Application Verifier, covered in Part 2, is quite effective, but the foundation is being able to trace lifetime breakdowns with your own logs first

In short, the first thing to do on a case like this is not to stare at the fact that “it crashed after a long period,” but to get the growth of resources and the failure paths into an observable form.

By the time a handle leak is found, it usually already wears the face of a secondary failure. So if you only look at the exception at the moment of the crash, you tend to walk off in quite the wrong direction.

2. What Is a Handle Leak?

2.1. What “Handle” Means Here

A handle here is the identifier through which a Windows process references OS resources. Examples of what falls under this include:

Category	Examples
Kernel objects	event, mutex, semaphore, thread, process, waitable timer
I/O	opens of files, pipes, sockets, devices
Common in equipment control	the camera SDK’s internal events, wait objects tied to callback registrations, acquisition-thread-related handles

What tends to become a problem in control apps in particular is the pattern of “forgetting to close a resource that was opened temporarily for some operation, on a partial-failure path”.

The typical flow looks like this.

Create one event on every reconnect
Callback registration or acquisition start fails partway through
The success path closes it, but the failure path does not
Routine short tests only exercise the success path, so it goes unnoticed

This type slips through quite routinely, both in code review and in production.

2.2. Why It Tends to Surface Only After Long-Running Operation

A handle leak does not necessarily break things spectacularly in one shot. What is actually nastier is a small-slope leak, where one failure leaks just one handle.

flowchart LR
    A[Normal operation] --> B[Occasional timeout / reconnect]
    B --> C[Failure path creates an event handle]
    C --> D[CloseHandle is never called]
    D --> E[Handle Count creeps up slightly]
    E --> F[Repeats hundreds of times]
    F --> G[CreateEvent / SDK open fails]
    G --> H[Crash / stall somewhere else]

If one reconnect leaks just one handle, nothing happens within minutes. But in an equipment control app running 24/7, boundary conditions like timeouts, re-initializations, and disconnect recovery occur over and over. The result is the odd presentation of a problem that only surfaces weeks later.

What matters here is that the handle leak itself is not necessarily the crashing line. The common modes of breakage are these.

An API that creates a new event / file / thread fails
The SDK cannot create a resource it needs internally and returns only a generic failure code
Error handling after the failure is thin, and the app dereferences a null / invalid handle and crashes
Timeouts increase, and as a result a watchdog or upstream controller kills the process

In other words, the crash site is the “last victim,” not necessarily the “original culprit.”

2.3. How It Differs from a Memory Leak

For defects after long-running operation, the first suspicion is a memory leak. That instinct is natural, of course, but handle leaks are sometimes faster to find when viewed along a different axis.

Aspect	Memory leak	Handle leak
Metrics to check first	`Private Bytes`, `Commit`, `Working Set`	`Handle Count`
Typical symptoms	Memory pressure, paging, slowdowns, OOM	`Create` / `Open` / SDK internal init failures, secondary failures
Where it tends to hide	Caches, retained references, forgotten frees	Asymmetry between `create/open` and `close/dispose`
How it presents	Memory creeps up	Handle count creeps up and never comes back down

So when isolating long-run issues, looking only at memory is like driving with one eye closed. At minimum, watching Handle Count and Thread Count together makes things considerably easier to sort out.

3. Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month

3.1. The Symptoms

The incident was simple.

A Windows app controlling an industrial camera runs 24/7
It runs fine normally
After roughly one month, one day the app suddenly crashes
After a restart, it runs fine again for a while

The first difficulty is that it takes a long time to crash. Waiting one month per reproduction attempt is brutal as an investigation.

What made it even nastier was that the crash site was not exactly the same each time. Sometimes it was right after a reconnect started, sometimes at acquisition start, sometimes after a failed SDK call.

With that presentation, at first you can suspect any of the following.

Instability on the camera SDK side
Transient failures caused by communication or device disconnects
A memory leak
A race around threading
An initialization failure not showing up in the logs

In other words, we were in a state of too many “vaguely suspicious” candidates.

3.2. The Metrics We Looked at First

So the first thing we did was look at how the process’s resources as a whole were growing. In this case, the observed trends were roughly as follows.

Metric	Observed trend	Reading
`Handle Count`	Creeps up after reconnects and timeouts, never comes back down	Suspect a handle leak
`Private Bytes`	Fluctuates, but the monotonic-increase slope is weak	The main culprit is not necessarily the heap
`Thread Count`	Essentially flat	A thread leak is unlikely
Crash site	Slightly different every time	A secondary failure is likely

At this point, our focus had narrowed considerably. It was more natural to read the situation not as “it crashes after one month,” but “something is leaking a little at a time along the way, and as a result it crashes after one month.”

3.3. The Leak That Was the Root Cause

The ultimate cause was a missed close of an event handle created on the initialization-failure path during camera reconnection.

Simplified, the flow looks like this.

sequenceDiagram
    participant App as Control app
    participant OS as Windows
    participant SDK as Camera SDK

    App->>OS: CreateEvent
    App->>SDK: Register callback
    SDK-->>App: Partial failure / timeout
    Note over App: Returns on the failure path
    Note over App: CloseHandle is never called

    loop Repeated reconnects
        App->>OS: Handle Count creeps up
    end

    App->>OS: Next CreateEvent / Open
    OS-->>App: Failure
    App-->>App: Crashes as a secondary failure

As a code sketch, the leak looks like this.

handle = CreateEvent(...)

if (!RegisterCallback(handle))
{
    return Error;   // CloseHandle(handle) is missing
}

if (!StartAcquisition())
{
    return Error;   // close is missing here too
}

...
CloseHandle(handle)

The reason this slips past short tests is also quite easy to see.

A normal startup -> normal shutdown does close it
Failures only happen partway through a reconnect
There is no test that hammers that failure path
In production, it accumulates a little at a time over weeks

In other words, the structure was: “invisible if you only watch the normal path, but it leaks routinely on the failure paths.”

The fix is not flashy.

Bring the responsibilities of create/open and close/dispose closer together
Move release into finally / destructors / a session object so it always happens even on partial failure
Make ownership explicit around callback registration and acquisition start
Express “who closes it” through the code’s responsibilities, not comments

This is not so much a special technique as housekeeping that embeds resource lifetimes into the code.

4. How We Isolated It

4.1. Compress Time Instead of Waiting for a Month-Scale Repro

In this kind of investigation, waiting a month per attempt is a bad approach. What you should do is drive the suspicious paths over and over in a short time.

In this case, we compressed the repro by running a loop like this.

flowchart LR
    A[Start] --> B[Open camera]
    B --> C[Start acquisition]
    C --> D[Simulated timeout / disconnect]
    D --> E[Reconnect]
    E --> F[Resume acquisition]
    F --> G{Repeat N times}
    G -- Yes --> D
    G -- No --> H[Check the deltas at the end]

The point is to spend your time on the lifetime operations at the boundaries, not on the routine “frames are coming in” periods.

Concretely effective scenarios look like these.

Run open -> start -> stop -> close in large volumes
Deliberately trigger timeouts and cycle through reconnects
Force a failure right after callback registration
Inject disconnect aborts, reconnect aborts, and shutdown races

You do not need to perfectly reproduce a month of real operation. On the contrary, stepping on the suspected lifetime edge thousands of times gets you much closer to the cause.

4.2. Read the Slope of `Handle Count`

In a handle leak investigation, looking only at absolute values can be confusing. What matters is whether the count comes back down after operations that should return it, and how many handles you gain per how many operations.

Roughly the following order works well.

Establish a baseline after warm-up
Record Handle Count after each reconnect / start-stop / close
Look at the delta per cycle
Also look at the slope aggregated over several cycles

For example, a view like this.

leakSlope =
    (currentHandleCount - baselineHandleCount)
    / reconnectCount

Whether an absolute value of 2000 is high or low varies by app. But if it is +1 per reconnect and never comes back, that is quite suspicious.

The trick here is to not watch Handle Count alone, but to record at least the following alongside it.

Handle Count
Private Bytes
Thread Count
ReconnectCount
Which phase you are currently in

With this, you can tell quite quickly whether “memory is growing,” “threads are growing,” or “resources are not coming back on every reconnect.”

4.3. Check the Pairing of `create/open` and `close/dispose`

Even once you know the process-wide Handle Count is suspicious, that alone does not get you to the leak site. What you need next is logs that show resource lifecycles as pairs.

As an image, structured logs like these.

CameraSession session=421 cameraId=CAM01 phase=ReconnectStart reason=FrameTimeout handleCount=1824 privateBytesMB=418

CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Create osHandle=0x00000ABC handleCount=1825

CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Close osHandle=0x00000ABC handleCount=1824

What matters here is to not rely on osHandle alone. Windows handle values can be reused later, so in the logs it is easier to trace if you carry at least the following.

sessionId
resourceId
kind
action(Create/Open/Register/Close/Dispose/Unregister)
osHandle
phase

With this in place, it becomes much easier to spot the one-lunged flow where a Create exists but no Close.

4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed

This point is quite important.

A handle leak often presents like this.

The crashing line: CreateEvent fails
The real leak: CloseHandle had been missing on a failure path since days earlier

In other words, the API that finally fell over is the exit of the damage, not necessarily the entrance of the cause.

So the investigation order should be:

Look at which resource keeps growing
Look at which operation boundary it fails to come back at
Find where the pairing of create/open and close/dispose is broken
Read the crash site last

In this order, you are far less likely to get lost.

5. The Logs You Need to Prevent Recurrence

5.1. The Minimum Set to Keep First

What worked in this investigation was not simply increasing log volume. It was methodically adding “information that lets you reach the cause later.”

At minimum, you want to keep the following.

Category	Minimum fields wanted	Reason
Operation context	`cameraId`, `sessionId`, `operationId`, `reconnectCount`, `phase`	To tie the event to which operation, on which iteration
Process resources	`handleCount`, `privateBytes`, `workingSet`, `threadCount`	To first isolate what is growing
Resource lifecycle	`action`, `resourceId`, `kind`, `osHandle`, `owner`	To trace the pairs of `create/open` and `close/dispose`
External call results	`win32Error`, `HRESULT`, `sdkError`, `timeoutMs`	To compare failure types later
State transitions	`OpenStart`, `OpenDone`, `ReconnectStart`, `ReconnectDone`, `ShutdownStart`, etc.	To know mid-which-phase things broke down
Execution environment	`pid`, `tid`, `buildVersion`, `machineName`	To correlate with dumps / symbols / deployed artifacts

We are not claiming this is sufficient. But without at least this, you easily end up with logs that record nothing more than the fact that “it crashed.”

5.2. The Logs We Actually Strengthened

In this case, we strengthened the logs in the following directions.

Periodic heartbeat
- Emit Handle Count / Private Bytes / Thread Count / ReconnectCount every 1-5 minutes
Boundary logs per camera session
- OpenStart
- CallbackRegistered
- AcquisitionStart
- TimeoutDetected
- ReconnectStart
- ReconnectDone
- CloseStart
- CloseDone
Resource lifecycle logs
- Create/Open/Register and Close/Dispose/Unregister for events / threads / files / timers / SDK registration tokens
Error normalization
- Do not stop at the exception message; emit win32Error, HRESULT, sdkError, and phase together

What is important is to not change the shape of the logs between success and failure. If failures get a different format, aggregation later becomes painful.

5.3. At What Granularity to Collect

A common trap here is “just dump everything at INFO.” But if you do that, you end up facing a wall of logs when you read them later. That is quite painful.

In terms of granularity, roughly the following split is realistic.

Periodic monitoring
- Handle Count, Private Bytes, Thread Count, ReconnectCount
Operation boundaries
- Session start / done / fail
Resource boundaries
- create/open/register and close/dispose/unregister
Failure details
- Error codes, stacks, dump capture triggers

Detailed per-frame logging is usually unnecessary. For long-run defects, logs that let you read “which responsibility opened it, and which responsibility closed it” are far more effective.

6. A Rough Decision Guide

Crashes only after days to weeks
- First add a heartbeat for Handle Count / Private Bytes / Thread Count
There are retries / reconnects / shutdowns
- Build a harness first that hammers just those boundaries in volume
Heavy use of native SDKs / P/Invoke / Win32
- Applying Application Verifier (Part 2) is well worth it
A GUI lives in the same process
- In addition to Handle Count, also watch GDI Objects / USER Objects
The exception at the moment of the crash tells you nothing
- It is faster to first put operation / session / resource lifecycle structured logs in order

That last item is quite important. In bug investigation, what decides the outcome is often not the analysis technique itself, but whether things are in an observable form.

7. Summary

For an app that only crashes after long-running operation, look at Handle Count, not just memory. Handle leaks tend to hide in the failure paths of abnormal flows rather than the normal path, and the crash site is usually the exit of a secondary failure, not the place that leaked. When it comes to reading the symptoms, it ultimately comes down to these three points.

For prevention, bring the responsibilities of create/open and close/dispose closer together, keep logs that carry context per session / operation, and record both process resources and resource lifecycles. In testing, instead of waiting for a month-scale repro, run timeout / reconnect / shutdown in short loops, and make “traceable when it breaks” — not just “doesn’t break” — the acceptance criterion. What worked in this case was this combination. In Part 2, we use Application Verifier to surface hard-to-trigger failure modes such as memory exhaustion and handle anomalies ahead of time.

In control apps, the normal path working matters, but being able to tell “what happened” when things break counts for a lot in long-term operation.

Handle leaks are exactly the type of defect where that difference pays off. If you look at them through growth rates, boundaries, and responsibility pairs — rather than only at the moment they occur — they become considerably easier to chase.

Part 2: When an Industrial Camera Control App Suddenly Crashes After One Month (Part 2) - What Application Verifier Is and How to Build a Failure-Path Test Foundation

8. References

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Building a Windows Failure-Path Test Foundation with Application Verifier

What Application Verifier is, organized together with how to build a Windows failure-path test foundation using Handles, Heaps, Low Resou...

Read Article

Why TCP Retransmissions Stall Industrial Camera Communication, and How to Isolate Them

How to isolate the cause when industrial camera communication stalls for several seconds due to TCP retransmissions, covering packet loss...

Read Article

Windows App Outsourcing and Contract Development: What to Sort Out Before You Ask

Before commissioning Windows app outsourcing or contract development, here is how to sort out existing software modification, device inte...

Read Article

Designing Windows Apps to Leave Logs and Dumps When They Crash

How to combine regular logging, a final crash marker, WER LocalDumps, and a watchdog process so that even when a Windows app dies from an...

Read Article

An Introduction to Collecting Windows Crash Dumps - WER/ProcDump/WinDbg

To chase hard-to-reproduce Windows application crashes, we walk through when to use WER LocalDumps, ProcDump, MiniDumpWriteDump, and WinD...

Read Article

Related Case Study

This case-study page shows a similar structure for diagnosis, prioritization, or redesign.

How We Traced a Long-Run Crash to a Handle Leak

Case-study page for turning a month-scale crash into a handle-leak investigation through better observation points and logging.

View Case Study

Where This Topic Connects

This article connects naturally to the following service pages.

Bug Investigation & Root Cause Analysis

Isolating failures that only occur after long-running operation is a theme that fits our bug investigation and root-cause analysis service extremely well.

View Service Contact

Windows App Development

If you want to review how your Windows app is built, including logging design and operational observability, this also connects to our Windows application development consulting.

View Service Contact

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

View Profile Contact

Public links

GitHub LinkedIn X COM_BLAS COM_BigDecimal

Investigating Long-Run Crashes of an Industrial Camera App - The Handle Leak (Part 1)

Table of Contents

1. The Conclusion First (In One Line)

2. What Is a Handle Leak?

2.1. What “Handle” Means Here

2.2. Why It Tends to Surface Only After Long-Running Operation

2.3. How It Differs from a Memory Leak

3. Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month

3.1. The Symptoms

3.2. The Metrics We Looked at First

3.3. The Leak That Was the Root Cause

4. How We Isolated It

4.1. Compress Time Instead of Waiting for a Month-Scale Repro

4.2. Read the Slope of `Handle Count`

4.3. Check the Pairing of `create/open` and `close/dispose`

4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed

5. The Logs You Need to Prevent Recurrence

5.1. The Minimum Set to Keep First

5.2. The Logs We Actually Strengthened

5.3. At What Granularity to Collect

6. A Rough Decision Guide

7. Summary

8. References

Building a Windows Failure-Path Test Foundation with Application Verifier

Why TCP Retransmissions Stall Industrial Camera Communication, and How to Isolate Them

Windows App Outsourcing and Contract Development: What to Sort Out Before You Ask

Designing Windows Apps to Leave Logs and Dumps When They Crash

An Introduction to Collecting Windows Crash Dumps - WER/ProcDump/WinDbg

Related Topics

Windows Technical Topics

Bug Investigation & Long-Run Failures

Related Case Study

How We Traced a Long-Run Crash to a Handle Leak

Where This Topic Connects

Bug Investigation & Root Cause Analysis

Windows App Development

Author Profile

Go Komura

Table of Contents

1. The Conclusion First (In One Line)

2. What Is a Handle Leak?

2.1. What “Handle” Means Here

2.2. Why It Tends to Surface Only After Long-Running Operation

2.3. How It Differs from a Memory Leak

3. Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month

3.1. The Symptoms

3.2. The Metrics We Looked at First

3.3. The Leak That Was the Root Cause

4. How We Isolated It

4.1. Compress Time Instead of Waiting for a Month-Scale Repro

4.2. Read the Slope of Handle Count

4.3. Check the Pairing of create/open and close/dispose

4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed

5. The Logs You Need to Prevent Recurrence

5.1. The Minimum Set to Keep First

5.2. The Logs We Actually Strengthened

5.3. At What Granularity to Collect

6. A Rough Decision Guide

7. Summary

8. References

Related Articles

Related Topics

Related Case Study

Where This Topic Connects

Author Profile

Go Komura

4.2. Read the Slope of `Handle Count`

4.3. Check the Pairing of `create/open` and `close/dispose`