Investigating Long-Run Crashes of an Industrial Camera App - The Handle Leak (Part 1)

· · Windows Development, Bug Investigation, Industrial Camera, Handle Leak, Logging Design

When a Windows app suddenly crashes after running for a long time, the first instinct is very often to suspect a memory leak. In reality, however, it is not uncommon for a handle leak to be the main culprit, finally surfacing weeks later as a secondary failure.

This article presents a case where we investigated a Windows app controlling an industrial camera that suddenly crashed after roughly one month of continuous operation. As we narrowed things down, the cause turned out to be a handle leak occurring on the failure path around camera reconnection.

In this first part, we cover what a handle leak is, how we isolated this incident, and what logs you should keep to prevent recurrence. In the second part, When an Industrial Camera Control App Suddenly Crashes After One Month (Part 2) - What Application Verifier Is and How to Build a Failure-Path Test Foundation, we discuss building a failure-path test foundation.

Proper names and some log fields have been redacted, but the way of thinking is broadly shared across Windows equipment control apps in general.

Table of Contents

  1. The Conclusion First (In One Line)
  2. What Is a Handle Leak?
    • 2.1. What “Handle” Means Here
    • 2.2. Why It Tends to Surface Only After Long-Running Operation
    • 2.3. How It Differs from a Memory Leak
  3. Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month
    • 3.1. The Symptoms
    • 3.2. The Metrics We Looked at First
    • 3.3. The Leak That Was the Root Cause
  4. How We Isolated It
    • 4.1. Compress Time Instead of Waiting for a Month-Scale Repro
    • 4.2. Read the Slope of Handle Count
    • 4.3. Check the Pairing of create/open and close/dispose
    • 4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed
  5. The Logs You Need to Prevent Recurrence
    • 5.1. The Minimum Set to Keep First
    • 5.2. The Logs We Actually Strengthened
    • 5.3. At What Granularity to Collect
  6. A Rough Decision Guide
  7. Summary
  8. References

1. The Conclusion First (In One Line)

  • In a control app that only crashes after long-running operation, always look at Handle Count, not just Private Bytes
  • Handle leaks tend to hide not in the normal path but in the timeout / reconnect / partial-failure / early-return paths
  • The line that actually crashes is often the place that could no longer create a new handle later, not the place that leaked it
  • The logs you need first are: the operation/session context, the process’s handle count, the open/close pairing of resources, and Win32 / HRESULT / SDK errors
  • Rather than waiting for a month-scale repro, it is faster to run the connect-disconnect-reconnect-failure paths thousands of times in a short loop
  • Application Verifier, covered in Part 2, is quite effective, but the foundation is being able to trace lifetime breakdowns with your own logs first

In short, the first thing to do on a case like this is not to stare at the fact that “it crashed after a long period,” but to get the growth of resources and the failure paths into an observable form.

By the time a handle leak is found, it usually already wears the face of a secondary failure. So if you only look at the exception at the moment of the crash, you tend to walk off in quite the wrong direction.

2. What Is a Handle Leak?

2.1. What “Handle” Means Here

A handle here is the identifier through which a Windows process references OS resources. Examples of what falls under this include:

Category Examples
Kernel objects event, mutex, semaphore, thread, process, waitable timer
I/O opens of files, pipes, sockets, devices
Common in equipment control the camera SDK’s internal events, wait objects tied to callback registrations, acquisition-thread-related handles

What tends to become a problem in control apps in particular is the pattern of “forgetting to close a resource that was opened temporarily for some operation, on a partial-failure path”.

The typical flow looks like this.

  • Create one event on every reconnect
  • Callback registration or acquisition start fails partway through
  • The success path closes it, but the failure path does not
  • Routine short tests only exercise the success path, so it goes unnoticed

This type slips through quite routinely, both in code review and in production.

2.2. Why It Tends to Surface Only After Long-Running Operation

A handle leak does not necessarily break things spectacularly in one shot. What is actually nastier is a small-slope leak, where one failure leaks just one handle.

Normal operationOccasional timeout / reconnectFailure path creates an event handleCloseHandle is never calledHandle Count creeps up slightlyRepeats hundreds of timesCreateEvent / SDK open failsCrash / stall somewhere else

If one reconnect leaks just one handle, nothing happens within minutes. But in an equipment control app running 24/7, boundary conditions like timeouts, re-initializations, and disconnect recovery occur over and over. The result is the odd presentation of a problem that only surfaces weeks later.

What matters here is that the handle leak itself is not necessarily the crashing line. The common modes of breakage are these.

  • An API that creates a new event / file / thread fails
  • The SDK cannot create a resource it needs internally and returns only a generic failure code
  • Error handling after the failure is thin, and the app dereferences a null / invalid handle and crashes
  • Timeouts increase, and as a result a watchdog or upstream controller kills the process

In other words, the crash site is the “last victim,” not necessarily the “original culprit.”

2.3. How It Differs from a Memory Leak

For defects after long-running operation, the first suspicion is a memory leak. That instinct is natural, of course, but handle leaks are sometimes faster to find when viewed along a different axis.

Aspect Memory leak Handle leak
Metrics to check first Private Bytes, Commit, Working Set Handle Count
Typical symptoms Memory pressure, paging, slowdowns, OOM Create* / Open* / SDK internal init failures, secondary failures
Where it tends to hide Caches, retained references, forgotten frees Asymmetry between create/open and close/dispose
How it presents Memory creeps up Handle count creeps up and never comes back down

So when isolating long-run issues, looking only at memory is like driving with one eye closed. At minimum, watching Handle Count and Thread Count together makes things considerably easier to sort out.

3. Case Study: An Industrial Camera Control App That Suddenly Crashes After One Month

3.1. The Symptoms

The incident was simple.

  • A Windows app controlling an industrial camera runs 24/7
  • It runs fine normally
  • After roughly one month, one day the app suddenly crashes
  • After a restart, it runs fine again for a while

The first difficulty is that it takes a long time to crash. Waiting one month per reproduction attempt is brutal as an investigation.

What made it even nastier was that the crash site was not exactly the same each time. Sometimes it was right after a reconnect started, sometimes at acquisition start, sometimes after a failed SDK call.

With that presentation, at first you can suspect any of the following.

  • Instability on the camera SDK side
  • Transient failures caused by communication or device disconnects
  • A memory leak
  • A race around threading
  • An initialization failure not showing up in the logs

In other words, we were in a state of too many “vaguely suspicious” candidates.

3.2. The Metrics We Looked at First

So the first thing we did was look at how the process’s resources as a whole were growing. In this case, the observed trends were roughly as follows.

Metric Observed trend Reading
Handle Count Creeps up after reconnects and timeouts, never comes back down Suspect a handle leak
Private Bytes Fluctuates, but the monotonic-increase slope is weak The main culprit is not necessarily the heap
Thread Count Essentially flat A thread leak is unlikely
Crash site Slightly different every time A secondary failure is likely

At this point, our focus had narrowed considerably. It was more natural to read the situation not as “it crashes after one month,” but “something is leaking a little at a time along the way, and as a result it crashes after one month.”

3.3. The Leak That Was the Root Cause

The ultimate cause was a missed close of an event handle created on the initialization-failure path during camera reconnection.

Simplified, the flow looks like this.

Camera SDKWindowsControl appCamera SDKWindowsControl appReturns on the failure pathCloseHandle is never calledloop[Repeated reconnects]CreateEventRegister callbackPartial failure / timeoutHandle Count creeps upNext CreateEvent / OpenFailureCrashes as a secondary failure

As a code sketch, the leak looks like this.

handle = CreateEvent(...)

if (!RegisterCallback(handle))
{
    return Error;   // CloseHandle(handle) is missing
}

if (!StartAcquisition())
{
    return Error;   // close is missing here too
}

...
CloseHandle(handle)

The reason this slips past short tests is also quite easy to see.

  • A normal startup -> normal shutdown does close it
  • Failures only happen partway through a reconnect
  • There is no test that hammers that failure path
  • In production, it accumulates a little at a time over weeks

In other words, the structure was: “invisible if you only watch the normal path, but it leaks routinely on the failure paths.”

The fix is not flashy.

  • Bring the responsibilities of create/open and close/dispose closer together
  • Move release into finally / destructors / a session object so it always happens even on partial failure
  • Make ownership explicit around callback registration and acquisition start
  • Express “who closes it” through the code’s responsibilities, not comments

This is not so much a special technique as housekeeping that embeds resource lifetimes into the code.

4. How We Isolated It

4.1. Compress Time Instead of Waiting for a Month-Scale Repro

In this kind of investigation, waiting a month per attempt is a bad approach. What you should do is drive the suspicious paths over and over in a short time.

In this case, we compressed the repro by running a loop like this.

YesNoStartOpen cameraStart acquisitionSimulated timeout / disconnectReconnectResume acquisitionRepeat N timesCheck the deltas at the end

The point is to spend your time on the lifetime operations at the boundaries, not on the routine “frames are coming in” periods.

Concretely effective scenarios look like these.

  • Run open -> start -> stop -> close in large volumes
  • Deliberately trigger timeouts and cycle through reconnects
  • Force a failure right after callback registration
  • Inject disconnect aborts, reconnect aborts, and shutdown races

You do not need to perfectly reproduce a month of real operation. On the contrary, stepping on the suspected lifetime edge thousands of times gets you much closer to the cause.

4.2. Read the Slope of Handle Count

In a handle leak investigation, looking only at absolute values can be confusing. What matters is whether the count comes back down after operations that should return it, and how many handles you gain per how many operations.

Roughly the following order works well.

  1. Establish a baseline after warm-up
  2. Record Handle Count after each reconnect / start-stop / close
  3. Look at the delta per cycle
  4. Also look at the slope aggregated over several cycles

For example, a view like this.

leakSlope =
    (currentHandleCount - baselineHandleCount)
    / reconnectCount

Whether an absolute value of 2000 is high or low varies by app. But if it is +1 per reconnect and never comes back, that is quite suspicious.

The trick here is to not watch Handle Count alone, but to record at least the following alongside it.

  • Handle Count
  • Private Bytes
  • Thread Count
  • ReconnectCount
  • Which phase you are currently in

With this, you can tell quite quickly whether “memory is growing,” “threads are growing,” or “resources are not coming back on every reconnect.”

4.3. Check the Pairing of create/open and close/dispose

Even once you know the process-wide Handle Count is suspicious, that alone does not get you to the leak site. What you need next is logs that show resource lifecycles as pairs.

As an image, structured logs like these.

CameraSession session=421 cameraId=CAM01 phase=ReconnectStart reason=FrameTimeout handleCount=1824 privateBytesMB=418

CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Create osHandle=0x00000ABC handleCount=1825

CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Close osHandle=0x00000ABC handleCount=1824

What matters here is to not rely on osHandle alone. Windows handle values can be reused later, so in the logs it is easier to trace if you carry at least the following.

  • sessionId
  • resourceId
  • kind
  • action(Create/Open/Register/Close/Dispose/Unregister)
  • osHandle
  • phase

With this in place, it becomes much easier to spot the one-lunged flow where a Create exists but no Close.

4.4. For Handle Leaks, Find Where It Leaked, Not Where It Crashed

This point is quite important.

A handle leak often presents like this.

  • The crashing line: CreateEvent fails
  • The real leak: CloseHandle had been missing on a failure path since days earlier

In other words, the API that finally fell over is the exit of the damage, not necessarily the entrance of the cause.

So the investigation order should be:

  1. Look at which resource keeps growing
  2. Look at which operation boundary it fails to come back at
  3. Find where the pairing of create/open and close/dispose is broken
  4. Read the crash site last

In this order, you are far less likely to get lost.

5. The Logs You Need to Prevent Recurrence

5.1. The Minimum Set to Keep First

What worked in this investigation was not simply increasing log volume. It was methodically adding “information that lets you reach the cause later.”

At minimum, you want to keep the following.

Category Minimum fields wanted Reason
Operation context cameraId, sessionId, operationId, reconnectCount, phase To tie the event to which operation, on which iteration
Process resources handleCount, privateBytes, workingSet, threadCount To first isolate what is growing
Resource lifecycle action, resourceId, kind, osHandle, owner To trace the pairs of create/open and close/dispose
External call results win32Error, HRESULT, sdkError, timeoutMs To compare failure types later
State transitions OpenStart, OpenDone, ReconnectStart, ReconnectDone, ShutdownStart, etc. To know mid-which-phase things broke down
Execution environment pid, tid, buildVersion, machineName To correlate with dumps / symbols / deployed artifacts

We are not claiming this is sufficient. But without at least this, you easily end up with logs that record nothing more than the fact that “it crashed.”

5.2. The Logs We Actually Strengthened

In this case, we strengthened the logs in the following directions.

  1. Periodic heartbeat
    • Emit Handle Count / Private Bytes / Thread Count / ReconnectCount every 1-5 minutes
  2. Boundary logs per camera session
    • OpenStart
    • CallbackRegistered
    • AcquisitionStart
    • TimeoutDetected
    • ReconnectStart
    • ReconnectDone
    • CloseStart
    • CloseDone
  3. Resource lifecycle logs
    • Create/Open/Register and Close/Dispose/Unregister for events / threads / files / timers / SDK registration tokens
  4. Error normalization
    • Do not stop at the exception message; emit win32Error, HRESULT, sdkError, and phase together

What is important is to not change the shape of the logs between success and failure. If failures get a different format, aggregation later becomes painful.

5.3. At What Granularity to Collect

A common trap here is “just dump everything at INFO.” But if you do that, you end up facing a wall of logs when you read them later. That is quite painful.

In terms of granularity, roughly the following split is realistic.

  • Periodic monitoring
    • Handle Count, Private Bytes, Thread Count, ReconnectCount
  • Operation boundaries
    • Session start / done / fail
  • Resource boundaries
    • create/open/register and close/dispose/unregister
  • Failure details
    • Error codes, stacks, dump capture triggers

Detailed per-frame logging is usually unnecessary. For long-run defects, logs that let you read “which responsibility opened it, and which responsibility closed it” are far more effective.

6. A Rough Decision Guide

  • Crashes only after days to weeks
    • First add a heartbeat for Handle Count / Private Bytes / Thread Count
  • There are retries / reconnects / shutdowns
    • Build a harness first that hammers just those boundaries in volume
  • Heavy use of native SDKs / P/Invoke / Win32
    • Applying Application Verifier (Part 2) is well worth it
  • A GUI lives in the same process
    • In addition to Handle Count, also watch GDI Objects / USER Objects
  • The exception at the moment of the crash tells you nothing
    • It is faster to first put operation / session / resource lifecycle structured logs in order

That last item is quite important. In bug investigation, what decides the outcome is often not the analysis technique itself, but whether things are in an observable form.

7. Summary

For an app that only crashes after long-running operation, look at Handle Count, not just memory. Handle leaks tend to hide in the failure paths of abnormal flows rather than the normal path, and the crash site is usually the exit of a secondary failure, not the place that leaked. When it comes to reading the symptoms, it ultimately comes down to these three points.

For prevention, bring the responsibilities of create/open and close/dispose closer together, keep logs that carry context per session / operation, and record both process resources and resource lifecycles. In testing, instead of waiting for a month-scale repro, run timeout / reconnect / shutdown in short loops, and make “traceable when it breaks” — not just “doesn’t break” — the acceptance criterion. What worked in this case was this combination. In Part 2, we use Application Verifier to surface hard-to-trigger failure modes such as memory exhaustion and handle anomalies ahead of time.

In control apps, the normal path working matters, but being able to tell “what happened” when things break counts for a lot in long-term operation.

Handle leaks are exactly the type of defect where that difference pays off. If you look at them through growth rates, boundaries, and responsibility pairs — rather than only at the moment they occur — they become considerably easier to chase.

Part 2: When an Industrial Camera Control App Suddenly Crashes After One Month (Part 2) - What Application Verifier Is and How to Build a Failure-Path Test Foundation

8. References

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

These topic pages place the article in a broader service and decision context.

This case-study page shows a similar structure for diagnosis, prioritization, or redesign.

This article connects naturally to the following service pages.

Windows App Development

If you want to review how your Windows app is built, including logging design and operational observability, this also connects to our Windows application development consulting.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

Back to the Blog