How to Correctly Compare the Speed of Different Program Versions on Windows

· · Windows, Benchmark, Performance, Profiling, Power Management

You want to compare version A and version B of a program on Windows. The single worst thing you can do is run each once on the same machine and declare “B seems about 8% faster.”

That 8% might genuinely be the code difference. But in reality, it was one of power mode, power plan, thermals, background updates, search indexing, virus scans, affinity, execution order, or cache state — the classic Windows benchmarking story. It is quite a muddy world.

This article summarizes how to compare the execution speed of different versions of a program on Windows in a form as close to the code difference as possible. The main target is Windows 11, but most of it — powercfg, start, and so on — works the same on Windows 10.

The Conclusion First

The tricks for improving reproducibility boil down to these six.

  1. Decide first “what you want to compare” Whether you want to see the code difference or the real user experience changes which environment factors you should align.

  2. Record power mode and power plan as separate things Handle this sloppily on Windows, and your comparison tends to become a comparison of the OS’s power-saving policies.

  3. Separate the cold first run from the warmed-up steady state “Only the first run is fast” or “only the later runs are slow” is not unusual.

  4. Alternate runs, A→B→A→B Run all of A first and then all of B, and you eat the skew of thermals and background state.

  5. Look at the median and the spread, not just the mean One outlier wrecks the whole picture. The mean is more fragile than you think.

  6. If the difference is small, dig down to the cause with ETW / WPR Argue from gut feel, and you mostly end up brawling in the fog.

Decide First What You Want to Compare

“Speed comparison” sounds like one thing, but there are actually two kinds.

1. A comparison to see the code difference

You want to know whether the implementation itself got faster due to an algorithm change, data structure change, compiler optimization, runtime update, and so on.

In this case, cut environmental noise as much as possible. A dedicated benchmarking session, fixed power mode, notifications off, search indexing and sync suppressed, and if necessary, go as far as a clean boot.

2. A comparison to see the real user experience

You want to know the speed users will actually feel on their everyday Windows after release.

In this case, you must not erase all the noise that exists in reality. Comparing in a “plausible everyday environment” — including OneDrive sync, Defender, notifications, and normal power settings — gives results closer to reality.

Mix these two, and your conclusions get twisted. Things like “12% faster in the lab but within noise in the real world” or “faster in the real world but unchanged in CPU time” happen routinely.

The Main Causes of Variance on Windows

First, a rough inventory of what makes results wobble.

Layer Variance factor Typical example
Hardware CPU / GPU, memory, SSD, cooling Thinness of a laptop, presence of a cooling pad
Firmware BIOS / UEFI, OEM controls Power-saving policies, fan control
OS Windows build, drivers, update state The same PC behaves differently after an update
Power AC / DC, power mode, power plan On battery, it is a different world
Thermals Room temperature, fans, prior load Turbo on the first run only, fading later
Background Update, Defender, sync, notifications A scan or sync runs mid-execution
Scheduling Priority, affinity, NUMA CPU placement varies by machine
Data / cache OS cache, app cache Slow only the first time, fast only from the second run
Build conditions Debug / Release, PGO, logging on/off You are comparing different things to begin with

In short: even “the same Windows machine” is a different experiment if the conditions are not aligned.

Treat Power Mode and Power Plan Separately

This part matters a lot.

Windows has the Power mode in the Settings app and the traditional Power plan (the power schemes visible via powercfg). They look similar and tend to get lumped together, but handle them sloppily and the comparison turns to mush.

In the Windows Settings app, you can choose the Power mode from Settings > System > Power & battery. Microsoft’s documentation states you can switch between Best power efficiency, Balanced, and Best performance separately for Plugged in / On Battery. Furthermore, changing the Power mode also affects the underlying power-related settings and PPM (Processor Power Management) behavior. In other words, this alone can change core parking and performance scaling policy.

The Power plan, on the other hand, is the traditional power scheme: Balanced, High performance, and so on. You can check it with powercfg /list and powercfg /getactivescheme.

The confusing part is that Windows has both the power mode overlay and the power plan. So record at least the following with your benchmark results:

  • AC or battery
  • Which power mode
  • Which active power plan

Benchmark results missing these three are quite painful to look at later.

Power conditions to pin down first

  1. Always compare laptops on AC power Battery operation easily introduces unintended limits.

  2. Pin the power mode For benchmarking, try Best performance first.

  3. Record the active power plan Save the current value with powercfg.

powercfg /list
powercfg /getactivescheme
  1. Switch to High performance if needed
# Balanced
powercfg /setactive 381b4222-f694-41f0-9685-ff5bb260df2e

# High performance
powercfg /setactive 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c

“High performance does not show up” is completely normal

This is another stumbling point. Microsoft’s documentation states that on devices supporting Modern Standby, only Balanced, or plans derived from Balanced, are allowed. So instead of “High performance is missing — is it broken?”, the answer may be that is how that machine is designed.

Microsoft also advises that if the Power mode cannot be changed, a custom power plan may be selected, so try selecting Balanced first. When the Power mode UI is unresponsive, this is the quickest thing to suspect.

Kill the Background Noise

Windows is a hard worker. Even when you want a quiet benchmark, it does all sorts of things in the background for you.

First, reboot and wait for things to settle

After changing settings, reboot once, and do not run immediately after login — wait a few minutes. Right after startup, updates, indexing, sync, Defender, and assorted residents are still thrashing around.

For serious comparisons, use a clean boot

Microsoft documents a procedure for reducing to a minimal startup configuration via clean boot: stop non-Microsoft services in msconfig and disable Startup apps in Task Manager.

This is powerful for reducing noise. However, it diverges from the everyday environment, so it is suited to “lab comparisons aimed at seeing the code difference.”

Silence notifications

Windows notification banners look light but are surprisingly disruptive. Beyond the visual nuisance, they can change execution timing, focus, and background app activity.

Enable Do not disturb manually, or at minimum turn notifications off during the benchmark.

Suppress search indexing and sync

If the benchmark target reads lots of files, writes lots of artifacts, or rebuilds source trees repeatedly, search indexing and cloud sync quietly sting.

  • Exclude the benchmark directory from search indexing
  • Pause OneDrive / Dropbox / Google Drive sync
  • Close browsers, Teams, Discord, Slack

Nothing flashy here, but when it matters, it matters a lot.

A Comparison That Does Not Align Thermals Is Mostly Comparing Thermals

A CPU or GPU is a different creature when cold versus warmed up. Laptops, thin mini PCs, and small desktops show this most clearly.

Rules to follow

  • Keep room temperature as consistent as possible
  • Fix how the laptop is positioned
  • Fix the AC adapter, dock, and external display configuration
  • Do no heavy work right before the benchmark
  • Measure the first run and the steady state separately

Alternate the execution order

Avoid running A 10 times and then B 10 times. The skew of thermals, caches, and background activity piles on.

Recommended patterns:

  • A B A B A B ...
  • A B B A A B B A ...
  • Pre-generate a random order and run in that order

What You Measure Changes What “Fast” Means

Squash “fast” into a single number and you mostly have an accident. The three representative metrics to look at on Windows:

1. Wall-clock time

The time the user waits. It is closest to the end-to-end experience, so this is the first value to look at.

On Windows, QueryPerformanceCounter (QPC) is available for high-resolution timing. In managed code, the Stopwatch family is the standard. Eyeballing milliseconds with DateTime.Now is, frankly, a bit defenseless.

2. CPU time (user + kernel time)

The time the process actually used the CPU, obtainable via GetProcessTimes.

This is useful for looking at computational efficiency. For example, if wall-clock improved but CPU time did not change, caches, I/O, wait time, or scheduling may be the active ingredient.

3. Cycle count (CPU cycles)

QueryProcessCycleTime gives you the CPU cycle count for the whole process.

This is also a CPU-work metric, but it shows a different face than wall-clock. It is particularly useful for asking “the wait time is the same, but did the computation itself get lighter?”

Priority, Affinity, and NUMA Are Last Resorts

These can have an effect. But touching them from the start, just because they work, easily creates a different phenomenon.

First, measure normally

If a difference shows up in the default state, that difference itself has value. Throwing in /high or /affinity from the start imports “conditions that do not occur on real Windows.”

If you use them, be clear about the purpose

  • /high: you want fewer disturbances from other processes
  • /affinity: you want to pin CPU placement for the comparison
  • NUMA control: you want to align memory locality on large machines

The Windows start command can launch with a priority class and affinity mask.

start "" /high /wait myapp.exe --bench case1.json
start "" /affinity F /high /wait myapp.exe --bench case1.json

But skip /realtime

/realtime is available, but you should not use it. It tends to work less as noise removal and more as a generator of new accidents.

Putting it all together, here is a procedure that is easy to run in practice.

Lab-leaning comparison procedure

  1. Fix the comparison targets
    • commit hash / build number
    • compiler / runtime version
    • Debug / Release
    • logging, asserts, tracing on/off
  2. Fix the machine conditions
    • Windows build
    • BIOS / UEFI version
    • driver version
    • AC power
    • room temperature, physical placement
  3. Fix the power conditions
    • Decide the power mode
    • Record the active power plan
  4. Reboot
  5. Wait a few minutes before benchmarking
  6. Clean boot if necessary
  7. Include a warm-up
  8. Alternate A / B runs
  9. Get enough repetitions
  10. Keep median, min, max, p95
  11. Save the raw data
  12. If the difference is small, capture ETW / WPR

Items Worth Recording That Save You Later

In the benchmark CSV or JSON, keeping at least the following pays off.

timestamp,version,scenario,elapsed_ms,user_ms,kernel_ms,cycles,power_mode,power_plan,ac_or_dc,room_temp_c,notes

If possible, these are handy as well.

cpu_package_temp_start_c,cpu_package_temp_end_c,affinity_mask,priority_class,windows_build,driver_version

With benchmarks, being interpretable later often matters more than the measuring itself.

Look at the Median and the Distribution, Not Just the Mean

The mean is convenient, but it breaks easily in Windows benchmarks. Defender kicking in just once, a notification popping, another process hammering the SSD — any of these can drag the mean away.

The recommended combination:

  • Median: look at this first
  • p95 / p99: check whether the tail has gotten worse
  • min / max: see how things stray
  • Box plots or scatter plots: useful when the difference is small

How to Read a Difference When You See One

Interpreting results is easiest when you look at combinations.

Only wall-clock is faster

Possibly improvements in I/O, wait time, caches, or scheduling.

CPU time and cycles both dropped

There is a good chance the implementation itself got lighter.

Only the first run is slow / fast

That is the cold / warm difference. Suspect startup, initialization, cache generation, JIT.

Gets slower the more runs you do

Suspect thermals, throttling, memory pressure, background activity.

Dig Down to “Why It Is Faster” with ETW / WPR

When the difference is small, or the reason is unreadable, moving on to Windows’s ETW (Event Tracing for Windows) tooling is the classic route.

Microsoft’s Windows Performance Recorder (WPR) is an ETW-based recording tool included in the Windows ADK. It can capture CPU, I/O, context switches, page faults, and more in one go.

At a minimum, it looks like this.

wpr -start CPU -filemode

REM Run the benchmark here

wpr -stop trace.etl

Once you reach this stage, instead of “B is 3% faster,” you can speak with reasons: “B has less lock contention and lower ready time.” “A opens more files and has a slower cold start.”

Summary

When comparing different versions of a program on Windows, what really works is not flashy tricks. What matters is the unglamorous discipline that pays off in reproducibility:

  • Pin and record AC / power mode / power plan
  • Separate cold and warm
  • Alternate A / B runs
  • Look at the median and the distribution
  • Clean boot if necessary
  • If the difference is small, dig to the reason with ETW / WPR

And most important of all: write down, alongside the results, what you pinned and what you did not. A benchmark is a comparison of speed, and at the same time a record of experimental conditions.

A speedup report without conditions is about as entertaining as fortune-telling that occasionally hits — but in terms of reproducibility, it is quite unreliable. Conversely, if the conditions are properly written down, the result has real value even when the difference is small.

References

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

These topic pages place the article in a broader service and decision context.

This article connects naturally to the following service pages.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

Back to the Blog