Skip to main content

NetMediate Benchmark Results

This document describes the performance characteristics of NetMediate under the current implementation, which uses explicit handler registration only (no assembly scanning) and closed-type pipeline executors registered at startup.


Reference benchmark environment

The table below is updated automatically by CI on every PR benchmark run. System info comes from the BenchmarkDotNet host environment.

KeyValue
OSLinux Ubuntu 25.10 (Questing Quokka)
CPUIntel Core Ultra 7 165U 2.69GHz, 1 CPU, 6 logical and 3 physical cores
.NET SDK10.0.107
Runtime.NET 10.0.7 (10.0.7, 10.0.726.21808), X64 RyuJIT x86-64-v3
Last CI run2026-05-06 12:26 UTC
Branchorigin/main
Commite426e39

Core dispatch throughput

Measured with BenchmarkDotNet (CoreDispatchBenchmarks) — no behaviors, no resilience, no adapters registered. Mean is the BenchmarkDotNet Throughput-job mean (ns/op). Throughput is the derived ops/s. Alloc Δ compares per-call allocation bytes against the baseline run — allocations are deterministic and unaffected by CPU load, making this the most reliable regression signal. The vs timing column compares dispatch time against the same-run base-branch measurement when available, or against stored target-branch values otherwise (±10% = no change on shared CI hardware; ✅ = improved, ⚠️ = degraded).

BenchmarkMeanErrorGen0AllocatedAlloc ΔThroughputvs timing
Command Send68.05 ns±1.699 ns0.007648 B✅ same~14.7M msg/s≈ (+0.0%)
Notification Notify178.47 ns±4.195 ns0.0688432 B✅ same~5.6M msg/s≈ (+0.0%)
Request Request75.04 ns±1.943 ns0.0178112 B✅ same~13.3M msg/s≈ (+0.0%)
Stream RequestStream164.37 ns±4.520 ns0.0344216 B✅ same~6.1M msg/s≈ (+0.0%)

¹ Stream measures complete stream invocations (3 items each). Higher throughput = better.

Note on stream vs other types: Stream invocations are inherently more expensive because each call allocates a new IAsyncEnumerator<T> and drives it through multiple MoveNextAsync cycles with Task.Yield() inside the handler. The per-invocation cost is higher by design.


BenchmarkDotNet project

For artifact-reproducible, statistically rigorous benchmarks including allocation data and GC gen0/1/2 counts, use the dedicated NetMediate.Benchmarks project:

# Standard JIT run (produces BenchmarkDotNet HTML/CSV artifacts in BenchmarkDotNet.Artifacts/)
dotnet run -c Release --project tests/NetMediate.Benchmarks/

# Quick dry-run to verify benchmark classes compile and can execute (no statistical warming)
dotnet run -c Release --project tests/NetMediate.Benchmarks/ -- --job Dry

# NativeAOT comparison — publish a native binary then run it
dotnet publish tests/NetMediate.Benchmarks/ -c Release -p:AotBenchmark=true -o /tmp/bench-aot
/tmp/bench-aot/NetMediate.Benchmarks

CoreDispatchBenchmarks covers the four core message types:

BenchmarkDescription
Command SendIMediator.Send<BenchCommand>() — no pipeline behaviors
Notification NotifyIMediator.Notify<BenchNotification>() — no pipeline behaviors
Request RequestIMediator.Request<BenchRequest, BenchResponse>() — no pipeline behaviors
Stream RequestStream (3 items/call)IMediator.RequestStream<BenchStreamRequest, BenchStreamItem>() — drains 3 items per invocation

BenchmarkDotNet output columns: Method, Mean, Error, StdDev, Gen0, Allocated. The --job Short flag adds a short statistical run (3 warmup + 3 measured iterations) alongside the default full job.


Hot-path throughput

Once warm, JIT and NativeAOT produce identical throughput. The handler cache (ConcurrentDictionary<Type, Lazy<T[]>>) and behavior cache eliminate DI resolution on the hot path. NativeAOT has no advantage or disadvantage in per-message throughput.

AspectJIT (CoreCLR)NativeAOT
Warm throughputBaselineSame ¹
Cold-start (first dispatch)JIT compiles on first callPre-compiled binary; no JIT overhead
Startup overheadNone (explicit registration only)None
Binary sizeStandardLarger (trimmed single-file)
Compatible registrationAllExplicit registration + source generator only

¹ Identical because the hot path makes no reflection, no MakeGenericType, and no dynamic IL calls — all resolved types are closed generics fixed at compile time.

How to run the comparison

JIT (standard dotnet test):

NETMEDIATE_RUN_PERFORMANCE_TESTS=true \
dotnet test tests/NetMediate.Tests/ --configuration Release \
--filter "FullyQualifiedName~CoreDispatchThroughput OR FullyQualifiedName~BenchmarkSystemInfo" \
--logger "console;verbosity=detailed"

NativeAOT (publish then run the native binary):

# 1. Publish NativeAOT test host
dotnet publish tests/NetMediate.Tests/ \
--configuration Release \
-p:PublishAot=true \
-p:TrimmerRootAssembly=NetMediate.Tests \
--output /tmp/nativeaot-bench

# 2. Run the native binary with the performance flag
NETMEDIATE_RUN_PERFORMANCE_TESTS=true \
/tmp/nativeaot-bench/NetMediate.Tests \
--filter "CoreDispatchThroughput|BenchmarkSystemInfo"

Look for execution_mode=jit vs execution_mode=nativeaot in the output to confirm which runtime produced each result line.

Trimming without NativeAOT

Publishing with --self-contained -p:PublishTrimmed=true reduces binary size but does not change dispatch throughput. The caches and closed-type registration model are trimmer-safe by design.


Implementation model

All handlers are registered explicitly via IMediatorServiceBuilder methods or the source generator:

builder.Services.UseNetMediate(configure =>
{
configure.RegisterCommandHandler<MyCommandHandler, MyCommand>();
configure.RegisterRequestHandler<MyRequestHandler, MyRequest, MyResponse>();
configure.RegisterNotificationHandler<MyNotificationHandler, MyNotification>();
configure.RegisterStreamHandler<MyStreamHandler, MyStream, MyItem>();
});

// Or via source generator (identical registrations, generated at compile time)
builder.Services.AddNetMediate();

At startup each Register*Handler<> call performs two TryAddSingleton<> / TryAddTransient<> registrations:

Handler kindExecutor registered
RegisterCommandHandler<THandler, TMsg>PipelineExecutor<TMsg, Task, ICommandHandler<TMsg>>
RegisterNotificationHandler<THandler, TMsg>NotificationPipelineExecutor<TMsg>
RegisterRequestHandler<THandler, TMsg, TResp>RequestPipelineExecutor<TMsg, TResp>
RegisterStreamHandler<THandler, TMsg, TResp>StreamPipelineExecutor<TMsg, TResp>

No MakeGenericType, no typeof(TResult) switch, no assembly scanning — fully NativeAOT-compatible.


Dispatch semantics

OperationMethodSemantics
SendIMediator.Send<TMsg>All ICommandHandler<TMsg> instances iterated sequentially
RequestIMediator.Request<TMsg, TResp>Single IRequestHandler<TMsg, TResp> (first registered)
NotifyIMediator.Notify<TMsg>Fire-and-forget per handler; all INotificationHandler<TMsg> instances started individually; exceptions logged
RequestStreamIMediator.RequestStream<TMsg, TResp>Single IStreamHandler<TMsg, TResp>; yields items lazily

Pipeline behavior resolution

Behaviors are registered via RegisterBehavior<TBehavior, TMessage, TResult>() — closed types only. The resolved behavior arrays are cached per message-result type in the same ConcurrentDictionary<Type, Lazy<T[]>> as handlers, so no DI enumeration occurs on the hot path after the first dispatch of a given message type.

Command pipeline (PipelineExecutor<TMsg, Task, ICommandHandler<TMsg>>)

Resolves IPipelineBehavior<TMsg, Task> — two-parameter closed-type lookup, cached.

Notification pipeline (NotificationPipelineExecutor<TMsg>)

Resolves both, then concatenates:

  1. IPipelineBehavior<TMsg, Task> — two-parameter closed-type lookup, cached
  2. IPipelineBehavior<TMsg> — one-parameter closed-type lookup, cached (notification-specific behaviors)

No runtime type switches — the two-lookup pattern is fixed at compile time inside the executor.

Request pipeline (RequestPipelineExecutor<TMsg, TResp>)

Resolves both, then concatenates:

  1. IPipelineBehavior<TMsg, Task<TResp>> — two-parameter closed-type lookup, cached
  2. IPipelineRequestBehavior<TMsg, TResp> — closed-type shorthand lookup, cached

Stream pipeline (StreamPipelineExecutor<TMsg, TResp>)

Resolves both, then concatenates:

  1. IPipelineBehavior<TMsg, IAsyncEnumerable<TResp>> — two-parameter closed-type lookup, cached
  2. IPipelineStreamBehavior<TMsg, TResp> — closed-type shorthand lookup, cached

Handler and behavior caches

Resolved handler arrays are cached permanently per service type using a global ConcurrentDictionary<Type, Lazy<T[]>> (s_handlerCache). Handlers are registered as Singletons, so their resolved arrays never change for the lifetime of the application — a single global cache is correct.

Resolved behavior arrays use a per-service-provider cache: a ConditionalWeakTable<IServiceProvider, ConcurrentDictionary<Type, Lazy<T[]>>> (s_behaviorCacheByProvider). Each DI container gets its own isolated behavior dictionary, preventing cache contamination between containers (e.g., different test suites or multi-tenant hosts). When the provider is garbage-collected its cache entry is automatically released — no memory leak.

First call for TMsg in a given provider → DI resolution + cache fill → O(n) one-time cost
All subsequent calls → cache read → O(1)

How to reproduce benchmarks

Core dispatch throughput (per message type)

NETMEDIATE_RUN_PERFORMANCE_TESTS=true \
dotnet test tests/NetMediate.Tests/ --configuration Release \
--filter "FullyQualifiedName~CoreDispatchThroughput OR FullyQualifiedName~BenchmarkSystemInfo" \
--logger "console;verbosity=detailed"

Output lines of interest:

SYSTEM_INFO execution_mode=<jit|nativeaot>
SYSTEM_INFO logical_cpus=<n>
SYSTEM_INFO total_ram_mb=<mb>
CORE_THROUGHPUT <type> tfm=<tfm> execution_mode=<mode> ops=<n> elapsed_ms=<ms> msgs_per_second=<n>
LOAD_RESULT <scenario> tfm=<tfm> execution_mode=<mode> ops=<n> elapsed_ms=<ms> throughput_ops_s=<n>

Full benchmark suite

NETMEDIATE_RUN_PERFORMANCE_TESTS=true \
dotnet test tests/NetMediate.Tests/ --configuration Release \
--filter "FullyQualifiedName~LoadPerformance OR FullyQualifiedName~PipelineVariants OR FullyQualifiedName~ExplicitRegistration OR FullyQualifiedName~CoreDispatchThroughput OR FullyQualifiedName~BenchmarkSystemInfo" \
--logger "console;verbosity=detailed"

Minimum CI assertions

Test classScenarioThreshold
CoreDispatchThroughputTestscore_command> 500 msgs/s
CoreDispatchThroughputTestscore_notification> 500 msgs/s
CoreDispatchThroughputTestscore_request> 500 msgs/s
CoreDispatchThroughputTestscore_stream> 500 msgs/s
LoadPerformanceTestsall> 500 ops/s
CoreExplicitRegistrationLoadTestsall> 500 ops/s
ResilienceLoadPerformanceTestsresilience_request_parallel≥ 30,000 ops/s
FullStackLoadPerformanceTestsfullstack_request_parallel≥ 20,000 ops/s
PipelineVariantsLoadTestsall> 500 ops/s

Thresholds are deliberately lenient to remain green on any CI hardware. Local developer machines and production servers typically produce 10–100× higher throughput than the minimum assertion.


See Also


Latest CI Benchmark Run

Run: 2026-05-06 12:26 UTC | Branch: origin/main | Commit: e426e39

✅ Base branch benchmarked in the same CI job (same machine — direct comparison).

System specification

Linux Ubuntu 25.10 (Questing Quokka)
Intel Core Ultra 7 165U 2.69GHz, 1 CPU, 6 logical and 3 physical cores
.NET SDK 10.0.107
Runtime: .NET 10.0.7 (10.0.7, 10.0.726.21808), X64 RyuJIT x86-64-v3

Performance summary (BenchmarkDotNet — Throughput job)

BenchmarkMeanErrorGen0AllocatedAlloc ΔThroughputvs timing
Command Send68.05 ns±1.699 ns0.007648 B✅ same~14.7M msg/s≈ (+0.0%)
Notification Notify178.47 ns±4.195 ns0.0688432 B✅ same~5.6M msg/s≈ (+0.0%)
Request Request75.04 ns±1.943 ns0.0178112 B✅ same~13.3M msg/s≈ (+0.0%)
Stream RequestStream164.37 ns±4.520 ns0.0344216 B✅ same~6.1M msg/s≈ (+0.0%)

Comparison vs baseline (main, average of ≤3 runs)

Timing: ✅ improved (>10% faster) |  ≈ no change (±10%) |  ⚠️ degraded (>10% slower) Alloc Δ: ✅ same / ✅ −N B (less) / ⚠️ +N B (more)

BenchmarkBaseline (main, average of ≤3 runs)CurrentΔ timingAlloc Δ
Command Send68.05 ns68.05 ns≈ +0.0%✅ same
Notification Notify178.47 ns178.47 ns≈ +0.0%✅ same
Request Request75.04 ns75.04 ns≈ +0.0%✅ same
Stream RequestStream164.37 ns164.37 ns≈ +0.0%✅ same