Certification

Vendor-Run vs Independent FHIR Benchmarks: What to Watch For

A vendor publishing a benchmark that includes their own product is one of the oldest tensions in software procurement, and FHIR is no exception. Health Samurai, the company behind Aidbox, just released an open-source performance benchmark covering Aidbox, HAPI FHIR, Medplum, and the Microsoft FHIR Server, with a dashboard that reruns daily. The framing is honest about who built it, the methodology is open, and the harness can be forked. That makes this a useful case to think about how to read vendor-run benchmarks generally. For more side-by-side breakdowns, the FHIR comparison library covers the broader review shelf.

What Vendor-Run Benchmarks Get Right

A vendor publishing a benchmark has one advantage that independent benchmarks rarely match. The vendor knows their own product cold. They know how to tune it, what its sharp edges are, and which workloads are the meaningful ones for the use cases their customers actually run. A benchmark written by a team with deep operational knowledge of one of the systems under test is usually a more carefully built benchmark than one assembled by a generalist.

The new release from Health Samurai, authored by Marat Surmashev, VP of Engineering, is a clean example. The workloads are specific (CRUD against 9 resource types, bundle import, search across six families). The hardware is pinned. The data is Synthea with about 2 million resources from 1,000 patient records. The harness is published in the open repo, so anyone can read it and re-run it themselves.

What to Hold in Mind

The obvious counterweight is that the publisher has a position. Aidbox tops the CRUD throughput row at about 5,200 requests per second, with Microsoft at about 440 and the other two in between. Numbers like that need to be read with the publisher in mind, not because they are wrong, but because the pick of which workloads to emphasize and which to leave for a later post is a vendor decision.

A few practical checks to keep in mind when reading any vendor-run benchmark:

  • Is the harness open and runnable, or is the only artifact a slide with numbers?
  • Does the test pin the hardware, the dataset, and the workload identically across all systems under test?
  • Are there caveats in the report itself, or is the framing purely promotional?

By those criteria the new public release does well. The complete guide to FHIR validators covers similar evaluative framing in the validator market.

The Dataset-Size Question

The report names its biggest limit out loud. The 1,000-patient dataset and the 2 million resources fit comfortably in memory, so the benchmark never tests the disk-bound behavior that real production workloads run into. That is the moment where the order on the table could shift. The published note says the next post in the series will run at scale, which is the right follow-up.

For a procurement team, the read is to use the daily snapshot as a sanity check, not a final ranking. The open repository gives an independent shop the option to fork the harness and substitute their own workload mix.

How This Compares to Validator Benchmarks

The validator market has long had a similar dynamic, where vendors publish performance numbers tied to their own products. The honest path is the same one this release takes: open the harness, pin the inputs, and let the broader community check the work. The commercial vs open-source FHIR validators comparison covers the parallel question on the validation side.