MIT LICENSED · MODEL EVALUATION

Model Selection by Behavioral Contract

How to evaluate AI models for enterprise deployment using system cards as behavioral specifications — not benchmark scores. Worked example included.

Part of the charleskjohnson.com governance framework

Benchmarks measure how the model performs on someone else's task. System cards measure how the model behaves on yours. For enterprise deployment, the behavioral profile is the determining factor — not the leaderboard.

System cards (or model cards) published by responsible model vendors describe observed behavioral tendencies: how the model handles uncertainty, when it shows initiative beyond instructions, where it tends to over- or under-claim, what failure modes were observed and characterized during testing. These are the inputs to an enterprise model-selection decision.

Treating system cards as behavioral specifications — rather than as marketing material — lets the operator design hard constraints that compensate for documented tendencies. A model whose system card documents "over-eager initiative on agentic coding tasks" can still be deployed safely, but the deployment must include constraints that close that initiative surface.

FOUR LENSES FOR READING A SYSTEM CARD
LENS 01
Observed failure modes
What did the vendor measure the model doing wrong? "Deleted failing tests rather than fixing the underlying issue." "Fabricated commit hashes." "Over-eager initiative on adjacent code." Each observed failure is a constraint you need to write.
LENS 02
Honesty / fabrication characterization
How does the model handle uncertainty? Does it volunteer "I don't know" or does it confabulate? Is there a documented gap between stated confidence and actual accuracy? Vendor reporting on honesty is the single most important signal for governance work.
LENS 03
Initiative profile
Does the model stay scoped, or does it expand into adjacent work? Does it modify tests when implementation fails? Does it refactor surrounding code? Initiative is not inherently bad — but it must be constrained to match the deployment risk profile.
LENS 04
Documented behavioral defenses
What does the vendor recommend as the primary mitigation? "Verification thoroughness." "Explicit pre-flight gates." "Source attribution requirements." The vendor's own recommended defenses are the starting point for your hard constraints.

Migration: Opus 4.7 to Sonnet 4.6

An operator running a governance-critical project on Opus 4.7 considers migrating to Sonnet 4.6. The decision is framed not as "which model is better" but as "which model's documented behavior best matches the deployment's risk profile, given the constraints we can write to compensate."

Observed signals (from system card review)

The decision

The migration was authorized with revised hard constraints calibrated to Sonnet 4.6's documented profile. The governance file was refactored, not the model usage pattern. The constraints did the compensation work; the model produced predictable behavior because the constraint set matched the documented behavioral tendencies.

The model is the application. The behavioral contract is the operating system. Migration is a contract revision, not a re-architecture.

For any model-selection decision:

  1. Read the system card. All four lenses. Document what was observed.
  2. Map each documented failure mode to a constraint. Either you can write the constraint, or the deployment risk is unacceptable.
  3. Verify the vendor's primary defense matches your governance posture. If the vendor says "verification thoroughness is the primary defense" and your project has weak verification discipline, fix the discipline before choosing this model.
  4. Test the constraint set against representative tasks. Not benchmarks — representative tasks. Does the constraint set produce predictable behavior on the work you actually do?
  5. Lock the constraint set as a decision. Future operators inherit the constraints calibrated to this model. When the model is replaced, re-evaluate the constraints.
  1. Have you read the candidate model's system card cover-to-cover?
  2. Have you enumerated each observed failure mode?
  3. For each failure mode, have you written or identified a compensating hard constraint?
  4. Does the vendor's primary recommended defense align with your existing governance discipline?
  5. Have you tested the constraint set on representative tasks (not benchmarks)?
  6. Have you locked the constraint set as a project decision with reference to the model version?

If all six are yes, the model can be deployed.