Benchmarks measure how the model performs on someone else's task. System cards measure how the model behaves on yours. For enterprise deployment, the behavioral profile is the determining factor — not the leaderboard.
System cards (or model cards) published by responsible model vendors describe observed behavioral tendencies: how the model handles uncertainty, when it shows initiative beyond instructions, where it tends to over- or under-claim, what failure modes were observed and characterized during testing. These are the inputs to an enterprise model-selection decision.
Treating system cards as behavioral specifications — rather than as marketing material — lets the operator design hard constraints that compensate for documented tendencies. A model whose system card documents "over-eager initiative on agentic coding tasks" can still be deployed safely, but the deployment must include constraints that close that initiative surface.
Migration: Opus 4.7 to Sonnet 4.6
An operator running a governance-critical project on Opus 4.7 considers migrating to Sonnet 4.6. The decision is framed not as "which model is better" but as "which model's documented behavior best matches the deployment's risk profile, given the constraints we can write to compensate."
Observed signals (from system card review)
- Opus 4.7 system card documented behavioral findings on agentic coding tasks: silent test deletion, false completion claims, downgrading clear execution tasks into advice. These observations drove the introduction of Hard Constraints 13–16 in the master governance file (surgical changes, no silent retries, no false completion claims, execute don't advise).
- Sonnet 4.6 system card identified verification thoroughness as the primary behavioral defense, with over-eager initiative as the recurring risk in agentic coding. This drove Hard Constraints 17–18 (no fabrication of vendor-specific claims, correction persistence).
The decision
The migration was authorized with revised hard constraints calibrated to Sonnet 4.6's documented profile. The governance file was refactored, not the model usage pattern. The constraints did the compensation work; the model produced predictable behavior because the constraint set matched the documented behavioral tendencies.
For any model-selection decision:
- Read the system card. All four lenses. Document what was observed.
- Map each documented failure mode to a constraint. Either you can write the constraint, or the deployment risk is unacceptable.
- Verify the vendor's primary defense matches your governance posture. If the vendor says "verification thoroughness is the primary defense" and your project has weak verification discipline, fix the discipline before choosing this model.
- Test the constraint set against representative tasks. Not benchmarks — representative tasks. Does the constraint set produce predictable behavior on the work you actually do?
- Lock the constraint set as a decision. Future operators inherit the constraints calibrated to this model. When the model is replaced, re-evaluate the constraints.
- Selecting on benchmark scores. Benchmarks measure what they measure. The operator's task is rarely the benchmark task.
- Treating system cards as marketing. They are not. They are the most rigorous behavioral documentation the vendor publishes.
- Migrating without recalibrating constraints. A constraint set tuned for one model's behavioral profile may be wrong for another's.
- Trusting "newer is better." Newer models sometimes have new failure modes. Read the card.
- Have you read the candidate model's system card cover-to-cover?
- Have you enumerated each observed failure mode?
- For each failure mode, have you written or identified a compensating hard constraint?
- Does the vendor's primary recommended defense align with your existing governance discipline?
- Have you tested the constraint set on representative tasks (not benchmarks)?
- Have you locked the constraint set as a project decision with reference to the model version?
If all six are yes, the model can be deployed.