AI Vendor Evaluation Scorecard: Rate Every Vendor Across 6 Governance Dimensions

Most AI vendor evaluations focus on benchmark scores and demo performance. That's the wrong frame for enterprise procurement decisions. The model that scores best on MMLU in February may be different from the model you're running in November — and your vendor may not have told you the change was happening.

Capability matters. But governance is what determines whether a vendor relationship is sustainable, auditable, and defensible to regulators. This scorecard evaluates both.

Why Capability Benchmarks Aren't Enough

AI model performance is not static. Vendors update models continuously — to reduce costs, improve performance on their target use cases, respond to safety concerns, or comply with regulation. Unless you've negotiated version-pinned endpoints with contractual stability commitments, the model you evaluated is not necessarily the model you're running in production.

For mission-critical workflows — loan decisioning, medical triage support, legal document review, fraud detection — this matters. A silent model update can change your system's behavior, void your validation, and create a compliance gap without triggering any alert in your monitoring stack.

Beyond version stability, enterprise AI procurement needs to assess:

Whether your data is used to train their next model
Whether you can export your work if you need to leave
Whether the vendor can produce documentation your regulators will accept
Whether the vendor will still exist and serve your use case in three years

This scorecard addresses all of it.

How to Use This Scorecard

Rate each criterion from 1 to 5 using the guidance provided. Calculate the dimension score as the average of its criteria scores. Multiply each dimension score by its weight. Sum the weighted scores for a total out of 5.0.

Score one scorecard per vendor. When evaluating multiple vendors, use identical scoring sessions — have the same person score all vendors on the same criterion before moving to the next, to reduce anchoring bias.

Dimension 1: Version Control and Change Management — Weight: 20%

Criterion	1	3	5
Version-pinned endpoints available?	No pinning; "latest" only	Pinning available but limited retention	Yes, with multi-year stability commitments
Advance notice before model changes?	No notice	Some notice, no defined window	30+ days notice with a testing window
Explicit behavior change documentation?	None	Release notes, minimal detail	Full changelog with before/after examples
Rollback capability if update breaks your use case?	None	Manual rollback possible; no SLA	Contractual rollback right with defined SLA

Scoring guidance: A vendor that doesn't offer version pinning scores 1 on the first criterion regardless of any other quality. No amount of benchmark improvement compensates for the inability to know what model is running in your production system.

Dimension 2: Audit and Logging — Weight: 20%

Criterion	1	3	5
Detailed input/output logging available?	No logging	Basic logging, limited detail	Immutable, timestamped, full I/O logs
Logs exportable for compliance reporting?	No export	Manual export possible	Structured export via API
Retention period meets regulatory requirements?	Less than 1 year	1-5 years	More than 10 years (or configurable)
Audit-grade log format (tamper-evident)?	No	Some integrity controls	Hash chain or equivalent; tamper-evident

Scoring guidance: SR 11-7 and the EU AI Act both require logs of model inputs and outputs for consequential decisions. If a vendor can't provide tamper-evident logs with sufficient retention, you'll have to build that infrastructure yourself — and there's no guarantee the vendor is recording what you need.

Dimension 3: Strategic Alignment — Weight: 15%

Criterion	1	3	5
Mission and customer segment alignment?	Vendor serves opposite use cases	Mixed alignment	Clearly aligned with your use case and sector
Major customer types disclosed?	Opaque	Partially disclosed	Fully disclosed with case studies
Public commitments about uses they won't serve?	None	Informal statements	Clear policy, published and contractually bindable
Financial stability / governance structure?	High risk (pre-revenue, unknown backers)	Some stability signals	Audited financials, stable governance, long runway

Scoring guidance: A vendor building for consumers is not necessarily building for enterprise compliance requirements. Strategic misalignment means governance features will always be deprioritized. Check the vendor's published customer list, job postings, and product roadmap — these reveal actual priorities more than sales decks do.

Dimension 4: Data Governance — Weight: 20%

Criterion	1	3	5
Your data used for model training?	Used by default, no opt-out	Opt-out available	Never used; confirmed in contract with audit rights
Data residency options?	No regional control	Some options	Full regional control, documented and contractual
Data deletion on account termination?	Unclear	Documented process, no SLA	Documented with defined SLA and confirmation
Sub-processor list disclosed?	No	Partial disclosure	Full list with change notification requirements

Scoring guidance: Data governance criteria carry the most legal weight. A vendor that uses your inputs as training data without opt-out is incompatible with most enterprise data handling policies and many regulatory frameworks (GDPR, HIPAA, attorney-client privilege contexts). Get this in writing — a vendor's privacy policy is not a contractual commitment.

Dimension 5: Regulatory Compliance Support — Weight: 15%

Criterion	1	3	5
BAA available (HIPAA)?	No	Available but non-standard	Pre-approved form, straightforward process
EU AI Act compliance documentation?	None	Partial documentation	Yes, in Annex IV format
SR 11-7 / model risk documentation support?	None	Some documentation	Dedicated materials, responsive to validator questions
Independent security audits (SOC 2, ISO 27001)?	None	Outdated or partial	Current certifications, available for review

Scoring guidance: Vendors that can't produce compliance documentation will cost you significant internal resources to work around. Before scoring a 5, verify the documentation is current — a SOC 2 report from 18 months ago may not satisfy your auditors.

Dimension 6: Exit Strategy — Weight: 10%

Criterion	1	3	5
Fine-tuning work exportable?	No export	Partial export	Full export in open format (GGUF, SafeTensors, etc.)
Migration support documented?	None	Basic guidance	Full migration documentation with SLAs
Contract exit clauses for material behavior changes?	None	Informal commitments	Defined contractual triggers for exit rights
Portable API format?	Proprietary only	Partial compatibility	OpenAI-compatible or equivalent open standard

Scoring guidance: Exit criteria are routinely underweighted in vendor evaluations because switching feels distant. Model the switching cost honestly: if this vendor changes terms, gets acquired, or materially degrades in quality, what does migration actually cost? That number should directly influence how much weight you put on exit criteria.

Scoring Interpretation

Calculate your total weighted score for each vendor:

Total Score = (D1 × 0.20) + (D2 × 0.20) + (D3 × 0.15) + (D4 × 0.20) + (D5 × 0.15) + (D6 × 0.10)

Score Range	Interpretation
4.0 – 5.0	Proceed. Governance posture is strong.
3.0 – 3.9	Proceed with mitigation plan. Document compensating controls for gaps.
2.0 – 2.9	Significant risk. Do not deploy for regulated or high-risk use cases without substantial compensating controls.
Below 2.0	Do not depend on this vendor for mission-critical workloads.

Any individual criterion score of 1 in Dimensions 2 or 4 should be treated as a potential blocker regardless of total score — these are the areas where gaps are hardest to compensate for internally.

Applying the Scorecard: A Worked Example

Consider three options for a loan eligibility support tool:

Vendor A (major commercial LLM API): Strong on capability and compliance documentation (SOC 2, HIPAA BAA available). Weak on version pinning (aliases deprecated with 30-day notice only, no rollback SLA). Data governance is opt-out only. Scores approximately 3.2 overall — proceed with mitigation: implement your own input/output logging, negotiate version pinning, get the data processing addendum in writing.

Vendor B (a smaller AI startup): Excellent benchmark scores, compelling demo. No BAA, no Annex IV documentation, no audit logs, no data residency options. Scores approximately 1.8 — not viable for a regulated use case regardless of capability.

Owned model (fine-tuned, self-hosted): By definition, scores 5.0 on Dimensions 1, 2, and 4. You control the version, you own the logs, your data never leaves your infrastructure. Regulatory compliance support (Dimension 5) depends on your internal processes, not a vendor's. Exit risk (Dimension 6) is zero — you own the weights.

The Owned Model Baseline

The scoring exercise makes explicit something that's easy to miss in capability comparisons: a model you own eliminates the most critical governance risks by construction.

Version stability: your model doesn't update unless you update it. Audit logging: you control the logging stack. Data governance: your training data never leaves your environment. Exit strategy: you hold the weights in an open format.

This doesn't mean owned models are always the right answer — they require investment in fine-tuning infrastructure and expertise. But for regulated use cases where vendor governance scores consistently fall in the 2.5-3.5 range, the total cost of compensating controls often exceeds the cost of owning the model.

See early bird pricing →

Book a discovery call with Ertas →

Running This Process

Identify all AI vendors your organization uses or is evaluating (include embedded AI in SaaS products)
Score each vendor using this scorecard — use the same evaluator for each dimension across vendors
Document your scoring rationale, not just the numbers — auditors will want to see it
For vendors scoring 2.0-3.9, document the compensating controls before deployment
Re-evaluate annually, or immediately following material changes (acquisition, policy change, major model update)

The scorecard is a decision-support tool, not a veto mechanism. A 3.2 with a solid mitigation plan is a defensible procurement decision. A 1.8 with no mitigation plan is not — and when something goes wrong, the absence of this analysis will be the first thing an auditor or regulator looks for.