Back to blog
    AI Vendor Evaluation Scorecard: Rate Every Vendor Across 6 Governance Dimensions
    vendor-evaluationai-governanceenterprise-aiprocurementscorecard

    AI Vendor Evaluation Scorecard: Rate Every Vendor Across 6 Governance Dimensions

    A complete weighted scorecard for evaluating AI vendors on governance, not just capability. Covers version control, audit logging, strategic alignment, data governance, compliance support, and exit strategy.

    EErtas Team·

    Most AI vendor evaluations focus on benchmark scores and demo performance. That's the wrong frame for enterprise procurement decisions. The model that scores best on MMLU in February may be different from the model you're running in November — and your vendor may not have told you the change was happening.

    Capability matters. But governance is what determines whether a vendor relationship is sustainable, auditable, and defensible to regulators. This scorecard evaluates both.

    Why Capability Benchmarks Aren't Enough

    AI model performance is not static. Vendors update models continuously — to reduce costs, improve performance on their target use cases, respond to safety concerns, or comply with regulation. Unless you've negotiated version-pinned endpoints with contractual stability commitments, the model you evaluated is not necessarily the model you're running in production.

    For mission-critical workflows — loan decisioning, medical triage support, legal document review, fraud detection — this matters. A silent model update can change your system's behavior, void your validation, and create a compliance gap without triggering any alert in your monitoring stack.

    Beyond version stability, enterprise AI procurement needs to assess:

    • Whether your data is used to train their next model
    • Whether you can export your work if you need to leave
    • Whether the vendor can produce documentation your regulators will accept
    • Whether the vendor will still exist and serve your use case in three years

    This scorecard addresses all of it.

    How to Use This Scorecard

    Rate each criterion from 1 to 5 using the guidance provided. Calculate the dimension score as the average of its criteria scores. Multiply each dimension score by its weight. Sum the weighted scores for a total out of 5.0.

    Score one scorecard per vendor. When evaluating multiple vendors, use identical scoring sessions — have the same person score all vendors on the same criterion before moving to the next, to reduce anchoring bias.


    Dimension 1: Version Control and Change Management — Weight: 20%

    Criterion135
    Version-pinned endpoints available?No pinning; "latest" onlyPinning available but limited retentionYes, with multi-year stability commitments
    Advance notice before model changes?No noticeSome notice, no defined window30+ days notice with a testing window
    Explicit behavior change documentation?NoneRelease notes, minimal detailFull changelog with before/after examples
    Rollback capability if update breaks your use case?NoneManual rollback possible; no SLAContractual rollback right with defined SLA

    Scoring guidance: A vendor that doesn't offer version pinning scores 1 on the first criterion regardless of any other quality. No amount of benchmark improvement compensates for the inability to know what model is running in your production system.


    Dimension 2: Audit and Logging — Weight: 20%

    Criterion135
    Detailed input/output logging available?No loggingBasic logging, limited detailImmutable, timestamped, full I/O logs
    Logs exportable for compliance reporting?No exportManual export possibleStructured export via API
    Retention period meets regulatory requirements?Less than 1 year1-5 yearsMore than 10 years (or configurable)
    Audit-grade log format (tamper-evident)?NoSome integrity controlsHash chain or equivalent; tamper-evident

    Scoring guidance: SR 11-7 and the EU AI Act both require logs of model inputs and outputs for consequential decisions. If a vendor can't provide tamper-evident logs with sufficient retention, you'll have to build that infrastructure yourself — and there's no guarantee the vendor is recording what you need.


    Dimension 3: Strategic Alignment — Weight: 15%

    Criterion135
    Mission and customer segment alignment?Vendor serves opposite use casesMixed alignmentClearly aligned with your use case and sector
    Major customer types disclosed?OpaquePartially disclosedFully disclosed with case studies
    Public commitments about uses they won't serve?NoneInformal statementsClear policy, published and contractually bindable
    Financial stability / governance structure?High risk (pre-revenue, unknown backers)Some stability signalsAudited financials, stable governance, long runway

    Scoring guidance: A vendor building for consumers is not necessarily building for enterprise compliance requirements. Strategic misalignment means governance features will always be deprioritized. Check the vendor's published customer list, job postings, and product roadmap — these reveal actual priorities more than sales decks do.


    Dimension 4: Data Governance — Weight: 20%

    Criterion135
    Your data used for model training?Used by default, no opt-outOpt-out availableNever used; confirmed in contract with audit rights
    Data residency options?No regional controlSome optionsFull regional control, documented and contractual
    Data deletion on account termination?UnclearDocumented process, no SLADocumented with defined SLA and confirmation
    Sub-processor list disclosed?NoPartial disclosureFull list with change notification requirements

    Scoring guidance: Data governance criteria carry the most legal weight. A vendor that uses your inputs as training data without opt-out is incompatible with most enterprise data handling policies and many regulatory frameworks (GDPR, HIPAA, attorney-client privilege contexts). Get this in writing — a vendor's privacy policy is not a contractual commitment.


    Dimension 5: Regulatory Compliance Support — Weight: 15%

    Criterion135
    BAA available (HIPAA)?NoAvailable but non-standardPre-approved form, straightforward process
    EU AI Act compliance documentation?NonePartial documentationYes, in Annex IV format
    SR 11-7 / model risk documentation support?NoneSome documentationDedicated materials, responsive to validator questions
    Independent security audits (SOC 2, ISO 27001)?NoneOutdated or partialCurrent certifications, available for review

    Scoring guidance: Vendors that can't produce compliance documentation will cost you significant internal resources to work around. Before scoring a 5, verify the documentation is current — a SOC 2 report from 18 months ago may not satisfy your auditors.


    Dimension 6: Exit Strategy — Weight: 10%

    Criterion135
    Fine-tuning work exportable?No exportPartial exportFull export in open format (GGUF, SafeTensors, etc.)
    Migration support documented?NoneBasic guidanceFull migration documentation with SLAs
    Contract exit clauses for material behavior changes?NoneInformal commitmentsDefined contractual triggers for exit rights
    Portable API format?Proprietary onlyPartial compatibilityOpenAI-compatible or equivalent open standard

    Scoring guidance: Exit criteria are routinely underweighted in vendor evaluations because switching feels distant. Model the switching cost honestly: if this vendor changes terms, gets acquired, or materially degrades in quality, what does migration actually cost? That number should directly influence how much weight you put on exit criteria.


    Scoring Interpretation

    Calculate your total weighted score for each vendor:

    Total Score = (D1 × 0.20) + (D2 × 0.20) + (D3 × 0.15) + (D4 × 0.20) + (D5 × 0.15) + (D6 × 0.10)

    Score RangeInterpretation
    4.0 – 5.0Proceed. Governance posture is strong.
    3.0 – 3.9Proceed with mitigation plan. Document compensating controls for gaps.
    2.0 – 2.9Significant risk. Do not deploy for regulated or high-risk use cases without substantial compensating controls.
    Below 2.0Do not depend on this vendor for mission-critical workloads.

    Any individual criterion score of 1 in Dimensions 2 or 4 should be treated as a potential blocker regardless of total score — these are the areas where gaps are hardest to compensate for internally.


    Applying the Scorecard: A Worked Example

    Consider three options for a loan eligibility support tool:

    Vendor A (major commercial LLM API): Strong on capability and compliance documentation (SOC 2, HIPAA BAA available). Weak on version pinning (aliases deprecated with 30-day notice only, no rollback SLA). Data governance is opt-out only. Scores approximately 3.2 overall — proceed with mitigation: implement your own input/output logging, negotiate version pinning, get the data processing addendum in writing.

    Vendor B (a smaller AI startup): Excellent benchmark scores, compelling demo. No BAA, no Annex IV documentation, no audit logs, no data residency options. Scores approximately 1.8 — not viable for a regulated use case regardless of capability.

    Owned model (fine-tuned, self-hosted): By definition, scores 5.0 on Dimensions 1, 2, and 4. You control the version, you own the logs, your data never leaves your infrastructure. Regulatory compliance support (Dimension 5) depends on your internal processes, not a vendor's. Exit risk (Dimension 6) is zero — you own the weights.

    The Owned Model Baseline

    The scoring exercise makes explicit something that's easy to miss in capability comparisons: a model you own eliminates the most critical governance risks by construction.

    Version stability: your model doesn't update unless you update it. Audit logging: you control the logging stack. Data governance: your training data never leaves your environment. Exit strategy: you hold the weights in an open format.

    This doesn't mean owned models are always the right answer — they require investment in fine-tuning infrastructure and expertise. But for regulated use cases where vendor governance scores consistently fall in the 2.5-3.5 range, the total cost of compensating controls often exceeds the cost of owning the model.

    See early bird pricing →

    Book a discovery call with Ertas →

    Running This Process

    1. Identify all AI vendors your organization uses or is evaluating (include embedded AI in SaaS products)
    2. Score each vendor using this scorecard — use the same evaluator for each dimension across vendors
    3. Document your scoring rationale, not just the numbers — auditors will want to see it
    4. For vendors scoring 2.0-3.9, document the compensating controls before deployment
    5. Re-evaluate annually, or immediately following material changes (acquisition, policy change, major model update)

    The scorecard is a decision-support tool, not a veto mechanism. A 3.2 with a solid mitigation plan is a defensible procurement decision. A 1.8 with no mitigation plan is not — and when something goes wrong, the absence of this analysis will be the first thing an auditor or regulator looks for.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading