Back to blog
    How to Evaluate AI Vendors on Governance, Not Just Capability
    vendor-evaluationai-governanceenterprise-aiprocurementresponsible-ai

    How to Evaluate AI Vendors on Governance, Not Just Capability

    Capability benchmarks tell you what a model can do. Governance evaluation tells you whether you can safely depend on it for production AI. Here's the framework most teams skip.

    EErtas Team·

    Most enterprise AI vendor evaluations follow the same process: run the model through benchmark tasks, evaluate output quality against your specific use case, compare pricing, check the security certifications page. If the model performs well and the price fits the budget, you move forward.

    That process is necessary. It's also incomplete. Capability benchmarks tell you what a model can do on the day you evaluated it. Governance evaluation tells you whether you can safely depend on the model in production over the 24-36 months that enterprise software deployments actually last.

    Here's the framework most teams skip — six dimensions of governance evaluation with specific questions for each.

    The Benchmark Trap

    Capability evaluation has become sophisticated. MMLU scores, GPQA benchmarks, coding evals, context length tests, multimodal capability assessment — the industry has developed good tools for measuring what models can do at a point in time.

    What benchmarks don't tell you: whether the vendor will notify you when behavior changes, how long you can pin to a specific version, what your legal position is if the vendor mishandles your data, whether you'll have access to the audit trail your compliance team needs, or how difficult exit will be if your requirements change.

    These questions don't show up in benchmark rankings. But they're what determines whether your production deployment remains viable over time.

    Dimension 1: Version Control and Change Management

    Models change. Safety recalibrations, performance improvements, fine-tuning updates — all of these affect model behavior, and most of them happen without the announcement fanfare of a new model release.

    The questions to ask:

    How are model updates communicated to customers? Look for specific answers: email notification to a designated contact, update to a changelog, API version header changes. "We post on our blog" is not an acceptable answer for enterprise production dependencies.

    How much notice do you provide before behavior changes affect production deployments? Anything less than 2 weeks is too short for a production enterprise system. 4-8 weeks is reasonable. Some vendors offer longer for enterprise contract customers.

    Can we pin to a specific model version? Most vendors support this. The follow-up question matters more: for how long? A 3-month pinning window is very different from a 12-month window. You need time to evaluate new versions, adapt your prompts, and migrate deliberately.

    What is your model deprecation timeline? How much notice do you receive when a model version is being discontinued, and what migration support do you get? This affects how you should budget for migration engineering work.

    Do you provide staging or preview access to new model versions before they're deployed to production? This is increasingly a differentiator for enterprise-focused vendors. Preview access lets you validate new versions against your eval set before they affect your production system.

    Dimension 2: Audit and Logging Capabilities

    Enterprise AI deployments need audit trails. Regulators want them. Legal teams need them. Risk managers require them. The question is whether the vendor can provide them — or whether you'll be building audit infrastructure on top of a vendor's basic request logs.

    What do you log, for how long, and with what access controls? Minimum requirement for regulated industries: timestamped request/response logs, retained for your audit retention period, with controlled access. Many vendors provide 30-90 day retention by default. Regulated industries may need 7 years.

    Can we export logs for our own compliance reporting? Logs that exist only in the vendor's system aren't fully useful for compliance. You need the ability to export structured logs to your own compliance infrastructure.

    Do you offer audit-grade logging? This means immutable (logs can't be modified after creation), timestamped with verifiable timestamps, structured for programmatic analysis, and with chain-of-custody documentation. This is a higher bar than standard request logging, and not all vendors meet it.

    What model version information is included in logs? For audit purposes, you need to be able to reconstruct which specific model version produced a given output. If logs don't include model version identifiers at a granular level, your audit trail has a gap.

    Dimension 3: Strategic Alignment and Mission Stability

    This dimension became more visible in 2026 but has always mattered. Who your AI vendor is, who they serve, and where they're going affects what their model gets trained and optimized for.

    Who are your major customers by industry? A vendor whose largest customers are in healthcare and financial services has different training priorities than one whose largest customers are in defense and government. Neither is wrong — but the priority alignment matters for your use case.

    Have you made public commitments about use cases you won't support? Vendors who have clear, public use case restrictions have thought through their mission and made it actionable. Vague statements about "responsible AI" without specific commitments are less useful for your risk assessment.

    What is your funding structure, and what investor obligations affect your product direction? Venture-backed companies face pressure to grow into new markets. That pressure can affect who they sell to and what capabilities they develop. Understanding who funds the vendor and what they expect matters for 3-year dependency planning.

    What significant contracts or partnerships have you entered in the past 12 months? OpenAI's DoD contract is the current prominent example, but vendor partnerships and customer relationships that affect training priorities can be more subtle. This question surfaces them.

    The answer to this dimension informs how you weight the vendor's strategic alignment with your organization's risk tolerance — not whether any specific decision is right or wrong.

    Dimension 4: Data Governance

    This dimension is often evaluated through the lens of security certifications, but certifications tell you about controls at a moment in time. The governance questions go deeper.

    Is our data used for model training? Many vendors have moved to opt-out-by-default models where enterprise data is excluded from training unless you opt in. Verify this explicitly and get it in writing — not just in the privacy policy, but in your contract.

    Where is our data processed, and what are the data residency options? For organizations with data localization requirements — EU customers under GDPR, organizations in countries with strict data sovereignty laws, regulated industries with explicit processing location requirements — this is a hard constraint, not a preference.

    What happens to our data on account termination? You want a specific answer: your data is deleted within X days of termination, with written confirmation. "We handle it per our privacy policy" is not specific enough.

    Who has access to our data within your organization? Support engineers? Data science teams? Model training pipelines? Know the access surface.

    Have you had any data incidents affecting enterprise customer data in the past 24 months? Ask directly. Check news sources independently. The answer and the vendor's disclosure posture are both informative.

    Dimension 5: Regulatory Compliance Support

    Compliance documentation that helps your organization satisfy its own regulatory obligations is different from the vendor being compliant themselves. Both matter.

    Do you have an EU AI Act compliance framework? For organizations with EU operations, the AI Act creates obligations on both the AI provider and the deployer. Understand what the vendor provides and what remains your obligation.

    Can you support our specific regulatory requirements? For healthcare: HIPAA, BAA availability, and any specific clinical system requirements. For financial services: SR 11-7 model risk management guidance, and your specific regulator's AI guidance. For legal: bar association AI use guidance. Ask about your specific requirements, not generic "compliance" support.

    Do you provide compliance documentation we can use in our own audits? Audit-ready documentation — vendor risk questionnaire responses, security assessment reports, control attestations — saves significant time in your own compliance processes. Some vendors provide this proactively for enterprise customers. Others require you to generate it yourself from their raw policy documents.

    What is your incident disclosure timeline for security events? Regulatory requirements often mandate customer notification within specific timeframes (72 hours under GDPR, for example). Know what you'll get from the vendor and whether it satisfies your own notification obligations.

    Dimension 6: Exit Strategy

    No vendor relationship lasts forever. Regulatory requirements change, better options emerge, vendors make strategic decisions that change the relationship. Your procurement framework should evaluate exit before you need it.

    What does model migration look like if we need to switch? Concretely: what's the migration path, what documentation exists, what support does the vendor provide, and what is the typical timeline for an enterprise migration?

    Can we export our fine-tuning work? If you've invested in customizing a model through the vendor's fine-tuning API, do you get the weights or just the performance benefit? Some vendors give you the fine-tuned weights. Others don't.

    What's the portability of customizations we've made? System prompts, few-shot examples, and retrieval configurations are generally portable. Fine-tuned model weights, custom function call definitions, and vendor-specific features may not be.

    What happens to our integrations if you're acquired? Acquisitions change vendor behavior more than almost anything else. Ask explicitly what acquisition protections exist in your contract.

    The Model Ownership Escape Valve

    If you find yourself uncertain about a vendor across multiple of these governance dimensions, the practical answer often isn't to keep evaluating vendors. It's to use APIs for development and experimentation — where governance risk is manageable — and build toward owned models for production workloads where governance certainty matters.

    When you own the model weights, most of these governance questions become irrelevant for your production system. Version control is your version control. Audit logging is your audit logging infrastructure. Strategic alignment is your organization's alignment, not a vendor's. Data governance is your data on your infrastructure.

    The path to model ownership is detailed in What AI Model Ownership Actually Means. The Enterprise AI Vendor Risk Guide covers where governance risk fits in the overall risk framework.

    Making the Evaluation Operational

    The framework is only useful if it's operationalized. Practical implementation:

    Add governance questions to your standard vendor questionnaire. Build scoring rubrics for each dimension. Require responses in writing, not just verbal reassurances in sales calls. Include governance evaluation results in the documentation trail for AI procurement decisions.

    Review your vendor governance assessments annually, or when the vendor makes material announcements. Vendor strategic decisions like major government contracts, acquisitions, or significant funding rounds warrant an unscheduled review.

    And for high-stakes environments — clinical, legal, financial, regulated operations — treat governance evaluation with the same rigor you'd apply to evaluating a critical infrastructure vendor. Because that's what production AI is.

    See early bird pricing →

    Book a discovery call with Ertas →

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading