Back to blog
    From Shadow AI to Sanctioned AI: The Enterprise Migration Playbook
    shadow-aienterprise-aion-premisemigrationplaybooksegment:enterprise

    From Shadow AI to Sanctioned AI: The Enterprise Migration Playbook

    The complete journey from 'employees are using ChatGPT with company data' to 'we have sanctioned, auditable, on-premise AI tools.' A phased playbook with timelines, resource estimates, and ROI calculations.

    EErtas Team·

    Your employees are using ChatGPT with company data. You know it. They know you know it. And yet the problem persists because knowing about shadow AI and fixing it are two different things.

    This is the migration playbook. It covers the complete journey from uncontrolled external AI usage to sanctioned, auditable, on-premise AI tools — broken into five phases with specific timelines, resource requirements, and decision points. It's designed for organizations that have moved past the "should we do something?" stage and into the "how do we actually do this?" stage.

    The total timeline is 24–36 weeks. The total cost ranges from $50,000 to $200,000 depending on scale. The alternative — doing nothing — costs an average of $19.5 million per year in shadow AI-related insider risk losses.

    The Migration Timeline at a Glance

    PhaseTimelineFocusKey Deliverable
    Phase 1: DiscoveryWeeks 1–4Audit, quantify, prioritizeShadow AI assessment report
    Phase 2: Quick WinsWeeks 5–12Deploy basic internal alternativeInternal AI chatbot live
    Phase 3: Data FoundationWeeks 9–24Build data preparation pipelineEnterprise data ready for AI
    Phase 4: Custom ModelsWeeks 17–32Fine-tune domain-specific modelsProduction custom models
    Phase 5: GovernanceWeek 12+ (ongoing)Monitoring, policy, auditsMature AI governance program

    Note: Phases 3 and 4 overlap with earlier phases by design. You don't wait for governance to start data preparation, and you don't wait for perfect data to start fine-tuning.


    Phase 1: Discovery (Weeks 1–4)

    Objective: Understand what shadow AI exists in your organization, quantify the risk, and identify the highest-value use cases that employees are solving with external tools.

    Resources: 1 security analyst, 1 IT operations lead, executive sponsor. Budget: $5,000–$15,000 (monitoring tools and analysis time).

    Step 1.1: Audit Current Usage

    Deploy network monitoring to identify traffic to known AI tool domains. The major targets:

    • LLM providers: openai.com, anthropic.com, gemini.google.com, chat.mistral.ai
    • Code assistants: copilot.github.com, cursor.sh, codeium.com
    • Embedded AI features: notion.so/ai, docs.google.com (Gemini features), bing.com/chat
    • AI aggregators: poe.com, huggingface.co, together.ai

    Most enterprise firewalls and proxy servers can generate domain-level traffic reports without new tooling. You're looking for:

    • Number of unique employees accessing AI tools
    • Volume of data transmitted (outbound request sizes)
    • Frequency of usage (daily, weekly, one-time)
    • Departments with highest usage

    Step 1.2: Quantify Risk

    Translate usage data into risk metrics:

    Risk FactorHow to MeasureBenchmark
    Exposure breadth% of employees using external AIIndustry average: 77%
    Data sensitivitySample outbound prompts for PII/PHI/IPIndustry average: 1.6% contain policy violations
    Account visibility% using corporate vs. personal accountsIndustry average: 82% personal accounts
    Tool diversityNumber of distinct AI tools in useTypical enterprise: 15–40 distinct tools
    VolumeAverage prompts per user per dayTypical: 8–12 per active user

    Calculate your estimated annual violation count: (active users) × (prompts/day) × (1.6%) × (220 working days). For a 1,000-person company with 60% active AI users: 600 × 10 × 0.016 × 220 = 21,120 estimated annual violations.

    Step 1.3: Identify High-Value Use Cases

    This is the most important discovery step and the one most organizations skip. Survey employees (anonymously) to understand what they're using AI for:

    • What tasks do you use AI tools for?
    • How much time does it save you per week?
    • What data do you typically provide to the AI tool?
    • If we provided an internal AI tool, what capabilities would it need?

    Common findings:

    Use CaseTypical DepartmentsTime SavedData Sensitivity
    Writing and editingAll3–5 hrs/weekLow to Medium
    Code generation and debuggingEngineering5–10 hrs/weekHigh
    Document summarizationLegal, Finance, Ops2–4 hrs/weekHigh
    Data analysisFinance, Marketing, Ops3–6 hrs/weekMedium to High
    ResearchAll2–3 hrs/weekLow
    Email draftingAll1–2 hrs/weekLow to Medium

    The discovery phase output is a Shadow AI Assessment Report that includes: usage metrics, risk quantification, top use cases by department, and a prioritized list of capabilities the internal platform must support.


    Phase 2: Quick Wins (Weeks 5–12)

    Objective: Deploy a basic internal AI chatbot that gives employees an immediate alternative to external tools. This reduces data leakage while you build the full solution.

    Resources: 1 ML/DevOps engineer, 1 system administrator. Budget: $10,000–$30,000 (hardware + setup).

    Step 2.1: Deploy Ollama + Open WebUI

    The fastest path to a functional internal AI chatbot:

    Hardware requirements (minimum):

    • 1 server with an NVIDIA GPU (RTX 4090 for small teams, A100 for 100+ users)
    • 32GB+ RAM, 500GB+ SSD storage
    • Internal network connectivity (no internet access required for inference)

    Software stack:

    • Ollama for model serving
    • Open WebUI for the user interface
    • NGINX for load balancing (if multiple GPUs)
    • LDAP/SSO integration for authentication

    Model selection for Phase 2:

    ModelSizeGood ForLimitations
    Llama 3.3 70B40GB VRAMGeneral tasks, writing, analysisSlower on consumer GPUs
    Qwen 2.5 32B20GB VRAMCode, multilingual, analysisLess conversational polish
    Mistral Small 24B14GB VRAMFast general usageLess capable on complex reasoning
    DeepSeek-R1 Distill 32B20GB VRAMReasoning, math, analysisSlower (chain-of-thought)

    Start with one general-purpose model (Llama 3.3 or Qwen 2.5) and expand based on employee feedback.

    Step 2.2: Announce and Migrate

    The announcement matters as much as the technology. Frame it as:

    • "We're providing a better tool" — not "we're blocking the tools you like"
    • Demonstrate that the internal tool handles the top use cases identified in Phase 1
    • Provide migration guides: "If you were using ChatGPT for X, here's how to do X with the internal tool"
    • Offer drop-in training sessions by department

    Step 2.3: Measure Adoption

    Track weekly:

    • Internal platform: unique users, prompts per day, satisfaction scores
    • External AI: traffic volume (should be declining)
    • Support tickets: what's missing, what's not working

    Expected result by end of Phase 2: 40–60% reduction in external AI tool usage. The remaining usage will be for capabilities the basic platform doesn't yet provide (code assistance, document upload, specialized tasks). That's what Phases 3 and 4 address.


    Phase 3: Data Foundation (Weeks 9–24)

    Objective: Build the data preparation pipeline that transforms your enterprise knowledge into training data for custom AI models. This is the foundation for Phase 4.

    Resources: 1–2 data engineers, 1 domain expert (part-time per department). Budget: $15,000–$50,000 (tooling + engineering time).

    Why This Phase Exists

    Generic open-source models (deployed in Phase 2) are good at general tasks. They know nothing about your specific products, processes, terminology, customers, or domain. For an internal AI platform to outperform ChatGPT for your employees, it needs to know your business.

    That knowledge comes from your enterprise data. But enterprise data is messy: scattered across file shares, wikis, Slack channels, email archives, databases, and document management systems. Before you can fine-tune a model, you need to extract, clean, structure, and validate that data.

    Step 3.1: Identify Data Sources

    Map the knowledge repositories across your organization:

    Source TypeExamplesTypical VolumeExtraction Complexity
    DocumentsPDFs, Word files, presentations10K–1M filesMedium (OCR, layout parsing)
    Knowledge basesConfluence, Notion, SharePoint1K–100K pagesLow (API extraction)
    CommunicationsSlack, Teams, email archivesHigh volumeHigh (noise filtering, privacy)
    DatabasesCRM, ERP, ticketing systemsStructured dataLow (SQL queries)
    Code repositoriesGit repos, documentationVariesLow (file system access)
    Specialized systemsEMR (healthcare), case management (legal)VariesHigh (proprietary formats)

    Step 3.2: Extract and Process

    Build an extraction pipeline that handles your specific data sources. The typical pipeline:

    1. Extract: Pull raw content from source systems. For documents, this means OCR and layout parsing (tools like Docling, Unstructured.io, or Apache Tika). For structured data, this means SQL queries and API calls.

    2. Clean: Remove duplicates, boilerplate, headers/footers, navigation elements, and other noise. For communications data, filter out small talk, social messages, and non-work content.

    3. Chunk: Break documents into semantically meaningful chunks (paragraphs, sections, Q&A pairs). Chunk size depends on the intended use: RAG retrieval works best with 200–500 token chunks; fine-tuning works best with complete conversation or document examples.

    4. Structure: Convert cleaned content into training format. For fine-tuning: instruction/response pairs. For RAG: indexed document chunks with metadata.

    5. Validate: Human review of a sample (5–10%) to verify quality, accuracy, and absence of sensitive data that shouldn't be in the training set.

    Step 3.3: Build the Data Quality Pipeline

    Data quality is the single biggest determinant of fine-tuned model performance. Bad data in, bad model out. Establish quality checks:

    • Accuracy: Is the information in the training data correct and current?
    • Relevance: Does the data represent the knowledge employees actually need?
    • Completeness: Are there gaps in topic coverage?
    • Consistency: Does the data contain contradictions?
    • Privacy: Has all PII/PHI been removed or appropriately handled?

    Budget 40–60% of Phase 3 time on data quality. Teams that rush through data preparation and move to fine-tuning quickly consistently get worse model performance than teams that spend more time on data quality and less on model tuning.

    Step 3.4: Establish an Ongoing Data Pipeline

    Enterprise knowledge changes constantly. New products launch, procedures update, regulations change. The data pipeline must be continuous, not one-time:

    • Scheduled extraction from source systems (weekly or monthly)
    • Automated quality checks
    • Human review queue for flagged content
    • Version control for training datasets
    • Documentation of data lineage (where each piece of training data came from)

    Phase 4: Custom Models (Weeks 17–32)

    Objective: Fine-tune domain-specific models on your enterprise data that outperform generic models for your specific use cases.

    Resources: 1 ML engineer, domain experts (part-time). Budget: $15,000–$75,000 (compute + tooling).

    Why Fine-Tune?

    A common question: "Why not just use RAG (retrieval-augmented generation) with the base model and skip fine-tuning?"

    RAG and fine-tuning solve different problems:

    CapabilityRAGFine-TuningBoth
    Access to current informationYesNo (static at training time)Yes
    Domain-specific terminology and stylePartiallyYesYes
    Following organizational processesNoYesYes
    Reducing hallucination on domain topicsPartiallyYesYes
    Handling novel questionsYesLimited to training distributionYes

    The best enterprise AI systems use both: fine-tuned models for domain expertise and style, plus RAG for current information and source citations. Phase 4 covers fine-tuning; RAG can be layered on during or after this phase.

    Step 4.1: Select Base Models for Fine-Tuning

    Choose base models based on your primary use cases:

    Use CaseRecommended Base ModelFine-Tuning MethodTypical Training Time
    General enterprise assistantLlama 3.3 70BQLoRA4–8 hours on 1× A100
    Code assistanceQwen 2.5 Coder 32BQLoRA2–4 hours on 1× A100
    Document analysisLlama 3.3 8BFull fine-tune or LoRA1–2 hours on 1× A100
    Specialized domain (legal, medical)Llama 3.3 70BQLoRA4–8 hours on 1× A100

    QLoRA (Quantized Low-Rank Adaptation) is the standard method for enterprise fine-tuning: it requires less GPU memory than full fine-tuning while achieving comparable results for most use cases.

    Step 4.2: Fine-Tune and Evaluate

    The fine-tuning cycle:

    1. Prepare training data: Convert Phase 3 outputs into the model's expected format (typically instruction/response pairs in JSONL)
    2. Configure training: Set hyperparameters (learning rate, epochs, LoRA rank). Start with established defaults and adjust based on evaluation results.
    3. Train: Run the fine-tuning job. Monitor loss curves for convergence.
    4. Evaluate: Test the fine-tuned model against a held-out evaluation set. Measure:
      • Accuracy: Does the model give correct, domain-appropriate answers?
      • Style: Does it match your organization's tone and terminology?
      • Safety: Does it refuse to provide information it shouldn't?
      • Comparison: Side-by-side evaluation against the base model and against ChatGPT/Claude for the same prompts
    5. Iterate: If evaluation results are below target, diagnose the issue (usually data quality) and retrain.

    Step 4.3: Deploy to Production

    Replace or augment the Phase 2 base models with fine-tuned models:

    • Deploy fine-tuned models alongside base models (let employees choose)
    • A/B test: route 50% of requests to the fine-tuned model and measure satisfaction
    • Collect feedback: thumbs up/down on responses, with optional written feedback
    • Plan for regular retraining (quarterly or when significant new data is available)

    Expected result by end of Phase 4: Internal AI platform handles 80–90% of the use cases employees were previously solving with external tools, with equal or better quality for domain-specific tasks. External AI tool usage drops to under 10% — mostly edge cases and personal preference for general research.


    Phase 5: Governance (Week 12+, Ongoing)

    Objective: Establish the monitoring, policy, and audit infrastructure for long-term AI governance.

    Resources: 1 security analyst (part-time), AI Governance Committee (quarterly meetings). Budget: $5,000–$30,000/year (tooling + committee time).

    Phase 5 starts during Phase 2 (not after Phase 4) because governance can't wait for the full platform to be ready.

    Step 5.1: Policy Framework

    Deploy an AI acceptable use policy that covers:

    • Approved tools and the process for requesting new ones
    • Data classification for AI usage (what data can go where)
    • Acceptable use guidelines per department
    • Monitoring and enforcement provisions
    • Incident response procedures
    • Training requirements

    See our Shadow AI Policy Template for Regulated Industries for a complete, adaptable policy document.

    Step 5.2: Monitoring Infrastructure

    Deploy monitoring that provides:

    • Usage dashboards: Who's using the internal platform, for what, and how often
    • External AI detection: Continued monitoring of traffic to external AI tools
    • Data leakage detection: Automated scanning for PII, PHI, and classified data in prompts
    • Anomaly detection: Unusual usage patterns (volume spikes, off-hours access, bulk data submission)
    • Audit logs: Complete record of all AI interactions for compliance and incident investigation

    Step 5.3: Regular Audits

    AuditFrequencyFocus
    Usage complianceMonthlyAre employees using approved tools? Is external usage declining?
    Data classification adherenceQuarterlyAre prompts consistent with data tier policies?
    Model performanceQuarterlyAre fine-tuned models meeting accuracy and quality targets?
    Policy effectivenessSemi-annuallyIs the policy being followed? What needs updating?
    Regulatory alignmentSemi-annuallyHave regulations changed? Does the policy need updating?

    Step 5.4: Continuous Improvement

    The internal AI platform must keep pace with external tools. If ChatGPT releases a capability your employees need and your internal platform doesn't have it, external usage will increase. Budget for:

    • Monthly model updates (new base model releases, retraining on fresh data)
    • Quarterly capability additions (new features, new use cases, new departments)
    • Annual infrastructure scaling (more GPUs, more storage, better performance)

    The ROI Case

    The numbers that justify this investment:

    Cost of Doing Nothing

    Risk CategoryEstimated Annual Cost
    Shadow AI insider risk losses (industry average)$19.5M
    Regulatory fines (per incident, GDPR)$100K–$20M
    Data breach investigation and response$500K–$5M per incident
    IP theft and competitive exposureUnquantifiable but significant
    Conservative estimate (one moderate incident/year)$2M–$10M

    Even if you discount the $19.5M industry average as inflated for your organization and assume just one moderate data leakage incident per year, the exposure is $2–10 million annually.

    Cost of the Migration

    PhaseBudget RangeTypical
    Phase 1: Discovery$5K–$15K$10K
    Phase 2: Quick Wins$10K–$30K$20K
    Phase 3: Data Foundation$15K–$50K$30K
    Phase 4: Custom Models$15K–$75K$40K
    Phase 5: Governance (annual)$5K–$30K/yr$15K/yr
    Total (Year 1)$50K–$200K$115K
    Ongoing (Year 2+)$20K–$80K/yr$45K/yr

    The Math

    At the conservative end: $2M in annual risk exposure versus $115K in migration cost. That's a 17:1 ratio.

    At the industry average: $19.5M in risk exposure versus $115K. That's a 170:1 ratio.

    Even accounting for the fact that migration doesn't eliminate 100% of risk — it reduces it by an estimated 80–95% — the ROI is overwhelming at any reasonable assumption.

    And this doesn't count the productivity gains. If your internal AI platform saves each knowledge worker 3 hours per week (a conservative estimate based on Phase 1 discovery data), and your average fully-loaded cost per knowledge worker is $80/hour, that's:

    • 500 knowledge workers × 3 hours × $80 × 48 weeks = $5.76 million in annual productivity gains

    The internal AI platform doesn't just reduce risk. It generates measurable productivity value that external shadow AI was already partially delivering — just without the audit trail, the data protection, or the organizational control.

    Common Objections and Responses

    "We can't afford it." You can't afford not to. One data leakage incident involving customer PII costs more than the entire migration. And the productivity gains alone typically offset the investment within 6 months.

    "Our employees won't use an internal tool." They will if it's good enough. The key is Phase 2 — deploy quickly, get feedback, iterate. Employees prefer sanctioned tools when those tools meet their needs.

    "Open-source models aren't as good as GPT-4." For general knowledge, that's partially true. For your specific domain, a fine-tuned 70B model outperforms GPT-4 because it knows your business. This is the whole point of Phase 4.

    "We don't have ML expertise in-house." Phases 1 and 2 require IT operations skills, not ML expertise. Phases 3 and 4 require some ML knowledge, which can be provided by a platform vendor, a consultant, or one new hire. The skills needed are increasingly common.

    "This timeline is too long." Phase 2 delivers value in 8 weeks. You don't need to wait for Phase 4 to see results. The phased approach means you're reducing risk from week 5 onward.

    "What about just getting ChatGPT Enterprise?" ChatGPT Enterprise addresses some concerns (data not used for training, SOC 2 compliance, SSO). It doesn't address data residency, custom model training, offline availability, or full audit control. For lightly regulated industries, it may be sufficient. For healthcare, financial services, legal, defense, and other regulated environments, on-premise deployment remains necessary.

    Getting Started

    The first step is Phase 1: Discovery. You need to know the size and shape of the problem before you can solve it. Most organizations are surprised by what they find — both the scale of shadow AI usage and the genuine productivity value employees are getting from it.

    Don't approach this as a crackdown. Approach it as a migration. Your employees have already demonstrated that AI tools make them more productive. Your job is to give them better tools — tools that are faster, more capable for your domain, and don't leak company data to third parties.

    The playbook works because it aligns incentives. Employees get better AI tools. Security gets visibility and control. Leadership gets reduced risk and measurable productivity gains. Nobody has to lose for this to work.

    Start with discovery. Deploy a quick win. Build the foundation. Train the models. Govern the system. Twenty-four weeks from now, the shadow AI problem isn't a problem anymore — it's a competitive advantage.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading