
From Shadow AI to Sanctioned AI: The Enterprise Migration Playbook
The complete journey from 'employees are using ChatGPT with company data' to 'we have sanctioned, auditable, on-premise AI tools.' A phased playbook with timelines, resource estimates, and ROI calculations.
Your employees are using ChatGPT with company data. You know it. They know you know it. And yet the problem persists because knowing about shadow AI and fixing it are two different things.
This is the migration playbook. It covers the complete journey from uncontrolled external AI usage to sanctioned, auditable, on-premise AI tools — broken into five phases with specific timelines, resource requirements, and decision points. It's designed for organizations that have moved past the "should we do something?" stage and into the "how do we actually do this?" stage.
The total timeline is 24–36 weeks. The total cost ranges from $50,000 to $200,000 depending on scale. The alternative — doing nothing — costs an average of $19.5 million per year in shadow AI-related insider risk losses.
The Migration Timeline at a Glance
| Phase | Timeline | Focus | Key Deliverable |
|---|---|---|---|
| Phase 1: Discovery | Weeks 1–4 | Audit, quantify, prioritize | Shadow AI assessment report |
| Phase 2: Quick Wins | Weeks 5–12 | Deploy basic internal alternative | Internal AI chatbot live |
| Phase 3: Data Foundation | Weeks 9–24 | Build data preparation pipeline | Enterprise data ready for AI |
| Phase 4: Custom Models | Weeks 17–32 | Fine-tune domain-specific models | Production custom models |
| Phase 5: Governance | Week 12+ (ongoing) | Monitoring, policy, audits | Mature AI governance program |
Note: Phases 3 and 4 overlap with earlier phases by design. You don't wait for governance to start data preparation, and you don't wait for perfect data to start fine-tuning.
Phase 1: Discovery (Weeks 1–4)
Objective: Understand what shadow AI exists in your organization, quantify the risk, and identify the highest-value use cases that employees are solving with external tools.
Resources: 1 security analyst, 1 IT operations lead, executive sponsor. Budget: $5,000–$15,000 (monitoring tools and analysis time).
Step 1.1: Audit Current Usage
Deploy network monitoring to identify traffic to known AI tool domains. The major targets:
- LLM providers: openai.com, anthropic.com, gemini.google.com, chat.mistral.ai
- Code assistants: copilot.github.com, cursor.sh, codeium.com
- Embedded AI features: notion.so/ai, docs.google.com (Gemini features), bing.com/chat
- AI aggregators: poe.com, huggingface.co, together.ai
Most enterprise firewalls and proxy servers can generate domain-level traffic reports without new tooling. You're looking for:
- Number of unique employees accessing AI tools
- Volume of data transmitted (outbound request sizes)
- Frequency of usage (daily, weekly, one-time)
- Departments with highest usage
Step 1.2: Quantify Risk
Translate usage data into risk metrics:
| Risk Factor | How to Measure | Benchmark |
|---|---|---|
| Exposure breadth | % of employees using external AI | Industry average: 77% |
| Data sensitivity | Sample outbound prompts for PII/PHI/IP | Industry average: 1.6% contain policy violations |
| Account visibility | % using corporate vs. personal accounts | Industry average: 82% personal accounts |
| Tool diversity | Number of distinct AI tools in use | Typical enterprise: 15–40 distinct tools |
| Volume | Average prompts per user per day | Typical: 8–12 per active user |
Calculate your estimated annual violation count: (active users) × (prompts/day) × (1.6%) × (220 working days). For a 1,000-person company with 60% active AI users: 600 × 10 × 0.016 × 220 = 21,120 estimated annual violations.
Step 1.3: Identify High-Value Use Cases
This is the most important discovery step and the one most organizations skip. Survey employees (anonymously) to understand what they're using AI for:
- What tasks do you use AI tools for?
- How much time does it save you per week?
- What data do you typically provide to the AI tool?
- If we provided an internal AI tool, what capabilities would it need?
Common findings:
| Use Case | Typical Departments | Time Saved | Data Sensitivity |
|---|---|---|---|
| Writing and editing | All | 3–5 hrs/week | Low to Medium |
| Code generation and debugging | Engineering | 5–10 hrs/week | High |
| Document summarization | Legal, Finance, Ops | 2–4 hrs/week | High |
| Data analysis | Finance, Marketing, Ops | 3–6 hrs/week | Medium to High |
| Research | All | 2–3 hrs/week | Low |
| Email drafting | All | 1–2 hrs/week | Low to Medium |
The discovery phase output is a Shadow AI Assessment Report that includes: usage metrics, risk quantification, top use cases by department, and a prioritized list of capabilities the internal platform must support.
Phase 2: Quick Wins (Weeks 5–12)
Objective: Deploy a basic internal AI chatbot that gives employees an immediate alternative to external tools. This reduces data leakage while you build the full solution.
Resources: 1 ML/DevOps engineer, 1 system administrator. Budget: $10,000–$30,000 (hardware + setup).
Step 2.1: Deploy Ollama + Open WebUI
The fastest path to a functional internal AI chatbot:
Hardware requirements (minimum):
- 1 server with an NVIDIA GPU (RTX 4090 for small teams, A100 for 100+ users)
- 32GB+ RAM, 500GB+ SSD storage
- Internal network connectivity (no internet access required for inference)
Software stack:
- Ollama for model serving
- Open WebUI for the user interface
- NGINX for load balancing (if multiple GPUs)
- LDAP/SSO integration for authentication
Model selection for Phase 2:
| Model | Size | Good For | Limitations |
|---|---|---|---|
| Llama 3.3 70B | 40GB VRAM | General tasks, writing, analysis | Slower on consumer GPUs |
| Qwen 2.5 32B | 20GB VRAM | Code, multilingual, analysis | Less conversational polish |
| Mistral Small 24B | 14GB VRAM | Fast general usage | Less capable on complex reasoning |
| DeepSeek-R1 Distill 32B | 20GB VRAM | Reasoning, math, analysis | Slower (chain-of-thought) |
Start with one general-purpose model (Llama 3.3 or Qwen 2.5) and expand based on employee feedback.
Step 2.2: Announce and Migrate
The announcement matters as much as the technology. Frame it as:
- "We're providing a better tool" — not "we're blocking the tools you like"
- Demonstrate that the internal tool handles the top use cases identified in Phase 1
- Provide migration guides: "If you were using ChatGPT for X, here's how to do X with the internal tool"
- Offer drop-in training sessions by department
Step 2.3: Measure Adoption
Track weekly:
- Internal platform: unique users, prompts per day, satisfaction scores
- External AI: traffic volume (should be declining)
- Support tickets: what's missing, what's not working
Expected result by end of Phase 2: 40–60% reduction in external AI tool usage. The remaining usage will be for capabilities the basic platform doesn't yet provide (code assistance, document upload, specialized tasks). That's what Phases 3 and 4 address.
Phase 3: Data Foundation (Weeks 9–24)
Objective: Build the data preparation pipeline that transforms your enterprise knowledge into training data for custom AI models. This is the foundation for Phase 4.
Resources: 1–2 data engineers, 1 domain expert (part-time per department). Budget: $15,000–$50,000 (tooling + engineering time).
Why This Phase Exists
Generic open-source models (deployed in Phase 2) are good at general tasks. They know nothing about your specific products, processes, terminology, customers, or domain. For an internal AI platform to outperform ChatGPT for your employees, it needs to know your business.
That knowledge comes from your enterprise data. But enterprise data is messy: scattered across file shares, wikis, Slack channels, email archives, databases, and document management systems. Before you can fine-tune a model, you need to extract, clean, structure, and validate that data.
Step 3.1: Identify Data Sources
Map the knowledge repositories across your organization:
| Source Type | Examples | Typical Volume | Extraction Complexity |
|---|---|---|---|
| Documents | PDFs, Word files, presentations | 10K–1M files | Medium (OCR, layout parsing) |
| Knowledge bases | Confluence, Notion, SharePoint | 1K–100K pages | Low (API extraction) |
| Communications | Slack, Teams, email archives | High volume | High (noise filtering, privacy) |
| Databases | CRM, ERP, ticketing systems | Structured data | Low (SQL queries) |
| Code repositories | Git repos, documentation | Varies | Low (file system access) |
| Specialized systems | EMR (healthcare), case management (legal) | Varies | High (proprietary formats) |
Step 3.2: Extract and Process
Build an extraction pipeline that handles your specific data sources. The typical pipeline:
-
Extract: Pull raw content from source systems. For documents, this means OCR and layout parsing (tools like Docling, Unstructured.io, or Apache Tika). For structured data, this means SQL queries and API calls.
-
Clean: Remove duplicates, boilerplate, headers/footers, navigation elements, and other noise. For communications data, filter out small talk, social messages, and non-work content.
-
Chunk: Break documents into semantically meaningful chunks (paragraphs, sections, Q&A pairs). Chunk size depends on the intended use: RAG retrieval works best with 200–500 token chunks; fine-tuning works best with complete conversation or document examples.
-
Structure: Convert cleaned content into training format. For fine-tuning: instruction/response pairs. For RAG: indexed document chunks with metadata.
-
Validate: Human review of a sample (5–10%) to verify quality, accuracy, and absence of sensitive data that shouldn't be in the training set.
Step 3.3: Build the Data Quality Pipeline
Data quality is the single biggest determinant of fine-tuned model performance. Bad data in, bad model out. Establish quality checks:
- Accuracy: Is the information in the training data correct and current?
- Relevance: Does the data represent the knowledge employees actually need?
- Completeness: Are there gaps in topic coverage?
- Consistency: Does the data contain contradictions?
- Privacy: Has all PII/PHI been removed or appropriately handled?
Budget 40–60% of Phase 3 time on data quality. Teams that rush through data preparation and move to fine-tuning quickly consistently get worse model performance than teams that spend more time on data quality and less on model tuning.
Step 3.4: Establish an Ongoing Data Pipeline
Enterprise knowledge changes constantly. New products launch, procedures update, regulations change. The data pipeline must be continuous, not one-time:
- Scheduled extraction from source systems (weekly or monthly)
- Automated quality checks
- Human review queue for flagged content
- Version control for training datasets
- Documentation of data lineage (where each piece of training data came from)
Phase 4: Custom Models (Weeks 17–32)
Objective: Fine-tune domain-specific models on your enterprise data that outperform generic models for your specific use cases.
Resources: 1 ML engineer, domain experts (part-time). Budget: $15,000–$75,000 (compute + tooling).
Why Fine-Tune?
A common question: "Why not just use RAG (retrieval-augmented generation) with the base model and skip fine-tuning?"
RAG and fine-tuning solve different problems:
| Capability | RAG | Fine-Tuning | Both |
|---|---|---|---|
| Access to current information | Yes | No (static at training time) | Yes |
| Domain-specific terminology and style | Partially | Yes | Yes |
| Following organizational processes | No | Yes | Yes |
| Reducing hallucination on domain topics | Partially | Yes | Yes |
| Handling novel questions | Yes | Limited to training distribution | Yes |
The best enterprise AI systems use both: fine-tuned models for domain expertise and style, plus RAG for current information and source citations. Phase 4 covers fine-tuning; RAG can be layered on during or after this phase.
Step 4.1: Select Base Models for Fine-Tuning
Choose base models based on your primary use cases:
| Use Case | Recommended Base Model | Fine-Tuning Method | Typical Training Time |
|---|---|---|---|
| General enterprise assistant | Llama 3.3 70B | QLoRA | 4–8 hours on 1× A100 |
| Code assistance | Qwen 2.5 Coder 32B | QLoRA | 2–4 hours on 1× A100 |
| Document analysis | Llama 3.3 8B | Full fine-tune or LoRA | 1–2 hours on 1× A100 |
| Specialized domain (legal, medical) | Llama 3.3 70B | QLoRA | 4–8 hours on 1× A100 |
QLoRA (Quantized Low-Rank Adaptation) is the standard method for enterprise fine-tuning: it requires less GPU memory than full fine-tuning while achieving comparable results for most use cases.
Step 4.2: Fine-Tune and Evaluate
The fine-tuning cycle:
- Prepare training data: Convert Phase 3 outputs into the model's expected format (typically instruction/response pairs in JSONL)
- Configure training: Set hyperparameters (learning rate, epochs, LoRA rank). Start with established defaults and adjust based on evaluation results.
- Train: Run the fine-tuning job. Monitor loss curves for convergence.
- Evaluate: Test the fine-tuned model against a held-out evaluation set. Measure:
- Accuracy: Does the model give correct, domain-appropriate answers?
- Style: Does it match your organization's tone and terminology?
- Safety: Does it refuse to provide information it shouldn't?
- Comparison: Side-by-side evaluation against the base model and against ChatGPT/Claude for the same prompts
- Iterate: If evaluation results are below target, diagnose the issue (usually data quality) and retrain.
Step 4.3: Deploy to Production
Replace or augment the Phase 2 base models with fine-tuned models:
- Deploy fine-tuned models alongside base models (let employees choose)
- A/B test: route 50% of requests to the fine-tuned model and measure satisfaction
- Collect feedback: thumbs up/down on responses, with optional written feedback
- Plan for regular retraining (quarterly or when significant new data is available)
Expected result by end of Phase 4: Internal AI platform handles 80–90% of the use cases employees were previously solving with external tools, with equal or better quality for domain-specific tasks. External AI tool usage drops to under 10% — mostly edge cases and personal preference for general research.
Phase 5: Governance (Week 12+, Ongoing)
Objective: Establish the monitoring, policy, and audit infrastructure for long-term AI governance.
Resources: 1 security analyst (part-time), AI Governance Committee (quarterly meetings). Budget: $5,000–$30,000/year (tooling + committee time).
Phase 5 starts during Phase 2 (not after Phase 4) because governance can't wait for the full platform to be ready.
Step 5.1: Policy Framework
Deploy an AI acceptable use policy that covers:
- Approved tools and the process for requesting new ones
- Data classification for AI usage (what data can go where)
- Acceptable use guidelines per department
- Monitoring and enforcement provisions
- Incident response procedures
- Training requirements
See our Shadow AI Policy Template for Regulated Industries for a complete, adaptable policy document.
Step 5.2: Monitoring Infrastructure
Deploy monitoring that provides:
- Usage dashboards: Who's using the internal platform, for what, and how often
- External AI detection: Continued monitoring of traffic to external AI tools
- Data leakage detection: Automated scanning for PII, PHI, and classified data in prompts
- Anomaly detection: Unusual usage patterns (volume spikes, off-hours access, bulk data submission)
- Audit logs: Complete record of all AI interactions for compliance and incident investigation
Step 5.3: Regular Audits
| Audit | Frequency | Focus |
|---|---|---|
| Usage compliance | Monthly | Are employees using approved tools? Is external usage declining? |
| Data classification adherence | Quarterly | Are prompts consistent with data tier policies? |
| Model performance | Quarterly | Are fine-tuned models meeting accuracy and quality targets? |
| Policy effectiveness | Semi-annually | Is the policy being followed? What needs updating? |
| Regulatory alignment | Semi-annually | Have regulations changed? Does the policy need updating? |
Step 5.4: Continuous Improvement
The internal AI platform must keep pace with external tools. If ChatGPT releases a capability your employees need and your internal platform doesn't have it, external usage will increase. Budget for:
- Monthly model updates (new base model releases, retraining on fresh data)
- Quarterly capability additions (new features, new use cases, new departments)
- Annual infrastructure scaling (more GPUs, more storage, better performance)
The ROI Case
The numbers that justify this investment:
Cost of Doing Nothing
| Risk Category | Estimated Annual Cost |
|---|---|
| Shadow AI insider risk losses (industry average) | $19.5M |
| Regulatory fines (per incident, GDPR) | $100K–$20M |
| Data breach investigation and response | $500K–$5M per incident |
| IP theft and competitive exposure | Unquantifiable but significant |
| Conservative estimate (one moderate incident/year) | $2M–$10M |
Even if you discount the $19.5M industry average as inflated for your organization and assume just one moderate data leakage incident per year, the exposure is $2–10 million annually.
Cost of the Migration
| Phase | Budget Range | Typical |
|---|---|---|
| Phase 1: Discovery | $5K–$15K | $10K |
| Phase 2: Quick Wins | $10K–$30K | $20K |
| Phase 3: Data Foundation | $15K–$50K | $30K |
| Phase 4: Custom Models | $15K–$75K | $40K |
| Phase 5: Governance (annual) | $5K–$30K/yr | $15K/yr |
| Total (Year 1) | $50K–$200K | $115K |
| Ongoing (Year 2+) | $20K–$80K/yr | $45K/yr |
The Math
At the conservative end: $2M in annual risk exposure versus $115K in migration cost. That's a 17:1 ratio.
At the industry average: $19.5M in risk exposure versus $115K. That's a 170:1 ratio.
Even accounting for the fact that migration doesn't eliminate 100% of risk — it reduces it by an estimated 80–95% — the ROI is overwhelming at any reasonable assumption.
And this doesn't count the productivity gains. If your internal AI platform saves each knowledge worker 3 hours per week (a conservative estimate based on Phase 1 discovery data), and your average fully-loaded cost per knowledge worker is $80/hour, that's:
- 500 knowledge workers × 3 hours × $80 × 48 weeks = $5.76 million in annual productivity gains
The internal AI platform doesn't just reduce risk. It generates measurable productivity value that external shadow AI was already partially delivering — just without the audit trail, the data protection, or the organizational control.
Common Objections and Responses
"We can't afford it." You can't afford not to. One data leakage incident involving customer PII costs more than the entire migration. And the productivity gains alone typically offset the investment within 6 months.
"Our employees won't use an internal tool." They will if it's good enough. The key is Phase 2 — deploy quickly, get feedback, iterate. Employees prefer sanctioned tools when those tools meet their needs.
"Open-source models aren't as good as GPT-4." For general knowledge, that's partially true. For your specific domain, a fine-tuned 70B model outperforms GPT-4 because it knows your business. This is the whole point of Phase 4.
"We don't have ML expertise in-house." Phases 1 and 2 require IT operations skills, not ML expertise. Phases 3 and 4 require some ML knowledge, which can be provided by a platform vendor, a consultant, or one new hire. The skills needed are increasingly common.
"This timeline is too long." Phase 2 delivers value in 8 weeks. You don't need to wait for Phase 4 to see results. The phased approach means you're reducing risk from week 5 onward.
"What about just getting ChatGPT Enterprise?" ChatGPT Enterprise addresses some concerns (data not used for training, SOC 2 compliance, SSO). It doesn't address data residency, custom model training, offline availability, or full audit control. For lightly regulated industries, it may be sufficient. For healthcare, financial services, legal, defense, and other regulated environments, on-premise deployment remains necessary.
Getting Started
The first step is Phase 1: Discovery. You need to know the size and shape of the problem before you can solve it. Most organizations are surprised by what they find — both the scale of shadow AI usage and the genuine productivity value employees are getting from it.
Don't approach this as a crackdown. Approach it as a migration. Your employees have already demonstrated that AI tools make them more productive. Your job is to give them better tools — tools that are faster, more capable for your domain, and don't leak company data to third parties.
The playbook works because it aligns incentives. Employees get better AI tools. Security gets visibility and control. Leadership gets reduced risk and measurable productivity gains. Nobody has to lose for this to work.
Start with discovery. Deploy a quick win. Build the foundation. Train the models. Govern the system. Twenty-four weeks from now, the shadow AI problem isn't a problem anymore — it's a competitive advantage.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

How to Migrate AI Workloads from Cloud to On-Premise: The Enterprise Playbook
A phased, step-by-step guide for migrating AI workloads from cloud to on-premise infrastructure. Covers workload classification, infrastructure planning, data pipeline migration, and the common pitfalls that derail enterprise migrations.

How to Build a Sanctioned AI Alternative to ChatGPT for Your Enterprise
Three approaches to deploying an internal AI assistant that replaces unauthorized ChatGPT usage: commercial on-prem platforms, open-source stacks, and fine-tuned domain-specific models. Covers requirements, economics, the UX trap, and why data preparation is the real moat.

From AI Pilot to AI Production: The Enterprise Scaling Playbook
A four-phase playbook for scaling enterprise AI from pilot to production. Covers the pilot trap, data preparation reality, infrastructure transition, and operational scaling with phase-specific budgets, timelines, and checklists.