Author: Law Wen Feng | Published on: wenfeng.my Pillar: Data-LLM | Reading time: 10 minutes

---

Introduction

Every enterprise AI architect I speak with in Malaysia is asking the same question: "Should we pay for Azure OpenAI or run open-source models ourselves on Azure GPU VMs?"

It sounds straightforward, but the answer is anything but. The decision doesn't just live in a spreadsheet — it ripples through your data governance posture, your latency profile, your team's headcount requirements, and ultimately your ability to move fast when the market shifts.

In this article, I'll walk through a structured decision framework I've developed over the past 18 months working with Malaysian enterprises deploying LLMs on Azure. You'll get concrete cost estimates (in MYR, using approximate mid-2025 exchange rates because regional pricing matters), architecture patterns you can adapt, and a decision matrix that cuts through the FUD.

---

The Problem: Two Very Different Paths to the Same Destination

Let's start by defining the two options clearly.

Azure OpenAI Service is Microsoft's managed offering. You get API access to GPT-4o, GPT-4o-mini, the o-series reasoning models, and embeddings models. Microsoft handles hardware provisioning, patching, load balancing, and model updates. You consume it as a managed API with a pay-per-token pricing model. Self-hosted open-source LLMs on Azure GPU VMs means you provision your own GPU instances (NC-series, ND-series), deploy an inference server (vLLM, TGI, llama.cpp), and run models like Llama 3, Mistral, Qwen, or Phi. You pay for compute by the hour regardless of usage, and you are responsible for the full operations stack.

These two paths produce the same output — text from a large language model — but the economics, operational burden, and architectural implications could not be more different.

---

Azure OpenAI Pricing (2025 Realities)

As of mid-2025, here are the approximate pricing tiers for Azure OpenAI in Southeast Asia (Singapore region, since Azure OpenAI is typically deployed in the Southeast Asia (Singapore) region for Malaysian customers):

Model Input (per 1M tokens) Output (per 1M tokens) Cached Input
GPT-4o $2.50 / ~RM11 $10.00 / ~RM44 $1.25 / ~RM5.50
GPT-4o-mini $0.15 / ~RM0.66 $0.60 / ~RM2.64 $0.075 / ~RM0.33
o3-mini $1.10 / ~RM4.85 $4.40 / ~RM19.40 $0.55 / ~RM2.42
text-embedding-3-small $0.02 N/A N/A

At first glance, this looks reasonable. GPT-4o-mini at RM0.66 per million input tokens is cheap enough to use for almost any classification or extraction task. But here's the trap: GPT-4o itself is expensive at scale. If your application processes 10 million tokens per month in both directions on GPT-4o, you're looking at roughly RM550 per month just in API costs.

Scale that to production — 500 million tokens per month per direction — and you're at RM27,500 per month. At that point, you should be asking: "What else could I buy for RM27,500 a month?"

---

Self-Hosted Cost Reality

Let's look at the self-hosted path. I'll use Azure ND A100 v4-series (the workhorse for LLM inference) as our baseline.

VM SKU GPU vCPUs RAM Monthly Cost (pay-as-you-go, SG) Monthly Cost (1-year reserved)
Standard_NC16as_T4_v3 1x T4 16GB 16 110 GiB ~RM7,800 ~RM3,200
Standard_NC48ads_A100_v4 1x A100 80GB 48 440 GiB ~RM30,700 ~RM20,000
Standard_ND40rs_v2 8x V100 40 672 GiB ~RM98,000 ~RM48,000
Note: Pricing is approximate (Azure Retail Prices, SE Asia, Linux, mid-2025 list rates) and varies by region, OS, and offer type. The ND96amsr_A100_v4 (8x A100) is not available in the SE Asia region. Verify with the Azure Pricing Calculator.

Now, the key insight: a high-memory GPU VM like the ND40rs_v2 (8x V100) or a multi-GPU A100 configuration can serve a 70B parameter model like Llama 3.1-70B with vLLM and handle hundreds of thousands of requests per day. Let's say you run 10 million tokens per day through that VM (300 million per month). Your cost with Azure OpenAI on GPT-4o would be about RM16,500 per month. Your cost on a reserved ND40rs_v2 would be about RM48,000 per month.

Wait — that looks worse for self-hosted, right?

Not so fast. That same GPU VM can serve multiple models simultaneously. You could run Llama 3.1-70B for complex reasoning, a fine-tuned Qwen 2.5-32B for your domain-specific tasks, and a small embeddings model — all on the same hardware using vLLM's multi-LoRA or model multiplexing. The economics improve when you amortise compute across workloads.

The break-even analysis looks roughly like this (actual thresholds vary by GPU SKU, reservation term, and workload mix):

- Below 200M tokens/month: Azure OpenAI (GPT-4o-mini) is cheaper. Don't bother self-hosting.

- 200M–500M tokens/month on small models (< 8B): Azure OpenAI still wins. The T4 VM is borderline but operational overhead argues against it.

- 500M–1B tokens/month on large models (70B+): Self-hosting starts to break even, especially with reserved instances and multiple workloads sharing the GPU.

- Above 1B tokens/month on any model size: Self-hosting wins decisively, assuming you have the team to run it.

---

The Decision Matrix

Here is the structured framework I use with clients. It evaluates across five dimensions:

Criterion Weight Azure OpenAI Self-Hosted (open-source)
Unit economics at low volume High ✅ Wins ❌ Paying for idle GPU
Unit economics at high volume High ❌ Expensive per token ✅ Wins, amortised compute
Data residency & sovereignty Critical ✅ Microsoft compliance, but data leaves your network ✅ Full control, data never leaves your Azure VNet
Model choice & customisation Medium ❌ Limited to Microsoft's catalog ✅ Any HuggingFace model, fine-tuned variants
Latency & throughput control Medium ❌ Shared infra, rate limits, variable latency ✅ Full control, predictable P50/P99
Operational overhead High ✅ Zero ops, managed SLA ❌ Need MLOps, monitoring, patching, GPU infra skills
Compliance & auditability Critical ✅ SOC 2, ISO, contractual (Microsoft liability) ✅ Your infra, your audit (but your responsibility)
Scaling agility Medium ✅ Auto-scale via provisioned throughput ❌ Must pre-provision, scale-out is slower
Fine-tuning capability Low ⚠️ Limited to Azure OpenAI fine-tuning (few models) ✅ Full LoRA/QLoRA/full fine-tuning freedom

How to use this matrix

Score each criterion from 0–3 for each option. Multiply by the weight factor (Critical=3, High=2, Medium=1, Low=0.5). Sum both sides. The higher total is your starting recommendation.

A real example from a Malaysian fintech client:

- Data sovereignty = Critical → Self-hosted +3

- Operational overhead = High → Azure OpenAI +2

- Volume (100M tokens/month) → Azure OpenAI +2

- Model customisation = Medium → Self-hosted +1

Result: Split deployment. Sensitive data workflows (customer PII processing) go self-hosted. General chat and summarisation go through Azure OpenAI. This "hybrid" pattern is where most enterprises land.

---

Architecture Patterns for Production

Pattern 1: Pure Azure OpenAI (The "No-Ops" Pattern)

[Client App] → [API Gateway / APIM] → [Azure OpenAI] → [Azure AI Search (RAG)]

Best for: Teams without MLOps capability, POCs, low-to-medium volume, general-purpose chat. Key considerations:

- Use Azure API Management in front for throttling, caching, and key rotation

- Enable content filtering and abuse detection (may cause false positives — tune carefully)

- Provisioned Throughput Units (PTUs) give you reserved capacity at a discount (reserve minimum 1 month)

- Data processing stays in Azure boundary but passes through Microsoft's inference infrastructure

Pattern 2: Pure Self-Hosted (The "Full Control" Pattern)

[Client App] → [Azure Load Balancer] → [AKS Node Pool (GPU)] → [vLLM / TGI] → [Vector DB]

┌──────────────────────┐
│  Model Registry (ACR) │

Best for: High-volume workloads, custom fine-tuned models, strict data sovereignty, predictable latency requirements. Key considerations:

- Deploy on AKS with GPU node pools and cluster-autoscaler (cold start is real — budget 3-5 min for node provisioning)

- Use vLLM for inference serving — it supports PagedAttention, continuous batching, and tensor parallelism across GPUs

- Implement GPU sharing with KServe or Knative for multi-model serving

- Monitor GPU utilisation with Prometheus + Grafana dashboards

- Plan for spot instance fallback (Azure Spot VMs can cut costs 60-80% but have eviction risk)

Pattern 3: Hybrid (The "Best of Both" Pattern)

[Client App] → [Router / Orchestrator]
├── ⬇️ Azure OpenAI (GPT-4o-mini for simple, GPT-4o for complex)

Best for: Most production deployments. Route simple/high-volume tasks to self-hosted, complex/occasional tasks to GPT-4o, sensitive data to self-hosted. Implementation strategy:

- Use a lightweight router (a simple Python service or an LLM-based router like Martian or Portkey)

- Self-host your small embedding model (BGE-M3 or Snowflake Arctic Embed) for RAG — this is almost always cheaper than Azure OpenAI embeddings at scale

- Reserve Azure OpenAI PTU only for your GPT-4o complex-task throughput

- Fine-tune a 7B–14B parameter model (Qwen 2.5, Llama 3.1, or Phi-3) on your domain data for the 80% case; route the remaining 20% to GPT-4o

---

Real-World Cost Comparison: Enterprise Chatbot in Malaysia

Let me walk through a concrete scenario. A Malaysian insurance company wants a customer-facing chatbot handling policy queries, claim status checks, and product recommendations.

Workload profile:

- 50,000 conversations per day (~1.5M/month)

- Average 500 input + 150 output tokens per conversation

- 80% simple (policy lookups) + 20% complex (claim escalation, ambiguous queries)

- Data sovereignty: Must remain in Malaysia/Southeast Asia

- Latency target: P99 < 3 seconds

Option A: Pure Azure OpenAI

Item Monthly Cost (RM)
GPT-4o-mini (95% of tokens) ~RM 2,400
GPT-4o (5% of tokens) ~RM 3,800
Embeddings (text-embedding-3-small) ~RM 500
Azure AI Search (S2 tier) ~RM 1,600
API Management ~RM 800
Total ~RM 9,100/month

Option B: Self-Hosted (Hybrid Strategy)

Item Monthly Cost (RM)
NC48ads_A100_v4 (1x A100, 1-yr reserved, 80% of inference) ~RM 20,000
NC16as_T4_v3 (1x T4 spot, embeddings & re-ranking) ~RM 1,100
Azure Kubernetes Service ~RM 800
Azure Container Registry + Storage ~RM 300
MLOps tooling (MLflow, custom monitoring) ~RM 600
Azure OpenAI GPT-4o (only 5% complex queries) ~RM 1,900
Total ~RM 24,700/month

The turning point

At this volume, the pure Azure OpenAI path is significantly cheaper on raw compute. The hybrid path costs more but gives the insurer:

1. Full data residency for policy documents (no data sent to external inference endpoints)

2. A fine-tuned model that understands Malaysian insurance vernacular (instead of generic English)

3. Predictable latency with no rate-limit surprises during peak hours

4. The ability to swap models without vendor negotiation

If this company scales to 5x the volume (250K conversations/day), the Azure OpenAI-only option balloons to roughly RM38,000/month while the hybrid architecture scales more linearly — the GPU reserved cost stays flat and only additional inference capacity is needed. The economics increasingly favour self-hosted at very high volumes, but the crossover point depends on actual GPU pricing and workload mix.

---

Pitfalls to Avoid

1. Underestimating the MLOps Tax

Self-hosting requires someone who understands GPU operators, CUDA driver compatibility, vLLM configuration, HPA tuning, and model storage optimisation. If your team doesn't have this expertise, budget 3-6 months to build it — or hire for it. The "free" open-source model can cost you RM200,000+ a year in MLOps engineering salary before it even answers a single query reliably in production.

2. Ignoring Token Caching Economics

Azure OpenAI charges less for cached input tokens (50% discount). Self-hosted solutions like vLLM also support KV-cache-based prefix caching. Many workloads (conversations with a shared system prompt or RAG context) see 40-60% cache hit rates. Factor caching into your cost model — it changes the break-even point significantly.

3. Over-Provisioning for Peak Load

One mistake I see repeatedly: teams provision GPU VMs for peak traffic and pay for idle GPUs 60-70% of the time. Use Azure's cluster-autoscaler with GPU node pools, configure pod-level HPA based on request queue depth, and always set a minimum replica of 1 with a generous cooldown. Consider Azure Spot GPU VMs for stateless inference workloads.

4. Treating All Models as Interchangeable

GPT-4o is genuinely better than open models at complex reasoning, instruction following, and multilingual nuance (especially Bahasa Malaysia and Mandarin mixed contexts which we deal with daily in Malaysia). Don't sacrifice accuracy on the altar of cost savings. Route with intelligence — let the difficult queries hit GPT-4o and serve the rest from your fine-tuned open model.

5. Skipping the Caching Layer

A Redis-backed response cache (with semantic similarity matching or exact match) can eliminate 20-40% of your inference calls on frequently-asked questions. This is the single highest-ROI optimisation you can make, and it works for both Azure OpenAI and self-hosted paths. Implement it from day one.

---

Key Takeaways

1. There is no universal winner. The right choice depends on your volume, data sensitivity, team capability, and latency requirements. Use the decision matrix in this article — don't guess.

2. The hybrid pattern wins in production. Route simple and high-volume tasks through self-hosted open models; reserve managed Azure OpenAI for complex reasoning, fallback, and multilingual edge cases. This gives you cost control without sacrificing capability.

3. Self-hosting pays off at scale, but only if you have the team. Below ~500M tokens per month, the operational overhead rarely justifies the savings. Above that, the economics swing in favour of self-hosted — especially with reserved instances and multi-model GPU sharing.

4. Data sovereignty is a legitimate driver, not an excuse. For Malaysian financial institutions, insurers, and government-linked companies subject to Bank Negara or PIDM regulations, self-hosting may be mandatory. Build the compliance case into your framework from the start.

5. Optimise the caching layer before optimising the model. A response cache with 30% hit rate effectively gives you a 30% discount on inference costs — and it takes a weekend to implement, not a month.

---

The LLM deployment landscape on Azure is rich with options, and the gap between managed and self-hosted paths is narrowing every quarter. Microsoft continues adding open models to Microsoft Foundry (Llama, Mistral, Phi, Cohere), while the open-source ecosystem improves inference efficiency through techniques like speculative decoding and 4-bit quantisation.

The decision framework above will serve you well today. But revisit it every six months — because this market moves faster than any architecture decision you'll ever make.

Have questions about your specific Azure LLM deployment? I work with Malaysian enterprises on data and AI architecture — from model selection and GPU sizing to hybrid inference architectures. Connect on LinkedIn or drop me a message.

---

Law Wen Feng is a Principal Solution Architect based in Malaysia, specialising in cloud-native data platforms, AI/ML infrastructure on Azure, and enterprise-scale LLM deployment strategies.