Introduction: Cloud-Native Done Right
Building microservices is easy. Building production-grade microservices that are resilient, observable, secure, and cost-efficient is hard. This deep dive assembles a complete cloud-native platform on AKS, using Dapr to abstract away infrastructure complexity, KEDA for intelligent autoscaling, Azure Service Bus for reliable async messaging, and Flux for GitOps-driven deployments. The result is a platform that development teams can build on without worrying about the underlying distributed systems challenges.
Prerequisites
- Azure subscription with Contributor access
- Azure CLI 2.50+ with aks-preview extension
- kubectl and Helm 3.x installed
- Familiarity with Kubernetes, Docker, and microservice patterns
- Azure DevOps or GitHub for CI/CD pipelines
Phase 1: AKS Cluster Provisioning
Production-Grade Cluster with Bicep
param location string = resourceGroup().location
param clusterName string = 'aks-cloudnative-prod'
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
name: clusterName
location: location
identity: {
type: 'SystemAssigned'
}
properties: {
kubernetesVersion: '1.29'
dnsPrefix: '${clusterName}-dns'
enableRBAC: true
aadProfile: {
managed: true
enableAzureRBAC: true
adminGroupObjectIDs: ['admin-group-id']
}
networkProfile: {
networkPlugin: 'azure'
networkPolicy: 'calico'
serviceCidr: '10.0.0.0/16'
dnsServiceIP: '10.0.0.10'
loadBalancerSku: 'standard'
outboundType: 'userDefinedRouting'
}
agentPoolProfiles: [
{
name: 'system'
count: 3
vmSize: 'Standard_D4s_v5'
osType: 'Linux'
mode: 'System'
availabilityZones: ['1', '2', '3']
enableAutoScaling: true
minCount: 3
maxCount: 5
nodeTaints: ['CriticalAddonsOnly=true:NoSchedule']
}
{
name: 'apppool'
count: 3
vmSize: 'Standard_D8s_v5'
osType: 'Linux'
mode: 'User'
availabilityZones: ['1', '2', '3']
enableAutoScaling: true
minCount: 3
maxCount: 20
nodeLabels: {
workload: 'application'
}
}
{
name: 'gpupool'
count: 0
vmSize: 'Standard_NC6s_v3'
osType: 'Linux'
mode: 'User'
enableAutoScaling: true
minCount: 0
maxCount: 4
nodeLabels: {
workload: 'gpu'
}
nodeTaints: ['nvidia.com/gpu=true:NoSchedule']
}
]
autoUpgradeProfile: {
upgradeChannel: 'stable'
nodeOSUpgradeChannel: 'NodeImage'
}
oidcIssuerProfile: { enabled: true }
securityProfile: {
workloadIdentity: { enabled: true }
defender: { securityMonitoring: { enabled: true } }
imageCleaner: { enabled: true, intervalHours: 48 }
}
addonProfiles: {
omsagent: {
enabled: true
config: { logAnalyticsWorkspaceResourceID: logAnalytics.id }
}
azurepolicy: { enabled: true }
azureKeyvaultSecretsProvider: {
enabled: true
config: { enableSecretRotation: 'true', rotationPollInterval: '2m' }
}
}
}
}
Phase 2: Dapr Runtime for Microservice Abstractions
Installing and Configuring Dapr
# Install Dapr on AKS with HA mode
helm repo add dapr https://dapr.github.io/helm-charts/
helm repo update
helm install dapr dapr/dapr `
--namespace dapr-system `
--create-namespace `
--set global.ha.enabled=true `
--set dapr_placement.replicaCount=3 `
--set dapr_sentry.replicaCount=3 `
--set dapr_operator.replicaCount=3 `
--set global.mtls.enabled=true `
--set global.logAsJson=true `
--version 1.13
Dapr Components for Azure Service Bus
# dapr-components/servicebus-pubsub.yaml
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: order-pubsub
namespace: production
spec:
type: pubsub.azure.servicebus.topics
version: v1
metadata:
- name: connectionString
secretKeyRef:
name: servicebus-secret
key: connectionString
- name: maxActiveMessages
value: "100"
- name: maxConcurrentHandlers
value: "10"
- name: lockRenewalInSec
value: "60"
- name: maxConnectionRecoveryInSec
value: "300"
- name: publishMaxRetries
value: "5"
- name: publishInitialRetryIntervalInMs
value: "500"
auth:
secretStore: azure-keyvault
---
# dapr-components/statestore.yaml
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: order-state
namespace: production
spec:
type: state.azure.cosmosdb
version: v1
metadata:
- name: url
value: "https://cloudnative-cosmos.documents.azure.com:443/"
- name: masterKey
secretKeyRef:
name: cosmos-secret
key: masterKey
- name: database
value: "orders"
- name: collection
value: "state"
- name: actorStateStore
value: "true"
auth:
secretStore: azure-keyvault
---
# dapr-components/bindings-blob.yaml
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: document-storage
namespace: production
spec:
type: bindings.azure.blobstorage
version: v1
metadata:
- name: accountName
value: "cloudnativedocs"
- name: accountKey
secretKeyRef:
name: storage-secret
key: accountKey
- name: containerName
value: "documents"
auth:
secretStore: azure-keyvault
Microservice with Dapr SDK
// OrderService/Program.cs - .NET 8 microservice using Dapr
using Dapr.Client;
using Dapr.AspNetCore;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddDaprClient();
builder.Services.AddControllers().AddDapr();
builder.Services.AddHealthChecks();
var app = builder.Build();
app.UseCloudEvents();
app.MapSubscribeHandler();
app.MapControllers();
app.MapHealthChecks("/healthz");
app.Run();
// Controllers/OrderController.cs
[ApiController]
[Route("api/orders")]
public class OrderController : ControllerBase
{
private readonly DaprClient _daprClient;
private readonly ILogger<OrderController> _logger;
public OrderController(DaprClient daprClient, ILogger<OrderController> logger)
{
_daprClient = daprClient;
_logger = logger;
}
[HttpPost]
public async Task<IActionResult> CreateOrder([FromBody] CreateOrderRequest request)
{
var order = new Order
{
Id = Guid.NewGuid().ToString(),
CustomerId = request.CustomerId,
Items = request.Items,
Status = OrderStatus.Created,
CreatedAt = DateTime.UtcNow,
TotalAmount = request.Items.Sum(i => i.Price * i.Quantity)
};
// Save state to Cosmos DB via Dapr state store
await _daprClient.SaveStateAsync("order-state", order.Id, order);
// Publish order created event via Dapr pub/sub (Service Bus)
await _daprClient.PublishEventAsync("order-pubsub", "orders", new OrderCreatedEvent
{
OrderId = order.Id,
CustomerId = order.CustomerId,
TotalAmount = order.TotalAmount,
CreatedAt = order.CreatedAt
});
_logger.LogInformation("Order {OrderId} created for customer {CustomerId}", order.Id, order.CustomerId);
return CreatedAtAction(nameof(GetOrder), new { id = order.Id }, order);
}
[Topic("order-pubsub", "payments")]
[HttpPost("payment-completed")]
public async Task<IActionResult> HandlePaymentCompleted([FromBody] PaymentCompletedEvent evt)
{
var order = await _daprClient.GetStateAsync<Order>("order-state", evt.OrderId);
if (order == null) return NotFound();
order.Status = OrderStatus.Paid;
order.PaidAt = evt.PaidAt;
await _daprClient.SaveStateAsync("order-state", order.Id, order);
// Invoke shipping service via Dapr service invocation
await _daprClient.InvokeMethodAsync(HttpMethod.Post, "shipping-service", "api/shipments", new
{
OrderId = order.Id,
Address = order.ShippingAddress,
Items = order.Items
});
return Ok();
}
}
Phase 3: KEDA Event-Driven Autoscaling
Scaling Based on Azure Service Bus Queue Depth
# keda/order-processor-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaler
namespace: production
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15
cooldownPeriod: 60
minReplicaCount: 1
maxReplicaCount: 30
fallback:
failureThreshold: 3
replicas: 5
triggers:
- type: azure-servicebus
metadata:
queueName: order-processing
namespace: cloudnative-bus
messageCount: "10"
activationMessageCount: "1"
connectionFromEnv: SERVICEBUS_CONNECTION
- type: cron
metadata:
timezone: "America/New_York"
start: "0 8 * * 1-5"
end: "0 20 * * 1-5"
desiredReplicas: "5"
---
# Scale based on HTTP traffic using Prometheus metrics
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-gateway-scaler
namespace: production
spec:
scaleTargetRef:
name: api-gateway
pollingInterval: 10
cooldownPeriod: 120
minReplicaCount: 3
maxReplicaCount: 50
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring:9090
metricName: http_requests_per_second
query: |
sum(rate(http_requests_total{app="api-gateway"}[2m]))
threshold: "200"
activationThreshold: "50"
- type: cpu
metricType: Utilization
metadata:
value: "70"
- type: memory
metricType: Utilization
metadata:
value: "80"
Phase 4: Azure Service Bus Patterns
Dead Letter Processing and Retry Logic
public class ServiceBusDeadLetterProcessor : BackgroundService
{
private readonly ServiceBusClient _client;
private readonly ILogger<ServiceBusDeadLetterProcessor> _logger;
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
var dlqReceiver = _client.CreateReceiver(
"order-processing",
new ServiceBusReceiverOptions
{
SubQueue = SubQueue.DeadLetter,
ReceiveMode = ServiceBusReceiveMode.PeekLock
});
while (!stoppingToken.IsCancellationRequested)
{
var messages = await dlqReceiver.ReceiveMessagesAsync(
maxMessages: 10,
maxWaitTime: TimeSpan.FromSeconds(30),
cancellationToken: stoppingToken);
foreach (var message in messages)
{
var deadLetterReason = message.DeadLetterReason;
var errorDescription = message.DeadLetterErrorDescription;
_logger.LogWarning(
"Dead letter: {Reason} - {Description} for message {Id}",
deadLetterReason, errorDescription, message.MessageId);
// Analyze and route dead letters
if (IsRetryable(deadLetterReason))
{
// Re-enqueue with exponential backoff
var retryCount = message.ApplicationProperties.ContainsKey("RetryCount")
? (int)message.ApplicationProperties["RetryCount"] + 1 : 1;
if (retryCount <= 5)
{
var sender = _client.CreateSender("order-processing");
var retryMessage = new ServiceBusMessage(message.Body)
{
MessageId = message.MessageId,
ScheduledEnqueueTime = DateTimeOffset.UtcNow.AddSeconds(Math.Pow(2, retryCount) * 10)
};
retryMessage.ApplicationProperties["RetryCount"] = retryCount;
await sender.SendMessageAsync(retryMessage, stoppingToken);
}
else
{
await ParkMessageAsync(message);
}
}
else
{
await ParkMessageAsync(message);
}
await dlqReceiver.CompleteAsync(message, stoppingToken);
}
}
}
}
Phase 5: GitOps with Flux CD
Repository Structure and Flux Configuration
# flux-system/gotk-sync.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: platform-config
namespace: flux-system
spec:
interval: 1m
url: https://dev.azure.com/org/project/_git/platform-config
ref:
branch: main
secretRef:
name: git-credentials
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infrastructure
namespace: flux-system
spec:
interval: 5m
path: ./infrastructure
prune: true
sourceRef:
kind: GitRepository
name: platform-config
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: dapr-operator
namespace: dapr-system
dependsOn: []
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: applications
namespace: flux-system
spec:
interval: 5m
path: ./applications/production
prune: true
sourceRef:
kind: GitRepository
name: platform-config
dependsOn:
- name: infrastructure
postBuild:
substitute:
ENVIRONMENT: production
CLUSTER_NAME: aks-cloudnative-prod
Phase 6: Observability Stack
OpenTelemetry Configuration
# otel-collector/config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
otel-collector-config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'dapr'
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_dapr_io_metrics_port]
action: keep
regex: '.+'
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
check_interval: 5s
resource:
attributes:
- key: environment
value: production
action: upsert
- key: cloud.provider
value: azure
action: upsert
exporters:
azuremonitor:
connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [azuremonitor]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [azuremonitor, prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [azuremonitor]
Platform Decision Matrix
| Concern | Technology | Why |
|---|---|---|
| Container Orchestration | AKS | Managed Kubernetes, Azure-native |
| Service Communication | Dapr | Infrastructure abstraction, mTLS, pub/sub |
| Autoscaling | KEDA | Event-driven, scales to zero, multi-trigger |
| Messaging | Azure Service Bus | Enterprise reliability, dead letter, sessions |
| Deployment | Flux CD | GitOps, drift detection, multi-tenancy |
| Observability | OpenTelemetry + Azure Monitor | Vendor-neutral, distributed tracing |
| Secrets | Key Vault + CSI Driver | Auto-rotation, managed identity |
| Networking | Calico | Network policies, microsegmentation |
Best Practices
- Use Dapr for service-to-service calls: Automatic retries, mTLS, and observability without code changes
- KEDA over HPA for event-driven workloads: Scale based on queue depth, not just CPU/memory
- GitOps is non-negotiable: Never kubectl apply in production — always go through Git
- Separate system and application node pools: Prevent application workloads from starving system components
- Implement circuit breakers: Use Dapr's built-in retry and circuit breaker policies
- Dead letter queues from day one: Every Service Bus subscription should have DLQ processing
- Use workload identity over service principals: Pod-level managed identity is more secure
Troubleshooting
| Issue | Root Cause | Resolution |
|---|---|---|
| Dapr sidecar not injecting | Missing annotation or namespace label | Add dapr.io/enabled: "true" annotation |
| KEDA not scaling | Trigger authentication failure | Verify TriggerAuthentication secret references |
| Service Bus message loss | Message not completed before lock expires | Increase lock duration, add renewal logic |
| Flux drift detected | Manual change outside GitOps | Revert manual change, update Git source |
| High latency between services | Missing Dapr service invocation | Use Dapr service invocation, not direct HTTP |
Architecture Decision and Tradeoffs
When designing integrated solutions solutions with Azure + Power Platform, consider these key architectural trade-offs:
| Approach | Best For | Tradeoff |
|---|---|---|
| Managed / platform service | Rapid delivery, reduced ops burden | Less customisation, potential vendor lock-in |
| Custom / self-hosted | Full control, advanced tuning | Higher operational overhead and cost |
Recommendation: Start with the managed approach for most workloads and move to custom only when specific requirements demand it.
Validation and Versioning
- Last validated: April 2026
- Validate examples against your tenant, region, and SKU constraints before production rollout.
- Keep module, CLI, and SDK versions pinned in automation pipelines and review quarterly.
Security and Governance Considerations
- Apply least-privilege access using RBAC roles and just-in-time elevation for admin tasks.
- Store secrets in managed secret stores and avoid embedding credentials in scripts or source files.
- Enable audit logging, data protection policies, and periodic access reviews for regulated workloads.
Cost and Performance Notes
- Define budgets and alerts, then monitor usage and cost trends continuously after go-live.
- Baseline performance with synthetic and real-user checks before and after major changes.
- Scale resources with measured thresholds and revisit sizing after usage pattern changes.
Official Microsoft References
- https://learn.microsoft.com/azure/architecture/
- https://learn.microsoft.com/azure/well-architected/
- https://learn.microsoft.com/power-platform/guidance/
Public Examples from Official Sources
- These examples are sourced from official public Microsoft documentation and sample repositories.
- Documentation examples: https://learn.microsoft.com/azure/well-architected/
- Sample repositories: https://github.com/Azure/ArchitectureCenter
- Prefer adapting these examples to your tenant, subscriptions, and governance requirements before production use.
Key Takeaways
- A cloud-native platform is more than Kubernetes — it requires service mesh, event-driven scaling, reliable messaging, and GitOps
- Dapr eliminates boilerplate distributed systems code while remaining runtime-agnostic
- KEDA enables true event-driven architectures that scale to zero during idle periods
- Azure Service Bus provides enterprise-grade messaging with dead letter handling and sessions
- GitOps with Flux ensures every deployment is auditable, reversible, and repeatable
Discussion