Azure Monitor and Application Insights: Complete Observability
You cannot fix what you cannot see. Azure Monitor and Application Insights give you a unified observability platform — application traces, infrastructure metrics, log analytics, and intelligent alerting — all queryable with KQL.
Observability Architecture
flowchart TB
subgraph Apps["Instrumented Applications"]
DOTNET[.NET / ASP.NET Core\nAuto + custom telemetry]
NODE[Node.js\napplicationinsights SDK]
JAVA[Java\nAI Java agent]
FUNC[Azure Functions\nBuilt-in integration]
end
subgraph Infra["Infrastructure Sources"]
VM[Virtual Machines\nAzure Monitor Agent]
AKS[AKS\nContainer Insights]
SQL[Azure SQL\nIntelligent Insights]
COSM[Cosmos DB\nDiagnostic logs]
end
subgraph AI["Application Insights"]
REQ[Requests]
DEP[Dependencies]
EXC[Exceptions]
TRACE[Traces / Custom Events]
METRIC[Custom Metrics]
end
subgraph LA["Log Analytics Workspace"]
KQL_Q[KQL Queries]
DASH[Dashboards & Workbooks]
ALERT[Alert Rules]
end
Apps --> AI
Infra --> LA
AI --> LA
LA --> KQL_Q
LA --> DASH
LA --> ALERT
style Apps fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
style Infra fill:#d1fae5,stroke:#059669,color:#065f46
style AI fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95
style LA fill:#fef3c7,stroke:#f59e0b,color:#78350f
Three Pillars of Observability
| Pillar | What It Captures | Azure Service | Key KQL Table |
|---|---|---|---|
| Traces | Request flows, dependencies, correlation IDs | Application Insights | requests, dependencies |
| Metrics | Numeric time-series (CPU, RU, duration) | Azure Monitor Metrics | performanceCounters, customMetrics |
| Logs | Structured events, exceptions, custom events | Log Analytics | exceptions, traces, customEvents |
Step 1: Instrument Your Application
.NET (ASP.NET Core)
dotnet add package Microsoft.ApplicationInsights.AspNetCore
// Program.cs
builder.Services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
options.EnableAdaptiveSampling = true;
options.EnableQuickPulseMetricStream = true;
});
Custom telemetry (Controller):
public class OrderController : ControllerBase
{
private readonly TelemetryClient _telemetry;
public OrderController(TelemetryClient telemetry) => _telemetry = telemetry;
[HttpPost]
public async Task<IActionResult> CreateOrder(Order order)
{
using var operation = _telemetry.StartOperation<RequestTelemetry>("CreateOrder");
try
{
_telemetry.TrackEvent("OrderCreated", new Dictionary<string, string>
{
["OrderId"] = order.Id,
["CustomerId"] = order.CustomerId,
["Amount"] = order.Total.ToString()
});
_telemetry.TrackMetric("OrderValue", order.Total);
await _repository.SaveOrderAsync(order);
return Ok(order);
}
catch (Exception ex)
{
_telemetry.TrackException(ex);
operation.Telemetry.Success = false;
throw;
}
}
}
Node.js
const appInsights = require('applicationinsights');
appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
.setAutoDependencyCorrelation(true)
.setAutoCollectRequests(true)
.setAutoCollectPerformance(true)
.start();
const client = appInsights.defaultClient;
app.post('/orders', async (req, res) => {
client.trackEvent({ name: 'OrderCreated', properties: { orderId: req.body.id } });
client.trackMetric({ name: 'OrderValue', value: req.body.total });
try {
await saveOrder(req.body);
res.json({ success: true });
} catch (error) {
client.trackException({ exception: error });
res.status(500).json({ error: error.message });
}
});
Step 2: KQL Queries for Insights
Failed Requests (Last 24 h)
requests
| where success == false
| where timestamp > ago(24h)
| summarize FailureCount = count() by operation_Name, resultCode
| order by FailureCount desc
| take 10
Slow Requests
requests
| where timestamp > ago(1h)
| where duration > 5000 // ms
| project timestamp, operation_Name, duration, url
| order by duration desc
Dependency Performance (p95 + Failure Rate)
dependencies
| where timestamp > ago(24h)
| summarize
AvgDuration = avg(duration),
P95Duration = percentile(duration, 95),
FailureRate = countif(success == false) * 100.0 / count()
by target, type
| order by P95Duration desc
Funnel Analysis
customEvents
| where timestamp > ago(30d)
| where name in ("ProductViewed", "AddedToCart", "CheckoutStarted", "OrderCompleted")
| summarize Users = dcount(user_Id) by name
| order by Users desc
Anomaly Detection
requests
| where timestamp > ago(7d)
| make-series RequestCount = count() default = 0 on timestamp step 1h
| extend anomalies = series_decompose_anomalies(RequestCount, 1.5)
| mv-expand timestamp to typeof(datetime), RequestCount to typeof(long), anomalies to typeof(double)
| where anomalies != 0
Step 3: Distributed Tracing
sequenceDiagram
participant Browser as Browser
participant API as API Gateway\n(App Insights)
participant Orders as Orders Service\n(App Insights)
participant DB as Cosmos DB\n(dependency track)
participant Queue as Service Bus\n(dependency track)
Browser->>API: POST /checkout\noperation_Id: abc123
API->>Orders: POST /orders\noperation_Id: abc123
Orders->>DB: CreateItem\noperation_Id: abc123
DB-->>Orders: 201 Created (12ms)
Orders->>Queue: Send message\noperation_Id: abc123
Queue-->>Orders: Sent (3ms)
Orders-->>API: 201 Created (85ms)
API-->>Browser: 200 OK (120ms)
Note over Browser,Queue: All spans share operation_Id abc123<br/>Visible in Application Map + E2E Transaction view
Query the full trace by operation ID:
union requests, dependencies, exceptions
| where operation_Id == "abc123..."
| project timestamp, itemType, name, duration, success
| order by timestamp asc
Step 4: Infrastructure Monitoring
VM CPU Metrics
az monitor metrics list \
--resource /subscriptions/.../providers/Microsoft.Compute/virtualMachines/vm-web \
--metric "Percentage CPU" \
--start-time 2025-08-04T00:00:00Z \
--end-time 2025-08-04T23:59:59Z \
--interval PT1H
AKS Container Logs
ContainerLog
| where TimeGenerated > ago(1h)
| where ContainerName == "api-orders"
| where LogEntry contains "error"
| project TimeGenerated, ContainerName, LogEntry
VM Memory (Insights Metrics)
InsightsMetrics
| where TimeGenerated > ago(1h)
| where Name == "AvailableMB"
| summarize AvgMemoryMB = avg(Val) by Computer
| order by AvgMemoryMB asc
Step 5: Alerting Strategies
flowchart LR
subgraph Signals["Alert Signals"]
M[Metric Alerts\nReal-time thresholds]
L[Log Alerts\nKQL-based conditions]
A[Activity Log\nResource changes]
SM[Smart Detection\nAI anomalies]
end
subgraph Actions["Action Groups"]
EMAIL[Email / SMS]
WEBHOOK[Webhook / ITSM]
RUNBOOK[Automation Runbook]
FUNC[Azure Function\nauto-remediation]
end
Signals --> Actions
style Signals fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
style Actions fill:#d1fae5,stroke:#059669,color:#065f46
Metric alert — high CPU:
az monitor metrics alert create \
--name "High CPU Alert" \
--resource-group rg-monitoring \
--scopes /subscriptions/.../providers/Microsoft.Web/sites/myapi \
--condition "avg Percentage CPU > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--action /subscriptions/.../actionGroups/ops-team
Log alert — error rate > 5%:
az monitor scheduled-query create \
--name "High Error Rate" \
--resource-group rg-monitoring \
--scopes /subscriptions/.../components/myapi-insights \
--condition "count > 0" \
--condition-query "requests | where timestamp > ago(5m) | where success == false | count" \
--window-size 5m \
--evaluation-frequency 5m \
--severity 2 \
--action /subscriptions/.../actionGroups/ops-team
Recommended alert thresholds:
| Signal | Warning | Critical |
|---|---|---|
| Request error rate | > 1% | > 5% |
| Response time p95 | > 2 s | > 5 s |
| CPU utilisation | > 70% | > 90% |
| Memory available | < 500 MB | < 200 MB |
| Availability test | < 99% | < 95% |
Step 6: Cost Optimisation
Adaptive sampling (reduces ingestion volume):
builder.Services.Configure<TelemetryConfiguration>(config =>
{
config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
.UseAdaptiveSampling(maxTelemetryItemsPerSecond: 5)
.Build();
});
Set data retention period:
az monitor log-analytics workspace update \
--resource-group rg-monitoring \
--workspace-name logs-workspace \
--retention-time 90 # days; default 30, max 730 (extra cost after 90)
Cap daily ingestion:
az monitor app-insights component update \
--app myapi-insights \
--resource-group rg-monitoring \
--cap 5 # GB per day — prevents runaway ingestion costs
Cost reduction tactics:
| Tactic | Typical Saving | Notes |
|---|---|---|
| Adaptive sampling | 50–90% ingestion reduction | Auto-adjusts based on traffic |
| Exclude noisy telemetry | 20–40% | Filter health checks, static assets |
| Reduce retention | Up to 60% storage cost | Use archive tier for compliance data |
| Cap daily volume | Prevents cost spikes | Alert before cap is reached |
| Separate dev/prod workspaces | Avoid mixing high-volume dev logs | Dev uses short retention |
Best Practices
Always correlate with
operation_Id. Pass the correlation header through every service boundary — it makes distributed trace reconstruction possible.
Use connection strings, not instrumentation keys. Connection strings include the endpoint and support regional isolation.
Sample aggressively in high-volume services. Adaptive sampling at 5 items/sec loses almost no diagnostic value but cuts ingestion cost by 90%.
Alert on symptoms, not causes. Alert on error rate and latency visible to users — not internal CPU spikes that may never surface as user impact.
Use Workbooks for executive dashboards. Live KQL-powered Workbooks update automatically and can be shared without access to the underlying workspace.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| No telemetry in portal | Wrong connection string | Verify APPLICATIONINSIGHTS_CONNECTION_STRING env var |
| Missing dependencies | SDK version too old | Update to latest Microsoft.ApplicationInsights.AspNetCore |
| Sampled telemetry gaps | Adaptive sampling too aggressive | Raise maxTelemetryItemsPerSecond |
| Alerts firing with no data | Log alert has no matching results | Check KQL returns rows when condition is met |
| High ingestion costs | Noisy health check / static asset logs | Add telemetry processors to exclude |
| Correlated trace missing spans | Service not instrumented or missing header propagation | Add SDK to all services; verify W3C trace context |
Key Takeaways
- ✅ Application Insights auto-captures requests, dependencies, and exceptions with zero code for most stacks
- ✅ All telemetry flows to Log Analytics — a single KQL query can span application + infrastructure
- ✅ Distributed tracing with
operation_Idgives you the full call chain across microservices - ✅ Adaptive sampling dramatically reduces ingestion cost with minimal loss of diagnostic value
- ✅ Alert on user-visible symptoms (error rate, latency) — not internal resource utilisation alone
Additional Resources
- Azure Monitor overview
- Application Insights documentation
- KQL quick reference
- Workbooks guide
- Sampling in Application Insights
What KQL queries have saved you in production incidents? Share your observability patterns below.
Discussion