Azure Monitor and Application Insights: Complete Observability

You cannot fix what you cannot see. Azure Monitor and Application Insights give you a unified observability platform — application traces, infrastructure metrics, log analytics, and intelligent alerting — all queryable with KQL.

Observability Architecture

flowchart TB
    subgraph Apps["Instrumented Applications"]
        DOTNET[.NET / ASP.NET Core\nAuto + custom telemetry]
        NODE[Node.js\napplicationinsights SDK]
        JAVA[Java\nAI Java agent]
        FUNC[Azure Functions\nBuilt-in integration]
    end

    subgraph Infra["Infrastructure Sources"]
        VM[Virtual Machines\nAzure Monitor Agent]
        AKS[AKS\nContainer Insights]
        SQL[Azure SQL\nIntelligent Insights]
        COSM[Cosmos DB\nDiagnostic logs]
    end

    subgraph AI["Application Insights"]
        REQ[Requests]
        DEP[Dependencies]
        EXC[Exceptions]
        TRACE[Traces / Custom Events]
        METRIC[Custom Metrics]
    end

    subgraph LA["Log Analytics Workspace"]
        KQL_Q[KQL Queries]
        DASH[Dashboards & Workbooks]
        ALERT[Alert Rules]
    end

    Apps --> AI
    Infra --> LA
    AI --> LA
    LA --> KQL_Q
    LA --> DASH
    LA --> ALERT

    style Apps fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
    style Infra fill:#d1fae5,stroke:#059669,color:#065f46
    style AI fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95
    style LA fill:#fef3c7,stroke:#f59e0b,color:#78350f

Three Pillars of Observability

Pillar	What It Captures	Azure Service	Key KQL Table
Traces	Request flows, dependencies, correlation IDs	Application Insights	`requests`, `dependencies`
Metrics	Numeric time-series (CPU, RU, duration)	Azure Monitor Metrics	`performanceCounters`, `customMetrics`
Logs	Structured events, exceptions, custom events	Log Analytics	`exceptions`, `traces`, `customEvents`

Step 1: Instrument Your Application

.NET (ASP.NET Core)

dotnet add package Microsoft.ApplicationInsights.AspNetCore

// Program.cs
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
    options.EnableAdaptiveSampling = true;
    options.EnableQuickPulseMetricStream = true;
});

Custom telemetry (Controller):

public class OrderController : ControllerBase
{
    private readonly TelemetryClient _telemetry;

    public OrderController(TelemetryClient telemetry) => _telemetry = telemetry;

    [HttpPost]
    public async Task<IActionResult> CreateOrder(Order order)
    {
        using var operation = _telemetry.StartOperation<RequestTelemetry>("CreateOrder");
        try
        {
            _telemetry.TrackEvent("OrderCreated", new Dictionary<string, string>
            {
                ["OrderId"] = order.Id,
                ["CustomerId"] = order.CustomerId,
                ["Amount"] = order.Total.ToString()
            });
            _telemetry.TrackMetric("OrderValue", order.Total);

            await _repository.SaveOrderAsync(order);
            return Ok(order);
        }
        catch (Exception ex)
        {
            _telemetry.TrackException(ex);
            operation.Telemetry.Success = false;
            throw;
        }
    }
}

Node.js

const appInsights = require('applicationinsights');
appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
  .setAutoDependencyCorrelation(true)
  .setAutoCollectRequests(true)
  .setAutoCollectPerformance(true)
  .start();

const client = appInsights.defaultClient;

app.post('/orders', async (req, res) => {
  client.trackEvent({ name: 'OrderCreated', properties: { orderId: req.body.id } });
  client.trackMetric({ name: 'OrderValue', value: req.body.total });
  try {
    await saveOrder(req.body);
    res.json({ success: true });
  } catch (error) {
    client.trackException({ exception: error });
    res.status(500).json({ error: error.message });
  }
});

Step 2: KQL Queries for Insights

Failed Requests (Last 24 h)

requests
| where success == false
| where timestamp > ago(24h)
| summarize FailureCount = count() by operation_Name, resultCode
| order by FailureCount desc
| take 10

Slow Requests

requests
| where timestamp > ago(1h)
| where duration > 5000  // ms
| project timestamp, operation_Name, duration, url
| order by duration desc

Dependency Performance (p95 + Failure Rate)

dependencies
| where timestamp > ago(24h)
| summarize
    AvgDuration   = avg(duration),
    P95Duration   = percentile(duration, 95),
    FailureRate   = countif(success == false) * 100.0 / count()
  by target, type
| order by P95Duration desc

Funnel Analysis

customEvents
| where timestamp > ago(30d)
| where name in ("ProductViewed", "AddedToCart", "CheckoutStarted", "OrderCompleted")
| summarize Users = dcount(user_Id) by name
| order by Users desc

Anomaly Detection

requests
| where timestamp > ago(7d)
| make-series RequestCount = count() default = 0 on timestamp step 1h
| extend anomalies = series_decompose_anomalies(RequestCount, 1.5)
| mv-expand timestamp to typeof(datetime), RequestCount to typeof(long), anomalies to typeof(double)
| where anomalies != 0

Step 3: Distributed Tracing

sequenceDiagram
    participant Browser as Browser
    participant API as API Gateway\n(App Insights)
    participant Orders as Orders Service\n(App Insights)
    participant DB as Cosmos DB\n(dependency track)
    participant Queue as Service Bus\n(dependency track)

    Browser->>API: POST /checkout\noperation_Id: abc123
    API->>Orders: POST /orders\noperation_Id: abc123
    Orders->>DB: CreateItem\noperation_Id: abc123
    DB-->>Orders: 201 Created (12ms)
    Orders->>Queue: Send message\noperation_Id: abc123
    Queue-->>Orders: Sent (3ms)
    Orders-->>API: 201 Created (85ms)
    API-->>Browser: 200 OK (120ms)

    Note over Browser,Queue: All spans share operation_Id abc123<br/>Visible in Application Map + E2E Transaction view

Query the full trace by operation ID:

union requests, dependencies, exceptions
| where operation_Id == "abc123..."
| project timestamp, itemType, name, duration, success
| order by timestamp asc

Step 4: Infrastructure Monitoring

VM CPU Metrics

az monitor metrics list \
  --resource /subscriptions/.../providers/Microsoft.Compute/virtualMachines/vm-web \
  --metric "Percentage CPU" \
  --start-time 2025-08-04T00:00:00Z \
  --end-time 2025-08-04T23:59:59Z \
  --interval PT1H

AKS Container Logs

ContainerLog
| where TimeGenerated > ago(1h)
| where ContainerName == "api-orders"
| where LogEntry contains "error"
| project TimeGenerated, ContainerName, LogEntry

VM Memory (Insights Metrics)

InsightsMetrics
| where TimeGenerated > ago(1h)
| where Name == "AvailableMB"
| summarize AvgMemoryMB = avg(Val) by Computer
| order by AvgMemoryMB asc

Step 5: Alerting Strategies

flowchart LR
    subgraph Signals["Alert Signals"]
        M[Metric Alerts\nReal-time thresholds]
        L[Log Alerts\nKQL-based conditions]
        A[Activity Log\nResource changes]
        SM[Smart Detection\nAI anomalies]
    end

    subgraph Actions["Action Groups"]
        EMAIL[Email / SMS]
        WEBHOOK[Webhook / ITSM]
        RUNBOOK[Automation Runbook]
        FUNC[Azure Function\nauto-remediation]
    end

    Signals --> Actions

    style Signals fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
    style Actions fill:#d1fae5,stroke:#059669,color:#065f46

Metric alert — high CPU:

az monitor metrics alert create \
  --name "High CPU Alert" \
  --resource-group rg-monitoring \
  --scopes /subscriptions/.../providers/Microsoft.Web/sites/myapi \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action /subscriptions/.../actionGroups/ops-team

Log alert — error rate > 5%:

az monitor scheduled-query create \
  --name "High Error Rate" \
  --resource-group rg-monitoring \
  --scopes /subscriptions/.../components/myapi-insights \
  --condition "count > 0" \
  --condition-query "requests | where timestamp > ago(5m) | where success == false | count" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --severity 2 \
  --action /subscriptions/.../actionGroups/ops-team

Recommended alert thresholds:

Signal	Warning	Critical
Request error rate	> 1%	> 5%
Response time p95	> 2 s	> 5 s
CPU utilisation	> 70%	> 90%
Memory available	< 500 MB	< 200 MB
Availability test	< 99%	< 95%

Step 6: Cost Optimisation

Adaptive sampling (reduces ingestion volume):

builder.Services.Configure<TelemetryConfiguration>(config =>
{
    config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .UseAdaptiveSampling(maxTelemetryItemsPerSecond: 5)
        .Build();
});

Set data retention period:

az monitor log-analytics workspace update \
  --resource-group rg-monitoring \
  --workspace-name logs-workspace \
  --retention-time 90     # days; default 30, max 730 (extra cost after 90)

Cap daily ingestion:

az monitor app-insights component update \
  --app myapi-insights \
  --resource-group rg-monitoring \
  --cap 5    # GB per day — prevents runaway ingestion costs

Cost reduction tactics:

Tactic	Typical Saving	Notes
Adaptive sampling	50–90% ingestion reduction	Auto-adjusts based on traffic
Exclude noisy telemetry	20–40%	Filter health checks, static assets
Reduce retention	Up to 60% storage cost	Use archive tier for compliance data
Cap daily volume	Prevents cost spikes	Alert before cap is reached
Separate dev/prod workspaces	Avoid mixing high-volume dev logs	Dev uses short retention

Best Practices

Always correlate with operation_Id. Pass the correlation header through every service boundary — it makes distributed trace reconstruction possible.

Use connection strings, not instrumentation keys. Connection strings include the endpoint and support regional isolation.

Sample aggressively in high-volume services. Adaptive sampling at 5 items/sec loses almost no diagnostic value but cuts ingestion cost by 90%.

Alert on symptoms, not causes. Alert on error rate and latency visible to users — not internal CPU spikes that may never surface as user impact.

Use Workbooks for executive dashboards. Live KQL-powered Workbooks update automatically and can be shared without access to the underlying workspace.

Troubleshooting

Symptom	Cause	Fix
No telemetry in portal	Wrong connection string	Verify `APPLICATIONINSIGHTS_CONNECTION_STRING` env var
Missing dependencies	SDK version too old	Update to latest `Microsoft.ApplicationInsights.AspNetCore`
Sampled telemetry gaps	Adaptive sampling too aggressive	Raise `maxTelemetryItemsPerSecond`
Alerts firing with no data	Log alert has no matching results	Check KQL returns rows when condition is met
High ingestion costs	Noisy health check / static asset logs	Add telemetry processors to exclude
Correlated trace missing spans	Service not instrumented or missing header propagation	Add SDK to all services; verify W3C trace context

Key Takeaways

✅ Application Insights auto-captures requests, dependencies, and exceptions with zero code for most stacks
✅ All telemetry flows to Log Analytics — a single KQL query can span application + infrastructure
✅ Distributed tracing with operation_Id gives you the full call chain across microservices
✅ Adaptive sampling dramatically reduces ingestion cost with minimal loss of diagnostic value
✅ Alert on user-visible symptoms (error rate, latency) — not internal resource utilisation alone

Additional Resources

What KQL queries have saved you in production incidents? Share your observability patterns below.

Azure Monitor and Application Insights: Complete Observability

Azure Monitor and Application Insights: Complete Observability

Observability Architecture

Three Pillars of Observability

Step 1: Instrument Your Application

.NET (ASP.NET Core)

Node.js

Step 2: KQL Queries for Insights

Failed Requests (Last 24 h)

Slow Requests

Dependency Performance (p95 + Failure Rate)

Funnel Analysis

Anomaly Detection

Step 3: Distributed Tracing

Step 4: Infrastructure Monitoring

VM CPU Metrics

AKS Container Logs

VM Memory (Insights Metrics)

Step 5: Alerting Strategies

Step 6: Cost Optimisation

Best Practices

Troubleshooting

Key Takeaways

Additional Resources

Discussion