Home / Azure / Azure Monitor and Application Insights: Complete Observability
Azure

Azure Monitor and Application Insights: Complete Observability

Implement end-to-end observability with Azure Monitor — application telemetry, infrastructure metrics, KQL analytics, and intelligent alerting for production workloads.

What you will learn

Practical execution with concise explanations, real implementation patterns, and production-ready recommendations.

Azure Monitor and Application Insights: Complete Observability

You cannot fix what you cannot see. Azure Monitor and Application Insights give you a unified observability platform — application traces, infrastructure metrics, log analytics, and intelligent alerting — all queryable with KQL.


Observability Architecture

Observability Architecture

flowchart TB
    subgraph Apps["Instrumented Applications"]
        DOTNET[.NET / ASP.NET Core\nAuto + custom telemetry]
        NODE[Node.js\napplicationinsights SDK]
        JAVA[Java\nAI Java agent]
        FUNC[Azure Functions\nBuilt-in integration]
    end

    subgraph Infra["Infrastructure Sources"]
        VM[Virtual Machines\nAzure Monitor Agent]
        AKS[AKS\nContainer Insights]
        SQL[Azure SQL\nIntelligent Insights]
        COSM[Cosmos DB\nDiagnostic logs]
    end

    subgraph AI["Application Insights"]
        REQ[Requests]
        DEP[Dependencies]
        EXC[Exceptions]
        TRACE[Traces / Custom Events]
        METRIC[Custom Metrics]
    end

    subgraph LA["Log Analytics Workspace"]
        KQL_Q[KQL Queries]
        DASH[Dashboards & Workbooks]
        ALERT[Alert Rules]
    end

    Apps --> AI
    Infra --> LA
    AI --> LA
    LA --> KQL_Q
    LA --> DASH
    LA --> ALERT

    style Apps fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
    style Infra fill:#d1fae5,stroke:#059669,color:#065f46
    style AI fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95
    style LA fill:#fef3c7,stroke:#f59e0b,color:#78350f

Three Pillars of Observability

Pillar What It Captures Azure Service Key KQL Table
Traces Request flows, dependencies, correlation IDs Application Insights requests, dependencies
Metrics Numeric time-series (CPU, RU, duration) Azure Monitor Metrics performanceCounters, customMetrics
Logs Structured events, exceptions, custom events Log Analytics exceptions, traces, customEvents

Step 1: Instrument Your Application

Step 1: Instrument Your Application

.NET (ASP.NET Core)

dotnet add package Microsoft.ApplicationInsights.AspNetCore
// Program.cs
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
    options.EnableAdaptiveSampling = true;
    options.EnableQuickPulseMetricStream = true;
});

Custom telemetry (Controller):

public class OrderController : ControllerBase
{
    private readonly TelemetryClient _telemetry;

    public OrderController(TelemetryClient telemetry) => _telemetry = telemetry;

    [HttpPost]
    public async Task<IActionResult> CreateOrder(Order order)
    {
        using var operation = _telemetry.StartOperation<RequestTelemetry>("CreateOrder");
        try
        {
            _telemetry.TrackEvent("OrderCreated", new Dictionary<string, string>
            {
                ["OrderId"] = order.Id,
                ["CustomerId"] = order.CustomerId,
                ["Amount"] = order.Total.ToString()
            });
            _telemetry.TrackMetric("OrderValue", order.Total);

            await _repository.SaveOrderAsync(order);
            return Ok(order);
        }
        catch (Exception ex)
        {
            _telemetry.TrackException(ex);
            operation.Telemetry.Success = false;
            throw;
        }
    }
}

Node.js

const appInsights = require('applicationinsights');
appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
  .setAutoDependencyCorrelation(true)
  .setAutoCollectRequests(true)
  .setAutoCollectPerformance(true)
  .start();

const client = appInsights.defaultClient;

app.post('/orders', async (req, res) => {
  client.trackEvent({ name: 'OrderCreated', properties: { orderId: req.body.id } });
  client.trackMetric({ name: 'OrderValue', value: req.body.total });
  try {
    await saveOrder(req.body);
    res.json({ success: true });
  } catch (error) {
    client.trackException({ exception: error });
    res.status(500).json({ error: error.message });
  }
});

Step 2: KQL Queries for Insights

Failed Requests (Last 24 h)

requests
| where success == false
| where timestamp > ago(24h)
| summarize FailureCount = count() by operation_Name, resultCode
| order by FailureCount desc
| take 10

Slow Requests

requests
| where timestamp > ago(1h)
| where duration > 5000  // ms
| project timestamp, operation_Name, duration, url
| order by duration desc

Dependency Performance (p95 + Failure Rate)

dependencies
| where timestamp > ago(24h)
| summarize
    AvgDuration   = avg(duration),
    P95Duration   = percentile(duration, 95),
    FailureRate   = countif(success == false) * 100.0 / count()
  by target, type
| order by P95Duration desc

Funnel Analysis

customEvents
| where timestamp > ago(30d)
| where name in ("ProductViewed", "AddedToCart", "CheckoutStarted", "OrderCompleted")
| summarize Users = dcount(user_Id) by name
| order by Users desc

Anomaly Detection

requests
| where timestamp > ago(7d)
| make-series RequestCount = count() default = 0 on timestamp step 1h
| extend anomalies = series_decompose_anomalies(RequestCount, 1.5)
| mv-expand timestamp to typeof(datetime), RequestCount to typeof(long), anomalies to typeof(double)
| where anomalies != 0

Step 3: Distributed Tracing

Step 3: Distributed Tracing

sequenceDiagram
    participant Browser as Browser
    participant API as API Gateway\n(App Insights)
    participant Orders as Orders Service\n(App Insights)
    participant DB as Cosmos DB\n(dependency track)
    participant Queue as Service Bus\n(dependency track)

    Browser->>API: POST /checkout\noperation_Id: abc123
    API->>Orders: POST /orders\noperation_Id: abc123
    Orders->>DB: CreateItem\noperation_Id: abc123
    DB-->>Orders: 201 Created (12ms)
    Orders->>Queue: Send message\noperation_Id: abc123
    Queue-->>Orders: Sent (3ms)
    Orders-->>API: 201 Created (85ms)
    API-->>Browser: 200 OK (120ms)

    Note over Browser,Queue: All spans share operation_Id abc123<br/>Visible in Application Map + E2E Transaction view

Query the full trace by operation ID:

union requests, dependencies, exceptions
| where operation_Id == "abc123..."
| project timestamp, itemType, name, duration, success
| order by timestamp asc

Step 4: Infrastructure Monitoring

VM CPU Metrics

az monitor metrics list \
  --resource /subscriptions/.../providers/Microsoft.Compute/virtualMachines/vm-web \
  --metric "Percentage CPU" \
  --start-time 2025-08-04T00:00:00Z \
  --end-time 2025-08-04T23:59:59Z \
  --interval PT1H

AKS Container Logs

ContainerLog
| where TimeGenerated > ago(1h)
| where ContainerName == "api-orders"
| where LogEntry contains "error"
| project TimeGenerated, ContainerName, LogEntry

VM Memory (Insights Metrics)

InsightsMetrics
| where TimeGenerated > ago(1h)
| where Name == "AvailableMB"
| summarize AvgMemoryMB = avg(Val) by Computer
| order by AvgMemoryMB asc

Step 5: Alerting Strategies

flowchart LR
    subgraph Signals["Alert Signals"]
        M[Metric Alerts\nReal-time thresholds]
        L[Log Alerts\nKQL-based conditions]
        A[Activity Log\nResource changes]
        SM[Smart Detection\nAI anomalies]
    end

    subgraph Actions["Action Groups"]
        EMAIL[Email / SMS]
        WEBHOOK[Webhook / ITSM]
        RUNBOOK[Automation Runbook]
        FUNC[Azure Function\nauto-remediation]
    end

    Signals --> Actions

    style Signals fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
    style Actions fill:#d1fae5,stroke:#059669,color:#065f46

Metric alert — high CPU:

az monitor metrics alert create \
  --name "High CPU Alert" \
  --resource-group rg-monitoring \
  --scopes /subscriptions/.../providers/Microsoft.Web/sites/myapi \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action /subscriptions/.../actionGroups/ops-team

Log alert — error rate > 5%:

az monitor scheduled-query create \
  --name "High Error Rate" \
  --resource-group rg-monitoring \
  --scopes /subscriptions/.../components/myapi-insights \
  --condition "count > 0" \
  --condition-query "requests | where timestamp > ago(5m) | where success == false | count" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --severity 2 \
  --action /subscriptions/.../actionGroups/ops-team

Recommended alert thresholds:

Signal Warning Critical
Request error rate > 1% > 5%
Response time p95 > 2 s > 5 s
CPU utilisation > 70% > 90%
Memory available < 500 MB < 200 MB
Availability test < 99% < 95%

Step 6: Cost Optimisation

Adaptive sampling (reduces ingestion volume):

builder.Services.Configure<TelemetryConfiguration>(config =>
{
    config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .UseAdaptiveSampling(maxTelemetryItemsPerSecond: 5)
        .Build();
});

Set data retention period:

az monitor log-analytics workspace update \
  --resource-group rg-monitoring \
  --workspace-name logs-workspace \
  --retention-time 90     # days; default 30, max 730 (extra cost after 90)

Cap daily ingestion:

az monitor app-insights component update \
  --app myapi-insights \
  --resource-group rg-monitoring \
  --cap 5    # GB per day — prevents runaway ingestion costs

Cost reduction tactics:

Tactic Typical Saving Notes
Adaptive sampling 50–90% ingestion reduction Auto-adjusts based on traffic
Exclude noisy telemetry 20–40% Filter health checks, static assets
Reduce retention Up to 60% storage cost Use archive tier for compliance data
Cap daily volume Prevents cost spikes Alert before cap is reached
Separate dev/prod workspaces Avoid mixing high-volume dev logs Dev uses short retention

Best Practices

Always correlate with operation_Id. Pass the correlation header through every service boundary — it makes distributed trace reconstruction possible.

Use connection strings, not instrumentation keys. Connection strings include the endpoint and support regional isolation.

Sample aggressively in high-volume services. Adaptive sampling at 5 items/sec loses almost no diagnostic value but cuts ingestion cost by 90%.

Alert on symptoms, not causes. Alert on error rate and latency visible to users — not internal CPU spikes that may never surface as user impact.

Use Workbooks for executive dashboards. Live KQL-powered Workbooks update automatically and can be shared without access to the underlying workspace.


Troubleshooting

Symptom Cause Fix
No telemetry in portal Wrong connection string Verify APPLICATIONINSIGHTS_CONNECTION_STRING env var
Missing dependencies SDK version too old Update to latest Microsoft.ApplicationInsights.AspNetCore
Sampled telemetry gaps Adaptive sampling too aggressive Raise maxTelemetryItemsPerSecond
Alerts firing with no data Log alert has no matching results Check KQL returns rows when condition is met
High ingestion costs Noisy health check / static asset logs Add telemetry processors to exclude
Correlated trace missing spans Service not instrumented or missing header propagation Add SDK to all services; verify W3C trace context

Key Takeaways

  • ✅ Application Insights auto-captures requests, dependencies, and exceptions with zero code for most stacks
  • ✅ All telemetry flows to Log Analytics — a single KQL query can span application + infrastructure
  • ✅ Distributed tracing with operation_Id gives you the full call chain across microservices
  • ✅ Adaptive sampling dramatically reduces ingestion cost with minimal loss of diagnostic value
  • ✅ Alert on user-visible symptoms (error rate, latency) — not internal resource utilisation alone

Additional Resources


What KQL queries have saved you in production incidents? Share your observability patterns below.

Discussion