Introduction: From Manual Processing to Intelligent Automation
Organizations process millions of documents annually — invoices, contracts, purchase orders, compliance forms. Manual data entry is slow, error-prone, and expensive. This deep dive builds a complete document intelligence platform that automatically classifies incoming documents, extracts structured data using custom AI models, routes them through approval workflows, and stores results in SharePoint with full audit trails. The system handles 50,000+ documents per month with 98%+ extraction accuracy.
Prerequisites
- Azure subscription with Azure AI Services access
- Azure AI Document Intelligence (formerly Form Recognizer) resource
- Power Automate Premium licenses (for custom connectors)
- SharePoint Online with appropriate document libraries
- Azure Storage account for document staging
- .NET 8 SDK or Python 3.11+ for custom model training
Phase 1: Azure AI Document Intelligence Setup
Training Custom Extraction Models
from azure.ai.formrecognizer import DocumentModelAdministrationClient
from azure.core.credentials import AzureKeyCredential
import os
endpoint = os.environ["DOCUMENT_INTELLIGENCE_ENDPOINT"]
key = os.environ["DOCUMENT_INTELLIGENCE_KEY"]
admin_client = DocumentModelAdministrationClient(
endpoint=endpoint,
credential=AzureKeyCredential(key)
)
# Train custom model for invoice extraction
training_data_url = "https://corpstorageaccount.blob.core.windows.net/training-data/invoices?sv=..."
poller = admin_client.begin_build_document_model(
build_mode="neural",
blob_container_url=training_data_url,
model_id="corp-invoice-model-v3",
description="Corporate invoice extraction model - handles 15 vendor formats",
tags={
"version": "3.0",
"accuracy": "98.5%",
"trained_on": "2026-05-01",
"vendor_count": "15"
}
)
model = poller.result()
print(f"Model ID: {model.model_id}")
print(f"Doc types: {list(model.doc_types.keys())}")
for name, doc_type in model.doc_types.items():
print(f"\nDocument type: {name}")
for field_name, field in doc_type.field_schema.items():
print(f" Field: {field_name} ({field['type']}) - Confidence: {field.get('confidence', 'N/A')}")
Composed Model for Document Classification
# Create composed model that handles multiple document types
composed_model = admin_client.begin_compose_document_model(
component_model_ids=[
"corp-invoice-model-v3",
"corp-purchase-order-model-v2",
"corp-contract-model-v1",
"corp-receipt-model-v2"
],
model_id="corp-document-classifier",
description="Unified classifier for all corporate document types"
)
classifier = composed_model.result()
print(f"Composed model: {classifier.model_id}")
print(f"Can classify: {list(classifier.doc_types.keys())}")
Document Analysis Pipeline
using Azure.AI.FormRecognizer.DocumentAnalysis;
public class DocumentAnalysisService
{
private readonly DocumentAnalysisClient _client;
private readonly ILogger<DocumentAnalysisService> _logger;
public async Task<DocumentExtractionResult> AnalyzeDocumentAsync(
Stream documentStream, string fileName)
{
// Step 1: Classify document type
var classifyOperation = await _client.AnalyzeDocumentAsync(
WaitUntil.Completed,
"corp-document-classifier",
documentStream);
var classifyResult = classifyOperation.Value;
var documentType = classifyResult.Documents
.OrderByDescending(d => d.Confidence)
.First();
_logger.LogInformation(
"Document {File} classified as {Type} with {Confidence:P} confidence",
fileName, documentType.DocumentType, documentType.Confidence);
// Step 2: Extract fields based on document type
documentStream.Position = 0;
var extractOperation = await _client.AnalyzeDocumentAsync(
WaitUntil.Completed,
GetModelForType(documentType.DocumentType),
documentStream);
var extractResult = extractOperation.Value;
var fields = new Dictionary<string, ExtractedField>();
foreach (var document in extractResult.Documents)
{
foreach (var field in document.Fields)
{
fields[field.Key] = new ExtractedField
{
Name = field.Key,
Value = field.Value.Content,
Confidence = field.Value.Confidence ?? 0,
FieldType = field.Value.FieldType.ToString(),
BoundingRegions = field.Value.BoundingRegions?.Select(r => new BoundingRegion
{
PageNumber = r.PageNumber,
Polygon = r.Polygon.Select(p => new Point(p.X, p.Y)).ToList()
}).ToList()
};
}
}
// Step 3: Apply business validation rules
var validationResults = ValidateExtraction(documentType.DocumentType, fields);
return new DocumentExtractionResult
{
DocumentType = documentType.DocumentType,
ClassificationConfidence = documentType.Confidence,
Fields = fields,
ValidationResults = validationResults,
PageCount = extractResult.Pages.Count,
ProcessedAt = DateTime.UtcNow
};
}
private ValidationResult ValidateExtraction(string docType, Dictionary<string, ExtractedField> fields)
{
var errors = new List<string>();
var warnings = new List<string>();
switch (docType)
{
case "invoice":
if (!fields.ContainsKey("InvoiceTotal") || fields["InvoiceTotal"].Confidence < 0.9)
errors.Add("Invoice total not extracted with sufficient confidence");
if (!fields.ContainsKey("VendorName"))
errors.Add("Vendor name missing");
if (!fields.ContainsKey("InvoiceDate"))
warnings.Add("Invoice date not detected — manual review recommended");
// Cross-validate line items total vs invoice total
if (fields.ContainsKey("Items") && fields.ContainsKey("InvoiceTotal"))
{
// Validate totals match
var declaredTotal = decimal.Parse(fields["InvoiceTotal"].Value ?? "0");
if (declaredTotal <= 0)
errors.Add("Invoice total is zero or negative");
}
break;
case "contract":
if (!fields.ContainsKey("EffectiveDate"))
errors.Add("Contract effective date missing");
if (!fields.ContainsKey("Parties"))
warnings.Add("Contract parties not fully extracted");
break;
}
return new ValidationResult
{
IsValid = errors.Count == 0,
Errors = errors,
Warnings = warnings,
RequiresHumanReview = errors.Any() || fields.Values.Any(f => f.Confidence < 0.85)
};
}
}
Phase 2: Power Automate Orchestration
Document Processing Flow Definition
{
"definition": {
"triggers": {
"When_document_uploaded_to_SharePoint": {
"type": "ApiConnectionWebhook",
"inputs": {
"host": {
"connection": { "name": "@parameters('$connections')['sharepointonline']['connectionId']" }
},
"path": "/datasets/@{encodeURIComponent('https://contoso.sharepoint.com/sites/documents')}/triggers/onnewfileinrootfolder"
}
}
},
"actions": {
"Get_file_content": {
"type": "ApiConnection",
"inputs": {
"host": { "connection": { "name": "@parameters('$connections')['sharepointonline']['connectionId']" } },
"method": "get",
"path": "/datasets/@{encodeURIComponent('https://contoso.sharepoint.com/sites/documents')}/files/@{triggerOutputs()?['body/{Identifier}']}/content"
}
},
"Analyze_document": {
"type": "Http",
"runAfter": { "Get_file_content": ["Succeeded"] },
"inputs": {
"method": "POST",
"uri": "https://corp-docai-api.azurewebsites.net/api/analyze",
"headers": {
"Content-Type": "application/octet-stream",
"x-filename": "@{triggerOutputs()?['body/{FilenameWithExtension}']}"
},
"body": "@body('Get_file_content')",
"authentication": { "type": "ManagedServiceIdentity" }
}
},
"Route_by_document_type": {
"type": "Switch",
"runAfter": { "Analyze_document": ["Succeeded"] },
"expression": "@body('Analyze_document')?['documentType']",
"cases": {
"Invoice": {
"actions": {
"Check_amount_threshold": {
"type": "If",
"expression": {
"greater": ["@float(body('Analyze_document')?['fields']?['InvoiceTotal']?['value'])", 10000]
},
"actions": {
"Start_approval_for_large_invoice": {
"type": "ApiConnection",
"inputs": {
"host": { "connection": { "name": "@parameters('$connections')['approvals']['connectionId']" } },
"method": "post",
"path": "/v2/approvals",
"body": {
"title": "Invoice Approval: @{body('Analyze_document')?['fields']?['VendorName']?['value']} - $@{body('Analyze_document')?['fields']?['InvoiceTotal']?['value']}",
"assignedTo": "@{body('Get_approver_by_amount')?['email']}",
"details": "Vendor: @{body('Analyze_document')?['fields']?['VendorName']?['value']}\nInvoice #: @{body('Analyze_document')?['fields']?['InvoiceNumber']?['value']}\nTotal: $@{body('Analyze_document')?['fields']?['InvoiceTotal']?['value']}\nDue Date: @{body('Analyze_document')?['fields']?['DueDate']?['value']}"
}
}
}
}
},
"Create_invoice_record": {
"type": "ApiConnection",
"inputs": {
"host": { "connection": { "name": "@parameters('$connections')['sharepointonline']['connectionId']" } },
"method": "post",
"path": "/datasets/@{encodeURIComponent('https://contoso.sharepoint.com/sites/finance')}/tables/@{encodeURIComponent('Invoices')}/items",
"body": {
"Title": "@{body('Analyze_document')?['fields']?['InvoiceNumber']?['value']}",
"VendorName": "@{body('Analyze_document')?['fields']?['VendorName']?['value']}",
"InvoiceTotal": "@{body('Analyze_document')?['fields']?['InvoiceTotal']?['value']}",
"InvoiceDate": "@{body('Analyze_document')?['fields']?['InvoiceDate']?['value']}",
"DueDate": "@{body('Analyze_document')?['fields']?['DueDate']?['value']}",
"Status": "Processed",
"ConfidenceScore": "@{body('Analyze_document')?['classificationConfidence']}",
"SourceDocumentLink": "@{triggerOutputs()?['body/{Link}']}"
}
}
}
}
},
"Contract": {
"actions": {
"Extract_key_dates_and_terms": {
"type": "Http",
"inputs": {
"method": "POST",
"uri": "https://corp-docai-api.azurewebsites.net/api/analyze/contract-terms",
"body": "@body('Analyze_document')"
}
},
"Add_contract_to_register": {
"type": "ApiConnection",
"inputs": {
"host": { "connection": { "name": "@parameters('$connections')['sharepointonline']['connectionId']" } },
"method": "post",
"path": "/datasets/@{encodeURIComponent('https://contoso.sharepoint.com/sites/legal')}/tables/@{encodeURIComponent('ContractRegister')}/items",
"body": {
"Title": "@{body('Analyze_document')?['fields']?['ContractTitle']?['value']}",
"EffectiveDate": "@{body('Analyze_document')?['fields']?['EffectiveDate']?['value']}",
"ExpirationDate": "@{body('Analyze_document')?['fields']?['ExpirationDate']?['value']}",
"Parties": "@{body('Analyze_document')?['fields']?['Parties']?['value']}",
"ContractValue": "@{body('Analyze_document')?['fields']?['ContractValue']?['value']}",
"Status": "Active"
}
}
},
"Set_renewal_reminder": {
"type": "ApiConnection",
"runAfter": { "Add_contract_to_register": ["Succeeded"] },
"inputs": {
"host": { "connection": { "name": "@parameters('$connections')['outlook']['connectionId']" } },
"method": "post",
"path": "/v3/me/events",
"body": {
"subject": "Contract Renewal Review: @{body('Analyze_document')?['fields']?['ContractTitle']?['value']}",
"start": "@{addDays(body('Analyze_document')?['fields']?['ExpirationDate']?['value'], -90)}",
"end": "@{addDays(body('Analyze_document')?['fields']?['ExpirationDate']?['value'], -90)}",
"isReminderOn": true,
"reminderMinutesBeforeStart": 1440
}
}
}
}
}
}
},
"Send_human_review_if_needed": {
"type": "If",
"runAfter": { "Route_by_document_type": ["Succeeded"] },
"expression": {
"equals": ["@body('Analyze_document')?['validationResults']?['requiresHumanReview']", true]
},
"actions": {
"Create_review_task": {
"type": "ApiConnection",
"inputs": {
"host": { "connection": { "name": "@parameters('$connections')['planner']['connectionId']" } },
"method": "post",
"path": "/v1.0/planner/tasks",
"body": {
"title": "Review: @{triggerOutputs()?['body/{FilenameWithExtension}']}",
"planId": "document-review-plan-id",
"bucketId": "needs-review-bucket-id",
"dueDateTime": "@{addDays(utcNow(), 2)}",
"assignments": { "reviewer-user-id": { "@odata.type": "microsoft.graph.plannerAssignment", "orderHint": " !" } }
}
}
}
}
}
}
}
}
Phase 3: SharePoint Document Management
Document Library Configuration with Content Types
# Connect to SharePoint Online
Connect-PnPOnline -Url "https://contoso.sharepoint.com/sites/documents" -Interactive
# Create content types for processed documents
Add-PnPContentType -Name "Processed Invoice" -Group "Document Intelligence" `
-ParentContentType "Document" -Description "Invoice processed by AI"
Add-PnPField -DisplayName "Vendor Name" -InternalName "VendorName" `
-Type Text -Group "Document Intelligence"
Add-PnPField -DisplayName "Invoice Total" -InternalName "InvoiceTotal" `
-Type Currency -Group "Document Intelligence"
Add-PnPField -DisplayName "Extraction Confidence" -InternalName "AIConfidence" `
-Type Number -Group "Document Intelligence"
Add-PnPField -DisplayName "Processing Status" -InternalName "ProcessingStatus" `
-Type Choice -Choices "Queued","Processing","Completed","Failed","Needs Review" `
-Group "Document Intelligence"
Add-PnPField -DisplayName "Human Reviewed" -InternalName "HumanReviewed" `
-Type Boolean -Group "Document Intelligence"
# Add fields to content type
Add-PnPFieldToContentType -Field "VendorName" -ContentType "Processed Invoice"
Add-PnPFieldToContentType -Field "InvoiceTotal" -ContentType "Processed Invoice"
Add-PnPFieldToContentType -Field "AIConfidence" -ContentType "Processed Invoice"
Add-PnPFieldToContentType -Field "ProcessingStatus" -ContentType "Processed Invoice"
Add-PnPFieldToContentType -Field "HumanReviewed" -ContentType "Processed Invoice"
# Configure retention labels
Set-PnPRetentionLabel -List "Processed Invoices" `
-Label "Financial Record - 7 Year Retention" `
-SyncToItems $true
# Create document library views
Add-PnPView -List "Processed Invoices" -Title "Needs Review" `
-Fields "FileLeafRef","VendorName","InvoiceTotal","AIConfidence","ProcessingStatus" `
-Query '<Where><Eq><FieldRef Name="ProcessingStatus"/><Value Type="Text">Needs Review</Value></Eq></Where>'
Add-PnPView -List "Processed Invoices" -Title "This Month" `
-Fields "FileLeafRef","VendorName","InvoiceTotal","InvoiceDate","ProcessingStatus" `
-Query '<Where><Geq><FieldRef Name="Created"/><Value Type="DateTime"><Today OffsetDays="-30"/></Value></Geq></Where>'
Phase 4: Monitoring and Analytics Dashboard
Processing Metrics with Application Insights
public class DocumentMetricsService
{
private readonly TelemetryClient _telemetry;
public void TrackDocumentProcessed(DocumentExtractionResult result, TimeSpan processingTime)
{
_telemetry.TrackEvent("DocumentProcessed", new Dictionary<string, string>
{
["DocumentType"] = result.DocumentType,
["ValidationStatus"] = result.ValidationResults.IsValid ? "Valid" : "Invalid",
["RequiresReview"] = result.ValidationResults.RequiresHumanReview.ToString(),
["PageCount"] = result.PageCount.ToString()
}, new Dictionary<string, double>
{
["ProcessingTimeMs"] = processingTime.TotalMilliseconds,
["ConfidenceScore"] = result.ClassificationConfidence,
["FieldCount"] = result.Fields.Count,
["LowConfidenceFields"] = result.Fields.Values.Count(f => f.Confidence < 0.85),
["ErrorCount"] = result.ValidationResults.Errors.Count
});
// Track SLA compliance
var slaThresholdMs = result.PageCount switch
{
<= 5 => 10000,
<= 20 => 30000,
_ => 60000
};
if (processingTime.TotalMilliseconds > slaThresholdMs)
{
_telemetry.TrackEvent("SLABreach", new Dictionary<string, string>
{
["DocumentType"] = result.DocumentType,
["PageCount"] = result.PageCount.ToString(),
["ExpectedMs"] = slaThresholdMs.ToString(),
["ActualMs"] = processingTime.TotalMilliseconds.ToString()
});
}
}
}
Performance Benchmarks
| Document Type | Pages | Extraction Time | Accuracy | Throughput |
|---|---|---|---|---|
| Invoice (standard) | 1-2 | 2.1s | 98.5% | 1,700/hour |
| Invoice (complex) | 3-5 | 4.8s | 96.2% | 750/hour |
| Purchase Order | 1-3 | 3.2s | 97.8% | 1,125/hour |
| Contract | 5-50 | 12.5s | 94.1% | 288/hour |
| Receipt | 1 | 1.4s | 99.1% | 2,570/hour |
Best Practices
- Train models with diverse samples: Include at least 50 samples per vendor/format for high accuracy
- Set confidence thresholds per field: Critical fields (amounts, dates) need 95%+ confidence
- Human-in-the-loop is not a failure: Route low-confidence documents for review — use corrections to retrain
- Version your models: Track model versions and A/B test new models before production rollout
- Implement document pre-processing: De-skew, enhance contrast, and OCR before AI extraction
- Monitor extraction drift: Accuracy can degrade as vendors change invoice formats — retrain quarterly
Architecture Decision and Tradeoffs
When designing integrated solutions solutions with Azure + Power Platform, consider these key architectural trade-offs:
| Approach | Best For | Tradeoff |
|---|---|---|
| Managed / platform service | Rapid delivery, reduced ops burden | Less customisation, potential vendor lock-in |
| Custom / self-hosted | Full control, advanced tuning | Higher operational overhead and cost |
Recommendation: Start with the managed approach for most workloads and move to custom only when specific requirements demand it.
Validation and Versioning
- Last validated: April 2026
- Validate examples against your tenant, region, and SKU constraints before production rollout.
- Keep module, CLI, and SDK versions pinned in automation pipelines and review quarterly.
Security and Governance Considerations
- Apply least-privilege access using RBAC roles and just-in-time elevation for admin tasks.
- Store secrets in managed secret stores and avoid embedding credentials in scripts or source files.
- Enable audit logging, data protection policies, and periodic access reviews for regulated workloads.
Cost and Performance Notes
- Define budgets and alerts, then monitor usage and cost trends continuously after go-live.
- Baseline performance with synthetic and real-user checks before and after major changes.
- Scale resources with measured thresholds and revisit sizing after usage pattern changes.
Official Microsoft References
- https://learn.microsoft.com/azure/architecture/
- https://learn.microsoft.com/azure/well-architected/
- https://learn.microsoft.com/power-platform/guidance/
Public Examples from Official Sources
- These examples are sourced from official public Microsoft documentation and sample repositories.
- Documentation examples: https://learn.microsoft.com/azure/well-architected/
- Sample repositories: https://github.com/Azure/ArchitectureCenter
- Prefer adapting these examples to your tenant, subscriptions, and governance requirements before production use.
Key Takeaways
- Azure AI Document Intelligence provides production-ready extraction for invoices, contracts, and custom document types
- Composed models enable automatic document classification before extraction
- Power Automate orchestrates the end-to-end flow from upload to approval to storage
- SharePoint provides the document management layer with retention, compliance, and collaboration
- Human-in-the-loop review ensures quality while continuously improving AI models
Discussion