Monitoring and Alerting for Salesforce

Overview

Monitoring and alerting patterns enable proactive detection of issues, performance problems, and system health degradation. This guide covers Platform Events monitoring, API health monitoring, async job failure detection, and log aggregation patterns.

Core Principle: Monitor system health proactively, alert on issues before they impact users, and aggregate logs for centralized analysis. Visibility enables rapid issue detection and resolution.

Prerequisites

Required Knowledge:

Recommended Reading:

When to Use Monitoring and Alerting

Use Monitoring and Alerting When

Avoid Monitoring and Alerting When

Platform Events Monitoring

Pattern 1: Event Publication Monitoring

Purpose: Monitor Platform Event publication success and failure rates.

Implementation:

Example:

public class EventPublicationLogger {
    public static void logEventPublication(String eventType, Boolean success, String errorMessage) {
        Event_Publication_Log__c log = new Event_Publication_Log__c(
            Event_Type__c = eventType,
            Success__c = success,
            Error_Message__c = errorMessage,
            Timestamp__c = Datetime.now()
        );
        insert log;
    }
}

Best Practices:

Pattern 2: Event Processing Monitoring

Purpose: Monitor event processing by subscribers.

Implementation:

Best Practices:

API Health Monitoring

Pattern 1: Callout Health Monitoring

Purpose: Monitor API callout health and performance.

Implementation:

Example:

public class CalloutHealthMonitor {
    public static void trackCallout(String integrationName, String endpoint, 
                                   Integer statusCode, Long duration) {
        Callout_Metric__c metric = new Callout_Metric__c(
            Integration_Name__c = integrationName,
            Endpoint__c = endpoint,
            Status_Code__c = statusCode,
            Duration_ms__c = duration,
            Timestamp__c = Datetime.now()
        );
        insert metric;
    }
    
    public static void checkHealth(String integrationName) {
        Integer failureCount = [
            SELECT COUNT() 
            FROM Callout_Metric__c 
            WHERE Integration_Name__c = :integrationName
            AND Status_Code__c >= 500
            AND Timestamp__c >= :Datetime.now().addHours(-1)
        ];
        
        if (failureCount > 10) {
            sendAlert('High failure rate: ' + integrationName);
        }
    }
}

Best Practices:

Pattern 2: API Rate Limit Monitoring

Purpose: Monitor API rate limit usage and approaching limits.

Implementation:

Best Practices:

Async Job Failure Detection

Pattern 1: Batch Job Monitoring

Purpose: Monitor Batch Apex job failures and performance.

Implementation:

Example:

public class BatchJobMonitor {
    public static void monitorBatchJobs() {
        List<AsyncApexJob> failedJobs = [
            SELECT Id, ApexClass.Name, Status, NumberOfErrors, 
                   JobItemsProcessed, TotalJobItems
            FROM AsyncApexJob
            WHERE Status = 'Failed'
            AND CreatedDate >= :Datetime.now().addHours(-24)
        ];
        
        if (!failedJobs.isEmpty()) {
            sendAlert('Failed batch jobs detected: ' + failedJobs.size());
        }
    }
}

Best Practices:

Pattern 2: Queueable Job Monitoring

Purpose: Monitor Queueable Apex job failures.

Implementation:

Best Practices:

Pattern 3: Scheduled Job Monitoring

Purpose: Monitor Scheduled Apex job execution and failures.

Implementation:

Best Practices:

Log Aggregation Patterns

Pattern 1: Centralized Logging

Purpose: Aggregate logs from multiple sources for centralized analysis.

Implementation:

Best Practices:

Pattern 2: Error Log Aggregation

Purpose: Aggregate error logs for analysis and alerting.

Implementation:

Best Practices:

Pattern 3: Performance Log Aggregation

Purpose: Aggregate performance logs for analysis.

Implementation:

Best Practices:

Alerting Patterns

Pattern 1: Threshold-Based Alerting

Purpose: Alert when metrics exceed thresholds.

Implementation:

Best Practices:

Pattern 2: Anomaly Detection

Purpose: Detect anomalies in metrics (unusual patterns).

Implementation:

Best Practices:

Pattern 3: Composite Alerting

Purpose: Alert based on multiple conditions.

Implementation:

Best Practices:

Q&A

Q: What are monitoring and alerting patterns for Salesforce?

A: Monitoring and alerting patterns enable: (1) Proactive issue detection (detect issues before user impact), (2) System health visibility (dashboards showing system health), (3) Performance monitoring (track performance metrics), (4) Error tracking (track and analyze errors), (5) Automated alerting (alert on issues automatically). Monitoring and alerting provide visibility into system health and enable rapid issue resolution.

Q: How do I monitor Platform Events?

A: Monitor by: (1) Log event publications (log all event publication attempts), (2) Track success/failure rates (monitor publication success), (3) Monitor event volume (track event counts), (4) Alert on failures (alert when failures detected), (5) Create dashboards (visualize event metrics). Platform Events monitoring ensures event-driven integrations are working correctly.

Q: How do I monitor API callout health?

A: Monitor by: (1) Track callout metrics (duration, status codes, response sizes), (2) Monitor failure rates (track callout failures), (3) Circuit breaker monitoring (monitor circuit breaker state), (4) Alert on failures (alert on high failure rates), (5) Create health dashboards (visualize API health). API health monitoring ensures integrations are functioning correctly.

Q: How do I detect async job failures?

A: Detect by: (1) Query AsyncApexJob (query job status), (2) Monitor job status (track job execution), (3) Detect failures (identify failed jobs), (4) Log errors (log job errors), (5) Alert on failures (alert when jobs fail). Async job failure detection ensures background processing is working correctly.

Q: How do I aggregate logs for analysis?

A: Aggregate by: (1) Centralized logging (log to custom object), (2) Collect from all sources (collect logs from all components), (3) Aggregate by type (group logs by type), (4) Analyze patterns (identify patterns in logs), (5) Create dashboards (visualize log data). Log aggregation enables centralized analysis and pattern detection.

Q: What are best practices for alerting?

A: Best practices: (1) Set appropriate thresholds (not too sensitive, not too lenient), (2) Monitor continuously (real-time or near-real-time), (3) Alert promptly (alert quickly on issues), (4) Include context (provide context in alerts), (5) Implement escalation (escalate if not acknowledged), (6) Review regularly (review and adjust thresholds). Effective alerting enables rapid issue response.

Q: How do I create monitoring dashboards?

A: Create by: (1) Identify key metrics (determine what to monitor), (2) Create custom objects (store metrics in custom objects), (3) Build reports (create reports on metrics), (4) Create dashboards (build dashboards from reports), (5) Share dashboards (share with stakeholders). Monitoring dashboards provide visibility into system health.