Salesforce → LLM Data Pipelines

Overview

This document covers high-level pipeline patterns for extracting, transforming, and loading Salesforce data and metadata into LLM-powered systems (RAG, tools, agents). It addresses how to move Salesforce data into RAG/LLM systems, common architectural variants and their tradeoffs, extraction strategies, transformation and chunking approaches, and interaction with manifest-style descriptions.

Prerequisites

Required Knowledge:

Recommended Reading:

Conceptual Architecture

Core Stages: Extract → Transform → Index → Retrieve

The fundamental pipeline consists of four stages:

  1. Extract: Pull data and metadata from Salesforce via APIs
  2. Transform: Normalize, enrich, and structure data for LLM consumption
  3. Index: Generate embeddings and store in vector database for RAG
  4. Retrieve: Query vector store to retrieve relevant context for LLM

Common Architectural Variants

Batch Export (Nightly/Periodic)

Pattern: Scheduled full or incremental data extraction on a fixed schedule.

Characteristics:

Pros:

Cons:

Use Cases:

Event-Driven (Change Data Capture or Events)

Pattern: Real-time or near-real-time updates triggered by Salesforce change events.

Characteristics:

Related: Change Data Capture Patterns - Complete CDC patterns guide

Pros:

Cons:

Use Cases:

On-Demand Query (Tool-Style Calls)

Pattern: LLM agent queries Salesforce directly during conversation via API calls.

Characteristics:

Pros:

Cons:

Use Cases:

Salesforce Data Sources for LLMs

Types of Data

Structured Objects/Fields

What: Standard and custom object records with field values.

When Useful:

Typical Use in RAG:

Metadata (Schema and Labels)

What: Object definitions, field metadata, validation rules, relationship definitions.

When Useful:

Typical Use in RAG:

Documents/Content (Files, Articles, Attachments)

What: Files, Knowledge articles, ContentVersion records, attachments.

When Useful:

Typical Use in RAG:

Logs/Events (If Relevant)

What: Platform Events, Change Data Capture events, audit logs.

When Useful:

Typical Use in RAG:

Typical Combinations Used in Real RAG Systems

Pattern 1: Schema + Records

Pattern 2: Records + Relationships

Pattern 3: Schema + Records + Documents

Extraction Patterns

Overview of Extraction APIs

REST/Composite APIs

Characteristics:

Use Cases:

Limitations:

Bulk API

Characteristics:

Use Cases:

Limitations:

Metadata/Tooling APIs (For Schema)

Characteristics:

Use Cases:

Limitations:

Change Data Capture / Event-Based Updates

Characteristics:

Use Cases:

Limitations:

Patterns to Feed RAG

Full Loads vs Incremental Loads

Full Loads:

Incremental Loads:

Snapshots vs CDC-Based Approaches

Snapshots:

CDC-Based Approaches:

Tradeoffs

Performance and Limits:

Complexity:

Data Freshness:

Operational Risk:

Transformation & Chunking for RAG

Strategies for Modeling Salesforce Data as RAG “Documents”

Per-Record Chunks vs Aggregated “Logical Documents”

Per-Record Chunks:

Aggregated Logical Documents:

Per-Object vs Cross-Object Chunks

Per-Object Chunks:

Cross-Object Chunks:

Chunking Strategies

Field Selection and Redaction

What to Include:

What to Exclude:

Flattening Relationships into Text

Pattern: Include related record information as text in chunks.

Example:

Account: Acme Corp
Industry: Technology
Related Contacts: John Doe (Primary), Jane Smith (Billing)
Related Cases: Case #12345 (Open), Case #12346 (Closed)

Pros: Preserves relationship context in text Cons: May create large chunks, relationship structure lost

Using Natural-Language Labels and Metadata

Pattern: Include field labels, help text, and metadata for better retrieval.

Example:

Object: Account
Field: AnnualRevenue
Label: Annual Revenue
Help Text: The company's annual revenue in USD
Value: $1,000,000

Pros: Better semantic understanding, improved retrieval Cons: Increases chunk size, more metadata to manage

Example Patterns

“Case-Centric Document” Pattern

Structure: Case record with related Account, Contact, and Case Comments.

Chunk Content:

Use Case: Support agent RAG system for case understanding

“Student Lifecycle” Pattern (Conceptual, Anonymized)

Structure: Student record with related Program Enrollment, Course Enrollment, and Application records.

Chunk Content:

Use Case: Education RAG system for student information

Interaction With Manifests

How Manifest-Style Descriptions Work

Manifest-style descriptions (e.g., tool manifests, connector manifests) typically describe:

Why That Is Not Sufficient

Manifest-style descriptions are not sufficient to define:

Full Data Pipelines

Gap: Manifests describe individual tools/endpoints, not end-to-end pipelines.

What’s Missing:

Chunking Strategies

Gap: Manifests don’t define how to chunk Salesforce data for RAG.

What’s Missing:

Security and Governance Requirements

Gap: Manifests don’t capture security model evaluation and governance.

What’s Missing:

Sources Used

To Evaluate

Q&A

Q: What is a Salesforce to LLM data pipeline?

A: A Salesforce to LLM data pipeline extracts, transforms, and loads Salesforce data and metadata into LLM-powered systems (RAG, tools, agents). The pipeline consists of four stages: (1) Extract - pull data/metadata from Salesforce via APIs, (2) Transform - normalize, enrich, structure data for LLM consumption, (3) Index - generate embeddings and store in vector database for RAG, (4) Retrieve - query vector store to retrieve relevant context for LLM.

Q: What extraction APIs should I use for Salesforce to LLM pipelines?

A: Use extraction APIs based on requirements: (1) REST/Composite APIs - real-time extraction, targeted queries, (2) Bulk API - high-volume extraction (millions of records), initial index population, (3) Metadata/Tooling APIs - schema extraction (object definitions, field metadata), (4) Change Data Capture (CDC) - real-time incremental updates, event-driven refresh. Choose based on volume, freshness requirements, and use case.

Q: What is the difference between full loads and incremental loads?

A: Full loads extract all records from selected objects (initial index population, periodic full refresh). Incremental loads extract only changed records since last extraction (timestamp-based or event-based). Full loads: complete data, simpler logic, but resource-intensive. Incremental loads: efficient, faster, but more complex logic requiring change tracking.

Q: How do I chunk Salesforce data for RAG systems?

A: Chunk Salesforce data by: (1) Per-record chunks - one chunk per record (simple, clear boundaries), (2) Aggregated logical documents - multiple related records in one chunk (Account + Contacts, preserves relationships), (3) Field selection - include descriptive fields (Name, Description, Status), exclude system/technical fields, (4) Flattening relationships - include related record info as text, (5) Natural-language labels - include field labels and help text for better retrieval.

Q: What should I include in RAG chunks from Salesforce?

A: Include in chunks: (1) Descriptive fields (Name, Description, Notes, Comments), (2) Status fields (Status, Stage, State), (3) Relationship context (related record names), (4) Temporal fields (Created Date, Last Modified Date), (5) Field labels and help text (better semantic understanding). Exclude: system fields, audit fields, technical fields, large binary data (unless needed).

Q: What are the tradeoffs between batch and event-driven extraction?

A: Batch extraction (nightly/periodic): predictable resource usage, efficient for large volumes, can run off-peak, simpler error recovery, but data may be stale, requires job scheduling. Event-driven extraction (CDC-based): real-time freshness, efficient incremental updates, but complex event handling, finite event retention windows (typically up to 72 hours), and event subscription infrastructure requirements.

Q: How do I handle security and governance in LLM data pipelines?

A: Handle security by: (1) Evaluating Field-Level Security (FLS) - respect FLS when extracting data, (2) Evaluating Object-Level Security (OLS) - respect object access, (3) Understanding sharing rules - consider sharing model, (4) Data retention policies - comply with retention requirements, (5) Audit trail requirements - track data extraction, (6) Redacting sensitive data - exclude PII/PHI if not needed. Security evaluation is critical for compliance.

Q: What is the difference between per-record and aggregated chunking?

A: Per-record chunking creates one chunk per Salesforce record (simple, clear boundaries, but may lose relationship context). Aggregated chunking includes multiple related records in one chunk (Account + Contacts, preserves relationships, richer context, but larger chunks, more complex logic). Choose based on use case - per-record for simple retrieval, aggregated for relationship-aware retrieval.

Q: How do I choose between REST API, Bulk API, and CDC for extraction?

A: Choose based on: (1) REST API - real-time, targeted extraction, flexible queries, but rate limits, (2) Bulk API - high-volume (millions of records), efficient for full loads, but not real-time, requires job polling, (3) CDC - real-time incremental updates, event-driven, but event retention limits, complex event handling. Use REST for on-demand, Bulk for initial/full loads, CDC for real-time incremental.

Q: What are best practices for Salesforce to LLM pipelines?

A: Best practices include: (1) Choose appropriate extraction API (REST, Bulk, CDC based on requirements), (2) Implement chunking strategy (per-record or aggregated based on use case), (3) Include relevant fields (descriptive, status, relationships), (4) Respect security model (FLS, OLS, sharing rules), (5) Handle errors gracefully (retry logic, error recovery), (6) Monitor pipeline health (extraction metrics, indexing status), (7) Optimize for retrieval (chunking, metadata, embeddings).

Edge Cases and Limitations

Edge Case 1: Large Object Records with Many Relationships

Scenario: Records with extensive relationship data (Account with 100+ Contacts, Cases, Opportunities) creating very large chunks.

Consideration:

Edge Case 2: Real-Time Data Freshness Requirements

Scenario: LLM system requires real-time data updates, but event replay windows are finite (typically up to 72 hours for CDC/high-volume Platform Events).

Consideration:

Edge Case 3: Field-Level Security (FLS) Evaluation Complexity

Scenario: Extracting data while respecting FLS requires per-user security evaluation, complicating extraction.

Consideration:

Edge Case 4: Chunking Strategy Selection for Complex Data Models

Scenario: Complex data models with many relationships make chunking strategy selection difficult.

Consideration:

Edge Case 5: Embedding Model Token Limits

Scenario: Chunks exceed embedding model token limits (e.g., 8,192 tokens for some models).

Consideration:

Edge Case 6: Incremental Update Complexity

Scenario: Updating RAG index incrementally requires complex change tracking and partial index updates.

Consideration:

Limitations