Salesforce Data Scope & Security for LLMs

Overview

This document covers how to choose what data to expose from Salesforce to LLMs and how to do that safely. It addresses scoping strategy (which objects, fields, and records to include), security considerations (how to reflect Salesforce access controls in LLM data extraction), data masking and redaction strategies, and governance and lifecycle management.

Prerequisites

Required Knowledge:

Recommended Reading:

Data Scoping Principles

Criteria for Including/Excluding Data

Relevance to RAG/LLM Use Cases

Include When:

Exclude When:

Sensitivity and PII/PHI

Include When:

Exclude When:

Data Volume and Performance Constraints

Include When:

Exclude When:

Approaches to Selecting

Core Business Data

What: Primary business entities and their key attributes.

Examples:

Rationale: Core business data provides essential context for most LLM use cases.

Supporting Context

What: Related records and metadata that provide additional context.

Examples:

Rationale: Supporting context enriches understanding but may not be essential for all use cases.

Excluded/Sanitized Fields

What: Fields that are excluded entirely or sanitized before inclusion.

Examples:

Rationale: Sensitive data must be excluded or sanitized to meet security and compliance requirements.

Mapping Salesforce Security to LLM Access

How to Interpret

Field-Level Security (FLS)

What It Means: Users may not have read access to certain fields.

For LLM Extraction:

Implementation: Query FLS metadata and filter fields based on accessibility.

Object Permissions

What It Means: Users may not have read access to certain objects.

For LLM Extraction:

Implementation: Query object permissions and skip inaccessible objects.

Sharing Rules and Org-Wide Defaults

What It Means: Record visibility is determined by sharing rules, not just object permissions.

For LLM Extraction:

Implementation: Use user context (not system context) to respect sharing rules automatically.

Strategies for

Extracting Data Under a Constrained Service Account

Pattern: Use dedicated service account with minimal required permissions.

Benefits:

Considerations:

Use When: Extracting data for general-purpose RAG system, not user-specific.

User-Context Extraction

Pattern: Extract data in context of specific user to respect their permissions.

Benefits:

Considerations:

Use When: Extracting data for user-specific RAG systems or personalized experiences.

Maintaining Consistent Access Rules in the RAG/LLM Layer

Challenge: RAG/LLM layer doesn’t automatically inherit Salesforce security.

Strategies:

Tradeoffs:

Approaches to Enforcing Security in RAG

Index Design

Separate Indexes Per Audience/Role

Pattern: Create separate vector indexes for different user roles or audiences.

Implementation:

Pros:

Cons:

Use When: Clear role-based access patterns, sufficient storage, performance critical.

Attribute-Based Filtering at Query Time

Pattern: Include access attributes in chunks and filter at query time.

Implementation:

Pros:

Cons:

Use When: Flexible access patterns, limited storage, access rules change frequently.

Pros/Cons of Each Approach

Separate Indexes:

Attribute-Based Filtering:

Potential Pitfalls

Over-Sharing Via a Single Global Index

Risk: All users query same index, may retrieve chunks they shouldn’t see.

Mitigation:

Treating LLM as If It “Inherits” Salesforce Security Automatically

Risk: Assuming LLM automatically respects Salesforce security without implementation.

Reality: LLM/RAG layer doesn’t automatically inherit Salesforce security. Must be implemented explicitly.

Mitigation:

Governance and Lifecycle

Refresh Cadence

Strategies:

Decision Factors:

Handling Updates, Corrections, and Deletions

Updates:

Corrections:

Deletions:

Retention and Compliance Considerations (e.g., Data Minimization)

Data Minimization:

Compliance:

Retention Policies:

Audit Trails

What Was Exported

Track:

Purpose: Understand what data is in LLM system, enable compliance reporting.

When

Track:

Purpose: Understand data freshness, enable temporal analysis, support compliance.

For Which Purpose

Track:

Purpose: Enable purpose limitation compliance, support governance.

Concrete Rules

Sources Used

To Evaluate

Q&A

Q: What is Salesforce LLM data governance?

A: Salesforce LLM data governance is the process of choosing what data to expose from Salesforce to LLMs and doing so safely. It includes: (1) Data scoping (which objects, fields, records to include), (2) Security considerations (reflecting Salesforce access controls in LLM data extraction), (3) Data masking and redaction (protecting sensitive data), (4) Governance and lifecycle management (policies, procedures, monitoring).

Q: How do I scope data for LLM systems?

A: Scope data by: (1) Including relevant data (provides context for answering questions, enables relationship understanding), (2) Excluding irrelevant data (doesn’t contribute to LLM understanding, redundant, operational/technical only), (3) Considering sensitivity (PII/PHI, classification), (4) Evaluating use case (what data is needed for LLM use case), (5) Balancing utility and risk (enough data for utility, not too much risk).

Q: How do I reflect Salesforce security in LLM data extraction?

A: Reflect security by: (1) Evaluating Field-Level Security (FLS) (respect FLS when extracting data), (2) Evaluating Object-Level Security (OLS) (respect object access), (3) Understanding sharing rules (consider sharing model), (4) Respecting user context (extract data based on user permissions), (5) Using attribute-based filtering (filter by user attributes, roles), (6) Implementing role-based access (different data per role).

Q: What data masking strategies should I use?

A: Use masking strategies: (1) Field-level masking (mask specific fields - SSN, email, phone), (2) Record-level masking (mask entire records if sensitive), (3) Tokenization (replace sensitive data with tokens), (4) Redaction (remove sensitive data completely), (5) Pseudonymization (replace with pseudonyms), (6) Aggregation (aggregate sensitive data). Choose strategy based on sensitivity and use case requirements.

Q: How do I handle PII/PHI in LLM data pipelines?

A: Handle PII/PHI by: (1) Identifying PII/PHI (inventory sensitive data), (2) Classifying by sensitivity (high, medium, low), (3) Masking or redacting sensitive fields, (4) Excluding if not needed (don’t include if not required for use case), (5) Using encryption (Shield Encryption for sensitive fields), (6) Complying with regulations (GDPR, HIPAA, FERPA), (7) Documenting handling (policies, procedures).

Q: What is attribute-based filtering for LLM data?

A: Attribute-based filtering filters data based on user attributes (roles, departments, regions). It enables: (1) Role-based data access (different data per role), (2) Department-based filtering (filter by department), (3) Region-based filtering (filter by region), (4) Dynamic filtering (filter based on user context). Attribute-based filtering provides fine-grained data access control.

Q: How do I implement data governance for LLM systems?

A: Implement governance by: (1) Defining data scoping policies (what data to include/exclude), (2) Establishing security policies (how to reflect Salesforce security), (3) Creating masking policies (what to mask, how), (4) Setting up monitoring (track data extraction, access), (5) Documenting procedures (policies, processes), (6) Regular reviews (audit data usage, compliance), (7) Training teams (governance requirements).

Q: What are the risks of exposing too much data to LLMs?

A: Risks include: (1) Privacy violations (exposing PII/PHI), (2) Compliance violations (GDPR, HIPAA, FERPA), (3) Security breaches (sensitive data in LLM systems), (4) Data leakage (data exposed to unauthorized users), (5) Increased token costs (more data = more tokens), (6) Reduced LLM performance (too much irrelevant data). Balance data utility with risk.

Q: How do I balance data minimization with LLM utility?

A: Balance by: (1) Including necessary data (data needed for LLM use case), (2) Excluding unnecessary data (data that doesn’t contribute), (3) Using selective extraction (extract only relevant records), (4) Masking sensitive data (protect while maintaining utility), (5) Testing LLM effectiveness (verify LLM works with minimized data), (6) Iterating based on results (adjust based on LLM performance). Too little data reduces utility, too much increases risk.

Q: What are best practices for LLM data governance?

A: Best practices include: (1) Scope data carefully (include only necessary data), (2) Respect Salesforce security (FLS, OLS, sharing rules), (3) Mask sensitive data (PII/PHI, sensitive fields), (4) Monitor data extraction (track what data is extracted), (5) Document policies (clear governance policies), (6) Regular audits (review data usage, compliance), (7) Train teams (governance requirements), (8) Comply with regulations (GDPR, HIPAA, FERPA).