Sanitize backups and test data

Keep production data safe and keep non-prod clean. Minimize what you back up, encrypt everything, and generate privacy-safe test data that preserves schema and behavior without exposing personal data.

Sanitize Backups and Test Data

Strategy and scope

Default rule: no production PII in dev, test, or staging.
Classify datasets and mark fields as direct_id, quasi_id, sensitive, or public.
Decide per dataset: exclude from backup, backup encrypted only, or include but crypto-shred capable.
Document restore procedures that immediately sanitize any copy used outside prod.

Production backups: minimize and protect

Reduce blast radius: exclude high-risk tables or columns from routine backups if they can be rebuilt.
Encrypt at rest and in transit: use KMS-managed keys, separate accounts/projects for backup stores, and object lock or WORM where supported.
Crypto-shred plan: use envelope encryption per tenant or dataset so destroying a KEK renders old backups unreadable.
Access control: least privilege, break-glass, dual control on decrypt. Log every restore and decrypt.
Retention: keep only as long as business and legal require. Apply lifecycle rules and prove deletions with audit entries.
Test restores: monthly at minimum. Run a PII scan on restored samples and verify decrypt, rotate, and shred paths.

Non-prod data: generate, subset, and mask

Prefer synthetic or fuzzed data over masked prod data.
If you must start from prod, use pseudonymization and redaction with deterministic functions to keep referential integrity.
Subset by time or tenant to reduce volume. Maintain foreign key completeness.

Patterns and examples

Postgres: sanitized schema for exports

Create a dedicated schema of sanitized views or materialized tables. Dump only that schema for dev.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

-- App-wide secret for deterministic pseudonyms (do not store in repo)
-- SET app.pseudo_salt = 'rotate_me_in_secrets_manager';

CREATE SCHEMA devmask;

-- Helper functions
CREATE OR REPLACE FUNCTION devmask.pseudo_text(t text) RETURNS text AS $$
  SELECT encode(hmac(t, current_setting('app.pseudo_salt', true), 'sha256'),'hex')
$$ LANGUAGE sql IMMUTABLE;

CREATE OR REPLACE FUNCTION devmask.null_if_sensitive(b bool, v text) RETURNS text AS $$
  SELECT CASE WHEN b THEN NULL ELSE v END
$$ LANGUAGE sql IMMUTABLE;

-- Users view: keep structure, remove direct identifiers
CREATE OR REPLACE VIEW devmask.users AS
SELECT
  id,
  devmask.pseudo_text(email)          AS email_pseudo,
  split_part(devmask.pseudo_text(coalesce(phone,'')), '', 1) AS phone_pseudo,
  left(digest(id::text || 'name','sha256')::text, 12) AS display_name,
  false                               AS email_verified,
  created_at::timestamptz             AS created_at,
  NULL::inet                          AS last_ip,
  NULL::text                          AS address_line1,
  NULL::text                          AS address_line2,
  NULL::text                          AS city,
  NULL::text                          AS postal_code,
  country
FROM public.users;

-- Orders view: keep relationships, strip payload
CREATE OR REPLACE VIEW devmask.orders AS
SELECT
  o.id, o.user_id, o.status, o.total_cents, o.currency, o.created_at,
  NULL::jsonb AS payment_details,
  NULL::text  AS shipping_name
FROM public.orders o;

Dump only sanitized objects:

# Materialize if you prefer tables, then:
# pg_dump --schema=devmask --no-owner --file dev_sanitized.sql "$DATABASE_URL"

Block non-sanitized dumps in automation:

REVOKE USAGE ON SCHEMA public FROM dev_exporter;
GRANT USAGE ON SCHEMA devmask TO dev_exporter;

Postgres: batch masking after restore

If you restore prod into a quarantine VPC, run a one-time sanitizer job before any developer access.

-- Example redactions in place (use in isolated restore env only)
UPDATE users
SET email = substr(encode(digest(email || current_setting('app.pseudo_salt'),'sha256'),'hex'),1,16) || '@example.test',
    phone = NULL,
    address_line1 = NULL,
    address_line2 = NULL,
    postal_code   = NULL;

-- Verify no direct IDs remain
-- SELECT * FROM users WHERE email ~* '(@gmail\.com|@yahoo\.com)';

MongoDB: projection and export

Create a projection pipeline that drops or hashes sensitive fields and export only the result.

db.users.aggregate([
  { $match: { /* selection or sampling */ } },
  { $addFields: {
      email_pseudo: { $toHexString: { $sha256: { $concat: ["$email", SECRET_SALT] } } }
  }},
  { $project: {
      email: 0, phone: 0, address: 0, ip: 0
  }},
  { $out: "devmask_users" }
]);

Object storage and media

Do not copy raw uploads to dev. Generate derivatives only.
Strip EXIF and GPS from images. Replace PDFs containing PII with placeholder pages that preserve count and size envelope.
Lifecycle rules: keep dev artifacts short-lived (for example 7 to 30 days).

# Image EXIF strip example (server-side pipeline)
magick in.jpg -strip -resize 1600x in_sanitized.jpg

Subsetting with integrity

Keep relational integrity while sampling.

-- Sample 5% of tenants, then include dependent rows
WITH sampled_tenants AS (
  SELECT id FROM tenants WHERE random() < 0.05
),
users_in_scope AS (
  SELECT u.* FROM users u JOIN sampled_tenants t ON t.id = u.tenant_id
)
SELECT count(*) FROM users_in_scope;
-- Repeat pattern for orders, messages, etc. using tenant_id

Test data generators

Use deterministic seeds to make tests reproducible. Keep generators in the repo and run them in CI and local setup.

// Node sketch with faker-like patterns
import crypto from "crypto";
function seeded(n){ return crypto.createHash("sha256").update(String(n)).digest("hex"); }
function user(n){
  const id = n;
  return {
    id,
    email: `user${n}@example.test`,
    name: `User ${n}`,
    phone: null,
    address: null
  };
}

Guardrails and automation

PII scanning gates

Run regex and checksum detectors on sanitized dumps before release. Block if any email, phone, SSN, or card PAN patterns are found.
Maintain an allowlist of test domains (example.test, invalid, localhost).

Quarantine and approvals

Restores land in a quarantine project. Sanitizer runs. Scanner passes. Only then copy to dev.
Every restore and export emits an audit event with dataset, approver, and hashes of delivered files.

Secrets and salts

Keep pseudonymization salts in a secrets manager per environment. Rotate on schedule. Never commit salts.

Vendor systems

Never mirror vendor-held raw datasets into dev. Use contract-approved sandbox exports that already exclude PII or come pseudonymized.

Quick sanitize checklist

No prod PII in non-prod by default, with explicit exceptions documented
Backups minimized, encrypted with KMS, object lock enabled, and access tightly controlled
Crypto-shred capability via per-dataset or per-tenant keys
Sanitized schema or post-restore sanitizer before any developer access
Deterministic pseudonymization to preserve joins and behavior
Subsetting that maintains referential integrity
Media and documents stripped of EXIF and sensitive content
Automated PII scanners and audit logs gating every export or restore
Short retention for dev artifacts and lifecycle rules in object storage
Separate accounts or projects for backup stores and quarantine restores

Conclusion

Sanitizing is a pipeline, not a manual step. By encrypting and minimizing backups, generating privacy-safe datasets for non-prod, and enforcing automated gates with audit trails, you cut risk, speed up testing, and align with GDPR and CPRA expectations for data minimization and security.

Monitor systems and automate breach alerts

Automate retention jobs and TTLs