Developer Implementation Guide

Table of Content

Table of Content

Table of Content

Sanitize backups and test data

Keep production data safe and keep non-prod clean. Minimize what you back up, encrypt everything, and generate privacy-safe test data that preserves schema and behavior without exposing personal data.

Sanitize Backups and Test Data

Keep production data safe and keep non-prod clean. Minimize what you back up, encrypt everything, and generate privacy-safe test data that preserves schema and behavior without exposing personal data.

Strategy and scope

  • Default rule: no production PII in dev, test, or staging.

  • Classify datasets and mark fields as direct_idquasi_idsensitive, or public.

  • Decide per dataset: exclude from backup, backup encrypted only, or include but crypto-shred capable.

  • Document restore procedures that immediately sanitize any copy used outside prod.

Production backups: minimize and protect

  • Reduce blast radius: exclude high-risk tables or columns from routine backups if they can be rebuilt.

  • Encrypt at rest and in transit: use KMS-managed keys, separate accounts/projects for backup stores, and object lock or WORM where supported.

  • Crypto-shred plan: use envelope encryption per tenant or dataset so destroying a KEK renders old backups unreadable.

  • Access control: least privilege, break-glass, dual control on decrypt. Log every restore and decrypt.

  • Retention: keep only as long as business and legal require. Apply lifecycle rules and prove deletions with audit entries.

  • Test restores: monthly at minimum. Run a PII scan on restored samples and verify decrypt, rotate, and shred paths.

Non-prod data: generate, subset, and mask

  • Prefer synthetic or fuzzed data over masked prod data.

  • If you must start from prod, use pseudonymization and redaction with deterministic functions to keep referential integrity.

  • Subset by time or tenant to reduce volume. Maintain foreign key completeness.

Patterns and examples

Postgres: sanitized schema for exports

Create a dedicated schema of sanitized views or materialized tables. Dump only that schema for dev.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

-- App-wide secret for deterministic pseudonyms (do not store in repo)
-- SET app.pseudo_salt = 'rotate_me_in_secrets_manager';

CREATE SCHEMA devmask;

-- Helper functions
CREATE OR REPLACE FUNCTION devmask.pseudo_text(t text) RETURNS text AS $$
  SELECT encode(hmac(t, current_setting('app.pseudo_salt', true), 'sha256'),'hex')
$$ LANGUAGE sql IMMUTABLE;

CREATE OR REPLACE FUNCTION devmask.null_if_sensitive(b bool, v text) RETURNS text AS $$
  SELECT CASE WHEN b THEN NULL ELSE v END
$$ LANGUAGE sql IMMUTABLE;

-- Users view: keep structure, remove direct identifiers
CREATE OR REPLACE VIEW devmask.users AS
SELECT
  id,
  devmask.pseudo_text(email)          AS email_pseudo,
  split_part(devmask.pseudo_text(coalesce(phone,'')), '', 1) AS phone_pseudo,
  left(digest(id::text || 'name','sha256')::text, 12) AS display_name,
  false                               AS email_verified,
  created_at::timestamptz             AS created_at,
  NULL::inet                          AS last_ip,
  NULL::text                          AS address_line1,
  NULL::text                          AS address_line2,
  NULL::text                          AS city,
  NULL::text                          AS postal_code,
  country
FROM public.users;

-- Orders view: keep relationships, strip payload
CREATE OR REPLACE VIEW devmask.orders AS
SELECT
  o.id, o.user_id, o.status, o.total_cents, o.currency, o.created_at,
  NULL::jsonb AS payment_details,
  NULL::text  AS shipping_name
FROM public.orders o;

Dump only sanitized objects:

# Materialize if you prefer tables, then:
# pg_dump --schema=devmask --no-owner --file dev_sanitized.sql "$DATABASE_URL"

Block non-sanitized dumps in automation:

REVOKE USAGE ON SCHEMA public FROM dev_exporter;
GRANT USAGE ON SCHEMA devmask TO dev_exporter;

Postgres: batch masking after restore

If you restore prod into a quarantine VPC, run a one-time sanitizer job before any developer access.

-- Example redactions in place (use in isolated restore env only)
UPDATE users
SET email = substr(encode(digest(email || current_setting('app.pseudo_salt'),'sha256'),'hex'),1,16) || '@example.test',
    phone = NULL,
    address_line1 = NULL,
    address_line2 = NULL,
    postal_code   = NULL;

-- Verify no direct IDs remain
-- SELECT * FROM users WHERE email ~* '(@gmail\.com|@yahoo\.com)';

MongoDB: projection and export

Create a projection pipeline that drops or hashes sensitive fields and export only the result.

db.users.aggregate([
  { $match: { /* selection or sampling */ } },
  { $addFields: {
      email_pseudo: { $toHexString: { $sha256: { $concat: ["$email", SECRET_SALT] } } }
  }},
  { $project: {
      email: 0, phone: 0, address: 0, ip: 0
  }},
  { $out: "devmask_users" }
]);

Object storage and media

  • Do not copy raw uploads to dev. Generate derivatives only.

  • Strip EXIF and GPS from images. Replace PDFs containing PII with placeholder pages that preserve count and size envelope.

  • Lifecycle rules: keep dev artifacts short-lived (for example 7 to 30 days).

# Image EXIF strip example (server-side pipeline)
magick in.jpg -strip -resize 1600x in_sanitized.jpg

Subsetting with integrity

Keep relational integrity while sampling.

-- Sample 5% of tenants, then include dependent rows
WITH sampled_tenants AS (
  SELECT id FROM tenants WHERE random() < 0.05
),
users_in_scope AS (
  SELECT u.* FROM users u JOIN sampled_tenants t ON t.id = u.tenant_id
)
SELECT count(*) FROM users_in_scope;
-- Repeat pattern for orders, messages, etc. using tenant_id

Test data generators

Use deterministic seeds to make tests reproducible. Keep generators in the repo and run them in CI and local setup.

// Node sketch with faker-like patterns
import crypto from "crypto";
function seeded(n){ return crypto.createHash("sha256").update(String(n)).digest("hex"); }
function user(n){
  const id = n;
  return {
    id,
    email: `user${n}@example.test`,
    name: `User ${n}`,
    phone: null,
    address: null
  };
}

Guardrails and automation

PII scanning gates

  • Run regex and checksum detectors on sanitized dumps before release. Block if any email, phone, SSN, or card PAN patterns are found.

  • Maintain an allowlist of test domains (example.test, invalid, localhost).

Quarantine and approvals

  • Restores land in a quarantine project. Sanitizer runs. Scanner passes. Only then copy to dev.

  • Every restore and export emits an audit event with dataset, approver, and hashes of delivered files.

Secrets and salts

  • Keep pseudonymization salts in a secrets manager per environment. Rotate on schedule. Never commit salts.

Vendor systems

  • Never mirror vendor-held raw datasets into dev. Use contract-approved sandbox exports that already exclude PII or come pseudonymized.

Quick sanitize checklist

  • No prod PII in non-prod by default, with explicit exceptions documented

  • Backups minimized, encrypted with KMS, object lock enabled, and access tightly controlled

  • Crypto-shred capability via per-dataset or per-tenant keys

  • Sanitized schema or post-restore sanitizer before any developer access

  • Deterministic pseudonymization to preserve joins and behavior

  • Subsetting that maintains referential integrity

  • Media and documents stripped of EXIF and sensitive content

  • Automated PII scanners and audit logs gating every export or restore

  • Short retention for dev artifacts and lifecycle rules in object storage

  • Separate accounts or projects for backup stores and quarantine restores

Conclusion

Sanitizing is a pipeline, not a manual step. By encrypting and minimizing backups, generating privacy-safe datasets for non-prod, and enforcing automated gates with audit trails, you cut risk, speed up testing, and align with GDPR and CPRA expectations for data minimization and security.