DataOps Suite — Enterprise Data Management
Conversational data catalog, automated metadata generation, and format-preserving obfuscation across 12 industries
Problem Statement
Amazon's PX Central Science team managed hundreds of datasets across 12 industry verticals. Metadata was manually written, quality checks were ad-hoc, and PII obfuscation required custom scripts per dataset — creating bottlenecks that slowed data sharing, introduced compliance risks, and consumed 60% of the team's bandwidth on routine data management tasks.
DataOps Suite automates the entire data lifecycle: a 4-step Ingest Wizard handles upload, quality validation, metadata generation, and catalog publication. The Data Catalog provides searchable, browsable access with full schema documentation, quality reports, and lineage tracking. The Obfuscation Service applies format-preserving HMAC-SHA256 transformations to PII columns with deterministic, reversible results — enabling safe data sharing while maintaining referential integrity.
System Architecture
The Ingest Wizard accepts CSV/JSON uploads (capped at 20 rows for LLM cost control), runs a hybrid quality pipeline (deterministic + LLM semantic checks), and generates column-level metadata via Anthropic Claude. Published datasets are persisted to the catalog with full provenance tracking.
The Obfuscation Service uses HMAC-SHA256 via the Web Crypto API (crypto.subtle.sign) with a deterministic seed (DEMO_SEED_2024). In production, the seed would be a KMS Customer Master Key (CMK) with IAM-controlled access — the seed IS the re-identification key, so access control is the security boundary.
Format preservation ensures obfuscated data maintains the same structure as the original: emails remain valid email formats, phone numbers keep their digit patterns, SSNs preserve the XXX-XX-XXXX format, and names remain pronounceable strings. Running the same input twice always produces the same output (deterministic), enabling cross-dataset joins on obfuscated keys.
Data Model
Catalog Entries
| Column | Type | Notes |
|---|---|---|
datasetId | VARCHAR | PK — ds-001 through ds-012 |
fileName | VARCHAR | Source JSON file name |
displayName | VARCHAR | Human-readable dataset name |
description | TEXT | Auto-generated by LLM |
industryTag | ENUM | 12 industries (FINTECH, HEALTHCARE, etc.) |
classification | ENUM | PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED |
regulatoryFlags | ARRAY | GDPR, HIPAA, PCI-DSS, SOX, FERPA |
columns | ARRAY | Column metadata with PII type and obfuscation rules |
rowCount | INT | 500 rows per dataset |
qualityScore | INT | 0-100, deducted by issue severity |
owner | VARCHAR | Dataset steward |
publishedAt | DATETIME | Catalog publication timestamp |
Quality Reports
| Column | Type | Notes |
|---|---|---|
datasetId | VARCHAR | FK → catalog |
overallScore | INT | 0-100 (CRITICAL=-8, WARNING=-3, INFO=-1) |
issues | ARRAY | Typed issues: NULL_RATE, DUPLICATE, OUTLIER, etc. |
columnHealth | ARRAY | Per-column healthScore, nullRate, uniqueRate |
issueTypes | ENUM[] | 11 types including SEMANTIC and BUSINESS_LOGIC |
Obfuscation Jobs
| Column | Type | Notes |
|---|---|---|
jobId | VARCHAR | PK — obfuscation job identifier |
datasetId | VARCHAR | FK → catalog |
status | ENUM | COMPLETED, FAILED, RUNNING |
piiColumnsObfuscated | INT | Number of PII columns processed |
rowsProcessed | INT | Always 500 for demo datasets |
processingTimeMs | INT | 800-3000ms range |
seedVersion | VARCHAR | HMAC seed used (re-id key) |
reidentificationRequests | INT | Approved re-id request count |
Quality Scoring Model
Score = 100 - (CRITICAL x 8) - (WARNING x 3) - (INFO x 1), clamped [0, 100]
CRITICAL (-8 pts)
CRITICALWARNING (-3 pts)
WARNINGINFO (-1 pt)
INFOLLM Enhancements
Automated Quality Check ✦
Combines 10 deterministic checks (null rates, duplicates, format validation) with 5 business logic rules and LLM-powered semantic analysis. Gracefully falls back to deterministic-only if LLM is unavailable.
Represents how I would build this system today with LLM capabilitiesMetadata Generation ✦
LLM generates comprehensive column-level metadata — descriptions, PII classification, sensitivity levels, obfuscation rules, and example values. Uses max_tokens 8000 for full JSON coverage across all columns.
Represents how I would build this system today with LLM capabilitiesAsk DataVault ✦
Conversational catalog search powered by LLM. Ask questions like "Which datasets contain financial PII?" and get inline dataset cards with direct links. Understands industry context and regulatory implications.
Represents how I would build this system today with LLM capabilitiesObfuscation Rule Suggestion ✦
LLM analyzes column metadata and PII types to suggest optimal obfuscation strategies — format-preserving HMAC for IDs, consistent hashing for emails, redaction for SSNs. JSON-only response for direct application.
Represents how I would build this system today with LLM capabilitiesKey Metrics
$1.2M
Annual Savings
99%
Metadata Accuracy
12
Industries Covered
99.9%
Service Availability