DataOps Suite — Enterprise Data Management

Conversational data catalog, automated metadata generation, and format-preserving obfuscation across 12 industries
Python
ReactJS
AWS Lambda
DynamoDB
KMS
S3
Anthropic Claude
HMAC-SHA256

Problem Statement

Amazon's PX Central Science team managed hundreds of datasets across 12 industry verticals. Metadata was manually written, quality checks were ad-hoc, and PII obfuscation required custom scripts per dataset — creating bottlenecks that slowed data sharing, introduced compliance risks, and consumed 60% of the team's bandwidth on routine data management tasks.

DataOps Suite automates the entire data lifecycle: a 4-step Ingest Wizard handles upload, quality validation, metadata generation, and catalog publication. The Data Catalog provides searchable, browsable access with full schema documentation, quality reports, and lineage tracking. The Obfuscation Service applies format-preserving HMAC-SHA256 transformations to PII columns with deterministic, reversible results — enabling safe data sharing while maintaining referential integrity.

System Architecture

The Ingest Wizard accepts CSV/JSON uploads (capped at 20 rows for LLM cost control), runs a hybrid quality pipeline (deterministic + LLM semantic checks), and generates column-level metadata via Anthropic Claude. Published datasets are persisted to the catalog with full provenance tracking.

The Obfuscation Service uses HMAC-SHA256 via the Web Crypto API (crypto.subtle.sign) with a deterministic seed (DEMO_SEED_2024). In production, the seed would be a KMS Customer Master Key (CMK) with IAM-controlled access — the seed IS the re-identification key, so access control is the security boundary.

Format preservation ensures obfuscated data maintains the same structure as the original: emails remain valid email formats, phone numbers keep their digit patterns, SSNs preserve the XXX-XX-XXXX format, and names remain pronounceable strings. Running the same input twice always produces the same output (deterministic), enabling cross-dataset joins on obfuscated keys.

Animated edges show the primary request path. Dashed edges show async/secondary flows. Drag to explore, scroll to zoom.

Data Model

Catalog Entries
ColumnTypeNotes
datasetIdVARCHARPK — ds-001 through ds-012
fileNameVARCHARSource JSON file name
displayNameVARCHARHuman-readable dataset name
descriptionTEXTAuto-generated by LLM
industryTagENUM12 industries (FINTECH, HEALTHCARE, etc.)
classificationENUMPUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
regulatoryFlagsARRAYGDPR, HIPAA, PCI-DSS, SOX, FERPA
columnsARRAYColumn metadata with PII type and obfuscation rules
rowCountINT500 rows per dataset
qualityScoreINT0-100, deducted by issue severity
ownerVARCHARDataset steward
publishedAtDATETIMECatalog publication timestamp
Quality Reports
ColumnTypeNotes
datasetIdVARCHARFK → catalog
overallScoreINT0-100 (CRITICAL=-8, WARNING=-3, INFO=-1)
issuesARRAYTyped issues: NULL_RATE, DUPLICATE, OUTLIER, etc.
columnHealthARRAYPer-column healthScore, nullRate, uniqueRate
issueTypesENUM[]11 types including SEMANTIC and BUSINESS_LOGIC
Obfuscation Jobs
ColumnTypeNotes
jobIdVARCHARPK — obfuscation job identifier
datasetIdVARCHARFK → catalog
statusENUMCOMPLETED, FAILED, RUNNING
piiColumnsObfuscatedINTNumber of PII columns processed
rowsProcessedINTAlways 500 for demo datasets
processingTimeMsINT800-3000ms range
seedVersionVARCHARHMAC seed used (re-id key)
reidentificationRequestsINTApproved re-id request count

Quality Scoring Model

Score = 100 - (CRITICAL x 8) - (WARNING x 3) - (INFO x 1), clamped [0, 100]

CRITICAL (-8 pts)
CRITICAL
Duplicate keys, severe null rates, schema violations
WARNING (-3 pts)
WARNING
Format inconsistencies, computed drift, outliers
INFO (-1 pt)
INFO
Minor completeness gaps, style inconsistencies

LLM Enhancements

Automated Quality Check

Combines 10 deterministic checks (null rates, duplicates, format validation) with 5 business logic rules and LLM-powered semantic analysis. Gracefully falls back to deterministic-only if LLM is unavailable.

Represents how I would build this system today with LLM capabilities
Metadata Generation

LLM generates comprehensive column-level metadata — descriptions, PII classification, sensitivity levels, obfuscation rules, and example values. Uses max_tokens 8000 for full JSON coverage across all columns.

Represents how I would build this system today with LLM capabilities
Ask DataVault

Conversational catalog search powered by LLM. Ask questions like "Which datasets contain financial PII?" and get inline dataset cards with direct links. Understands industry context and regulatory implications.

Represents how I would build this system today with LLM capabilities
Obfuscation Rule Suggestion

LLM analyzes column metadata and PII types to suggest optimal obfuscation strategies — format-preserving HMAC for IDs, consistent hashing for emails, redaction for SSNs. JSON-only response for direct application.

Represents how I would build this system today with LLM capabilities

Key Metrics

$1.2M

Annual Savings

99%

Metadata Accuracy

12

Industries Covered

99.9%

Service Availability