Misbahuddin Mohammed — Engineering Portfolio

DataOps Suite — Enterprise Data Management

Conversational data catalog, automated metadata generation, and format-preserving obfuscation across 12 industries

Python

ReactJS

AWS Lambda

DynamoDB

KMS

Anthropic Claude

HMAC-SHA256

Problem Statement

Amazon's PX Central Science team managed hundreds of datasets across 12 industry verticals. Metadata was manually written, quality checks were ad-hoc, and PII obfuscation required custom scripts per dataset — creating bottlenecks that slowed data sharing, introduced compliance risks, and consumed 60% of the team's bandwidth on routine data management tasks.

DataOps Suite automates the entire data lifecycle: a 4-step Ingest Wizard handles upload, quality validation, metadata generation, and catalog publication. The Data Catalog provides searchable, browsable access with full schema documentation, quality reports, and lineage tracking. The Obfuscation Service applies format-preserving HMAC-SHA256 transformations to PII columns with deterministic, reversible results — enabling safe data sharing while maintaining referential integrity.

System Architecture

The Ingest Wizard accepts CSV/JSON uploads (capped at 20 rows for LLM cost control), runs a hybrid quality pipeline (deterministic + LLM semantic checks), and generates column-level metadata via Anthropic Claude. Published datasets are persisted to the catalog with full provenance tracking.

The Obfuscation Service uses HMAC-SHA256 via the Web Crypto API (crypto.subtle.sign) with a deterministic seed (DEMO_SEED_2024). In production, the seed would be a KMS Customer Master Key (CMK) with IAM-controlled access — the seed IS the re-identification key, so access control is the security boundary.

Format preservation ensures obfuscated data maintains the same structure as the original: emails remain valid email formats, phone numbers keep their digit patterns, SSNs preserve the XXX-XX-XXXX format, and names remain pronounceable strings. Running the same input twice always produces the same output (deterministic), enabling cross-dataset joins on obfuscated keys.

Animated edges show the primary request path. Dashed edges show async/secondary flows. Drag to explore, scroll to zoom.

Data Model

Catalog Entries

Column	Type	Notes
`datasetId`	VARCHAR	PK — ds-001 through ds-012
`fileName`	VARCHAR	Source JSON file name
`displayName`	VARCHAR	Human-readable dataset name
`description`	TEXT	Auto-generated by LLM
`industryTag`	ENUM	12 industries (FINTECH, HEALTHCARE, etc.)
`classification`	ENUM	PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
`regulatoryFlags`	ARRAY	GDPR, HIPAA, PCI-DSS, SOX, FERPA
`columns`	ARRAY	Column metadata with PII type and obfuscation rules
`rowCount`	INT	500 rows per dataset
`qualityScore`	INT	0-100, deducted by issue severity
`owner`	VARCHAR	Dataset steward
`publishedAt`	DATETIME	Catalog publication timestamp

Quality Reports

Column	Type	Notes
`datasetId`	VARCHAR	FK → catalog
`overallScore`	INT	0-100 (CRITICAL=-8, WARNING=-3, INFO=-1)
`issues`	ARRAY	Typed issues: NULL_RATE, DUPLICATE, OUTLIER, etc.
`columnHealth`	ARRAY	Per-column healthScore, nullRate, uniqueRate
`issueTypes`	ENUM[]	11 types including SEMANTIC and BUSINESS_LOGIC

Obfuscation Jobs

Column	Type	Notes
`jobId`	VARCHAR	PK — obfuscation job identifier
`datasetId`	VARCHAR	FK → catalog
`status`	ENUM	COMPLETED, FAILED, RUNNING
`piiColumnsObfuscated`	INT	Number of PII columns processed
`rowsProcessed`	INT	Always 500 for demo datasets
`processingTimeMs`	INT	800-3000ms range
`seedVersion`	VARCHAR	HMAC seed used (re-id key)
`reidentificationRequests`	INT	Approved re-id request count

Quality Scoring Model

Score = 100 - (CRITICAL x 8) - (WARNING x 3) - (INFO x 1), clamped [0, 100]

CRITICAL (-8 pts)

CRITICAL

Duplicate keys, severe null rates, schema violations

WARNING (-3 pts)

WARNING

Format inconsistencies, computed drift, outliers

INFO (-1 pt)

INFO

Minor completeness gaps, style inconsistencies

LLM Enhancements

Automated Quality Check ✦

Combines 10 deterministic checks (null rates, duplicates, format validation) with 5 business logic rules and LLM-powered semantic analysis. Gracefully falls back to deterministic-only if LLM is unavailable.

Represents how I would build this system today with LLM capabilities

Metadata Generation ✦

LLM generates comprehensive column-level metadata — descriptions, PII classification, sensitivity levels, obfuscation rules, and example values. Uses max_tokens 8000 for full JSON coverage across all columns.

Represents how I would build this system today with LLM capabilities

Ask DataVault ✦

Conversational catalog search powered by LLM. Ask questions like "Which datasets contain financial PII?" and get inline dataset cards with direct links. Understands industry context and regulatory implications.

Represents how I would build this system today with LLM capabilities

Obfuscation Rule Suggestion ✦

LLM analyzes column metadata and PII types to suggest optimal obfuscation strategies — format-preserving HMAC for IDs, consistent hashing for emails, redaction for SSNs. JSON-only response for direct application.

Represents how I would build this system today with LLM capabilities

Key Metrics

$1.2M

Annual Savings

99%

Metadata Accuracy

12

Industries Covered

99.9%

Service Availability