Capability 05 — Walkthrough

Synthetic Detection Pipeline

A live, end-to-end run of the anomaly detection pipeline against synthetic federal financial system records. Each invocation generates a fresh batch, normalizes it to OCSF, scores every record with IsolationForest, and writes the result to S3 for inspection.

What this pipeline actually does

The pipeline runs three stages in sequence. Each stage is implemented in the inference harness Lambda and writes its output to a versioned S3 location that downstream stages read from. No real data is generated, normalized, or stored at any point. Synthetic records are tagged _synthetic: true on creation and that flag follows the record through normalization and scoring.

Stage 01

Generate

Builds synthetic federal financial system records mirroring PBIS, STARS-FL, FFMS, GFEBS, and EBS log structures. The configurable anomaly rate seeds a known proportion of suspicious patterns (permission escalation, bulk export, off-hours config changes, external-IP origin) for downstream scoring to recover.

Stage 02

Normalize

Maps each source-system record into the OCSF API Activity (6003) schema. Actor, source endpoint, activity ID, time, and metadata fields are populated consistently across source systems so the scorer can run on a uniform feature space.

Stage 03

Score

Runs IsolationForest with configurable contamination over a feature vector (cyclical time, IP last octet, activity ID, record size). Scores are normalized to [0, 1] and any record above the threshold is flagged with is_anomaly=true and a CVSS-style severity. Output written to S3 as Parquet and JSON.

Synthetic data only

Every invocation creates new fictitious records. The harness rejects any attempt to feed it real agency data. This page is for capability demonstration before formal accreditation; production use against client data requires the post-ATO promotion path.

Try it live

Choose your parameters and run the pipeline against the live sandbox Lambda. Results return inline with the run ID and S3 keys for follow-up inspection in CloudWatch or SageMaker Studio.

Live Pipeline Run

Synthetic Data
Range 10–500. The cap protects API Gateway timeout.
Proportion of records seeded as suspicious. 0.05 = 5%.
Restrict to one source schema, or leave mixed.
IsolationForest threshold parameter. Match seeded rate for honest scoring.

Configure parameters and click Run Pipeline.

A 200-record run typically completes in 6–12 seconds.

Where the output lives

After a successful run the response includes the S3 keys for both the raw generated batch and the scored OCSF output. The raw batch goes to the synthetic-data bucket; the scored output goes to the model-artifacts bucket as both Parquet (for Athena and SageMaker) and JSON (for human inspection). Both buckets are KMS-encrypted with the sandbox CMK and accessible from the provisioned developer SageMaker Studio profiles.

What this is for

The pipeline exists to validate detection approaches on a known, labeled dataset before any contact with real client systems. Internal reviewers can inspect any run end-to-end: the inputs (raw synthetic records), the intermediate (normalized OCSF), and the output (scored records with model attribution). This is the path detection capability takes from experimentation through internal review and into formal accreditation.

Sign in to run the live pipeline

The live demo calls a JWT-protected endpoint. Sign in with your Kearney sandbox account to continue.