HackingApr 10, 202611 min read

MedScribe-R-Us: An Appsec Case Study - P2: The CI/CD Security Pipeline

MedScribe-R-Us: Building an AppSec Program from Scratch — P2: The CI/CD Security Pipeline

Part 3 of 4— Series

MedScribe-R-Us: An Appsec Case Study

Ongoing

01Medscribe-R-Us: An Appsec Case Study - Phase 0

02Medscribe-R-Us: An Appsec Case Study - Phase 1 - Threat Modeling an AI Pipeline

03MedScribe-R-Us: An Appsec Case Study - P2: The CI/CD Security Pipeline

04MedScribe-R-Us: An Appsec Case Study - P3: Vulnerability Management & Secure Architecture

← Part 2: Medscribe-R-Us: An Appsec Case Study - Phase 1 - Threat Modeling an AI Pipeline Part 4: MedScribe-R-Us: An Appsec Case Study - P3: Vulnerability Management & Secure Architecture→

Series: Securing MedScribe-R-Us | Part 3 of 5

Security tooling in CI/CD has a failure mode that I've seen at every client that tries it without a prior threat model: it produces findings that nobody acts on.

The pipeline runs. Semgrep(or other tools they use) flags 400 things. Developers learn to ignore the pipeline. The security team starts treating the report as a checkbox. The tool is "integrated" in the sense that it runs — but not in the sense that it changes what gets shipped.

The root cause is almost always the same: the tooling wasn't configured against the actual threat model. Default rulesets catch generic vulnerabilities, not the specific risks of your specific system. Without that specificity, the signal-to-noise ratio is bad enough that real findings get buried alongside hundreds of irrelevant ones.

Phase 1 (P1) gave us the threat model. Phase 2(P2) is where it gets enforced.Follow along on the Github here: https://github.com/LeSpookyHacker/medscribe-r-us-appsec

Five Tools, One Pipeline

MedScribe-R-Us's CI/CD pipeline runs five security workflows in GitHub Actions. Each one is a separate file. Each has a different trigger strategy, a different failure policy, and a direct line back to specific findings in the P1 STRIDE register.

1.github/workflows/
2├── sast.yml        ← Semgrep (custom rules + community packs)
3├── sca.yml         ← pip-audit, npm audit, OWASP Dependency-Check
4├── secrets.yml     ← Gitleaks (full history scan)
5├── container.yml   ← Trivy (CVEs + misconfigs + delta report)
6└── dast.yml        ← OWASP ZAP (3 authenticated scan contexts)
7
8semgrep-rules/
9├── phi-in-logs.yml         ← T-006: PHI in application logs
10├── auth-missing.yml        ← T-011: missing auth, tenant scope bypass
11└── llm-output-handling.yml ← LLM02: insecure LLM output consumption
12
13.gitleaks.toml              ← Custom patterns: GCP SA keys, MongoDB URIs, FHIR secrets

SAST: Rules That Address Actual Threats

The Semgrep workflow runs two passes. The first uses the custom ruleset in `semgrep-rules/` — three files, each mapped to specific STRIDE findings. The second runs community packs (`p/python`, `p/owasp-top-ten`, `p/jwt`).

The gate logic is different for each. Custom rules: any finding at any severity blocks the PR. Community rules: ERROR severity blocks merge to `main`, WARNING creates an issue. The asymmetry is intentional — custom rules are tuned to this codebase with zero false positive tolerance, community rules are generic and need some noise tolerance.

phi-in-logs.yml — addressing T-006

T-006 in the STRIDE register is "PHI in Application Logs" — rated Critical because application logs are not subject to the same access controls as the clinical data stores, are aggregated in Datadog, and are accessible to a broader set of engineers. It's one of the most common HIPAA violations in SaaS products.

The rule targets four patterns:

1yaml
2- id: phi-variable-in-logger
3  severity: ERROR
4  patterns:
5    - pattern: $LOGGER.$METHOD(..., $VAR, ...)
6    - metavariable-regex:
7        metavariable: $LOGGER
8        regex: '^(logger|log|logging|LOGGER|LOG)$'
9    - metavariable-regex:
10        metavariable: $VAR
11        regex: '.*(transcript|patient|phi|note|audio|deid|soap|clinical|encounter|mrn).*'

The other three patterns cover f-string interpolation in log calls, exception handlers that log the full request body, and print() calls with PHI-pattern variable names — the classic debug statement that makes it to production.

A false positive suppression looks like this:

1python
2# nosemgrep: phi-variable-in-logger
3# Justification: `session_id` is a UUID generated by MedScribe —
4# no patient data. Confirmed in code review 2024-01-15.
5logger.info("Session completed", session_id=session_id)

No justification comment means the suppression gets removed at the next quarterly review. The gate enforces this — # nosemgrep alone without a comment still triggers in CI.

auth-missing.yml — addressing T-011

T-011 is "Admin Portal Horizontal Privilege Escalation" — an admin endpoint that validates "is this user an admin?" rather than "is this admin scoped to this tenant?" means a Clinic Admin can traverse into another health system's data.

Two rules here. The first detects FastAPI route handlers that lack an authentication dependency:

1yaml
2- id: fastapi-route-missing-auth-dependency
3  patterns:
4    - pattern: |
5        @$ROUTER.$METHOD("...")
6        async def $FUNC($PARAMS):
7          ...
8    - pattern-not: |
9        @$ROUTER.$METHOD("...")
10        async def $FUNC(..., $DEP: $TYPE = Depends(...), ...):
11          ...

The second catches admin routes that pull `tenant_id` from path parameters rather than from the authenticated user's JWT claims — the exact pattern that enables horizontal privilege escalation:

1yaml
2- id: admin-route-missing-tenant-scope
3  severity: ERROR
4  patterns:
5    - pattern: |
6        @$ROUTER.$METHOD("/admin/...")
7        async def $FUNC(..., tenant_id: $TYPE, ...):
8          ...
9    - pattern-not: |
10        @$ROUTER.$METHOD("/admin/...")
11        async def $FUNC(..., $USER = Depends(...), ...):
12          ...
13          $USER.tenant_id

llm-output-handling.yml — addressing LLM02

This ruleset catches the pattern where raw LLM output gets consumed before it clears the Output Validation Service. Three variants: direct field access on the raw response object, writing raw LLM content directly to MongoDB, and json.loads() on LLM output outside a try/except block.

1yaml
2- id: llm-output-used-before-validation
3  severity: ERROR
4  patterns:
5    - pattern: $VAR.$FIELD
6    - metavariable-regex:
7        metavariable: $VAR
8        regex: '^(raw_response|llm_output|vertex_response|gemini_response)$'
9    - metavariable-regex:
10        metavariable: $FIELD
11        regex: '^(text|content|candidates|assessment|plan|subjective|objective)$'

The fourth rule catches the PHI scrubbing bypass pattern — a raw transcript variable passed directly to prompt construction functions without going through the scrubbing layer first. This is the automated enforcement of the most critical control in the L2 DFD.

SCA: The Supply Chain Surface

Two separate SCA workflows — pip-audit for Python, npm audit for Node.js — unified into the same SARIF upload path feeding GitHub Advanced Security.

The gate threshold is CVSS ≥ 9.0 with an available fix. This keeps the gate's signal-to-noise ratio high — it only blocks on Critical CVEs that are both severe and actionable. A CVSS 8.x CVE in a PHI-touching package (MongoDB driver, FHIR client, JWT library) gets escalated manually during triage because the contextual risk is higher than the base score implies.

The SCA workflow also runs daily at 06:00 UTC on a schedule — not just on PRs. This catches newly published CVEs against pinned dependencies in the window between code changes. A CVE published against cryptography 41.0.3 on a Tuesday afternoon fires on the Wednesday morning scan even if no code changed overnight.

One finding that surfaced during pipeline testing: a transitive dependency in the FHIR client library pinned to a version of requests with a known SSRF vulnerability. The FHIR client itself doesn't use the vulnerable code path, but MedScribe's Audio Ingestion Service uses requests for outbound webhook delivery — and if webhook URLs are configurable by Clinic Admins, the SSRF surface becomes real. The threat model didn't flag this specifically. The SCA tool did. That's the right division of labor between the two.

Secrets Detection: The Pre-commit Hook Is the Real Gate

The Gitleaks workflow scans the full commit history on every push — not just the diff. A secret committed three months ago and "deleted" in a later commit is still in git history and still requires rotation regardless of the current file state.

But the CI gate is the wrong place to catch secrets. By the time a secret reaches CI, it's in the repository, it's been pushed to GitHub's servers, and it needs to be rotated whether or not the build fails. The pre-commit hook catches it before the commit exists.

1bash
2# Setup from repo root
3pip install pre-commit
4pre-commit install

The .gitleaks.toml adds MedScribe-specific patterns on top of the default ruleset: GCP service account key files (the JSON "type": "service_account" pattern), MongoDB Atlas connection strings (mongodb+srv://user:pass@cluster.mongodb.net), Vertex AI API keys (AIza...), and SMART on FHIR OAuth client secrets.

There is no warn-only mode for secrets. A detected secret means the credential is assumed compromised, rotation is required immediately, and the security team is notified before any merge discussion happens. The PR comment that fires on detection makes this explicit — it calls it a security incident, not a build failure.

Container Scanning: The Surface SAST Can't See

Trivy scans three surfaces per image: OS-level CVEs in the base image, library CVEs in installed packages, and Dockerfile misconfigurations.

The OS-level surface is what SAST misses entirely — vulnerabilities in packages installed via apt that the Python and Node.js dependency scanners don't know about. It also catches secrets baked into image layers: a COPY .env that was removed in a later RUN layer is still present in the image history and extractable by anyone with pull access to Artifact Registry.

Both MedScribe service images use distroless base images (gcr.io/distroless/python3, gcr.io/distroless/nodejs20). Distroless removes the shell, package manager, and most OS-level packages — eliminating large portions of the CVE surface and removing the shell that an attacker needs to do anything useful after container breakout.

The delta report is the most useful feature for day-to-day developer experience. Instead of showing all CVEs in the image, it shows only the CVEs introduced by the current change compared to the previously deployed image. PR review stays focused on new risk:

1## 🛡️ Container CVE Delta Report
2
3⚠️ 1 new CVE(s) introduced:
4
5| CVE           | Severity | Package         | Fixed In |
6|---------------|----------|-----------------|----------|
7| CVE-2024-1234 | HIGH     | urllib3 1.26.18 | 2.0.7    |

DAST: Testing the Running Application

DAST is qualitatively different from the other four tools — it's the only one that tests what the application actually does at runtime, not what the code says it should do. OWASP ZAP runs post-deploy to staging, not on PR, because it needs a running application.

Three scan contexts:

Unauthenticated baseline covers security headers (CSP, HSTS, X-Frame-Options), information disclosure in error responses, open redirects, and common injection patterns — the surface any internet scanner would see.

Authenticated — clinician role covers the highest-risk authenticated surface: PHI access controls on the note retrieval endpoints, the approval gate enforcement, the audio upload endpoint, and the SOAP note editor (stored XSS surface). Uses a dedicated test clinician account in staging.

Authenticated — admin role specifically targets T-011 — horizontal privilege escalation. ZAP's active scanner manipulates path parameters and query strings on tenant-scoped admin endpoints, attempting to access data outside the test account's tenant. This is the automated equivalent of what a penetration tester would try on the Admin Portal.

Gate logic: High severity ZAP findings block the next environment promotion. Medium severity findings auto-create GitHub Issues with severity:medium, source:dast, status:open labels and a 30-day SLA.

The ZAP rules file (.zap/rules.tsv) documents every intentional IGNORE decision — findings suppressed because they're accepted risks, not because they're false positives. Every IGNORE entry has a comment. Undocumented ignores don't exist in this configuration.

The Developer Guide

All of this tooling only works if engineers understand it. The docs/sdlc/developer-guide.md covers how to read each tool's output, how to write a valid suppression, and when to escalate. The suppression policy is specific: a justification comment is required, and suppressions without one get removed at the quarterly review.

The guide closes with five golden rules short enough to memorize:

1. PHI never goes in logs. Ever.

2. Every FastAPI route that touches PHI requires an auth dependency.

3. LLM output is untrusted until it clears the Output Validation Service.

4. Secrets go in Secret Manager. Never hardcoded, never in `.env` files.

5. If you're unsure, ask.

The Full Gate Summary

Tool	Trigger	Blocks On
Semgrep custom rules	Every PR	Any finding
Semgrep community	Every PR	ERROR severity on `main`
pip-audit / npm audit	Every PR	CVSS ≥ 9.0 with fix
Gitleaks	Every push	Any secret detected
Trivy	Push to `main`/`staging`	Critical or High with fix
ZAP	Post-deploy to staging	High severity alert

That's the pipeline. Not a checkbox. A loop with teeth.

What P3 Builds On This

P2 generates findings. P3 defines what happens to them — the vulnerability management program that assigns SLAs, tracks remediation, and produces the security metrics that make the program legible to leadership. The gate policy already references SLA tiers that don't formally exist until P3 defines them. P3 closes that loop.

*All workflows, custom Semgrep rules, Gitleaks config, and policy documents are in the repo under .github/workflows/, semgrep-rules/, and docs/sdlc/. All companies, scenarios, and clinical details are fictional.*

The full repo is on GitHub at https://github.com/LeSpookyHacker/medscribe-r-us-appsec/

All companies, patients, and clinical scenarios are fictional.

— LeSpookyHacker

Small Glossary of Acronyms

Since you are reading this, I am going to assume you know most general AppSec acronyms, so I will only be defining some medical specific ones, or new acronyms that some people may not know yet in the security field.

ATLAS: Adversarial Threat Landscape for AI Systems (MITRE framework)
BA: Business Associate (under HIPAA)
CSF: Cybersecurity Framework (NIST) / Common Security Framework (HITRUST)
EMR: Electronic Medical Record
FHIR: Fast Healthcare Interoperability Resources (R4 refers to Release 4)
HIPAA: Health Insurance Portability and Accountability Act
HITECH: Health Information Technology for Economic and Clinical Health Act
HITRUST: Health Information Trust Alliance
MRN: Medical Record Number
PHI: Protected Health Information
PII: Personally Identifiable Information
SOAP: Subjective, Objective, Assessment, and Plan (Medical clinical note format)
SOC: System and Organization Controls
STRIDE: Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege (Threat modeling framework)

Tags:AppSec Medical Security

Bluesky LinkedIn Reddit Hacker News

Join the Grimoire

Get notified when I publish new posts. No spam, unsubscribe anytime.