38 secret patterns we hunt (and why we stopped there)

Posted 2026-05-04 · 8 min read · secretsreconattack surface

Run a credential-detection tool over a public GitHub repo today and you'll get one of two experiences. Trufflehog ships somewhere north of 700 detectors. Gitleaks has roughly 150 rules. Both will return a wall of findings — most of them noise — and you'll spend more time triaging false positives than rotating real secrets.

UnveilScan curates 38 patterns. They live in internal/secretpatterns/patterns.go, hand-written, regularly tested against real corpus samples. This article walks through the curation philosophy, why fewer is more honest, the severity bands, and the redaction discipline that keeps us — and your secrets — out of the legal grey zone.

The 38 patterns, by family

Roughly grouped:

Cloud providers (5): AWS access key (AKIA…), AWS temporary key (ASIA…), AWS secret key (heuristic on aws_secret_access_key), Azure storage account key, Azure SAS token.
Source-code platforms (7): GitHub PAT (ghp_), fine-grained PAT (github_pat_), OAuth token (gho_), user-to-server token (ghu_), server-to-server token (ghs_), refresh token (ghr_), GitLab PAT (glpat-).
Payments (2): Stripe live secret (sk_live_), Stripe restricted (rk_live_).
Communications (5): Slack bot/app token, Slack incoming webhook URL, Twilio account SID, Twilio auth token (heuristic), SendGrid API key.
AI / ML APIs (3): OpenAI key (sk- and sk-proj-), Anthropic key (sk-ant-api03- and -admin01-), Google OAuth refresh token (ya29.).
Other SaaS (8): Google API key (AIza), Google service-account JSON (header match), Heroku API key, Mailgun, Mailchimp, Datadog (heuristic), Cloudflare API token (heuristic), Firebase database secret.
Cryptographic material (3): JWT (3-segment base64url), PEM private key (RSA/EC/OPENSSH/DSA/PGP), PKCS#8 private key.
Database connection URLs with embedded passwords (3): Postgres, MySQL, MongoDB.
Generic catch-all (1): assignment-style match for (api_key|secret|password|token) = "long_string" in code.
Package registries (1): npm token (npm_).

That's 38. No Twitter API keys (the platform pivoted, the keys are mostly dead). No LinkedIn, Discord, Reddit. No DigitalOcean (the format isn't fingerprinted enough to match without false positives). No JIRA, Confluence, Bitbucket — heuristic-only, too noisy.

The curation philosophy: precision over recall

Two ways to write a secret detector. The recall-maximalist approach (Trufflehog) tries to detect every secret, accepts that some matches will be false, and pushes the triage burden onto the user. The precision-maximalist approach (us) only flags when the format is distinctive enough that a match is almost certainly a real secret.

Concretely: we'll match AKIA[0-9A-Z]{16} because Amazon designed access keys with a fixed prefix, fixed length, and a restricted alphabet. A match is an AWS key with high probability. We won't match a generic 40-character base64-looking string because half the world's CI/CD job IDs look like that.

The cost of recall maximalism in production:

Every developer who's ever used Trufflehog has stories about flag fatigue. After the third "this is a generated CI ID, not a credential" the team stops looking.
The interesting secret — the AWS key on line 287 of config/staging.tf — gets buried among 47 false positives in the same scan.
Customers ask "is this signal? or noise?" and want a human triage. That doesn't scale.

The cost of precision maximalism:

We miss some secrets. Specifically: any secret format that's not distinctive enough to fingerprint without false positives. Most pre-2020 API keys (32 random chars, no prefix) fall in this category.
This is a deliberate tradeoff. Better to flag 100% of fingerprintable secrets and miss 5% of the universe than flag 95% and bury everything in noise.

Industry trend supports us: every modern API provider in the last 5 years has adopted prefixed, length-fixed key formats specifically because they make detection trivial. Stripe (sk_live_), GitHub (ghp_), OpenAI (sk-), Anthropic (sk-ant-), Slack (xox) all converged on this pattern. The legacy "32 random chars" providers are dying off — the precision-maximalist approach gets stronger every year.

Severity bands: what fires CRITICAL vs HIGH vs MEDIUM

A leaked PEM private key is not the same as a leaked Twilio account SID. Severity bands reflect the operational consequence of the credential being public.

CRITICAL (~21 patterns) — credentials that grant code execution, money movement, or persistent access. AWS keys, Stripe live keys, all GitHub token variants, OpenAI/Anthropic keys, NPM tokens, PEM private keys, Google service-account JSON, Azure storage keys. A single leak here = active incident.
HIGH (~12 patterns) — credentials that grant read access to infrastructure or data, but require additional steps to monetise. Slack webhook (spam, not full takeover), Postgres/MySQL/MongoDB URLs (database access if network-reachable), Heroku, Mailgun, SendGrid, Datadog, Cloudflare API token, Firebase. A leak here = rotate within 24h, audit usage, monitor for activity.
MEDIUM (~5 patterns) — JWT (often expired), Twilio account SID (public-facing, only useful with the auth token), generic password = "..." matches. A leak here = rotate when convenient, low immediate risk.

Counts are approximate because some patterns can fire at multiple severities depending on context — we don't currently differentiate. v1.5+ will add context-aware severity (a CRITICAL secret in a 5-year-old commit on a fork of a public dataset is a different alert than the same secret in last week's main-branch commit).

Redaction is not optional

We never store, log, or transmit the raw secret value. Ever. The discipline is implemented in secretpatterns.Redact() and applied at the earliest possible point in the pipeline — before the match leaves the package, before it hits the database, before it touches a log line.

The redaction format:

Strings ≥ 12 characters: first 4 + **** + last 4 (e.g. AKIA****MPLE).
Strings < 12 characters: ****, full mask. A 6-char secret with 4-and-4 visible exposes the entire string.
PEM blocks: only the header line (-----BEGIN RSA PRIVATE KEY-----), none of the body.

Why this matters operationally:

The user's downstream tools never see the raw secret. Our API responses, alerting emails, webhook payloads, audit logs — all consume the redacted form. A user accidentally pasting an UnveilScan finding into a chat or ticket tracker doesn't propagate the leak.
Database breach scenario. If our discovered_assets table leaks tomorrow, the attacker gets metadata but no usable credentials. The blast radius of a compromise of UnveilScan is bounded by what we don't store.
Legal positioning. Storing leaked third-party credentials is at best a grey area, at worst CFAA / unauthorized-access territory in some jurisdictions. Storing only the redacted fingerprint + the file URL where it appeared keeps us strictly in the "we noticed and notified" lane, never the "we possess your credentials" lane.

This is non-negotiable. If you hear "we'll show you the raw secret in your dashboard", you're talking to a vendor that's not thinking clearly about the legal surface.

What we deliberately don't do

The line between "credential leak detection" and "active probing of someone else's account" is well-defined and we stay on the right side of it. Specifically:

We never validate a found secret against the provider's API. Tempting as it is to call aws sts get-caller-identity with an AKIA we found, or hit GitHub's /user with a ghp_ we matched, that's active probing of a third-party account. We do not have authorisation. The provider's audit log records the call. Legal exposure for zero added value: we already know the format matches — calling the API doesn't add useful signal.
We don't fork, clone, or download repos in bulk. The fetch path goes through GitHub's official content API (/repos/{owner}/{repo}/contents/{path} with Accept: vnd.github.raw), one file at a time, only files surfaced by code search, capped at 60 fetches per scan and 1 MiB per file. 100% within GitHub's Acceptable Use Policy.
We don't scrape raw.githubusercontent.com directly. Same content, different legal posture: scraping a CDN endpoint without API authentication is a different conversation with GitHub's lawyers than authenticated content API calls. We use the boring, contractual path.
We don't index secrets across users. A leak found in tenant-A's scan is not visible to tenant-B, even if tenant B happens to share infrastructure. Each user's discovered_assets rows are scoped by user_id and never cross-leak.

Why curated wins over crowdsourced

Trufflehog and Gitleaks both accept community PRs for new detectors. The pattern space grows organically. The honest reading: most community PRs are well-intentioned but under-tested. The new "Discord webhook" detector ships, fires on 12% of all scanned files (Discord URLs are very common in code as comments / examples), and the false-positive rate degrades the entire tool's utility.

Our curation discipline:

Add a pattern only when the format is fingerprintable to > 99% precision.
Test the new pattern against ~1000 known-clean files (the Linux kernel, freeCodeCamp, a sample of well-managed enterprise repos) — false positives must be 0.
Test against ~50 deliberately-leaked samples in our test corpus — true positives must be 100%.
If either bar fails, the pattern doesn't ship. We'd rather miss a class of secret than degrade the tool's overall trustworthiness.

This is slow. We add roughly one pattern per quarter. It's also why a UnveilScan finding is one you can act on immediately — if we say it's a Stripe live key, it's a Stripe live key.

The roadmap: context-aware severity

What we'd like to ship next, in order:

Commit age as a severity modifier. A CRITICAL secret in a commit from 2019 in a repo with no recent activity is probably long since rotated; the same finding in this morning's main push is an active incident. Same slug, different alert urgency.
File path heuristics. A secret in tests/fixtures/ or example.env is more often a deliberate placeholder than a real leak. config/production.yml is the opposite.
Cross-repo deduplication. The same AKIA appearing in 12 forks of a tutorial doesn't deserve 12 alerts.
A few more high-precision patterns. Cloudflare workers AI key, Vercel deploy hooks, Notion integration tokens — all on the watchlist when their adoption hits a threshold and the format stabilises.

What we won't ship: a "submit your own pattern" UI. The curation discipline is the product.

Where to look at the actual list

The 38 patterns, with their regexes and severity, live in our internal/secretpatterns/patterns.go. Closed source, but the table above is the entire inventory. If you want to verify a specific pattern's behaviour, run a Recon scan on a domain you control where you've intentionally committed a test credential — the finding will tell you exactly which pattern matched, with the redacted sample for sanity-checking.

Find the secrets you forgot you committed

One Recon scan ingests every public GitHub repo that mentions your domain, runs the 38 patterns against the file content, and emails you within a minute if anything matches. We never store the raw value.

Run a Recon scan

UnveilScan Blog