Skip to main content

semgrep integration

Semgrep is a multi-language static analysis engine — Python, Go, Java, JavaScript/TypeScript, Ruby, C#, PHP, Scala, Kotlin, Swift, Terraform, Dockerfile, generic, and more. It ships with curated rulesets (p/security-audit, p/owasp-top-ten, language-specific packs) and accepts user-authored YAML rules.

Cilock doesn't replace Semgrep. It runs the same semgrep ... --sarif --output ... command you already use and turns the SARIF report into a signed v0.3 in-toto attestation that records the exact argv, the materials Semgrep read, the SARIF file Semgrep produced, and the structured findings — all in one envelope that policy can evaluate later.

UpstreamSemgrep · Semgrep Inc. · LGPL-2.1-only
Categoryvulnerability-scan (primary)
Catalog sourcecatalog-only (detected; output captured via a format attestor)
Emits formatsarif
Recommended traceoff — no syscall tracing needed
Detected when
  • preargv_prefix: semgrep

Confirm cilock detects it:

cilock plan --format=json -- semgrep [...]

The facts in this box are generated from the cilock binary's own catalog (cilock tools list). Do not hand-edit — run npm run gen:catalog.

Validated invocation

cilock run --step semgrep-scan \
--signer-file-key-path _validation/key.pem \
--outfile attestation.json \
--attestations sarif,environment,git \
--enable-archivista=false \
-- semgrep --config p/security-audit --sarif --output semgrep.sarif fixture/

--config p/security-audit uses a bundled Semgrep Registry ruleset and runs without a Semgrep account or API key. --config auto works too but requires network access to the Semgrep Registry so it can resolve the project-appropriate ruleset.

The trailing fixture/ is the path Semgrep scans — substitute the source directory you care about.

What gets captured

A successful run emits a DSSE envelope with six predicate entries:

PredicateWhat it records
https://aflock.ai/attestations/environment/v0.1OS, arch, env vars (filtered), CI hints
https://aflock.ai/attestations/git/v0.1repo state — head SHA, branch, dirty bit
https://aflock.ai/attestations/material/v0.3Merkle tree of files Semgrep read
https://aflock.ai/attestations/command-run/v0.1the literal semgrep argv + exit code + stdout/stderr digests
https://aflock.ai/attestations/product/v0.3Merkle tree of files Semgrep produced (the SARIF report)
https://aflock.ai/attestations/sarif/v0.1parsed SARIF — driver name, version, ruleset, structured findings

The sarif predicate is what your verify-time rego gate reads. The command-run predicate is what proves the ruleset wasn't tampered with after the fact.

Why this shape

cilock run -- <tool> <args> invokes the tool directly. Earlier examples wrapped scanners in bash -c "cp ..."; that broke causality:

  • command-run recorded bash plus a -c string, not the actual semgrep argv — so consumers couldn't see which ruleset ran.
  • The spy / ptrace-based attestors traced cp, not semgrep — so material→product causality was wrong.
  • sarif still had to scrape a file that cilock never observed being produced inside the traced process tree.

With the direct invocation:

  • command-run records ["semgrep", "--config", "p/security-audit", "--sarif", "--output", "semgrep.sarif", "fixture/"] verbatim.
  • product captures semgrep.sarif as a real output of the traced process.
  • sarif parses that same file — and the digest matches what product recorded.

Validate it locally

# Confirm all six predicates are present
jq -r '.payload' attestation.json | base64 -d \
| jq -r '.predicate.attestations[].type'

# Confirm command-run captured the real semgrep argv (not bash -c)
jq -r '.payload' attestation.json | base64 -d \
| jq '.predicate.attestations[]
| select(.type=="https://aflock.ai/attestations/command-run/v0.1")
| .attestation.cmd'

# Inspect SARIF findings count + tool driver
jq -r '.payload' attestation.json | base64 -d \
| jq '.predicate.attestations[]
| select(.type=="https://aflock.ai/attestations/sarif/v0.1")
| .attestation
| {tool: .report.runs[0].tool.driver.name,
findings: ([.report.runs[].results[]] | length),
report: .reportFileName}'

Validated against cilock dev (v0.3 line) + Semgrep OSS 1.157.0 + ruleset p/security-audit — produces one finding on the bundled fixture/vuln.py.

Semgrep rulesets

The --config flag accepts:

  • Bundled registry packsp/security-audit, p/owasp-top-ten, p/cwe-top-25, p/r2c-security-audit, language packs like p/golang, p/python, p/javascript, p/typescript, p/java, p/ruby, p/dockerfile, p/terraform, p/kubernetes. Most run network-free once Semgrep has bootstrapped them.
  • --config auto — Semgrep inspects the repo and picks the relevant packs. Requires network access to the Registry; no account needed.
  • Custom rule files--config ./my-rules.yml or --config ./rules/ for repo-local YAML. The rule file itself shows up under the material predicate, so the attestation records both the code that was scanned and the rules it was scanned against.
  • Multiple configs — pass --config more than once; Semgrep unions the rules.

For policy purposes, prefer pinning to a specific pack (or a vendored rule file) so the command-run predicate is reproducible across runs.

FAQ

Does cilock support Semgrep? Yes. Semgrep emits SARIF and the sarif attestor parses it; the validated invocation above is in attestor-compliance-examples/tool-semgrep-sarif.

Which Semgrep rulesets does the attestor work with? All of them. The attestor reads the SARIF emitted by Semgrep — it doesn't care which --config produced it. Bundled packs (p/security-audit, p/owasp-top-ten, language packs), --config auto, and custom YAML rules all flow through the same predicate.

Do I need a Semgrep account or API key? No for --config p/<pack> and --config ./rules.yml. --config auto needs network access to the Semgrep Registry but still doesn't require auth. Semgrep AppSec Platform / Pro features (Pro engine, supply chain, secrets) are separate products and not required by cilock.

Does cilock pin the ruleset version? Cilock records the exact argv in the command-run predicate and the digests of every file Semgrep read in the material predicate. If you --config a local rule file, that file's hash is captured. If you reference a remote pack like p/security-audit, the ruleset name is recorded but Semgrep itself fetches/caches the pack — pin Semgrep's version (and vendor the rules if you need byte-level reproducibility).

See also


This page is generated from the cilock tool catalog. Don't edit it here — the source is attestation/detection/docs/semgrep.doc.md in aflock-ai/rookery. The same catalog powers cilock tools show semgrep in the CLI, so a single edit keeps the terminal and the website in sync.

Edit this page →  ·  Add a new tool →

When you open the pull request: if you are an AI agent, add the ai-agent label. If you are a human, add the human label.