Running Scrutineer
There is a project from the Linux Foundation’s Alpha Omega project called Scrutineer. The TL;DR for this one is it uses claude code to run a variety of security related tests. One of the targets for this project is no doubt Anthropic’s Mythos model, which is behind Project Glasswing, which Alpha Omega is a part of.
Recently Ollama has acquired the ability to run as the backend for Claude Code. So this means we should be able to run Scrutineer against our own Ollama backend.
I want to start out by saying I’m doing this in a VM that only runs Scrutineer. We have to disable a bunch of security features probably that don’t really work anyway, so the VM helps keep things segmented.
Scrutineer has a pretty reasonable getting started directions. All we really have to do is make a few changes to the configuration file. Here are the changes I made
# Where scan workspaces and the sqlite DB live.
data: ./data
# Claude effort level for model-backed skills.
effort: high
These are probably pretty self explanatory. I’m not sure if the effort: high does anything in my environment, but meh.
# Clone strategy. "shallow" (default) clones with --depth 1; "full"
# clones the entire history. Switching from shallow to full unshallows
# existing clones on next scan. Full clones use significantly more disk
# and are slower for large repos — only flip if a skill needs history.
clone: shallow
# Local directories to load SKILL.md files from. Repeatable via -skills
# on the command line; here you can list them directly.
skills:
- ./skills
# Disable the docker runner even if docker is available on the host.
no_docker: true
The only strange there about this is turning off docker. I couldn’t get it work with Docker. I don’t think this was my fault. I’ll file a bug eventually.
# How many scans run in parallel. Default is 4; raise on bigger hosts,
# lower if you want model calls serialized.
concurrency: 1
# Wall-clock limit for one scan. Scans that exceed this are killed and
# marked failed. Go duration syntax ("30m", "1h30m", "2h").
scan_timeout: 12h
# Global --max-turns passed to claude-code. 0 (default) defers to the
# per-skill value (scrutineer.max_turns in SKILL.md metadata, default 30).
# When set, acts as a fallback for skills that don't declare their own cap.
max_turns: 100
Here is the stuff that matters for us. First I only want one thing talking to Ollama at a time, so the concurrency of 1 makes sense. I also set the timeout incredibly high. The longest scan I’ve seen so far is 5 hours. I also increased the max turns. The local models were running out of turns when it was 30. 100 seems to be working better.
# Custom Anthropic API base URL. When set, the hostname is automatically
# added to the egress allowlist. Falls back to the ANTHROPIC_BASE_URL
# environment variable when empty.
anthropic_base_url: http://framework:11434
# The model picker in the UI. Replacing this list replaces the built-in
# set (Sonnet, Opus). Leave it out to keep the built-in list.
models:
- name: Qwen
id: qwen3.6:35b-a3b-bf16
# - name: Opus
# id: claude-opus-4-7
# Pin the default model id. When set, this wins over the "first entry
# in models wins" rule. A skill can declare its own preferred model
# (scrutineer.model in SKILL.md metadata); per-skill wins over this
# default, and an explicit per-scan model wins over both.
default_model: qwen3.6:35b-a3b-bf16
This is the place where we point claude code at our infrastructure. It’s pretty simple once you have the right config.
Configuring Claude Code#
We also have to disable a bunch of the security in Claude Code. Our local models screw this up pretty bad, so I just turned it all off. The file this all belongs in the Claude settings.json file.
$ cat ~/.claude/settings.json
{
"sandbox": {
"enabled": false
},
"skipDangerousModePermissionPrompt": true,
"permissions": {
"defaultMode": "bypassPermissions",
"allow": [
"Bash(*)",
"Read(*)",
"Write(*)",
"Edit(*)",
"Glob(*)",
"Grep(*)",
"WebFetch(*)",
"WebSearch(*)",
"NotebookEdit(*)",
"Task(*)"
]
}
}
What about the scans?#
So what have I scanned? How long does it take? And what did I find?
The first two things I scanned were Syft and Grype. Since I am the person responsible for the vulnerability disclosures in those, I gave myself permission :)
I didn’t find anything real, which didn’t make me too sad. I think I can make this better with some slightly different prompts. But that’s a project for another day.
I also scanned Curl to answer some questions about the price scanning a project like Curl. That conversation came from Daniels blog post Mythos finds a curl vulnerability. Thankfully I didn’t find anything real, so nothing to report to Curl.
So let’s talk about wall clock time first. Syft and Curl took about 6 hours, Grype took 7. Remember this is a single worker running at a time. This is fine since I can start a scan and do other things. I don’t need this running fast.
The cost is a bit more interesting. Scrutineer has a cost estimator based on the Anthropic pricing on input and output tokens. Each of these scans would cost around $50 in anthropic tokens. That’s just the initial discovery. Scrutineer also has the ability to do a deeper analysis on any findings that would rack up more time and cost.
In Conclusion#
So what does this all mean? It means I can scan things for security vulnerabilities easily now in a way I don’t have to worry about going broke. The quality is pretty bad at the moment, but if history repeats itself, in 6-12 months we will start seeing pretty capable public models that can do this work well enough.
In the meantime I’ll keep seeing what I can accomplish in my basement.