Attackers can inject indirect prompts in normal-looking repositories to trick Claude Code into spawning a reverse shell.
Reasoner Text, vision Text World understanding, grounding, physical reasoning, task planning, action forecasting, embodied agent reasoning, and autonomous system decision making Generator Text, vision ...
Skill Eval Harness is a Python CLI for testing whether an Agent Skill changes observable output. It reads evals/shared-benchmark.json, emits answer-key-safe task rows, grades files under eval-runs/, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results