CI Infrastructure¶

Purpose: Documentation for CI/CD infrastructure, self-hosted runners, health checks, and optimization. Version: 2.15.0 (February 2026)

Overview¶

Assay uses a hybrid CI infrastructure: - GitHub-hosted runners: Standard CI jobs (build, test, lint) - Self-hosted runner: eBPF/LSM kernel tests (requires privileged access)

Self-Hosted Runner: `assay-bpf-runner`¶

Architecture¶

flowchart TB
    subgraph GitHub["GitHub Actions"]
        Workflow[Kernel Matrix CI]
        Queue[Job Queue]
    end

    subgraph Local["Local Machine"]
        HealthCheck[health_check.sh]
        Cron[Cron Job]
    end

    subgraph Multipass["Multipass VM"]
        Runner[actions-runner]
        Service[systemd service]
        Cache[Actions Cache]
    end

    Workflow --> Queue
    Queue --> Runner
    Cron --> HealthCheck
    HealthCheck --> Runner
    HealthCheck --> Cache
    Runner --> Service

Location¶

Component	Path
Health check script	`infra/bpf-runner/health_check.sh`
Setup script	`infra/bpf-runner/setup_local_multipass.sh`
Register script	`infra/bpf-runner/register_local.sh`
VM name	`assay-bpf-runner`
Runner directory	`/opt/actions-runner`
Runner user	`github-runner`

Health Check Features¶

The health check script (health_check.sh) provides self-healing capabilities:

Feature	Trigger	Action
Stale Job Cleanup	Jobs >4 hours in queue	Auto-cancel
Superseded Run Cancel	Multiple runs for same branch	Cancel older runs
PR Priority	Many PRs waiting, push runs queued	Cancel push runs
Cache Corruption	"Can't find action.yml" errors	Clear cache, rerun jobs
Runner Offline	Status != online with queued jobs	Full recovery
Clock Drift	Time difference >60s	NTP sync

Health Check Commands¶

# Full health check (runs all maintenance)
./health_check.sh

# Show status
./health_check.sh --status

# Manual recovery
./health_check.sh --recover

# Queue management
./health_check.sh --cancel-stale        # Cancel jobs >4h old
./health_check.sh --cancel-superseded   # Cancel duplicate runs
./health_check.sh --prioritize-prs      # Cancel push runs for PRs
./health_check.sh --optimize-queue      # All queue optimizations

# Cache management
./health_check.sh --clean-cache         # Remove stale entries
./health_check.sh --clear-cache         # Force clear all cache
./health_check.sh --heal-cache          # Detect + fix + rerun

Environment Variables¶

Variable	Default	Description
`STALE_JOB_HOURS`	4	Hours before job considered stale
`ACTIONS_CACHE_MAX_AGE_DAYS`	7	Days before cache entry is stale
`VM_NAME`	assay-bpf-runner	Multipass VM name
`RUNNER_NAME`	assay-bpf-runner	GitHub runner name
`RUNNER_DIR`	/opt/actions-runner	Runner installation path

Cron Setup¶

The health check runs every 5 minutes:

# Install cron job
./health_check.sh --install-cron

This adds:

*/5 * * * * /path/to/health_check.sh >> /tmp/assay-runner-health.log 2>&1

CI Optimization: Kernel Matrix Skip¶

Problem¶

Dependabot PRs that only update Cargo.toml/Cargo.lock (pure dependency bumps) don't need kernel matrix tests. These tests take 5-10 minutes each on the self-hosted runner.

Solution¶

The kernel-matrix.yml workflow now: 1. Checks if eBPF-related files changed 2. Skips heavy self-hosted runner tests for pure dep bumps 3. Still runs lint and build jobs (on GitHub-hosted runners)

flowchart TD
    PR[Pull Request] --> Check{eBPF files<br/>changed?}
    Check -->|Yes| Full[Full Matrix Test]
    Check -->|No| Skip[Skip Matrix]

    Full --> Lint[Lint]
    Full --> Build[Build ARM]
    Full --> Matrix[Kernel 5.15 + 6.6]

    Skip --> Lint2[Lint]
    Skip --> Build2[Build ARM]
    Skip --> Summary[Summary: Skipped]

Files Triggering Full Matrix¶

The following patterns trigger full kernel matrix tests: - crates/assay-ebpf/** - crates/assay-monitor/** - crates/assay-evidence/** - scripts/ci/** - .github/workflows/kernel-matrix.yml

NOT triggering full matrix: - Cargo.toml / Cargo.lock only changes - Documentation changes - Other crate changes

Implementation¶

# .github/workflows/kernel-matrix.yml
check-ebpf-changes:
  name: Check eBPF Changes
  runs-on: ubuntu-latest
  outputs:
    ebpf_changed: ${{ steps.check.outputs.ebpf_changed }}
  steps:
    - name: Check for eBPF-related changes
      id: check
      run: |
        EBPF_PATTERNS="crates/assay-ebpf|crates/assay-monitor|..."
        if git diff --name-only | grep -qE "$EBPF_PATTERNS"; then
          echo "ebpf_changed=true" >> "$GITHUB_OUTPUT"
        else
          echo "ebpf_changed=false" >> "$GITHUB_OUTPUT"
        fi

matrix-test:
  if: needs.check-ebpf-changes.outputs.ebpf_changed == 'true'
  # ... matrix test configuration

Queue Management¶

Automatic Cleanup¶

The health check performs these queue optimizations:

Stale Jobs: Jobs waiting >4 hours are cancelled (configurable)
Superseded Runs: When multiple runs exist for the same branch, older ones are cancelled
PR Priority: When >5 PR runs are waiting, push runs are cancelled

Manual Queue Operations¶

# View queue status
./health_check.sh --status

# Cancel all stale jobs
./health_check.sh --cancel-stale

# Cancel superseded (duplicate) runs
./health_check.sh --cancel-superseded

# Prioritize PR runs
./health_check.sh --prioritize-prs

# All optimizations at once
./health_check.sh --optimize-queue

Using GitHub CLI¶

# List queued runs
gh run list --repo Rul1an/assay --status queued

# Cancel specific run
gh run cancel <run_id> --repo Rul1an/assay

# Rerun failed jobs
gh run rerun <run_id> --repo Rul1an/assay --failed

Cache Management¶

Actions Cache Issues¶

The _actions cache can become corrupted, causing "Can't find action.yml" errors.

Symptoms:

Error: Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under
'/opt/actions-runner/_work/_actions/actions/checkout/<SHA>'

Auto-Heal Process: 1. Health check detects failures with this error pattern 2. Clears the entire actions cache 3. Reruns affected jobs

Manual Fix:

# Force clear cache
./health_check.sh --clear-cache

# Or detect + fix + rerun
./health_check.sh --heal-cache

Cache Locations¶

Cache	Path	Cleanup
Actions cache	`/opt/actions-runner/_work/_actions/`	Auto (7 days)
Rust cache	In workspace	By workflow
Docker cache	System	`docker system prune`

Runner Recovery¶

When Recovery Triggers¶

Runner status is "offline" on GitHub
Queued jobs are waiting
Runner service is not active

Recovery Steps¶

Time Sync: NTP synchronization (fixes token expiry from clock drift)
Token Generation: New registration token from GitHub API
Config Cleanup: Stop service, remove old config
Reconfigure: Register runner with new token
Start Service: Install and start systemd service
Verification: Check runner is online

Manual Recovery¶

# Full recovery
./health_check.sh --recover

# Or step by step in VM:
multipass shell assay-bpf-runner
cd /opt/actions-runner
sudo ./svc.sh stop
sudo ./svc.sh uninstall
./config.sh remove
# Get new token from GitHub
./config.sh --url https://github.com/Rul1an/assay --token <TOKEN> --labels self-hosted,linux,x64,bpf-lsm,assay-bpf-runner --unattended --replace
sudo ./svc.sh install
sudo ./svc.sh start

Workflows Overview¶

`ci.yml` - Main CI¶

Runs on: GitHub-hosted (ubuntu, macos, windows)
Jobs: Build, Test, Clippy, Criterion benches
Trigger: Push to main, PRs

`kernel-matrix.yml` - Kernel Tests¶

Runs on: Self-hosted (assay-bpf-runner)
Jobs: Lint, Build ARM, Kernel 5.15/6.6 tests
Trigger: PRs touching eBPF code, push to main
Optimization: Skips matrix for pure dep bumps

`perf_pr.yml` - Performance¶

Runs on: GitHub-hosted
Jobs: Criterion benchmarks with Bencher
Note: --err flag removed (alerts are warnings, not failures)

`assay-security.yml` - Security Validation¶

Runs on: GitHub-hosted
Jobs: Policy validation on the repo itself

Monitoring¶

Status Dashboard¶

# Quick status
./health_check.sh --status

# Output includes:
# - VM state (Running/Stopped)
# - GitHub runner status (online/offline)
# - Service status (active/inactive)
# - Queue stats (queued, in_progress, stale)
# - Cache health

Logs¶

Log	Location	Purpose
Health check	`/tmp/assay-runner-health.log`	Cron job output
Runner	`/opt/actions-runner/_diag/`	Runner diagnostics
Service	`journalctl -u actions.runner.*`	Systemd logs

Alerts¶

The health check automatically: - Recovers offline runners - Clears corrupted caches - Cancels stale jobs - Reruns failed jobs due to cache issues

Troubleshooting¶

Runner Not Picking Up Jobs¶

Check status: ./health_check.sh --status
Verify labels match workflow: self-hosted,linux,x64,bpf-lsm,assay-bpf-runner
Check runner is online in GitHub Settings > Actions > Runners
Run recovery: ./health_check.sh --recover

"Can't find action.yml" Errors¶

Clear cache: ./health_check.sh --clear-cache
Rerun failed jobs: ./health_check.sh --heal-cache

Jobs Stuck in Queue¶

Check runner status: ./health_check.sh --status
Cancel stale jobs: ./health_check.sh --cancel-stale
Check if runner is busy: Only 1 job at a time on self-hosted

Clock Drift Issues¶

Symptoms: "Token expired" errors despite recent token

Fix: Health check auto-syncs time, or manually:

multipass exec assay-bpf-runner -- sudo timedatectl set-ntp true
multipass exec assay-bpf-runner -- sudo systemctl restart systemd-timesyncd

Architecture Diagrams - CI/CD flow diagrams
User Flows - Flow 2: CI/CD Regression Gate
Quick Reference - GitHub Action usage