Onboarding Architecture & Friction Mapping
Map and eliminate developer onboarding friction. Visualize dependency trees, enforce runtime parity, fix common local failures, and track time-to-first-PR.
A new contributor should go from git clone to a running stack and a first merged change without a single message asking "what am I missing?" Most teams fall far short of that because onboarding friction is invisible until it costs a day: a container that exits on boot, a service that cannot resolve its database, a runtime that drifts a patch version from staging. This guide treats local environment provisioning as a measurable engineering discipline. It covers the failure points that block first boot, the dependency maps that explain what a service actually needs, the parity checks that keep local execution honest against staging, the metrics that quantify the cost of friction, and the README-driven automation that collapses setup into one command. It pairs closely with containerized local environment patterns and with environment sync, secrets, and CI parity.
Diagnosing First-Boot Failures
The first hour of onboarding is dominated by a handful of recurring breakages: containers that exit immediately, services that cannot resolve each other over the Compose network, ports already in use, and bind-mounted files the container user cannot write. Each has a deterministic signature and a deterministic fix, so the right move is to script the triage rather than answer it in Slack every week. Catalogue these against common local failure points so a new engineer can self-serve the diagnosis before asking for help.
#!/usr/bin/env bash
set -euo pipefail
# First-boot triage: surface exited/unhealthy containers and the reason
docker compose ps -a --format '{{.Name}} {{.State}} {{.Status}}'
docker compose ps -a --filter status=exited --format '{{.Name}}' \
| while read -r svc; do
[ -n "$svc" ] || continue
echo "=== last logs: $svc ==="
docker compose logs --tail 20 "$svc"
done
Mapping Service Dependencies
A service rarely fails alone. It fails because something it depends on never came up, came up in the wrong order, or formed a circular wait with another build target. Making that graph explicit turns "the stack is broken" into "the auth service is missing a depends_on: db." Build the adjacency list from your Compose file and lockfiles, then keep it current as the canonical answer to "what does this service need?" The full extraction and rendering workflow lives under dependency tree visualization.
#!/usr/bin/env bash
set -euo pipefail
# Emit a service -> dependency adjacency list straight from Compose
docker compose config --format json \
| jq -r '
.services
| to_entries[]
| .key as $svc
| (.value.depends_on // {} | keys[]? ) as $dep
| "\($svc) -> \($dep)"
'
Enforcing Runtime Parity
"Works on my machine" is not a personality trait; it is an unpinned base image, a runtime that drifted a patch version, or an environment variable that exists in staging and nowhere else. Parity means local execution shares an explicit contract with staging: identical image digests, identical runtime versions, identical declared environment. Pin digests, capture both runtimes, and diff them before the difference becomes a production incident. The script templates and thresholds belong to runtime parity frameworks.
#!/usr/bin/env bash
set -euo pipefail
# Compare local and staging runtime fingerprints
LOCAL=$(node -p 'JSON.stringify({v:process.version,arch:process.arch})')
STAGING=$(ssh staging "node -p 'JSON.stringify({v:process.version,arch:process.arch})'")
if [ "$LOCAL" != "$STAGING" ]; then
echo "DRIFT: local=$LOCAL staging=$STAGING" >&2
exit 1
fi
echo "Runtime parity OK: $LOCAL"
Measuring Time-to-First-PR
If you cannot measure onboarding, you cannot tell whether a change helped or hurt. Time-to-first-PR (TTFPR) is the durable proxy: the wall-clock interval from a contributor's first clone to their first merged, human-authored pull request. Instrument the boundaries deterministically, exclude bots and CI commits, and report the median and p90 per cohort so outliers surface. The boundary definitions and dashboards are detailed in time-to-first-PR metrics.
#!/usr/bin/env bash
set -euo pipefail
# Record an onboarding milestone with a UTC timestamp
EVENT_TYPE="${1:?usage: record-event <clone|first-pr>}"
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
REPO_URL=$(git remote get-url origin 2>/dev/null || echo unknown)
curl -fsS -X POST https://metrics.internal/api/v1/onboarding \
-H 'Content-Type: application/json' \
-d "{\"event\":\"${EVENT_TYPE}\",\"ts\":\"${TIMESTAMP}\",\"repo\":\"${REPO_URL}\"}"
Automating Setup From the README
Every friction point above collapses if the repository ships a single bootstrap command that provisions toolchains, generates .env, starts the stack, and self-verifies. The README stops being prose to read and becomes automation to run: one make target, one health check, zero tribal knowledge. Treat the documented commands as the source of truth and execute them in CI so they never rot. The patterns for this live under README-driven automation.
# Makefile — one-command onboarding
.PHONY: bootstrap
bootstrap:
@cp -n .env.example .env || true
@docker compose up -d --wait
@./scripts/health-check.sh
@echo "Environment ready. Open a PR."
Cross-Cutting Concerns
The same host-level caveats recur across every section above, so treat them as a shared checklist for any onboarding change.
macOS (Docker Desktop): bind-mounted files keep host ownership (commonly UID
501), so a container expecting UID1000hitsEACCESon writes; passuser: "${HOST_UID}:${HOST_GID}". VirtioFS I/O is slow for large dependency trees — allocate ≥8GB to the VM. WSL2: keep the repository on the Linux filesystem (~/code, not/mnt/c) ornpm ciand file-watchers degrade 40–60%. Normalize line endings withcore.autocrlf=falseso.envand shell scripts parse. Apple Silicon (ARM64): pinplatform: linux/amd64only for images lacking an arm64 manifest; otherwise prefer native multi-arch images, since QEMU emulation masks architecture-specific bugs and slows healthcheck polling.
Verification Suite
Wire every concern above into a single target so any contributor — or CI job — can confirm the environment is sound in seconds.
# Makefile — onboarding verification
.PHONY: verify-onboarding
verify-onboarding:
@set -e; \
docker compose config --quiet; \
docker compose ps --format '{{.Name}} {{.State}}' | grep -qv exited; \
./scripts/parity-check.sh --mode ci; \
./scripts/health-check.sh; \
echo "Onboarding baseline OK"