TDD for Infrastructure: Tests Aren't Just for Code
Most infrastructure work follows a familiar pattern:
1. Write configuration 2. Deploy 3. Check if it works 4. Debug when it doesn't 5. Repeat
This is manual testing. It works, but it has problems: - No regression protection - Knowledge leaves with the deployer - "Works on my machine" syndrome - Fear of changes
What if infrastructure had the same rigor as code?
1. Write a test that expects the infrastructure to work 2. Run it, watch it fail (infrastructure doesn't exist yet) 3. Build the minimal infrastructure to pass the test 4. Refactor if needed 5. Test still passes? Good. Commit.
Now you have: - Regression protection - Documentation in test form - Confidence to make changes - "Works everywhere the tests run"
it('should resolve internal services via MagicDNS', () => { const result = execSync('tailscale ping griak-pi-hole').toString(); expect(result).toMatch(/pong from griak-pi-hole/); }); }); \`\`\`
Now run the test. It fails. Tailscale isn't installed yet.
Install Tailscale. Configure it. Run the test. It passes.
Six months later, someone changes the network config. The test fails. They know immediately something broke.
In TDD, the red phase is valuable. A failing test tells you: - What you're trying to build - That the test is actually testing something - Where to focus your effort
For infrastructure, the red phase is the same:
\`\`\`typescript it('should have SSL certificate from Let\\'s Encrypt', () => { const cert = execSync('curl -sI https://griak.net | grep "SSL"').toString(); expect(cert).toContain('Let\\'s Encrypt'); }); \`\`\`
This fails immediately. Good. Now I know I need to: 1. Configure Caddy for auto-HTTPS 2. Ensure domain resolves to server 3. Open port 443 4. Wait for certificate issuance
Each step brings me closer to green.
| Layer | What to Test | |-------|--------------| | Network | Connectivity, DNS resolution, port availability | | Services | Health endpoints, response times, resource usage | | Security | Firewall rules, SSL certificates, auth flows | | Configuration | Environment variables, file permissions, secrets | | Integration | Service-to-service communication, API contracts | | Deployment | Container health, process status, log output |
Six months from now, someone (maybe me) will ask: "How is this supposed to work?"
The tests answer:
\`\`\`typescript describe('Authentication Flow', () => { it('should redirect unauthenticated /ops requests to signin', async () => { const response = await fetch('https://griak.net/ops', { redirect: 'manual' }); expect(response.status).toBe(307); expect(response.headers.get('location')).toContain('/auth/signin'); }); }); \`\`\`
This test documents the expected behavior. Reading it, I understand: 1. \`/ops\` routes require authentication 2. Unauthenticated requests redirect 3. The redirect goes to \`/auth/signin\` 4. HTTP 307 is used (preserves method)
The test is executable documentation.
For griak.net, I wrote 59 tests:
| Category | Tests | What They Verify | |----------|-------|------------------| | Tailscale | 4 | Connectivity, MagicDNS, Funnel | | Auth | 5 | Configuration, providers | | Middleware | 5 | Route protection, redirects | | JMAP | 5 | Client can connect, fetch mail | | Ops Dashboard | 6 | Page renders, widgets work | | Webhooks | 5 | Endpoint exists, validates signatures | | ISR | 3 | Revalidation works | | Time-Machine | 4 | Page renders, shows timeline | | Portfolio | 3 | Projects display | | Blog | 3 | Posts display | | Landing | 4 | Branding, navigation | | Signin | 4 | Form exists, handles input | | Credentials | 5 | Validation logic | | **Total** | **59** | |
59 tests. All passing. All the time.
When I changed Docker networking, tests failed. I fixed it before deploying.
When I added Auth.js, tests guided me through configuration.
When I thought I was done, tests caught the edge cases I missed.
1. **Start with the test** - Every infrastructure task begins with "how will I know this works?"
2. **Fail first** - A test that never fails isn't testing anything.
3. **Test behavior, not implementation** - Check that the service responds, not that it uses a specific port internally.
4. **Keep tests fast** - Slow tests don't get run. Optimize.
5. **Trust the tests** - If tests pass, deploy with confidence. If tests fail, stop and fix.
---