Production CI/CD — GitHub Actions, testing strategies, deployment gates, rollbacks, feature flags, and release management.
A CI/CD pipeline isn't a YAML file — it's the immune system of your codebase. Every merge to main should be a non-event. If deploying makes you nervous, your pipeline is broken.
Core principles:
Structure your workflows as composable units. Don't copy-paste between repos.
.github/
├── workflows/
│ ├── ci.yml # Main CI pipeline
│ ├── deploy-staging.yml # Staging deployment
│ ├── deploy-production.yml # Production deployment
│ └── release.yml # Release management
Create org-level reusable workflows in a .github repository:
# org/.github/.github/workflows/node-ci.yml
name: Node.js CI (Reusable)
on:
workflow_call:
inputs:
node-version:
type: string
default: '20'
working-directory:
type: string
default: '.'
run-e2e:
type: boolean
default: false
secrets:
NPM_TOKEN:
required: false
CODECOV_TOKEN:
required: false
jobs:
lint-and-typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: 'npm'
cache-dependency-path: '${{ inputs.working-directory }}/package-lock.json'
- name: Install dependencies
working-directory: ${{ inputs.working-directory }}
run: npm ci
- name: Lint
working-directory: ${{ inputs.working-directory }}
run: npm run lint
- name: Type check
working-directory: ${{ inputs.working-directory }}
run: npm run typecheck
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: 'npm'
cache-dependency-path: '${{ inputs.working-directory }}/package-lock.json'
- run: npm ci
working-directory: ${{ inputs.working-directory }}
- name: Unit tests with coverage
working-directory: ${{ inputs.working-directory }}
run: npm run test:unit -- --coverage --reporter=junit --outputFile=junit.xml
- name: Upload coverage
if: inputs.working-directory == '.'
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
flags: unit
- name: Upload test results
if: always()
uses: actions/upload-artifact@v4
with:
name: unit-test-results
path: ${{ inputs.working-directory }}/junit.xml
integration-tests:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7-alpine
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: 'npm'
cache-dependency-path: '${{ inputs.working-directory }}/package-lock.json'
- run: npm ci
working-directory: ${{ inputs.working-directory }}
- name: Run migrations
working-directory: ${{ inputs.working-directory }}
env:
DATABASE_URL: postgresql://test:test@localhost:5432/testdb
run: npm run db:migrate
- name: Integration tests
working-directory: ${{ inputs.working-directory }}
env:
DATABASE_URL: postgresql://test:test@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
NODE_ENV: test
run: npm run test:integration
e2e-tests:
if: inputs.run-e2e
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: 'npm'
- run: npm ci
- name: Install Playwright browsers
run: npx playwright install --with-deps chromium
- name: Build application
run: npm run build
- name: Run E2E tests
run: npx playwright test
env:
CI: true
- name: Upload Playwright report
if: failure()
uses: actions/upload-artifact@v4
with:
name: playwright-report
path: playwright-report/
retention-days: 7
Consume it from any repo:
# your-repo/.github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
ci:
uses: your-org/.github/.github/workflows/node-ci.yml@main
with:
node-version: '20'
run-e2e: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
secrets:
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
Use matrices for cross-version testing, but be smart about it:
jobs:
test:
runs-on: ubuntu-latest
strategy:
fail-fast: false # Don't cancel other jobs if one fails
matrix:
node-version: [18, 20, 22]
os: [ubuntu-latest]
include:
# Only test macOS on latest Node (saves minutes)
- node-version: 22
os: macos-latest
exclude:
- node-version: 18
os: macos-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
- run: npm ci
- run: npm test
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
# npm ci uses the cache automatically. Done.
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
ghcr.io/${{ github.repository }}:${{ github.sha }}
ghcr.io/${{ github.repository }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Build with Turborepo
run: npx turbo run build --filter=...[origin/main]
env:
TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }}
TURBO_TEAM: ${{ vars.TURBO_TEAM }}
/ E2E \ ← 5-10 critical user journeys. Main merges only.
/ ——————— \
/ Integration \ ← API contracts, DB queries. All PRs.
/ ————————————— \
/ Unit Tests \ ← Pure logic, fast. Every push.
/ ————————————————— \
on: push
jobs:
unit:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20', cache: 'npm' }
- run: npm ci
- run: npm run test:unit -- --bail
e2e:
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20', cache: 'npm' }
- run: npm ci
- run: npx playwright install --with-deps chromium
- run: npm run build
- run: npx playwright test --shard=${{ matrix.shard }}/4
# .github/workflows/deploy-production.yml
name: Deploy to Production
on:
push:
branches: [main]
concurrency:
group: production-deploy
cancel-in-progress: false # Never cancel a running production deploy
jobs:
test:
uses: ./.github/workflows/ci.yml
with:
run-e2e: true
build:
needs: test
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/${{ github.repository }}
tags: type=sha,prefix=
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
kubectl set image deployment/app \
app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
--namespace=staging
kubectl rollout status deployment/app --namespace=staging --timeout=300s
- name: Smoke tests
run: |
sleep 10
curl -sf https://staging.example.com/healthz || exit 1
npm run test:smoke -- --base-url=https://staging.example.com
approve-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production # Requires manual approval in GitHub settings
steps:
- run: echo "Production deployment approved"
deploy-canary:
needs: approve-production
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy canary (10% traffic)
run: |
kubectl set image deployment/app-canary \
app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
--namespace=production
kubectl rollout status deployment/app-canary --namespace=production --timeout=300s
- name: Monitor canary (5 minutes)
run: |
for i in $(seq 1 30); do
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode "query=rate(http_requests_total{status=~\"5..\",deployment=\"canary\"}[1m]) / rate(http_requests_total{deployment=\"canary\"}[1m])" \
| jq -r '.data.result[0].value[1] // "0"')
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
echo "Canary error rate ${ERROR_RATE} exceeds 5% threshold"
kubectl rollout undo deployment/app-canary --namespace=production
exit 1
fi
echo "Canary healthy (error rate: ${ERROR_RATE})"
sleep 10
done
deploy-production:
needs: deploy-canary
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Full rollout
run: |
kubectl set image deployment/app \
app=ghcr.io/${{ github.repository }}:${{ github.sha }} \
--namespace=production
kubectl rollout status deployment/app --namespace=production --timeout=600s
- name: Post-deploy smoke tests
run: |
sleep 15
npm run test:smoke -- --base-url=https://app.example.com
- name: Auto-rollback on failure
if: failure()
run: |
kubectl rollout undo deployment/app --namespace=production
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-Type: application/json' \
-d '{"text":"Production deploy failed — auto-rolled back"}'
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 3
revisionHistoryLimit: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime
template:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2
Manual rollback:
kubectl rollout undo deployment/app --namespace=production
kubectl rollout undo deployment/app --to-revision=3 --namespace=production
Rule: Every migration must be reversible.
// migrations/20240301_add_user_email_verified.ts
import { Knex } from 'knex';
export async function up(knex: Knex): Promise<void> {
await knex.schema.alterTable('users', (table) => {
table.boolean('email_verified').nullable().defaultTo(null);
});
await knex.raw(`
UPDATE users SET email_verified = true WHERE confirmed_at IS NOT NULL
`);
}
export async function down(knex: Knex): Promise<void> {
await knex.schema.alterTable('users', (table) => {
table.dropColumn('email_verified');
});
}
Expand-contract pattern for breaking schema changes:
type FeatureFlag = {
enabled: boolean;
rolloutPercentage?: number;
allowList?: string[];
};
const FLAGS: Record<string, FeatureFlag> = {
'new-checkout-flow': {
enabled: true,
rolloutPercentage: 25,
},
'admin-analytics-v2': {
enabled: true,
allowList: ['user_123', 'user_456'],
},
'dark-mode': {
enabled: process.env.ENABLE_DARK_MODE === 'true',
},
};
export function isFeatureEnabled(flag: string, userId?: string): boolean {
const f = FLAGS[flag];
if (!f || !f.enabled) return false;
if (f.allowList && userId) {
return f.allowList.includes(userId);
}
if (f.rolloutPercentage !== undefined && userId) {
const hash = simpleHash(userId + flag);
return (hash % 100) < f.rolloutPercentage;
}
return f.enabled;
}
function simpleHash(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = ((hash << 5) - hash) + str.charCodeAt(i);
hash |= 0;
}
return Math.abs(hash);
}
import * as LaunchDarkly from '@launchdarkly/node-server-sdk';
const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY!);
await client.waitForInitialization({ timeout: 5 });
async function handler(req: Request) {
const user = {
key: req.userId,
email: req.userEmail,
custom: { plan: req.userPlan, company: req.companyId },
};
const showNewCheckout = await client.variation('new-checkout-flow', user, false);
return showNewCheckout ? renderNewCheckout() : renderOldCheckout();
}
npm install -D @changesets/cli
npx changeset init
# .github/workflows/release.yml
name: Release
on:
push:
branches: [main]
jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-node@v4
with: { node-version: '20', cache: 'npm' }
- run: npm ci
- name: Create Release PR or Publish
uses: changesets/action@v1
with:
publish: npx changeset publish
version: npx changeset version
commit: 'chore: version packages'
title: 'chore: version packages'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
name: CI
on:
pull_request:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-node@v4
with: { node-version: '20', cache: 'npm' }
- run: npm ci
- name: Build affected
run: npx turbo run build test lint --filter=...[origin/main]
env:
TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }}
TURBO_TEAM: ${{ vars.TURBO_TEAM }}
- name: Derive SHAs
uses: nrwl/nx-set-shas@v4
- name: Run affected
run: npx nx affected -t lint test build --parallel=3
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
aws-region: us-east-1
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v2
with:
workload_identity_provider: 'projects/123/locations/global/workloadIdentityPools/github/providers/github'
service_account: 'deploy@project.iam.gserviceaccount.com'
AWS IAM trust policy for GitHub OIDC:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
}
}
}]
}
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
on:
push:
paths-ignore: ['**.md', 'docs/**', '.vscode/**']
- uses: actions/cache@v4
id: pw-cache
with:
path: ~/.cache/ms-playwright
key: playwright-${{ hashFiles('package-lock.json') }}
- if: steps.pw-cache.outputs.cache-hit != 'true'
run: npx playwright install --with-deps chromium
Use npm ci not npm install — faster and deterministic.
Set timeouts on every job — a hung test can burn your monthly minutes.
npm install instead of npm ci — non-deterministic