Art of SOPs

Payal Sinha

04 Oct 2023 • 5 min read

Standard Operating Procedures for technical operations often fall into two traps: either they're so vague they're useless ("restart the service if it fails"), or so rigid they become obsolete the moment your infrastructure changes. The best technical SOPs strike a balance—specific enough to be actionable, flexible enough to survive your next migration.

Start With the Incident Post-Mortem

Before writing a single procedure, look at your last three outages. What went wrong? More importantly, what took so long to fix? The answer usually isn't "we didn't know how to solve it"—it's "we couldn't find the credentials," "we weren't sure which database was primary," or "we spent 20 minutes arguing about whether to failover."

Your technical SOP isn't documentation for documentation's sake. It's armor against 3 AM panic.

Talk to your database administrators, your on-call engineers, your newest team member who just spent two hours figuring out how to provision a test environment. Their pain points are your roadmap.

Write for the Engineer at 2 AM

Technical SOPs have a specific user: someone under pressure who needs to act quickly and correctly. This changes everything about how you write:

Put the critical path first. "How to restore from backup" comes before "Backup architecture overview."
Use exact commands. Not "check the replication status"—give them SHOW SLAVE STATUS\G with the specific output fields they need to examine.
Include failure modes. "If you see Error 1205, this means lock timeout—check for long-running transactions using this query..."

Write assuming the reader is competent but stressed. They know SQL, but they shouldn't have to figure out which database server is which from first principles.

The Anatomy of a Good Technical SOP

Every technical procedure should have:

Prerequisites: What access, credentials, or conditions are needed? "This requires SSH access to prod-db-01 and the vault token from the on-call rotation doc."

Context: One paragraph maximum. Why does this procedure exist? What problem does it solve? This helps engineers know when to use it—and when not to.

The Steps: Numbered, specific, testable. Each step should have a clear success indicator. Not "migrate the data," but "Run migration script, verify row count matches source (should be ~2.3M rows), check application logs for errors."

Rollback: For any destructive operation, include the undo path. Before someone runs that ALTER TABLE on production, they should know exactly how to reverse it.

Edge Cases: What does "normal" look like vs. when should they escalate? "Replication lag under 5 seconds is expected during business hours. Over 30 seconds requires investigation."

Build It in the Trenches

The worst technical SOPs are written by people who haven't done the operation in years. Here's a better approach:

Pair with someone doing the task. Actually watch them. Not just the happy path—watch them hit errors, check three different wikis for the right connection string, sudo to the wrong account and have to start over.

Draft the SOP as they work, then have someone else follow it while the subject matter expert watches. The gaps become obvious immediately.

Handle the Abstraction Problem

Technical environments change constantly. New databases get provisioned, connection strings change, servers get renamed. Your SOP can't reference prod-db-01 if it'll be db-prod-postgres-primary-01 next month.

Solutions:

Reference central sources of truth: "See the database inventory in Confluence/runbook/etc for current hostnames"
Use patterns, not specifics: "Connect to the primary database (identified by read_only=OFF in the config)"
Automate the lookup: Provide a script that finds the right server rather than hardcoding hostnames

The goal: someone should be able to follow your SOP six months from now without you having to update it constantly.

Navigate the Politics Without Mentioning Them

You'll face resistance. Some engineers pride themselves on "just figuring it out" and see SOPs as training wheels. Others worry that documented procedures mean anyone can do their job.

Address the first group by showing efficiency gains. "Yes, you can figure out the backup restoration process from scratch. But do you want to spend 45 minutes on it at 3 AM? Here's 8 steps that take 10 minutes."

Address the second by involving them in creation. "I need your expertise to document this properly—you're the only one who really understands how the replication topology works." People rarely resist processes they helped design.

When leadership asks for more review steps or approval gates, push back with data. "Adding an approval step here adds 2 hours to our recovery time. Is that acceptable for a Severity 1 incident?" Usually the answer is no.

Test Your SOPs in Safe Environments

The worst time to discover your disaster recovery SOP doesn't work is during an actual disaster. Schedule regular chaos engineering sessions:

Restore a database from backup in your staging environment
Failover your Redis cluster
Revoke someone's access and have them regain it following the documented process

If the SOP fails in practice, fix it immediately. Document what went wrong in the SOP itself.

Version Control and Change Management

Technical SOPs should live where your code lives—in git, with pull requests and review. This gives you:

Change history (when did we add that step about checking replication lag?)
Review process (does this change match our actual infrastructure?)
Discoverability (engineers already know where to look)

When you update infrastructure, update the SOPs in the same pull request. Make it part of the definition of done.

Build Feedback Loops Into Operations

After any time someone follows an SOP, capture what happened:

Did it work? If not, why not?
What was unclear? Even if they figured it out, ambiguity is a bug.
What was missing? Did they have to check Slack history or ask someone?

Add a simple feedback mechanism: a comment in the doc, a Slack thread, a note in your incident management tool. Review these monthly and update accordingly.

Automate What You Can, Document What You Can't

The best SOP is the one you don't need because the operation is automated. But you can't automate everything—especially edge cases, disaster recovery, or operations you do twice a year.

For common operations, provide both:

The manual steps (for understanding and troubleshooting)
The automation script (for execution)

When the script fails, engineers can fall back to manual execution. When manual execution is confusing, the script serves as executable documentation.

Know Your Audience

Junior engineers need more context and explicit commands. Senior engineers need architecture diagrams and decision trees. Your on-call rotation includes both.

Solve this with layers:

Quick reference: The command sequence for someone who's done this before
Detailed walkthrough: Expanded explanation for someone learning
Troubleshooting: What to do when it doesn't work as expected

Let people enter at their competency level.

The Real Measure of Success

You've written good technical SOPs when:

Your on-call team resolves incidents faster
Your newest team member can complete complex operations safely
You go on vacation without fielding "how do I..." messages
Your team suggests improvements rather than ignoring the docs

Technical SOPs aren't about controlling engineers or creating bureaucracy. They're about encoding collective knowledge so it survives transitions, scales across teams, and turns potential disasters into routine operations.

When someone resolves a production incident at 2 AM using your SOP and goes back to sleep instead of staying up worried they missed something—that's when you know you got it right.