Ten central tenets of system administration

(Plus a few more)

Ade Rixon

Thurs 24th Sept 1998 (from 29 Nov 1996)

I don't claim these bear an exact relation to how I work, and I've probably got sloppier in the intervening years. Learn from my mistakes. In no particular order:

Be rigorous and do it right first time. Never be tempted to think, "That works, I'll clear up the mess later" - you won't get the chance. An extra hour spent doing it properly now will save confusion and distress later.
RTFM. 90% of the time, the information you want is in the manual (although not necessarily obvious, understandable or correct). Remember the old Unix slogan: "the source is the documentation". Make sure you use all available documentation - books, websites, newsgroups, local info, etc.
Be efficient with resources where you can. That means compiling with SparcWorks C and optimisation on, stripping binaries, avoiding obvious memory, disk or network drains where possible, etc. However, remember that the really big performance gains only come from major reconfiguration of the application and/or hardware.
Don't believe a user's problem until you see it yourself. Question any reports carefully to ensure that what you think they're doing matches what they think they're doing - from the fundamentals upwards.
Trust the users if they report a problem. There may well be something wrong even if you haven't come across it yourself yet.
Don't let error reports go by without checking them out. Often "temporary glitches" are nothing of the sort. But sometimes they are: don't spend too long investigating either. Anything that occurs more than once in a short space of time should be checked out.
Resist anything that requires future maintenance. If it needs updating then you'll probably forget it at the critical moment. Obviously you can't avoid this much of the time, but keep your maintenance requirements to the minimum to avoid becoming swamped in fire-fighting.
Automate maintenance where possible. That way the updates will be consistent, simple and regular. But remember to update your scripts if things change, otherwise the problems will be multiplied in one step.
Document everything. If other staff can't figure out what you've done then they'll probably break it or abandon it. Worse, they may not even be aware of it when disaster strikes. Often, you need the documentation to refresh your own memory. Be especially careful to note down any dependencies or gotchas.
Plan ahead. If you implement something immediately, it may become a minor irritation in a month's time and a major problem in a year. By which time it will be difficult to repair the damage. Take into account other activities and applications.
Complete trashing and reinstallation is always the last resort. It's particularly regrettable if you're not sure it will fix the problem.
Don't forget the backups. Don't ever forget the backups. The most depressing phrase you can hear in this line of work is, "What backups?" NB. If you're restoring to fix a problem, make sure the problem hasn't migrated to the dumps as well.
New applications: make them obey the FIFO rule - Fit In or Fuck Off. You don't want to end up maintaining separate, autonomous systems if possible. (FIFO sometimes goes for people too.)

Ade Rixon