Ten central tenets of system administration
(Plus a few more)
Ade Rixon
Thurs 24th Sept 1998 (from 29 Nov 1996)
I don't claim these bear an exact relation to how I work, and I've
probably got sloppier in the intervening years. Learn from my mistakes.
In no particular order:
- Be rigorous and do it right first time. Never be tempted to think,
"That works, I'll clear up the mess later" - you won't get the chance. An
extra hour spent doing it properly now will save confusion and distress
later.
- RTFM. 90% of the time, the information you want is in the manual
(although not necessarily obvious, understandable or correct). Remember the
old Unix slogan: "the source is the documentation". Make sure you
use all available documentation - books, websites, newsgroups,
local info, etc.
- Be efficient with resources where you can. That means compiling with
SparcWorks C and optimisation on, stripping binaries, avoiding obvious
memory, disk or network drains where possible, etc. However, remember that
the really big performance gains only come from major reconfiguration of
the application and/or hardware.
- Don't believe a user's problem until you see it yourself. Question any
reports carefully to ensure that what you think they're doing matches what
they think they're doing - from the fundamentals upwards.
- Trust the users if they report a problem. There may well be something
wrong even if you haven't come across it yourself yet.
- Don't let error reports go by without checking them out. Often
"temporary glitches" are nothing of the sort. But sometimes they are: don't
spend too long investigating either. Anything that occurs more than once
in a short space of time should be checked out.
- Resist anything that requires future maintenance. If it needs
updating then you'll probably forget it at the critical moment. Obviously you
can't avoid this much of the time, but keep your maintenance requirements
to the minimum to avoid becoming swamped in fire-fighting.
- Automate maintenance where possible. That way the updates will be
consistent, simple and regular. But remember to update your scripts if
things change, otherwise the problems will be multiplied in one step.
- Document everything. If other staff can't figure out what
you've done then they'll probably break it or abandon it. Worse, they may
not even be aware of it when disaster strikes. Often, you need the
documentation to refresh your own memory. Be especially careful to note
down any dependencies or gotchas.
- Plan ahead. If you implement something immediately, it may become a minor
irritation in a month's time and a major problem in a year. By which
time it will be difficult to repair the damage. Take into account other
activities and applications.
- Complete trashing and reinstallation is always the last
resort. It's particularly regrettable if you're not sure it will fix the
problem.
- Don't forget the backups. Don't ever forget the backups. The
most depressing phrase you can hear in this line of work is, "What backups?"
NB. If you're restoring to fix a problem, make sure the problem hasn't
migrated to the dumps as well.
- New applications: make them obey the FIFO rule - Fit In or Fuck Off.
You don't want to end up maintaining separate, autonomous systems if
possible. (FIFO sometimes goes for people too.)
Ade Rixon