As you can probably guess, one day the hosts file on the central server got corrupted, and was diligently copied to all of the other systems. With corrupt hosts files, the systems could no longer communicate on the network, and could not even receive a valid hosts file again. It was a number of days before order was restored.
Personally, I wish I had a dollar for each time a systems administrator has accidentally deleted the entire contents of a file server, or database, and realise that there is no backup or that it will take an immense amount of time to recover from the backups.
System administrators make mistakes. And the mistakes are made for a number of reasons:
Ignorance – Ignorance is the lack of education. It is very common in the IT industry for people to try to “fix” critical systems with which they have insufficient knowledge. This can be a very dangerous area – and managers should be aware of the importance of educating staff to perform the tasks required of them. Ignorance can often lead to stress.
Negligence – Negligence is the failure to perform a task which should be well known and/or obvious. It is common amongst IT people to skip practices ( such as change control procedures, performing backups, etc ) when the workload becomes high – also known as load shedding. Such actions can lead to further problems. Negligence is best managed by management educating staff, and managing their workload effectively.
Stress – Stress is very common in the IT industry, particularly amongst those working in a Helpdesk sitution, or in a technical support role. These people constantly receive nothing but the worst of calls : clients with problems that need to be understood and fixed quickly. Stress is caused by many factors, mostly from the desire to perform to perceived expectations. Stress often leads to fatigue and health problems.
Fatigue – Fatigue is the desire to rest. It often occurs when stress levels are high, or the person has not been resting enough. Mistakes through fatigue are common in the IT industry : frequently changes to critical systems must be made at night when they can be changed without disrupting important users. These jobs place a lot of strain on technical people, and can lead to fatigue – often to the point that the probability of making a mistake can outweigh the importance of the work being performed. Fatigue can build up over a period of days and weeks and often leads to serious health problems.
In Security Operations, stress can be a major contributing factor. Many operations teams I have come across have a purely reactive role within the organisation. While they are respoonsible for correcting problems, they often don´t have the authority to remove the root cause of the problem.