Managing human error in maintenance

Sandy Dunn 1 Dec 2014

Numerous research studies have shown that over 50% of all equipment fails prematurely after maintenance work has been performed on it. In the most embarrassing cases, the maintenance work performed was intended to prevent the very failures that occurred. Building on the latest academic research, and based on practical experience, this paper outlines the key things that maintenance managers can do to reduce or eliminate the impact of human error in maintenance.

The key points that will be covered include:

Human error is inevitable – we ignore it at our peril
The role of an optimum PM program in minimising the impact of human error
Maintenance Quality Management – essential elements for managing maintenance error

Introduction

In their ground-breaking work that led to the establishment of the technique that we now know as Reliability Centered Maintenance, Nowlan and Heap⁽ⁱ⁾ found, when analysing the failures of hundreds of mechanical, structural and electrical aircraft components, that these failures occurred with 6 distinct patterns, as illustrated below.

The interesting finding, in the context of this paper, is that more than two-thirds of all components demonstrated early-life failure. It has been estimated that maintenance errors ranked second to only controlled flight into terrain accidents in causing onboard aircraft fatalities between 1982 and 1991 (despite the application of RCM techniques in the airline industry during this period).⁽ⁱⁱ⁾

A study of coal-fired power stations indicated that 56% of forced outages occur less than a week after a planned or maintenance shutdown.⁽ⁱⁱⁱ⁾

Other studies have been conducted which confirm these findings, but, until recently, there has been little research performed that has investigated the reasons for this. Several plausible theories have been proposed – possible explanations that I have heard include:

“Human Error” – the repair/replace task was not successfully completed due to a lack of knowledge or skill on the part of the person performing the repair.
“System Error” – the equipment was returned to service after a high-risk maintenance tasks without the repair having been properly inspected/tested.
“Design Error” – the capability of the component being replaced is too close to the performance expected of it, and therefore lower capability (quality) parts fail during periods of high performance demand. The remaining higher capability (quality) parts are capable of withstanding all performance demands placed on it. This could be envisaged in the below graph.
“Parts Error” – the incorrect part or an inferior quality part has been supplied.

More recently, James Reason^(iv) has compiled a table summarising the results of three surveys – two performed by the Institute of Nuclear Power Operations (INPO) in the USA, and one by the Central Research Institute for the Electrical Power Industry (CRIEPI) in Japan. In all three of these studies, more than half of all identified performance problems were associated with maintenance, calibration and testing activities. In comparison, on average only 16% of problems occurred while these power stations were operating under normal conditions.

Reason also quoted the results of a Boeing Study^(v) which indicated that the top seven causes of inflight engine shutdowns (IFSDs) in Boeing aircraft were as follows:

Incomplete installation (33%)
Damaged on installation (14.5%)
Improper installation (11%)
Equipment not installed or missing (11%)
Foreign Object Damage (6.5%)
Improper fault isolation, inspection, test (6%)
Equipment not activated or deactivated (4%)

We can see from this, that only one of these causes was unrelated to maintenance activities, and that maintenance activities contributed to at least 80% of all IFSDs.

If poor quality maintenance causes so many incidents in highly regulated and hazardous industries such as Nuclear Power Generation and Civil Aviation, what proportion of failures may be being caused by Maintenance within your organisation?

What are the outcomes of maintenance-induced failures? Clearly, depending on the industry in which you operate, there are potentially significant safety and environmental risks. There is a long list of catastrophic failures in which, the inadequate performance of a maintenance task played a significant role. Some of these include:

Flixborough
Three Mile Island
Piper Alpha
American Airlines Flight 191
Bhopal
Japan Airlines Flight 123
Clapham Junction
Etc. etc.

But besides the obvious safety risks, perhaps the bigger consequences are economic. General Electric has estimated that each in-flight engine shutdown costs airlines in the region of US$500,000. What could maintenance-induced failures be costing your organisation?

Clearly, we need to do something to reduce the number of equipment failures that are being caused, not prevented, by maintenance. This paper suggests that the most appropriate approach is:

Admit that human error is inevitable (even in Maintenance!) and design our systems and processes around this inevitability
Use appropriate tools to ensure that we are not unnecessarily over-maintaining plant and equipment (and therefore increasing the risk associated with the fact that this work may not be performed correctly), and
Work to improve the quality with which maintenance activities are performed – including error-proofing where possible.

Human error is inevitable

Think of the traditional engineering approach to dealing with maintenance error, and most engineers tend to think along two lines – either discipline/counsel/train the individual(s) involved, and/or write a new procedure/work instruction to make sure that it doesn’t happen again. Unfortunately, recent research and experience by Behavioural Psychologists indicates that neither of these approaches are likely to be successful in eliminating maintenance error.

Work by Reason and Hobbs^(vi) explains why maintenance activities can be particularly error-provoking. In particular, it argues the futility of trying to change the human condition, when a more effective way of managing maintenance error is to treat errors as a normal, expected, and foreseeable aspect of maintenance work, and therefore, manage maintenance error by changing the conditions under which that work is carried out.

Reason and Hobbs identified a number of physiological and psychological factors which contribute to the inevitability of human error. These include:

Differences between the capabilities of our long-term memory and our conscious workspace. In particular, what we call “attention” is closely linked with the activities of the conscious workspace, and the conscious workspace has extremely limited capabilities including:
- Attention is an extremely limited commodity – if it is drawn to one thing, then it is, by necessity, withdrawn from other competing concerns
- These capacity limits give attention its selective properties – we can only attend to a very small proportion of the total available sensory data we receive
- Unrelated matters can capture attention – such as preoccupation with other sensory or emotional demands
- Attentional focus (concentration) is hard to maintain for any more than a few seconds
- The ability to concentrate depends strongly on the intrinsic capability of the current object of attention
- The more skilled or habitual our actions, the less attention they demand
- Correct performance requires the right balance of attention, neither too much or too little.
The Vigilance Decrement – it is more common for inspectors to miss obvious faults the longer that they have been performing the inspection. This is particularly the case when the number of “hits” is few and far between.
The impact of fatigue – this could be due to:
- Time of day effects – our daily rhythms ensure that we are more likely to commit errors in the small hours of the morning
- Stresses – physical, social, drugs, pace of work, personal factors
The level of arousal – too much or too little arousal impairs work performance
Biases in thinking and decision making. There is no such thing as “common sense”. In particular we are subject to:
- Confirmation Bias – where we seek information that confirms our initial (and often incorrect) diagnosis of a problem
- Emotional Decision Making – if a situation keeps frustrating us, then we tend to move into “aggressive” mode, but this often clouds our better judgement

As a result of these contributing factors, the types of errors that occur most often in Maintenance include:

Recognition failures – these include
- Misidentification of objects, signals and messages, and
- Non-detection of problem states
Memory failures – this includes:
- Input failure – insufficient attention is paid to the to-be-remembered item. This in turn can include:
  - Losing our place in a series of actions
  - The “time-gap” experience
- Storage failure – remembered material decays or suffers interference. Most common in maintenance is the problem of forgetting the intention to do something
- Output failure – things we know cannot be recalled at the required time – the “what’s his name?” experience
- Omissions following interruptions – we rejoin a sequence of actions having omitted certain required steps
- Premature exits – we terminate a job before all the actions are complete
Skill-based slips. Generally associated with “automatic” routines, these can include:
- Branching errors – such as intending to drive to the golf course on a weekend, but missing the turnoff, and continuing on towards the office as you would every other day of the week
- Overshoot errors – intending to stop at the shops on the way home, but forgetting and continuing home without stopping
Rule-based Mistakes. Most maintenance work is highly proceduralised, and consist of many “rules”. These can be formally written, or exist only in peoples’ heads. Typical rule-based errors include:
- Misapplying a good rule – using a rule in a situation where it is not appropriate
- Applying a bad rule – the rule may get the job done in certain situations, but can have unwanted consequences. This is most common when people pick up others’ “bad habits”.
Knowledge-based errors. Generally the situation when someone is performing an unusual task for the first time. These need not necessarily be committed by inexperienced personnel.
Violations – deliberate acts which violate procedures. These can be:
- Routine violations – committed in order to avoid unnecessary effort, get the job done quickly, to demonstrate skill, or avoid what is seen as an unnecessarily laborious procedure
- Thrill-seeking violations – often committed in order to avoid boredom, or win peer praise
- Situational violations – those committed because it is not possible to get the job done if procedures are strictly adhered to.

Think of your own situation – have you never committed an error? For most of us, the consequences of our past errors are relatively minor – but that is largely due to luck, and the situation that we were in at the time. The traditional approach to dealing with human error – counselling and/or writing a procedure – cannot possibly effectively deal with all of the types of errors listed above. We need a more holistic approach for managing maintenance error, and assuring Maintenance Quality.

Avoid unnecessary “preventive” maintenance

Given the statistics mentioned earlier from Nowlan and Heap’s work, and others, it is clear that over-maintaining equipment not only is a waste of time and money, but it also increases the risk of safety and environmental incidents, as well as potentially causing expensive, and unnecessary failures.

Techniques based on the application of Reliability Centered Maintenance principles are an extremely effective way of weeding out this unnecessary maintenance, and streamlining and optimising equipment PM programs.

Our analysis of PM programs in place at our clients has indicated that in almost all organisations there is a huge amount of unnecessary routine maintenance being performed. In some situations, fewer than 10% of the existing PM tasks were optimal, and it is not unusual for us to identify that as much as half of the routine maintenance activities were, at best, a complete waste of time. In many cases, the performance of some of these “preventive” maintenance activities were potentially causing equipment failures, rather than preventing them – particularly where these activities involved intrusive, fixed interval inspections and overhauls. At one major offshore oil and gas platform in Western Australia, a comprehensive review of the Preventive Maintenance program led to a 25% reduction in the amount of routine PM being performed. It also led to a 25% reduction in the amount of Corrective maintenance being performed. Clearly, in this case, a fair proportion of the PM that had previously been performed was actually causing, rather than preventing, failures.

The starting point in eliminating unnecessary routine maintenance lies in ensuring that the need for all these routine maintenance tasks is defensibly justified. This is the objective of Assetivity’s Rapid Equipment Strategy Development process. This process is based on RCM principles and has ten steps as outlined below.

Determine Scope of Analysis
Verify Equipment Capability
Identify Failure Modes
Analyse Failure Modes, Effects and Consequences
Select Recommended Maintenance Tasks
Identify Additional Improvement Tasks
Consolidate Schedules and Integrate with Operational Strategies>
Gain Approval and Implement Recommended Actions
Track Success
Beyond RCM and PMO

Detailed description of this process is beyond the scope of this paper. We would strongly suggest, however that, if you have not already done so, a critical review of your PM program is an essential first step to managing the impact of human error in maintenance.

Maintenance quality management: key principles

Following Reason and Hobbs^(vii), the following are the principles that a Maintenance Quality Management system must embrace:

Human error is both universal and inevitable. Human error is not a moral issue – making them is as much a part of human life as eating and breathing
Errors are not intrinsically bad. Success and failure spring from the same roots. We are error-guided creatures. Errors mark the boundaries of the path to successful action
You cannot change the human condition, but you can change the conditions in which humans work. There are two parts to an error – a mental state and a situation. We have limited control over people’s mental states, but we can control the situations in which they have to work.
The best people can make the worst mistakes. No one is immune to error – if only a few people were responsible for most of the errors, then the solution would be simple, but some of the worst mistakes are made by the most experienced people.
People cannot easily avoid those actions they did not intend to commit. Blame and punishment is not appropriate when peoples’ intentions were good, but their actions did not go as planned. This does not mean, however, that people should not be accountable for their actions, and be given the opportunity to learn from their mistakes.
Errors are consequences, rather than causes. Errors are the product of a chain of actions and conditions which involve people, teams, tasks, workplace and organisational factors. Discovering a human error is the beginning of the search for causes, not the end.
Many errors fall into recurrent patterns. More than half of maintenance errors are recognised as having happened before – often many times. Targeting these recurrent errors is the most effective way of addressing human error issues.
Safety-significant errors can occur at all levels in the system. Indeed, the higher in an organisation that an error is made, the more significant the consequences.
Error Management is all about managing the manageable. Situations are manageable – human nature, in its broadest sense, is not.
Maintenance Quality Management is about making good people excellent. Maintenance QualityManagement is not about making a few error-prone people better – rather it is a way of making the larger proportion of well-trained and motivated people excellent
There is no one best way. Different Maintenance QualityManagement methods will apply in different situations, and in different organisations.
Effective Maintenance Quality Management aims at Continuous Reform rather than Local Fixes. The temptation is to resolve errors one at a time, as they arise, but as errors tend to be systemic in nature, a more appropriate method is to deal with human error systematically, and continuously.

There are a number of Maintenance Quality Management tools that can be applied. The exact combination of these that is most appropriate for any organisation varies, but they could include:

Person measures

Provide training in error-provoking factors. Training maintenance personnel in order to give them an understanding and awareness of the factors and situations that may lead them to be more error-prone is a starting point in successfully addressing human error. They should understand such factors as the limitations of human performance, the limitations of short term memory, the impact of fatigue, the impact of interruptions, the impact of pressure and stress, the types of errors that they can make, and the situations in which these errors are most likely to arise. Once maintainers are aware of their own limitations, then they can start to detect the warning signals that indicate a higher risk of an error being made, and can take steps to avoid this from happening.

Implement measures to reduce the number of deliberate violations. Traditional approaches to the avoidance of violations tend to focus on scaring people into compliance. This may have its place, but an additional, effective approach is to create a social environment within the workplace where deliberate violations bring disapproval from within peoples’ peer groups. There are a number of approaches that are being tried, both within and external to the workplace, which appear to be successfully creating this social environment, but overnight success stories are rare.

Encourage mental rehearsal of tasks before they are performed. There is significant evidence to suggest that achieving the right degree of mental readiness for a task before it begins has a significant positive impact on the quality and reliability with which this task is performed. This is based on studies of surgeons and Olympic athletes.^(viii)

Control Distractions. Anticipating the distractions that are likely to occur, and developing a strategy for dealing with them before they occur is most likely to enhance the quality of task performance.

Avoid Place-Losing Errors. Through such techniques as inserting place-markers at appropriate points in the procedure.

Team measures

Provide teamwork training. Significant accidents have occurred as a result of poorly functioning teams. Most notable of these was an aircraft accident involving KLM and PanAm 747s at Tenerife, which resulted in the loss of more than 500 lives. Effective teamwork training will focus on:

Communication skills
Crew development and leadership skills
Workload management, and
Technical proficiency

Workplace and task measures

Ensure that personnel only perform tasks when they are properly trained, skilled and qualified. It goes without saying that quality work practices can only be put in place when maintenance personnel have the requisite technical skills and capabilities required to perform the work that is allocated to them.

Fatigue Management. Ensure that a well-designed shift roster is in place which minimises the impact of fatigue. Ensure also, that there are adequate controls in place for managing overtime work

Assign tasks appropriately. There is evidence to suggest that there is a link between the frequency with which a task is performed, and the likelihood that the task will be performed correctly. Both infrequently performed, and very frequently performed tasks tend to be those at greatest risk of human error. Infrequently performed tasks are generally more at risk because of the lack of experience of the person performing the task, while on very frequently performed tasks fall victim to skill-based slips and lapses, as the person performing the work operates on “auto-pilot”. Intelligent allocation of work to individuals takes this into account, and can assist in minimising human error.

Ensure that equipment, and tasks, are properly designed. In order to minimise the likelihood of error in performing maintenance tasks, the equipment should be designed for maintainability. This should include consideration of such factors as:

Easy access to components
Components that are functionally related should be grouped together
Components should be clearly labelled
There should be minimal requirement for special tools
It should not be necessary to perform high-precision work in the field
Equipment should be designed to permit easy fault diagnosis

Enforce good housekeeping standards. Housekeeping practices are a good indicator of attitudes and culture relating to quality. The correct standards are those that avoid dangerous slovenliness, without resorting to anally-retentive cleanliness.

Ensure Spare Parts and Tools are managed well. Maintenance cannot perform high quality work if the parts and tools that they need are not available when required. This leads to potentially dangerous short-cuts and workarounds being put in place. An important aspect of Maintenance Quality Management is ensuring that Tool Management and Spare Parts Management processes and practices support the achievement of high quality work.

Write, and Use, Effective Maintenance Work Instructions. Omission of necessary steps is the most common form of maintenance error. Some estimates suggest that omissions account for more than half of all human factors problems in maintenance. The development, and use, of effective maintenance work instructions is an important tool in managing these types of errors.

Organisational measures

Put in place effective processes for analysing, and learning from, past failures. It is vitally important that any significant failures should be investigated using an effective Root Cause Analysis process. This Root Cause Analysis process, to be effective should fully investigate all of the contributing causes to the failure, whether these be physical causes, human causes, or organisational causes. The most effective solutions to preventing these failures from happening again, will be those that deal effectively with the organisational causes of failures.

However, in order to effectively analyse those failures that are occurring as a result of human failures, it is also necessary to engender a “Reporting Culture” within the organisation – where all failures, no matter how seemingly insignificant, are reported. This, in turn, particularly when we are dealing with human errors, requires the development of a high level of trust between management and those at lower levels in the organisation. People must not feel that reporting human failures is likely to lead to adverse personal consequences. Those who have researched so-called “High Reliability Organisations” (HROs) have noted that high levels of failure reporting is a significant feature of those organisations.^(ix)

Put in place proactive processes for assessing the risk of future maintenance errors. Avoiding the recurrence of past failures is an admirable, but insufficient, goal for those seeking to achieve high quality maintenance outcomes. One possible proactive method that could be employed to proactively manage Maintenance Quality is to perform a risk assessment of maintenance activities, in order to assess whether the likelihood of human error is high. Possible areas that could be assessed in this risk assessment would include:

The knowledge, skills and experience of maintenance personnel at all levels
Employee morale
The availability of tools, equipment and parts to perform maintenance tasks
Workforce fatigue, stress and time pressures
Shift rosters
The adequacy of maintenance procedures and work instructions

One example of a risk assessment process that is used in the aviation industry is Managing Engineering Safety Health (MESH) which was developed initially by British Airways in the early 1990s, and has been further developed and adapted by Singapore Airlines.^(x)

In addition, more specific review and assessment of error detection and containment defences can be performed. This could ask questions such as:

Are there adequate processes in place for independent inspection of high-risk tasks?
Are functional tests and checks ever omitted or abbreviated, for any reason?
Have tasks ever been signed off as completed, when this was subsequently found not to be the case?
After maintenance, is equipment adequately tested before being returned to service?

Ultimately, even putting both proactive and reactive measures in place will not guarantee the absence of human error, but together, these strengthen the organisation’s intrinsic resistance to human error.

Conclusion

The impact of human error on maintenance quality and costs, safety and equipment reliability is huge. Yet we are only just starting to develop a better understanding of what causes error in maintenance activities, and to develop better tools and techniques to avoid or minimise the consequences of this error. This paper has attempted to outline some of the latest research findings, and provide you with some ideas that you may find useful in addressing maintenance error within your organisation.

[i] Nowlan FS & Heap H – Reliability-centered Maintenance. Springfield, Virginia: National Technical Information Service, US Department of Commerce, 1978.
[ii] Davis RA – Human Factors in the Global Marketplace – Keynote address, Annual Meeting of the Human Factors and Ergonomics Society, Seattle, 12 October 1993
[iii] Smith A – Reliability Centered Maintenance – Boston, McGraw Hill, 1992
[iv] Reason J – Managing the Risks of Organizational Accidents – Ashgate Publishing, 1997
[v] Boeing – Maintenance Error Decision Aid, Seattle: Boeing Commercial Airplane Group, 1994
[vi] Reason J & Hobbs A – Managing Maintenance Error, Ashgate Publishing, 2003
[vii] Reason J & Hobbs A – Managing Maintenance Error, Ashgate Publishing, 2003
[viii] Orlick T – In Pursuit of Excellence – Ottowa, Zone of Excellence, 2000
[ix] See for example, Karl E.Weick & Kathleen M. Sutcliffe, “Managing the Unexpected – Assuring High Performance in an Age of Complexity”, Jossey-Bass, 2001
[x] See Reason J – Managing the Risks of Organizational Accidents – Ashgate Publishing, 1997

Next article Equipment criticality analysis: a streamlined approach

Sandy Dunn 1 Dec 2014

Reliability Improvement

Equipment criticality analysis: a streamlined approach

Let's explore a streamlined approach for determining equipment criticality, which achieves the desired results while consuming less...

Managing human error in maintenance

Introduction

Human error is inevitable

Avoid unnecessary “preventive” maintenance

Maintenance quality management: key principles

Person measures

Team measures

Workplace and task measures

Organisational measures

Conclusion

Equipment criticality analysis: a streamlined approach

Related articles

4 key reliability centered maintenance (RCM) and preventive maintenance...

Challenges of configuration management in the age of automation

How operations management can impact equipment reliability

5 tips for more effective maintenance planning and scheduling

Maintenance plan development: template at your peril

Getting the most value from your CMMS