What is reliability centered maintenance (RCM)?

Sandy Dunn 21 Oct 2016

Reliability Centered Maintenance (RCM) is a very powerful methodology which, when properly applied, can drive significant improvements in equipment reliability and plant performance while, at the same time, making sure that the money being spent on Predictive and Preventive maintenance programs is optimised.

What is reliability centered maintenance?

One of the most common approaches for developing or improving a Preventive Maintenance program is to use Reliability Centered Maintenance (RCM). In this article we give a brief summary of the key elements of Reliability Centered Maintenance.

Reliability centered maintenance (RCM) standards

We are fortunate that there are several standards that define the Reliability Centered Maintenance approach. This includes the SAE standard JA1011: Evaluation Criteria for Reliability-Centered Maintenance (RCM) Processes, and IEC 60300.3.11-2011 Dependability management – Application guide – Reliability Centered Maintenance.

RCM is a structured process which sequentially asks the following seven questions about the asset or system under review:

Functions – what are the functions and associated performance standards of the asset in its present operating context?
Functional Failures – in what ways does it fail to fulfill its functions?
Failure Modes – what causes each functional failure?
Failure Effects – what happens when each failure occurs?
Failure Consequences – in what way does each failure matter?
Proactive Tasks – what can be done to predict or prevent each failure?
Default Actions – what should be done if a suitable proactive task cannot be found?

Each of these questions is discussed briefly below.

Functions and performance standards

A key concept of Reliability Centered Maintenance is the understanding that the primary purpose of a preventive maintenance program is to ensure that equipment continues to do what the business requires it to do, in its present operating context. Therefore, the first step is to ensure that we full understand what it is that we need the equipment to do (its Functions), and the level of performance that is required if the business is to meet its objectives.

The Functions of equipment consists of Primary Functions (normally associated with the reasons that the equipment was acquired in the first place), and Secondary Functions (additional requirements, often to do with Safety, Efficiency etc.).

For many equipment items, there can be several (potentially as much as 20 or 30 in some cases) functions. For each of these functions, we must, where possible, quantify the level of performance required. For example, it would not be sufficient to state that the Primary Function of a pump is to pump a liquid from A to B, we would also need to specify the flow rate required, if production output targets were to be met.

As a result, if done properly, answering this first question alone can take between 25% and 35% of the total time required for an entire RCM analysis.

Functional failures

Functional Failures (or Failed States) simply define, for each function, the states under which the equipment does not fulfil its Functions. For each function, there is a need to consider both complete failure (for example where the equipment fails to operate at all), and partial failures (where the equipment operates, but does not operate at a level sufficient to meet the performance standard associated with that Function.

Failure modes

Once each Functional Failure has been identified, the next step in the RCM process is to identify all the events which are reasonably likely to cause each failed state. These events are known as Failure Modes. “Reasonably likely” Failure Modes include those which have occurred on the same or similar equipment operating in the same context, failures which are currently being prevented by existing maintenance regimes, and failures which have not happened yet but which are considered to be real possibilities in the context in question. The list should include failures caused by human errors (on the part of operators and maintainers) and design flaws so that all reasonably likely causes of equipment failure can be identified and dealt with appropriately. It is also important to identify the cause of each failure in enough detail to ensure that time and effort are not wasted trying to treat symptoms instead of causes. On the other hand, it is equally important to ensure that time is not wasted on the analysis itself by going into too much detail.

Failure effects

The fourth step in the RCM process entails listing Failure Effects, which describe what happens when each failure mode occurs. These descriptions should include all the information needed to support the evaluation of the consequences of the failure (in the next step in the process), such as:

what evidence (if any) that the failure has occurred
in what ways (if any) it poses a threat to safety or the environment
in what ways (if any) it affects production or operations
what physical damage (if any) is caused by the failure
what must be done to repair the failure.

Identifying the relevant Failure Modes and Failure Effects is also a comparatively lengthy activity – typically requiring around 30-35% of the time required for the entire RCM analysis.

Failure consequences

Another key concept underpinning Reliability Centered Maintenance is that the primary objective of a Preventive Maintenance program is not necessarily to avoid or minimise failures themselves, but to avoid or minimise the consequences of those failures. There is little point in spending a lot of time and money preventing failures that have little or no consequences associated with them. On the other hand, if a failure has serious consequences, we may be able to justify going to great lengths to avoid those consequences. In this way, the RCM process focuses attention on the maintenance activities which have most effect on the performance of the organization, and diverts energy away from those which have little or no effect.

The fifth step in the RCM process classifies the consequences associated with each failure mode as belonging to one of the following four groups:

Hidden failure consequences: Hidden failures have no direct impact, but they expose the organization to multiple failures with serious, often catastrophic, consequences. (Most of these failures are associated with protective devices which are not fail-safe.)
Safety and environmental consequences: A failure has safety consequences if it could hurt or kill someone. It has environmental consequences if it could lead to a breach of any corporate, regional, national or international environmental standard.
Operational consequences: A failure has operational consequences if it affects production (output, product quality, customer service or operating costs in addition to the direct cost of repair)
Non-operational consequences: Evident failures which fall into this category affect neither safety nor production, so they involve only the direct cost of repair.

The consequence evaluation process also shifts emphasis away from the idea that all failures are bad and must be prevented. In so doing, it focuses attention on the maintenance activities which have most effect on the performance of the organization, and diverts energy away from those which have little or no effect.

Proactive tasks

In RCM, failure management techniques are divided into two categories:

Proactive Tasks: these are tasks undertaken before a failure occurs, in order to prevent the item from getting into a failed state. They embrace what is traditionally known as ‘predictive’ and ‘preventive’ maintenance, although we will see later that RCM uses the terms Scheduled Restoration, Scheduled Discard, and Condition-based Maintenance
Default Actions: these deal with the failed state, and are chosen when it is not possible to identify an effective proactive task. Default actions include failure-finding, redesign and run-to-failure.

Many people still believe that the best way to improve equipment reliability is to do some kind of proactive maintenance on a routine basis. Conventional wisdom suggested that this should consist of overhauls or component replacements at fixed intervals. Figure 1 illustrates the fixed interval view of failure.

Conditional Probability of Failure — Figure 1 – Traditional View of Equipment Failure

Figure 1 is based on the assumption that most items operate reliably for a period of time, and then wear out. Classical thinking suggests that extensive records about failure will enable us to determine this life and so make plans to take preventive action shortly before the item is due to fail in future.

This model is true for certain types of simple equipment, and for some complex items with dominant failure modes. In particular, wear-out characteristics are often found where equipment comes into direct contact with the product. Age-related failures are also often associated with fatigue, corrosion, abrasion and evaporation.

However, equipment in general is far more complex than it used to be. This has led to significant changes in the patterns of failure, as shown in Figure 2. The graphs show conditional probability of failure against operating age for a variety of electrical and mechanical items.

Pattern A is the well-known bathtub curve. It begins with a high incidence of failure (known as infant mortality) followed by a constant or gradually increasing conditional probability of failure, then by a wear-out zone. Pattern B shows constant or slowly increasing conditional probability of failure, ending in a wear-out zone (the same as Figure 1).

Pattern C shows slowly increasing conditional probability of failure, but there is no identifiable wear-out age. Pattern D shows low conditional probability of failure when the item is new or has just been refurbished, then a rapid increase to a constant level, while pattern E shows a constant conditional probability of failure at all ages (random failure). Pattern F starts with high infant mortality, which drops eventually to a constant or very slowly increasing conditional probability of failure.

Studies done by Nowlan and Heap in the 1960s on civil aircraft showed that 4% of the items conformed to pattern A, 2% to B, 5% to C, 7% to D, 14% to E and no fewer than 68% to pattern F. (The number of times these patterns occur in aircraft is not necessarily the same as in other industries. But there is no doubt that as assets become more complex, we see more and more of patterns E and F.)

These findings contradict the belief that there is always a connection between reliability and operating age. This belief led to the idea that the more often an item is overhauled, the less likely it is to fail. Nowadays, this is seldom true. Unless there is a dominant age-related failure mode, age limits do little or nothing to improve the reliability of complex items. In fact, scheduled overhauls can actually increase overall failure rates by introducing infant mortality into otherwise stable systems.

As mentioned earlier, RCM divides proactive tasks into three categories, as follows:

Scheduled Restoration tasks
Scheduled Discard tasks
Condition-based Maintenance tasks.
Scheduled Restoration and Scheduled Discard tasks

Scheduled restoration entails remanufacturing a component or overhauling an assembly at or before a specified age limit, regardless of its condition at the time. Similarly, scheduled discard entails discarding an item at or before a specified life limit, regardless of its condition at the time.

Collectively, these two types of tasks are now generally known as Preventive Maintenance. They used to be by far the most widely used form of proactive maintenance. However, for the reasons discussed above, they are much less widely used than they used to be.

The continuing need to prevent certain types of failure, and the growing inability of classical techniques to do so, are behind the growth of Condition-based approaches to failure management. The majority of these techniques rely on the fact that most failures give some warning of the fact that they are about to occur. These warnings are known as Potential Failures, and are defined as identifiable physical conditions which indicate that a Functional Failure is about to occur or is in the process of occurring.

The new techniques are used to detect potential failures so that action can be taken to avoid the consequences which could occur if they degenerate into Functional Failures. They are called Condition-based tasks because items are left in service on the condition that they continue to meet desired performance standards. Condition-based tasks can include the use of sophisticated technology, such as Vibration Analysis, Thermography, Oil Analysis, Ultrasonics and others, but they can also include simple techniques such as visual inspection. Used appropriately, on-condition tasks are a very good way of managing failures, but they can also, if not applied in the right way and at the right frequency, be an expensive waste of time.

RCM provides a structured decision-making process, with clear evaluation criteria, which enables decisions regarding the selection of the appropriate Proactive Task to be made with confidence.

Default actions

RCM recognizes three major categories of default actions, as follows:

Failure-Finding: Failure-finding tasks entail checking hidden functions periodically to determine whether they have failed (whereas condition-based tasks entail checking if something is failing).
Redesign: redesign entails making any one-off change to the built-in capability of a system. This includes modifications to the hardware and also covers once-off changes to procedures.
No Scheduled Maintenance: as the name implies, this default entails making no effort to anticipate or prevent failure modes to which it is applied, and so those failures are simply allowed to occur and then repaired. This default is also called run-to-failure.

Conclusion

I hope that this has given you a flavour of what RCM is all about. However, I have been training and implementing RCM in organisations for over 20 years, and I can tell you that there is a lot more to successfully applying this process than I have had time to cover in this article.

Want to learn more about RCM?

If you are interested in learning more, then why not consider attending one of our two-day courses in Reliability Centered Maintenance and PM Optimisation. These run in various locations around Australia, or we can deliver the course in-house for your organisation. Our consultants can also facilitate RCM studies and help you to embed the RCM process within your organisation. Contact us if you would like to discuss how we may be able to help you.