Much has been written on the subject of the “root cause” and “root cause analysis” (RCA) of failures and it is a subject on which it is worth spending considerable time and effort. But first, lets define a “root cause”.
If a root cause is defined as “the first action, decision or omission that eventually leads to a failure” it can lead to some conclusions that may not be particularly helpful for preventing recurrence of the problem. As an example, consider the failure of a knife-gate valve in an effluent system that shut down a large pulp and paper mill for three days. This valve isolated primary and backup effluent pumps from each other, and devices that separate primary and backup utilities (water, electrical, air, effluent, etc) are often the most critical and the hardest to access for maintenance of any components in a large operation. The mode of failure of this valve was the fracture of a manufacturer’s weld that attached the valve yoke to the body. The pressure of the effluent forced the spade out of the body and the escaping 180oF effluent submerged all the effluent pumps, shutting down the plant.
One cause of this failure was the use of an inappropriate valve for the service, but was that the “root cause”? If the question is asked “Why was such a valve selected?” there are a number of possible answers, including:
– the project budget did not allow for the purchase of a more reliable valve.
– the design engineer did not appreciate the operating context to which the valve would be exposed (very high vibration levels).
– there was blind adherence to the plant’s valve standards, which did not consider the criticality of the service.
If any one of these factors is the root cause, then the conclusion would be that the valve should be replaced with a more suitable and reliable valve. While this would certainly prevent recurrence, the cost of such a project and its associated downtime, in this case, would be prohibitive.
While it is of value to perform an analysis and identify the root cause, it may not be of much use in solving any specific problem. The objective is usually to prevent a problem from occurring again and this can often be achieved without finding the real “root cause”. And, of course, any analysis of the root cause is of no value unless it leads to actions, some of which may be difficult and expensive, and the discipline required to ensure that this happens is covered later in this article.
In the valve example, the action that was taken to prevent recurrence was to redesign the attachment of the valve yoke to the valve body so that it would not fail when subject to high vibration and other actual operating conditions. This action was taken without any consideration of the real root cause (valve selection).
So it can be seen that “Root cause” and “elimination of the problem” may be two different things.
To complete the valve example, in an excellent maintenance organization the following two actions would be taken:
– “Eliminate the problem” – redesign the valve yoke-to-body connection to prevent further failures on the failed valve and similar valves in a similar operating context.
– “Root cause” – revise the plant standards and train the design engineers so that the selection of components will always take into account such factors as operating context, maintenance accessibility and the consequence of failure.
From the above example, a more practical definition of “Root Cause” is “The cause of a problem which, if adequately addressed, will prevent a recurrence of that problem”. Looking at another example:
Imagine that a bearing has failed, and that an investigation shows that it had not been lubricated. Asking the question “Why had it not been lubricated?” may lead to the discovery that the grease point for the bearing had been missed during a lubrication survey and it was not on the lubrication mechanic’s route sheet.
Using the new definition of “root cause” this problem can be prevented from happening again by simply adding this grease point to the lubrication route sheet.
But if the definition of “root cause” is changed slightly to “The cause of a problem which, if adequately addressed, will prevent recurrence of that problem and similar problems” it raises the bar to a higher level. The next question then would become “Why had the grease point been missed from the lubrication route?”. The answer may be that lubrication routes were set up by a single person, with no checks or confirmation that the routes were complete. This may lead to an action to change the procedure for the development of lubrication routes to ensure that there are no other missed lubrication points in the plant, nor will there be in the future.
By asking the question “why” a few times, the root cause of a problem is often identified as a procedural, or management, shortcoming. In the valve and bearing examples, identifying the real root cause should result in such changes as revised plant component and procedure standards, improved training of design engineers and rigid procedures for establishing plant lubrication and inspection routines. Addressing true root causes often requires a change of thinking and some expense and effort, but the results will be much longer-lasting and higher-value than correcting individual failures because they will improve not only maintenance and operating procedures but will also lead to better plant engineering and design decisions.
One perception of an RCA programme that limits its value and may even lead to its demise is that a root cause investigation must involve a large number of people in a long meeting or meetings to reach a conclusion. Often the people required are the key operating, maintenance and engineering specialists who will always have other, pressing priorities. For some complex problems with serious consequences, such meetings may be necessary, but they should be few and far between. These investigation meetings may use any one of the many structured techniques for identifying the root cause, including fishbone diagrams, “Questioning to the Limit”, Kepner-Tregoe Problem Analysis® and others.
However, where possible, the RCA process should usually be very simple.
In our Planning and Scheduling training we stress that all tradespeople should provide useful feedback whenever a corrective-maintenance repair is completed. Supervisors should be in the habit of asking three questions of their tradespeople:
- Do you think this job was necessary?
- Can you think of a better way to do it?
- Can you suggest any action that will prevent another failure like the one that you’ve just repaired?
To assist tradespeople to provide the best answer to the last question, they should be encouraged to discuss repair work with operators. Between them, an experienced operator and an experienced tradesperson will have the knowledge required to suggest an action to prevent the recurrence of failures in the large majority of cases.
Of course, and this is the key, once the root cause has been determined, the recommended or agreed action to address it MUST be taken. The discipline required to ensure that this happens is often lacking, for a number of reasons. With limited resources, long-term solutions must be prioritized along with all the other day-to-day requests for work and addressing the root cause may be well down the list of priorities, and may stay there until it is forgotten or cancelled.
Consider another example. A tradesman has just changed the rotating assembly on a condensate pump and writes the following note on the back of the work order. “Changed the rotating assembly. Found that the coupling was out of line by .060″ and had to also change the coupling insert. Two of the pump base anchor bolts are loose and there seems to be a lot of pipe strain”.
On receiving such a note, the Supervisor (or Planner, or whoever has the responsibility for follow-up, which should be defined in their job description – see “Maintenance business processes and position descriptions“) must recognize that some action is required to prevent a recurrence of this problem and their responsibility at this point is to enter a work request to repair or replace the pump base and to check the piping supports, cold spring, etc. Note that these jobs may require substantially more effort, and more downtime, than the initial job of changing the rotating assembly and may be put off for some time because of their complexity.
This is not unusual for root causes – they are often difficult to address, but the best organizations will make sure that they are given the priority they deserve and will make this work a cornerstone of a continuous improvement process. Some system to ensure the completion of work to address root causes is required, and one strong driver is to include Maintenance in an ISO 9000 quality assurance programme, where this is in effect. Another tool is to include a work order field that defines how the work was identified, where one value in the “drop-down” list is “Investigation” (see “Work order coding“). This allows for easy tracking of all root cause elimination work orders.
Bearing failures deserve a special mention, because they are a common cause of downtime in manufacturing. If rotating equipment is properly selected for the operating context to which it will be exposed, the operating load on its bearings will virtually always be in the theoretical “infinite life” range – i.e. the contact points of rolling elements will always be separated by an oil film. Bearing failures nearly always result from non-operating loads on bearing surfaces which are caused by oil contaminants or corrosion products bridging the oil film, overloading, misalignment, out-of-balance, etc. However, the root cause of bearing failures can often be traced back to events that occur before the equipment was put into service. These events include improper storage and handling of bearings and storage and handling of equipment. I remember seeing a small mixer being unpacked from the crate in which the manufacturer had carefully supported and packed it to prevent damage. It was then put on a pallet and bounced across the rough roads in the plant on a hard-tired lift truck to its destination. Such treatment may reduce the bearing life from many years to a few months, but the root cause may never be recognized.
Excellent organizations will encourage and promote the elimination of root causes by training tradespeople and operators in the principles of operation of the equipment for which they are responsible and allowing them time to help to find the reasons for equipment failures. They will also ensure that the work required to eliminate repetitive problems is given a high priority and is closely followed up in the work management system. As in many other aspects of maintenance, root cause elimination may be perceived as a process for “Maintenance to work itself out of a job”, and it will, over time, reduce the number of breakdowns and therefore the total maintenance effort required to keep the operation reliable. A good manager will understand such concerns and will act appropriately (see the “The Maintenance Cost-reduction Conundrum“).
One last word. I once attended an investigation into an accident where a mechanic had broken an ankle because he had used the top half of an extension ladder (without safety feet) and the ladder had slipped out from under him. The recommendation that resulted was to dispose of the ladder and to check the plant for any other unsafe ladders. Present at the investigation, as an observer, was our local government Workmen’s Compensation Board representative, and when the meeting was over he said “I know that your recommendation is sound, but the question I have is ‘How can your management allow a work environment to exist where any employee accepts the use or even the existence of an unsafe ladder or any other unsafe equipment?'”. This question is a real sleeper and goes right to the heart of the Root Cause philosophy.
To return to the articles index, click here.
Don Armstrong, P Eng
President, Veleda Services Ltd