Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Effective Maintenance Program Development/Optimization
Sammy Seifeddine HSB Reliability Technologies Senior Project Manager 800 Rockmead Drive Three Kingwood Place, Suite 180 Kingwood, TX 77339 (281) 358-1477 ext. 276 (281) 358-1871 fax
[email protected]
12th International Process Plant Reliability Conference October 22-23, 2003 Houston, Texas
Page
1
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Effective Maintenance Program Development/Optimization Abstract This paper describes a proven process for developing, optimizing, and managing effective maintenance programs for new and in-service assets based on risk and costbenefit principles. The process calls for utilizing operational and maintenance experience as long as the experience is documented for the proper class of assets in the form of standard tasks. In absence of standard tasks, a more comprehensive analysis is performed using Reliability-Centered Maintenance (RCM2) or Failure Modes Effects Analysis (FMEA) to develop an optimum program. Asset performance data is used to continually adjust the maintenance program to meet user objectives. 1.0
Introduction
A maintenance program is effective when it targets critical production equipment and puts emphasis on minimizing risk, which will lead to improved reliability, availability and resource utilization. This paper focuses on a process for developing effective asset (or optimizing existing) maintenance programs. The process is a component of overall asset’s Life Cycle Management (LCM). 2.0
Maintenance Program Development/Optimization
This process consists of the following steps (refer to Figure 1): 1. 2. 3. 4. 5. 6. 7.
Identifying business objectives. Development of plant/asset technical model. Condition assessment of installed assets. Criticality and risk assessment. Maintenance program development/review. Loading of maintenance tasks to the CMMS system. Maintenance spares strategy (not covered in this document.)
These steps are considered in more detail in the following sections. 3.0
Business Objective
Business objectives are set at the corporate and plant levels. They reflect market conditions, shareholders expectations, and regulatory compliance. Objectives at this level
Page
2
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
include production levels, products qualities, safe operation policies and requirements, environmental integrity requirements, and operating cost targets. Objectives are then translated to major assets’ specific performance expectations. Measures at this level might include availability, asset utilization, efficiency, specific products qualities, Overall Equipment Effectiveness (OEE), cost per unit produced, etc. Target values are set by plant operating departments and approved by plant and corporate management. Major assets or systems performance expectations are further refined to the individual equipment level. Here target vales for measures, such as Mean Time Between Failure (MTBF), Mean Time To Repair (MTTR), availability, etc., are set and approved. This process is repeated periodically, and the objectives are changed to reflect the company’s position regarding the main business drivers. Figure 2 identifies the steps involved in developing asset performance expectations. Business objectives and performance expectations set the stage for defining equipment performance standards for high risk equipment in which RCM2 is the utilized method for developing/optimizing the maintenance programs. 4.0
Plant Technical Model
The plant technical model (also known as asset hierarchy) is composed of a hierarchy of systems and sub-systems that gradually represent increased levels of detail in describing the asset. The model reflects how systems and sub-systems fit together, interrelate and operate to provide the intended business function. As such, the hierarchy reflects both the structural and process flow characteristics of the plant/asset. The model starts with the process flow diagram representing the overall operation of a plant. This level consists of the major plant production units, utility systems (such as electricity, water, steam, air, fuel, etc.), feed and raw material preparation facilities, final product storage, plant control systems and local area network(s), infrastructures, etc. The next level breaks down each unit into systems and sub-systems as depicted on unit process flow diagram and P&ID’s. Examples at this level include systems such as feed filtration, feed pressurization, feed heating, atmospheric fractionation, etc. At progressively lower levels of the model, the breakdown of the plant becomes more detailed. At the end, the plant is reduced to a set of systems and sub-subsystems and the equipment items that support each one of the systems or sub-sub-systems. Control and protective systems are incorporated in the hierarchy at the appropriate levels. In the case where a control or protective system is dedicated to one system or sub-system then it should be setup as a sub-element of that system. In the case that a
Page
3
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
control/protective system is controlling/protecting multiple systems, it should be setup as an element at the same level in the hierarchy. Every hierarchy element - whether it is a system, sub-system or an equipment item - has a clearly defined boundary. Boundary definitions are standardized for classes of system/equipment items. The steps involved in developing a plant technical model are as follows (see Figure 3): 1. Collect technical information and drawings (PFD’s, P&ID’s, line diagrams, datasheets, O&M manuals, etc.) 2. Establish a standard for defining systems’ boundaries. See references 4 and 6 for details. 3. Develop plant technical hierarchy. 4. Define systems’ functions (optional). 5. Load hierarchy into the plant maintenance information system (CMMS). 5.0
Criticality and Risk Assessment
Criticality and risk assessment is a qualitative analysis of assets failure events and the ranking of those events according to their impact on the business goals of the company. The process consists of the following main activities (see Figure 4): 1. Establish criticality assessment criteria. 2. Define for each assessment criteria the failure consequences and their scores. 3. Collect equipment condition assessment records or generic failure frequencies. 4. Determine failure frequencies and their ratings. 5. Define criticality ranking scores. 6. Define criticality ranking rules. 7. Select systems and/or equipment for assessment. 8. Perform the analysis. 9. Rank systems/equipment by criticality. 10. Rank systems/equipment by risk. These steps are considered in more detail in the following sections. 5.1
Assessment Criteria
The first step in the analysis is to use the organizational business objectives to define the criticality assessment criteria. The following are some suggested criticality assessment criteria. ¾ Health and Safety. ¾ Environmental Integrity. ¾ Throughput.
Page
4
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
¾ Customer Service. ¾ Operating Cost. Each criterion is given a maximum score to reflect the consequences and relative importance. In Table 1, the safety criterion is given a maximum score of twenty (20) while operating cost criterion is given a maximum score of ten (10). 5.2
Failure Consequences
Failure consequences within each criterion are defined and given an evaluation score. Table 2 provides examples of safety, throughput/downtime, product quality, maintenance and operating cost criteria and their associated consequences of failure and their scores. 5.3
Failure Frequencies
Failure frequencies are defined based on systems and equipment performance. When defining failure frequencies, consideration is given to aspects such as: ¾ ¾ ¾ ¾ ¾
Operational failure history (where available). Generic reliability data. Equipment redundancy. Mode of equipment operation. Equipment stress variations, etc.
The frequency of failure score is used in the calculation of relative risk to determine how likely the failure of the assessed system or equipment item will impact an organization’s business. Table 3 shows a sample of frequency scores. 5.4
Criticality Ranks and Rules
The criticality rank number of a system or equipment is a function of the system’s or equipment’s impact on the business when the system or equipment fails, regardless of how often the failure occurs. For example, a set of criticality ranking numbers might range from 1 to 10. Criticality rank number 10 represents the highest rank while number 1 represents the lowest. Criticality ranking rules are defined to assist in assigning criticality ranks to systems or equipment during the analysis. The rules are established by considering the combined consequence scores for all assessment criteria. For example, a rule can be defined as “Assign criticality of 10 to a system/equipment, if any of safety or environmental consequence scores are greater than 18, or any of throughput, product quality or maintenance and operating cost consequence scores are equal to 10”, and so forth.
Page
5
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
The equipment criticality rank numbers, number range, and the rules for assigning the numbers to systems or equipment under assessment are defined before conducting the analysis. Criticality rank numbers are assigned to systems and/or equipment based on the rules developed. This is accomplished by comparing the equipment’s criteria consequence scores to the criticality rank number’s rules. If the equipment matches the rules, the equipment is assigned that criticality rank number. The equipment is always assigned the highest criticality rank number it matches. 5.5
Criticality and Risk Assessment
The assessment starts by analyzing the selected system and/or equipment failure consequences. The most serious failure consequence in each defined consequence criterion is identified and its score recorded. System and equipment failure consequences are analyzed in terms of the resultant effects on the asset as a whole and consider the impact of the failure on safety of personnel and on the asset commercial performance. The later requires consideration of both direct and indirect failure costs. The analysis is conducted by answering a series of questions about each system or equipment item. These questions assess both the consequence of system or equipment failure and the frequency/probability of failure with respect to the assessment criteria. The criticality number and relative risk are calculated during the assessment from responses to the questions. Questions are formulated in the following form: “If the system/equipment fails, could it result in a safety consequence? If yes, how serious should the potential consequence be rated?” 5.6
Results of Criticality and Risk Assessment
5.6.1
Outcome of the Assessment
Criticality and risk assessment produces the results: 1. 2. 3. 4.
Systems/equipment criticality ranks. Relative risk. Total consequence scores. Individual system/equipment scores.
Page
6
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
5.6.2
Relative Risk
The probability of failure is used in combination with the total failure consequence of a system/equipment to determine the RR value of the system/equipment. CARA uses the concept of the relative risk (RR) to identify system/equipment that has the greatest potential impact on the business goals of the company. The RR of a system or equipment is the product of its Total Consequence Score (TC) and the Frequency/Probability (F/P) Number. It is called “relative risk” because it only has meaning relative to the other equipment evaluated by the same method. The Total Consequence (TC) is the sum of all the scores assigned to each of the criteria including: Safety (S), Environmental (E), Quality (Q), Throughput (T), Customer Service (CS) and Operating Cost (OC). TC = S + E + Q + T + CS + OC RR = TC * F/P 6.0
Maintenance Tasks Development/Optimization (MTD/O)
The MTD/O process described in this paper establishes a structured framework for developing or assessing maintenance programs for in-service or newly commissioned assets. The process emphasizes the use of operation and maintenance experience documented in a form of standard maintenance tasks (SMT). 6.1
Maintenance Tasks Development/Optimization (MTD/O) Overview
The flowchart in Figure 5 describes the steps involved in carrying out the MTD/O process. The steps involved in the development/optimization of maintenance tasks are as follows: 1. A system is identified for review by selecting an element from the plant technical hierarchy. As described earlier, the selected system boundary should be clearly defined. The selected system includes all lower level elements. 2. A risk analysis is performed per section 4 of this paper. If an analysis was conducted in the past, review of failure frequencies in lieu of the current system/equipment items’ condition is conducted and the frequency scores changed as necessary. The system/equipment items selected are then ranked by their risk ranking. 3. In the case that the system under review belongs to an equipment class group that has a Standard Maintenance Task (SMT) documented, it is only necessary to verify for low risk systems/equipment that any specific company, standards, and regulation
Page
7
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
requirements are applicable and simple service activities are adequate and cost efficient. For high and medium risk systems/equipment, verification of all SMT elements is required. 4. When an applicable SMT is not available, a more detailed analysis is required for high and medium risk systems/equipment. For high risk items, a complete RCM2 analysis is recommended, while for medium risk items, RCM2 (FMEA) is sufficient to develop/optimize the maintenance program. The outcome of RCM2 or RCM2 FMEA is a set of proposed tasks, their frequencies, and the crafts and skill levels of individuals performing the work, or recommended actions in case suitable routine tasks cannot be found. 5. For low risk items not governed by any company, standard or governmental requirements a run-to-failure strategy is adapted. When requirements exist, routine tasks are developed and incorporated into work packages. 6. From the output of RCM2 or RCM2 (FMEA), detailed routine task descriptions are developed and then incorporated into work packages. 7. SMTs are developed to reduce tasks development time, efforts, and to ensure consistency when dealing with equipment from the same equipment group. Developed SMTs are kept in a library for future reference. Routine updates are made to SMTs to reflect current condition of equipment, gained maintenance and operating experience, and any new changes/modification to systems and equipment. 8. The final step in the analysis is to upload the developed work packages into Plant Reliability Information Management Systems (PRIMS). PRIMS include maintenance systems such as MAXIMO, SAP Plant Maintenance, Document Management Systems, Inspection Systems, etc. 9. Monitoring developed/optimized maintenance programs is essential to ensure their effectiveness in meeting the objective set by the organization. An established method for recording failure modes, failure effects, and failure causes as well as the corrective actions taken to eliminate/reduce the failure effects is critical to the successful implementation of any maintenance program. 6.2
Standard Maintenance Task (SMT)
An SMT is a set of maintenance activities, which demonstrate a technically feasible and cost-effective maintenance strategy for a defined equipment group. An equipment group is a set of equipment of the same class that functions in an identical operating context. An equipment group has similar design, failure modes and frequencies. Establishing a library of SMTs ensures consistent documentation of maintenance strategies, reduces the efforts for developing maintenance programs for new systems,
Page
8
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
ensures the application of uniform, consistent and cost-effective maintenance activities, and facilitates analysis of equipment groups. It is recommended to include the following information when documenting a standard maintenance task: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Applicable company requirements. Applicable governing standards. Governmental requirements/regulations. Completed RCM2 analysis. Description of equipment boundary and proper reference to drawings/isometrics. Description of operating context (operational and environmental.) Assumptions/requirements for/from risk assessment. Dominating failure modes with approximate probability. The selected maintenance activities to reduce the probability of identified failure mechanisms to cause failure along with the proper intervals (time-based or performance/condition-based). 10. All equipment monitored parameters (RCM2) with their sensitivity to faults/failures. 11. Established performance indicators. 12. Experience from using a known maintenance strategy along with periodic monitoring of established performance indicators. 13. For non evident failure modes, the tests/inspections required to determine equipment expected availability. 14. Required experience and competency of maintenance personnel. 15. Estimated person-hours for maintenance activities. 16. Estimated repair time. 17. Essential spare parts, tools, equipment, and lead times. The extent of documentation depends on the complexity and the risk assigned to the assets under review. For low risk assets, it is only required to document items one to three above and an assessment if simple service activities are adequate and cost effective. For high and medium risk assets, it is recommended that the SMT documents all of the listed items. 6.3
Condition Monitoring
The MTD/O review will determine that the best maintenance strategy is to perform “on condition maintenance.” Equipment condition is determined by monitoring operational and non-operational parameters sensitive to failure modes. Since not all parameters are effective in detecting failure modes, a formal analysis is needed to select the right corroborative set of parameters. The analysis must identify the failure sensitive parameters and their monitoring practicality.
Page
9
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
After establishing the technical feasibility of condition monitoring, the economic viability must be considered. The costs associated with the operation and on-going support of the condition-monitoring program must be considered against the potential cost savings and cost of alternative maintenance strategies. 6.4
Monitoring Maintenance Program Effectiveness
Monitoring the effectiveness of the developed maintenance programs is accomplished by tracking and trending a set of key performance indicators. The indicators were established during the assets condition assessment phase. Progress reports are produced periodically. Modifications to maintenance tasks are made when necessary. 7.0
Application
This process was introduced and implemented at several plants in North America. Assets’ condition assessment studies were conducted and baselines established for each facility. The studies helped in developing the frequency score tables and provided points of reference for future analysis to assess the effectiveness of the devised maintenance programs. Areas of assessment included the following: ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾
Mean time between failures. Downtime due to unscheduled maintenance. Downtime for scheduled maintenance. Asset downtime due to failures of utilities, upstream, and downstream production assets. Slowdowns due to equipment failures. Slowdowns due to utilities, upstream and downstream failures. Quality problems due to equipment failures. Maintenance cost. Increased operating cost due to equipment failures. Safety incidents due to equipment failures. Environmental releases and damages due to equipment failures. Spares consumptions. Survey of existing PM and PdM tasks.
Operational downtimes and slowdowns data were collected but not used for this analysis. The impact of adapting this process on assets performance and maintenance organizations are summarized in Table 4.
Page
10
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Start Start
Develop Develop Plant/Asset Plant/Asset Technical Model Technical Model
Perform Perform Criticality & Risk Criticality & Risk Assessment Assessment Existing New/Existing Plant/Asset?
Assess Plant/Asset Assess Plant/Asset Condition Condition
New Develop / Develop / Optimize Optimize MP MP
Develop / Develop / Optimize Optimize Spares Strategy Spares Strategy
Modify/Load MP To PRIMS
Monitor Monitor MP MP Effectiveness Effectiveness
End End
MP: Maintenance Program PRIMS: Plant Reliability Information Management Systems
Figure 1: Maintenance Program Development Process.
Page
11
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Start
Corporate Objectives Corporate Objectives
Plant Objectives Plant Objectives
Major Assets/Systems Major Assets/Systems Performance Expectations Performance Expectations
Equipment Item Equipment Item Performance Expectations Performance Expectations
End
Figure 2: Setting Performance Expectations.
Page
12
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Start Start
Collect Collect Plant/Asset Plant/Asset Technical Data Technical Data
Establish Establish Boundary Definition Boundary Definition Standards Standards
Develop Develop Plant/Asset Plant/Asset Technical Model Technical Model
Describe Systems’ Describe Systems’ Functions Functions
Load Plant/Asset Model & Equipment To PRIMS
End End
PRIMS: Plant Reliability Information Management Systems
Figure 3: Plant Technical Model Development.
Page
13
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Start Start
Select System Select System
Generic Failure Generic Failure Data Data
Define Failure Define Failure Consequences & Consequences & Their Ratings Their Ratings
Perform the Analysis Perform the Analysis
Assess Plant/Asset Assess Plant/Asset Condition Condition
Determine Determine Failure Frequencies Failure Frequencies & Their Ratings & Their Ratings
Assign Criticality & Risk Ranks To System(s) / Equipment
Define Criticality Define Criticality Ranking Table Ranking Table
More Systems/ Equipment?
Develop Develop Plant/Asset Plant/Asset Technical Model Technical Model
Yes
Cycle through systems/equipment list
Establish Establish Assessment Assessment Criteria Criteria
Business Objectives
No
End End
Define Criticality Define Criticality Assignment Rules Assignment Rules
Figure 4: Criticality and Risk Assessment.
Page
14
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Start Start
Select Select A System/Equipment A System/Equipment
End End
Perform Perform Risk Risk Assessment Assessment
Planned Planned Corrective Repair Corrective Repair (Run-to-Failure) (Run-to-Failure)
Risk? Low
No
Regulatory Requirements?
Medium
High
Perform Perform RCM 2 RCM 2 Analysis Analysis
Perform Perform RCM2 (FMEA) RCM2 (FMEA)
Yes
Standard Maintenance Task Exist?
No
Establish SMT Establish SMT
Yes
Yes
Relevant as SMT?
Routine Activities, Frequencies, Required Resources
No
Select Proper SMT Select From Proper LibrarySMT From Library
Write Detailed Write Detailed Work Instructions Work Instructions
Add SMT to Library
Verify Verify Work Packages Work Packages
Load Work Packages To PRIMS
Yes
Determine Determine Work Packages Work Packages
More Systems/ Equipment? No
End End
Figure 5: Maintenance Tasks Development/Optimization.
Page
15
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Criterion Health and Safety Environmental Integrity Production Throughput Operating Cost
Score 20 20 10 10
Table 1: Assessment Criteria Scores.
Score
Consequence Safety
20 18 14 6 0 10 9 8 7 6 4 2 0 10 5 0 10 8 6 4 2 1
Fatalities. Disabling injury. Serious injury. Minor or first aid injury such. No injury. Throughput/Downtime Production downtime equal or greater than 7 days Production downtime from 3 to 7 days. Production downtime from 1 to 3 days. One day production down time. Production throughput at 25% of capacity. Production throughput at 50% of capacity. Production throughput at 75% of capacity. No impact on throughput. Product Quality Unacceptable quality resulting in TOTAL product loss. Unacceptable quality resulting in TOTAL product rework. No effect on product quality. Maintenance and Operating Cost Incurred cost <$400K. Incurred cost >$100K and <$400K. Incurred cost >$50K and <$100K. Incurred cost >$10K and <$50K. Incurred cost >$1K and <$10K. Incurred cost <$1K.
Table 2: Safety Criterion Consequence Table.
Page
16
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Failure Frequency Failures occur daily Failures occur weekly Failures occur monthly Failures occur between one month and one year intervals Failures occur yearly Failures occur between 1 and 5 years Failures occur between 5 and 10 years Failures occur less frequently than once in 10 years
Score 10 9 8 7 6 5 4 1
Table 3: Failure Frequency Scores.
Before
After
1) 2) 3) 4)
Availability (%)1
Downtime (%)2
RAV3
Product Quality Rejects (%)4
Plant 1
88
8
4.1
6
Plant 2
89
7
3.5
6
Plant 3
92
5
3.1
4
Plant 4
93
4
2.5
2
Plant 1
92
4
3.25
4.5
Plant 2
91.5
4.5
2.85
4.2
Plant 3
94.5
2.5
2.4
2.4
Plant 4
94.5
2.5
2.1
1.3
Availability ([operating time - all downtimes including slowdowns]*100/operating time). Planned and unplanned downtime for maintenance (excluding TA). Percent of maintenance cost to asset replacement value. Percent reject due to equipment failure (includes startup and shutdown of spec products).
Table 4: Implementation Results.
Page
17
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Appendix A: Definitions Asset: May refer to a plant, system, or a piece of equipment. Failure Mechanism: Physical, chemical, or other processes which lead or have led to failure. Maintenance Program: A comprehensive set of maintenance activities, their intervals, and required recourses along with the performed maintenance analysis documentation. Maintenance Strategy: The means by which equipment are maintained. The maintenance strategy can be of four main types: Run-to-failure, preventive, predictive (on condition maintenance), or, redesign (the equipment). Standard Maintenance Task (SMT): A set of cost-effective maintenance actions for an equipment class group. Equipment Group: A set of equipment of the same class that functions in an identical operating context.
Page
18
Downloaded from Reliabilityweb.com on the web at http://www.reliabilityweb.com
Appendix B: References 1. AIChE/CCPS, Guidelines for Process Equipment Reliability Data. Center for Chemical Process Safety, American Institute of Chemical Engineers, New York, 1989. 2. Blanchard, Benjamin S., Logistics Engineering and Management, Prentice Hall, Inc., 1998. 3. EXP Training Documentation, IVARA Corporation, 2002. 4. Moubray, John, Reliability-Centered Maintenance (RCM II), 2nd Edition, Industrial Press, 1997. 5. ISO 14224, “Petroleum and Natural Gas Industries – Collection and Exchange of Refinery and Maintenance Data for Equipment,” International Standards Organization, First Edition, 1999. 6. Norsok Standard, “Criticality Analysis for Maintenance Purposes,” Z-008, Rev. 2, November 2001. 7. OREDA-97, Offshore Reliability Data, Det Norske Veritas, P.O.Box 300, N-1322 Hovik, Norway, 3 Edition, 1997. 8. Seifeddine, Sammy, “Criticality and Risk Assessment,” HSB Reliability Technologies, Project Document, 2000.
Page
19