Because of their highly visible role in the Information Technology hierarchy, Business Continuity professionals have a unique opportunity to help their Facilities Operations counterparts achieve long term continuous operation. Facilities related downtime events are increasingly rare, yet often result in extended downtime, because they can impact a significant portion of the operation.
A major interruption to data processing will cause much more than embarrassment for most organizations, particularly with today’s level of public visibility. Has your organization adequately prepared your Facilities team to minimize this risk? A carefully implemented strategy will significantly enhance your chance of success.
Over the last fifteen years, the industry has quantified the importance of devoting considerable attention to a Facilities Operations Plan. Even though electrical and cooling system designs are more robust than ever, typically allowing for system maintenance without a data center shutdown and often eliminating most single points of failure, these facility designs do not eliminate the risk of interruption.
Human error is consistently found to be the primary cause of facilitiesrelated computer downtime events. As Robert McFarlane, principal and data center design expert at Shen Milsom and Wilke stated in the June, 2011 edition of Search Data Center.com: “Reputable studies have concluded that as much as 75% of downtime is the result of some sort of human error.”
Creating and implementing a thorough plan for Facilities Operations is the best means to apply this knowledge and achieve optimal systems performance. Based on experience gained through more than 150 data center facilities consulting engagements and in managing the start-up of a critical data center facility, the author recommends the following strategy.
The design and structure of the department that operates the electrical, cooling, and fire detection/suppression systems in the data center is the first step. For example, a minimum of two trained individuals per shift on a continuous shift schedule are required to effectively intervene when a generator or cooling system fails to start automatically, regardless of facility size. This number not only helps ensure safety, it dramatically minimizes the chance of error, as long as one person reads the steps of a procedure and the other repeats each step aloud before carrying it out. Whether your computer room is 5,000 square feet in size or 150,000 square feet, if it has redundant electrical and cooling systems, a well trained continuous shift presence will enable you to avoid interruptions due to failed generator or cooling components. You will also recover much more quickly if an electrical system failure occurs.
Operating with two individuals or more on each shift also ensures each shift will be able to perform productive work, instead of simply serving as shift “watchmen.” Counter-intuitively, employing two individuals per shift will actually show a cost savings compared to a single-person-per-shift plan, due to the elimination of some contracted work. A sample organization chart for this level of support follows.
Annual objectives for this group should include collective goals for consistent facilities systems uptime and successful/safe completion of all assigned preventive maintenance (PM) tasks and customer requests. Individual objectives should vary by position, allowing ownership of specific systems, tasks, and projects to be clear. The Facilities team should report to the internal organization that will ensure it receives the best ongoing communications, funding, and support in order to meet the data center’s specific objectives. For many companies, this will mean the Information Technology organization; for some, the Finance department; for most others, the Real Estate group.
If your company can afford an interruption of computer operations due to facilities system failures roughly once a year, the industry average, you should be able to operate with a substantially smaller Facilities staff. Be aware that operating without two individuals will increase the risk of error (one person will easily miss a step when following a procedure), increase the risk of injury, delay response time to an event, and render the shift(s) with only one person less productive. Ultimately, you will spend more for the same amount of work for functions such as cable installation, which will have to be contracted or performed only on a shift where you have two individuals present. There are many organizations operating with less than continuous shift Facilities Operations coverage, fewer than the optimal number of procedures, and minimal training. They must be willing to absorb the impact of an interruption to data processing when these choices permit one to occur.
New Facility Hiring Schedule
Most owners fail to hire the Facilities team early enough in the design/ construction process. Staff involvement in construction monitoring will pay off over the years you operate the new facility. For example, root cause analysis during a system failure will be greatly enhanced with detailed knowledge of system construction and configuration. This personal observation during construction can often make the difference between outage avoidance and the need to explain “what went wrong.” If you are planning a new facility, several of your Facilities team members should be hired in time to participate in factory witness testing of equipment, as well as systems commissioning – once the equipment is installed at your new facility.
Procedures and Training
With a fully developed staff plan and a hiring schedule in place, your next objective should be defining site-specific procedures and training programs with detailed schedules. Just as airline pilots must be trained and certified on specific models of airplanes, data center facilities systems training must be customized to the unique systems configuration at each site. Many owners assume the general training provided by equipment manufacturers will enable the Facilities team to confidently operate infrastructure systems without error. Although critical facilities-experienced individuals should be sought as you make hiring decisions, they will need the benefit of procedures and training specific to the system configurations for which they will be responsible.
Depending on the complexity of your facilities systems configuration, the number of emergency response and system transfer procedures required will range from 50 – 200. If you have not contracted with your design engineer(s) or commissioning agent to develop these, an operations consultant should be engaged to develop, or assist one of your staff with developing the needed procedures before the building is completed. Regardless of who is responsible for development, procedures should be “tested” individually with Facilities staff members for clarity before they are finalized. The process to create and test procedures is normally a three to six month endeavor.
A clear, concise and consistent procedure format should be employed; one which includes a means to “check off” each step as it is completed. One team member must read aloud the desired step and the second individual must repeat back what they are about to do before proceeding (as a pilot and co-pilot would). Failure to follow this simple process is the cause of an alarming number of downtime events.
Training programs for your Facilities team should include:
- Initial testing of new procedures as they are developed
- Systems overviews provided by design engineers
- Manufacturers’ provided training on individual systems/components installed and participation in integrated systems commissioning (when constructing a new facility)
- Repetitive site-specific training (monthly sessions)
Your Facilities group should manage and dictate the schedule for each of these training programs. Testing of procedures should be spread evenly among your Facilities staff, who will each be working with the person responsible for creating and refining the documents. Systems overview training should be presented to your team as a group, so that “how all the pieces fit together” is understood first. Completed procedures will serve as additional content.
Monthly site-specific training sessions should be developed by your Facilities team with a focus on which emergencies, typically system failures, they wish to be most prepared for. Emergency response procedures and systems overview documents will be the basis for these sessions. Individual “system experts,” such as manufacturers’ installation technicians, design engineers, service providers, and some of your own staff members should serve as initial trainers.
This program will provide an annual chance for your Facilities team to simulate the desired response when an emergency occurs. Similarly, system transfer procedures will be the basis for another form of training, as planned preventive maintenance activities are conducted throughout the year. Confidence instilled through practice will pay off. Without repetitive training, your Facilities staff will be trying to “land the plane” from only the memory of the initial training provided when construction was completed.
Successful data center Facilities teams adhere to a rigid schedule for planned maintenance tasks. Your Facilities group should follow a work schedule generated by an automated program, which has been customized by your group for your specific site. Maintenance intervals as recommended by each infrastructure system manufacturer should be compared to actual experience in the field (the experience of your Facilities staff members and those from comparable facilities you have benchmarked with). The automated program will permit you to plan PM events which require change management approval well in advance. Individual work orders should be input with enough detail that they may be printed and followed stepby- step during the PM activity. The automated system will generate a late report to the department manager when the scheduled PM date has passed without closing out the work order.
In addition to a well implemented Facilities Operations strategy, every data center operation will benefit from carefully articulated control processes that apply to all who work in the facility. Business Continuity personnel should ensure their organizations have implemented the following policies:
- Specific data center work rules, which are thoroughly reviewed and signed by each individual before entering the facility the first time (and again annually)
- Limited access – minimize those permitted unescorted access
- Shipping/receiving only on a planned basis – unscheduled deliveries turned away
- Computer hardware installation planned in advance by a team of IT and Facilities individuals
- Power and network cabling connections performed only by designated and trained individuals
- Team development
– Clearly defined IT and Facilities relationship, mutual expectations (or SLAs), shared incentives
– Defined Data Center Facilities and Office Facilities relationship (if other buildings on campus)
A data center facility will operate successfully if the Facilities team is provided management support, appropriate resources and site-specific systems experience. Effectively deploying this Facilities Operations strategy, and the additional recommended control processes, will provide for a much higher reliability potential over the life of the facility. With these practices in place, you may realistically achieve multiple years of continuous facilities systems availability – multi-million dollar savings when compared to the average operating experience in the critical data center industry. Business Continuity can play a significant role in facilitating the successful implementation of these practices by bringing them to the attention of IT and Facilities management senior executives. These programs will only succeed with adequate funding and a full endorsement from the executive level.
About the Author
David Boston was Facilities Manager for GTE Data Services for 10 years and has assisted data center management teams for 17 years as an industry consortium director and consultant. Currently, he assists clients with staffing strategies and the development and testing of comprehensive training and procedures programs. He may be contacted at Brookfield Global Integrated Solutions (BGIS) [email protected]