
Introduction
Recognizing the critical nature of file server backup and recovery procedures, as well as the lack of an institutional standard of expectation for LAN administrators, the Institutional Computing Standards Committee formed the LAN Backup and Disaster Recovery Subcommittee to develop a common set of guidelines and expectations concerning this area of LAN administration. The activity of the group has centered on the collection of information about LAN backup among the various departments, and the development of an internal JHMCIS system capable of backing up and restoring large numbers of file servers. In addition, this report makes recommendations concerning LAN backup and recovery for LANs not covered by an "enterprise" system.
Scope
LAN Backup and Disaster Recovery for the purposes of this report concerns the potential loss of a file server and/or the media used for backup in any single location. Primarily, this means a hardware or software failure, not loss of a facility. In order to provide the capability to restore the data, applications and functionality of a file server, this report provides standards for backup frequency and rotation, as well as off site storage and recording of file server configuration information. In addition, it addresses some of the special issues involved in data restoration and techniques used to ensure that complete file server recovery, including NDS and Domain information, is achievable. This report does not address the larger issue of the loss of a file server’s environment, although some of the recommendations, if followed, may make a relatively minor loss, such as a fire or flood in a file server room, easier to recover.
Larger disasters, such as the loss of a building, along with file servers, multiple workstations and network connectivity hardware, are not in the scope of this standard.
Survey of Information
According to recent estimates, the Institution possesses perhaps 300 file servers attached to the network backbone, and perhaps 50 others attached to local LANs. A survey was conducted to determine the software and hardware in use by LAN administrators and the level of preparedness among LAN administrators and those acting in that capacity for their departments.
During a survey of non-JHMCIS file servers, several trends became apparent. In those departments with at least 1 full time LAN support staff member, backups were more likely to occur regularly. The survey found that 80% of such LANs had regular backups being conducted. Twenty-five percent had conducted test restores sometime in the past year. Fifty percent had at least a rudimentary form of off-site storage, usually a tape taken home by the LAN administrator.
Of those departments with smaller LANs or no full time staff, file server backups were only occurring in 45% of the cases. Five percent had conducted a test restore. Ten percent were conducting some type of off-site storage. On the positive side, six of the file servers that were not being backed up regularly are now covered under the JHMCIS Enterprise Backup system.
Most departments had file servers servicing under 50 workstations. The majority of these systems were backed up using 4mm DAT systems, with a variety of software attached to servers and workstations. Next in order of predominance were DC2000 type systems attached to workstations. 8mm and DLT technology was employed in some of the larger or better funded departments and groups.
JHMCIS Efforts
A summary has been generated on the internal efforts of JHMCIS to develop an enterprise backup and recovery system capable of protecting up to 40 file servers and 50GB of data, along with special considerations for databases, configuration recovery and server monitoring. At this time, 75% of the goals of this project have been met. An updated copy of this summary is included with this report. JHMCIS is also able to provide a hot spare server and backup devices in an emergency for those departments that pre-register through the ICSC. The current configuration of this system is a Netware 4.1 file server with 4GB of storage, 64MB of RAM, a 15/30GB DLT drive, a 4mm and 8mm DAT drives It is configured with Seagate Backup Exec software. The device is used for test restore and is available on a temporary basis to provide a platform for emergency restoration of data on a first come first server basis. The offer of tape storage services has been made to ICSC members; however, interest in this service has been minimal.
JHMCIS will continue to develop its internal capabilities in order to offer LAN backup services for departments with servers managed by JHMCIS, for its own internal file servers, those used for capital projects and to ensure progress on disaster recovery standards in the institution.
Guidelines and Standards Proposals
In order to ensure that file servers are able to be restored to service in a reasonable amount of time, the following guidelines are proposed as standards for departments to follow when considering the purchase of backup devices and developing procedures for staff members responsible for file servers.
Level of Expectation
It is the goal of this report to set a standard level of expectation that a department can restore the functionality of a file server within 24 hours of loss of functionality due to hardware failure, data corruption or loss of data. However, recovery times may be modified downward based on the criticality of applications and data residing on the file server. For example, some systems may require additional fault tolerance and high availability features that allow the file server to recover from many hardware failures without loss of functionality. These factors must be carefully reviewed during the planning phases of system design and periodically reviewed as needs changes. Exceptions to this standard must be fully explained and justified by departmental management.
Area of Responsibility
Departments must ensure that someone is assigned the responsibility for backing up and restoring file servers. Whether internal staff, another department or outside contractors, the need to allocate human resources to ensure adequate backup and restore capabilities is a requirement. A backup person must be available and sufficiently trained in the event that the primary person is not available. Documentation of backup and recovery procedures must be made available to the backup person.
Backup Frequency
File servers must be backed up daily. Full backups must be done weekly. Daily backups should consist of incremental or differential backups. These tend to result in shorter backup times and reduce traffic on the network. Special cases may exists where entire databases or servers need more frequent backup, or where a special purpose system, such as a message server or gateway, may be restored to functionality faster without restoring from tape. Instead, it may be more useful to have a hot spare system and to document the procedures and time required to reconfigure a new system. In these special cases, it is important to note the reasons and contingency plans in case of deviation from the standard. Departments should seek advice from JHMCIS to ensure that backup decisions are properly supported.
Scope of Backup
File servers must be backed up with software that effectively backup and restore all data and applications residing on the server. In addition, file servers must be backed up with software that can effectively backup and restore bindery, domain and NDS information, as well as NDS schema and Windows NT registry information. Software should be able to effectively backup open files and have provision for special database issues such as SQL, Oracle, Btrieve and other "live" databases.
It is recommended that the LAN Administrator not be tasked with the responsibility of backing up and protecting individual workstations at this time. The load on network traffic, and the increased level of responsibility does not appear warranted for environment that has much work to do in protecting its file server resources. Individual workstation backup should be accomplished using locally attached devices, and the end user should be familiar with the operation and maintenance of such equipment. End users must be provided with and encouraged to utilize, private and shared data storage areas on file servers. This will help minimize the need for workstation backup.
Software
Software selection is dependent on several factors. Price, features and ease of use are all important factors to consider, as well as the mode of backup (server or workstation based). For purposes of this report, departments must select software that has been proven to effectively backup and restore file server data, file systems and the special configuration information particular to the network operating system, including binderies, NDS, schema, domain database and registries.
While institutional software standard is not proposed at this time, JHMCIS recommends the use of Seagate Backup Exec for its ease of use, reliability of restoration, reliability across the WAN and ease of installation. After a long period of testing, JHMCIS is also in a position to be a source of technical support information on this product’s application in the Hopkins environment. This recommendation extends to server based backups, especially those with multiple file servers. JHMCIS encourages continuing research and exchange of information on the various software products in use and being developed, especially for special considerations such as database backup. JHMCIS maintains its hot spare system with Windows NT and NetWare versions of the Backup Exec software. In addition, JHMCIS’s 3 backup systems (Enterprise, Public and Bayview Alpha Commons) have standardized on this software. As software products develop, the systems JHMCIS uses for backup and restore are expected to evolve and change.
Hardware and Configuration
For those departments with the need to backup large amounts of data, server based DLT backup is the current preferred choice. Up to 30Gb of data can be backed up to a single tape with this technology. The systems are generally high performance and are more reliable that 8mm and 4mm technology. Current prices for these drives are about $3000.
Server based backups are recommended for those situations where high performance or multiple systems need to be protected. For those file servers where smaller amounts of data need to be backed up, workstation based backups offer the least expensive, easiest to use alternative. Most 4mm, Travan and DC2000 type systems can be attached to a workstation on a LAN and coupled with inexpensive software that effectively backs up and restores most file server configurations. Where a workstation is not located on the same hub or behind the same router as the file server it protects, issues of performance and network traffic may require that a server based backup be conducted.
Fault tolerance is any technology that employs redundancy or automatic recovery in the event of a hardware or system failure. Fault tolerance gives a file server the ability to recover from internal failures with little or no interruption in service. Repairs can then be effected during off hours. Fault tolerance must be considered be a part of any file server system. For standard file servers with user data, email and other constantly changing information, at least drive mirroring must be employed. In cases where high availability is required, other fault tolerance technology should be considered and deployed.
Network Issues
For those systems that protect resources across the network, especially across routers, it is important to ensure that there is understanding of traffic patterns during proposed backup times. Tests should be conducted to record the extra load placed on routers and network segments before putting these procedures into production. Across the network backups should not be conducted during times of heavier traffic. As a general recommendation, backups should be scheduled after 7pm and before 6am. Assistance on network utilization issues is available from JHMCIS.
Tape Retention and Off-Site Storage
Tapes should be made available to meet the needs of the department and the institution. Legal, regulatory and governmental standards for tape retention should determine the length of time yearly and monthly backups are retained. For example, the IRS requires 7 years of data for financial data. JCAHO and governmental policies regulate the retention of patient data. If tape backup media is the only source of information available, these policies must be adhered to. In the absence of a prevailing legal or regulatory requirement, full end of month backups must be retained for at least 3 months. In addition weekly backups must be retained for the previous month and daily tapes for the previous week.
Off-site storage of backup media must be part of tape retention. The tapes required to fully restore the server from its most recent backup should be stored off-site. Off-site storage provides the department with access to data in the event of fire, theft of equipment or flood. JHMCIS offers its facilities to those departments needing off-site storage on a regular basis. JHMCIS has established procedures for removal and retrieval of tape media that can assure compliance with this issue.
Removal of backup media to someone’s home or automobile is not a valid procedure for off-site tape rotation.
Server Monitoring
In addition to backup, file servers should be monitored using products such as Compaq Insight Manager, or at least the built in facilities of the operating system to determine the "health" of the system. Monitoring a file server can help diagnose problems before they result in loss of data and allow the LAN administrator to take preventive, proactive action. JHMCIS staff are available for basic information on how to monitor departmental file servers and advice on procedures.
Some companies, such as Compaq, provide pre-failure replacement of parts based on information provided by monitoring products. Analysis of predictive indicators of system and hard drive failures is necessary to ensure that these additional warranties can be exploited.
Server Configuration Information & Documentation
Every file server should have a record of its configuration. A variety of utility programs can be used to "survey" a file server for a printout of important configuration information that is not included in backups. In addition, it is necessary to retain licensing diskettes and installation media for the life of the file server in a safe, secure location. Many backup software packages include survey utilities and emergency restoration procedures that save configuration and recovery information to a set of diskettes. These should be used if the software provides the feature.
Documentation of the hardware brands, models and serial numbers for the backup system and file servers should be part of written inventory available to management and the personnel responsible for backup. Sources of spare media and parts should be identified, as well as written information concerning warranties and procedures for obtaining warranty replacement.
Documentation of backup and restore procedures, as well as the estimated length of time to restore functionality in the event of various failures must be documented. Tests should be conducted and documented to restore various applications, user data, shared data and databases in order to determine how the length of time to restore functionality.
Test Restores
Periodically, file servers, backup media and tape devices should be tested to ensure that a restore of information results in a successful replacement of data. A test restore procedure should occur at least monthly. On production systems, the restore should be redirected to a different directory or volume and in some cases a different server. It is recommended that a test restore of the entire file server be conducted as well. Test restores should be documented to record information about problems encountered, techniques and procedures employed. In the event of file server loss, determining the amount of time to effect a restore should be predetermined by the LAN administrator. Test restores offer the LAN administrator benchmarks that can be recorded to determine recovery time for a variety of data losses situations.
Hardware Replacement / Service Contracts
It is recommended that file servers be covered by on-site manufacturers warranty that provides for replacements of failed components within the same day. If this is not possible, spare equipment should be available to effect same day replacement of server components. If the server is retained beyond the manufacturer’s warranty, it is recommended that additional replacement or extended coverage be put into effect to ensure that the file server or a replacement can be put back into production within 24 hours of a failure, with all data intact.
Summary
Each department is responsible for providing equipment, facilities and human resources for the most important file server issue of system backup and recovery. Each department can use this document to provide a standard for its staff to follow for the level of expectation. Once equipped with the proper equipment and assigned the job function, it is the responsibility of the LAN administrator, or the person acting in that capacity, to ensure that backups are completed, monitored and tested for effectiveness. The goal of this standard is to ensure that file server functionality can be restored within 24 hours or less of a hardware failure, or loss or corruption of data.