Appendix 77

IT RISK MANAGEMENT STANDARDS AND GUIDELINES
Area: IT Operations
(Appendix to Sec. 148 on Purpose and Scope, and IT Risk Management Systems)

1. INTRODUCTION

1.1. The evolving role IT plays in supporting the business function has become increasingly complex. IT operations – traditionally housed in a computer data center with user connections through terminals – have become more dynamic and include distributed environments, integrated applications, telecommunication options, internet connectivity, and an array of IT operating platforms1. With the advent of technology, even small BSFIs have now become increasingly reliant on IT to achieve operational efficiency and deliver innovative products and services. Although some of these BSFIs have developed their products and services in-house, many have relied on vendors and service providers to develop and operate these products and services.

1.2. The increasing dependency to IT of BSFIs has consequently resulted to heightened risk exposures arising from their reliance on a variety of IT solutions and services and third-party relationships as well. It is also emphasized that risks involve more than IT and that controls include sound processes and well-trained people. To many BSFIs, effective support and delivery from IT operations has become vital to the performance of most of their critical business lines. This necessitates the adoption of risk management processes that promote sound and controlled operation of IT environments to ensure that IT operations process and store information in a timely, reliable, secure, and resilient manner.

2. ROLES AND RESPONSIBILITIES

2.1. Board of Directors (Board) and Senior Management. The BSFI’s Board and senior management are responsible for overseeing a safe, sound, controlled and efficient IT operating environment that supports the institution’s goals and objective. Although they can delegate implementation and oversight of daily operations to IT management, final responsibility for these activities remains with the Board and senior management. Consequently, the Board and senior management are responsible for understanding the risks associated with existing and planned IT operations, determining the risk tolerance of the BSFI, and establishing and monitoring policies for risk management.

On the other hand, IT operations management is primarily responsible in ensuring the BSFI’s current and planned infrastructure is sufficient to accomplish the strategic plans of senior management and the Board. To accomplish this objective, operations management should ensure the BSFI has sufficient personnel (in knowledge, experience, and number), system capacity and availability, and storage capacity to achieve strategic objectives. Operations management should select or recommend IT solutions that can meet strategic requirements with reduced resources to control capital expenditures and operating costs.

3. IT OPERATIONS STANDARDS

3.1. Technology Inventory. To effectively identify, assess, monitor, and manage the risks associated with IT operations, management should have a comprehensive understanding of the BSFI’s operations universe. Regardless of size, BSFI management should perform and maintain an inventory of all its IT resources, recognize interdependencies of these systems and understand how these systems support the associated business lines. Management should ensure the inventory is updated on an on-going basis to reflect the BSFI’s IT environment at any point in time.

Appropriate documentation of infrastructure and data flow should be in place to facilitate risk identification, application of controls, and ongoing maintenance of information systems. At a minimum, said documentation should include among others, the following components:

a. Hardware – Inventory should be comprehensive to include BSFI’s owned assets and equipment owned by other parties but located within the environment. To the extent possible, hardware items should be marked with a unique identifier, such as a bar code, tamper-proof tag, or other label.

b. Software – There are at least three major categories of software the BSFI should include in the software inventory: operating systems, application software, and back- office and environmental applications.

c. Network Components and Topology2 – Network management should develop and maintain high-level topologies that depict local area networks (LANs3), metropolitan area networks (MANs4) and wide area networks (WANs5). The topologies should have sufficient detail to facilitate network maintenance and troubleshooting, facilitate recovery in the event of a disruption and plan for expansion, reconfiguration, or addition of new technology.

d. Data Flow Diagram – Management should also develop data flow diagrams to supplement its understanding of information flow within and between network segments as well as across the BSFI’s perimeter to external parties. Data flow diagrams are also useful for identifying the volume and type of data stored on various media. In addition, the diagrams should identify and differentiate between data in electronic format, and in other media, such as hard copy or optical images.

e. Media – Descriptive information should identify the type, capacity, and location of the media. It should also identify the location, type, and classification (public, private, confidential, or other) of data stored on the media. Additionally, management should document source systems, data ownership, back up frequency and methodology (tape, remote disk, compact disc (CD), or other), and the location of back-up media if other than at the primary off-site storage facility.

3.2. Risk Assessment. Once inventory is complete, management should employ a variety of risk assessment techniques to identify threats and vulnerabilities to its IT operations, covering among others, the following:

a. Internal and external risks;

b. Risks associated with individual platforms, systems, or processes as well as those of a systemic nature; and

c. The quality and quantity of controls. The risk assessment process should be appropriate to the BSFI’s IT risk profile. To the extent possible, the assessment process should quantify the probability of a threat or vulnerability and the financial consequences of such an event.

After the BSFI identifies and analyzes the universe of risks, management should prioritize risk mitigation actions based on the probability of occurrence and the financial, reputational or legal impact to the institution. Management should prioritize the risk assessment results based on the business importance of the associated systems. The probability of occurrence and magnitude of impact provide the foundation for establishing or expanding controls for safe, sound, and efficient operations appropriate to the risk tolerance of the BSFI.

3.3. Risk Mitigation & Control Implementation

3.3.1. Policies, Standards and Procedures. Board and management should enact policies, standards and procedures sufficient to address and mitigate the risk exposure of the BSFI. The BSFI should adopt minimum IT standards to establish measurable controls and requirements to achieve policy objectives. Procedures describe the processes used to meet the requirements of the BSFI’s IT policies and standards. Management should develop written procedures for critical operations, which procedures should be updated and reviewed regularly. The scope of required procedures depends on the size, complexity and the variety of functions performed by the BSFI’s IT operations.

3.3.2. Controls Implementation

3.3.2.1. Environmental Controls. IT equipment should have a continuous uninterruptible power supply (UPS6). Management should configure the UPS to provide sufficient electricity within milliseconds to power equipment until there is an orderly shutdown or transition to the back-up generator. The back-up generator should generate sufficient power to meet the requirements of mission critical IT and environmental support systems. Similarly, IT operations centers should have independent telecommunication feeds from different vendors. Wiring configurations should support rapid switching from one provider to another without burdensome rerouting or rewiring.

Even small IT operations centers with modest IT equipment can contain a significant amount of computer cabling. Management should physically secure these cables to avoid accidental or malicious disconnection or severing. In addition, management should document wiring strategies and organize cables with labels or color codes to facilitate easy troubleshooting, repair, and upgrade.

Every operations center should have adequate heating, ventilation, and air conditioning (HVAC) systems in order for personnel and equipment to function properly. Organizations should plan their HVAC systems with the requirements of their IT systems in mind. Also, operations personnel should be familiar with written emergency procedures in the event of HVAC system disruption.

Water leaks can cause serious damage to computer equipment and cabling under raised floors. For this reason, operations centers should be equipped with water detectors under raised flooring to alert management of leaks that may not be readily visible. Management should also consider installing floor drains to prevent water from collecting beneath raised floors or under valuable computer equipment.

A variety of strategies are available for fire suppression. Ideally, the fire suppression system should allow operators time to shut down computer equipment and cover it with waterproof covers before releasing the suppressant.

Lastly, Management should consider using video surveillance and recording equipment in all or parts of the facility to monitor activity and deter theft. Management should also use inventory labels, bar codes, and logging procedures to control the inventory of critical and valuable equipment.

3.3.2.2. Preventive Maintenance. All maintenance activities should follow a predetermined schedule. A record of all maintenance activities should be maintained to aid management in reviewing and monitoring employee and vendor performance. Management should schedule time and resources for preventive maintenance and coordinate such schedule with production. During scheduled maintenance, the computer operators should dismount all program and data files and work packs, leaving only the minimum software required for the specific maintenance task on the system. If this is impractical, management should review system activity logs to monitor access to programs or data during maintenance. Also, at least one computer operator should be present at all times when the service representative is in the computer room.

In case a vendor performs computer maintenance online, operators should be aware of the online maintenance schedule so that it does not interfere with normal operations and processing. Operators and information security personnel should adhere to established security procedures to ensure they grant remote access only to authorized maintenance personnel at predetermined times to perform specific tasks.

Operators should maintain a written log of all hardware problems and downtime encountered between maintenance sessions. A periodic report on the nature and frequency of those problems is a necessary management tool, and can be valuable for vendor selection, equipment benchmarking, replacement decisions, or planning increased equipment capacity.

3.3.2.3. Change Management7 & Control. Complex BSFIs should have a change management policy that defines what constitutes a “change” and establishes minimum standards governing the change process. Simple BSFIs may successfully operate with less formality, but should still have written change management policies and procedures.

All changes should flow through the oversight function, which may include appropriate representation from business lines, support areas, IT management, information security, and internal audit. In establishing a framework for managing change, a policy should be present describing minimum standards and including such factors as notification, oversight, and control. Control standards should address risk, testing, authorization and approval, timing of implementation, post installation validation, and back-out or recovery.

3.3.2.4. Patch Management8
Management should establish procedures to stay abreast of patches, to test them in a segregated environment, and to install them when appropriate. Change management procedures should require documentation of any patch installations. Management should develop a process for managing version control of operating and application software to ensure implementation of the latest releases. Management should also maintain a record of the versions in place and should regularly monitor the Internet and other resources for bulletins about product enhancements, security issues, patches or upgrades, or other problems with the current versions of the software.

3.3.2.5. Conversions. Conversions involve major changes to existing systems or applications, or the introduction of systems or data sets which may span multiple platforms. Consequently, they have a higher level of risk requiring additional, specialized controls. Conversions, if improperly handled, may result to corrupt data; hence, strong conversion policies, procedures, and controls are critical. Likewise, since the ramifications of conversion span IT operations, it is important for management to periodically re-evaluate all operations processes and consider the appropriateness of process re-engineering.

3.3.2.6. Network Management Controls. Network standards, design, diagrams and operating procedures should be formally documented, kept updated, communicated to all relevant network staff and reviewed periodically. Communications facilities that are critical to continuity of network services should be identified. Single points of failure should be minimized by automatic re-routing of communications through alternate routes should critical nodes or links fail.

The network should be monitored on a continuous basis to reduce the likelihood of network traffic overload and detect network intrusions. Powerful network analysis and monitoring tools, such as protocol analyzers, network scanning and sniffer tools, are normally used for monitoring network performance and detecting potential or actual intrusions. These powerful network tools should be protected from unauthorized usage (e.g., viewing of unencrypted sensitive information). The use of network tools should also be tightly restricted to authorized staff only and be subject to stringent approval and review procedures.

3.3.2.7. Disposal of Media. Management should have procedures for the destruction and disposal of media containing sensitive information. These procedures should be risk-based relative to the sensitivity of the information and the type of media used to store the information. Furthermore, disposal procedures should recognize that records stored on electronic media, including tapes, and disk drives present unique disposal problems in that residual data can remain on the media after erasure. Since data can be recovered, additional disposal techniques should be applied to remove sensitive information.

3.3.2.8. Imaging. Management should ensure there are adequate controls to protect imaging processes, as many of the traditional audit and controls for paper-based systems may be reduced. Management should also consider issues such as converting existing paper storage files, integration of the imaging system into the organization workflow, and business continuity planning needs to achieve and maintain business objectives.

3.3.2.9. Event/Problem Management. Management should ensure appropriate controls are in place to identify, log, track, analyze, and resolve problems that occur during day-to-day IT operations. The event/ problem management process should be communicated and readily available to all IT operations personnel. Management should ensure it trains all operations personnel to act appropriately during significant events. Employees should also receive training to understand event response escalation procedures.

Operations personnel should be properly trained to recognize events that could trigger implementation of the business continuity plan. Although an event may not initially invoke the plan, it may become necessary as conditions and circumstances change. Management should train and test BSFI personnel to implement and perform appropriate business continuity procedures within the timeframes of the BCP. Operations personnel should properly log and record any events that trigger BCP response and document their ultimate resolutions.

3.3.2.10. User Support/Help Desk. User support processes and activities should ensure end users continuously have the resources and services needed to perform their job functions in an efficient and effective manner. In complex BSFIs, the help desk function provides user support, which typically consists of dedicated staff trained in problem resolution, equipped with issue tracking software, and supported with knowledge-based systems that serve as a reference resource to common problems. In simple BSFIs, user support may consist of a single person, a very small group, or a contract with a support vendor.

The help desk should record and track incoming problem reports, whether handled by live operators or automated systems.

Documentation in the tracking system should include such data as user, problem description, affected system (platform, application, or other), prioritization code, current status toward resolution, party responsible for resolution, root cause (when identified), target resolution time, and a comment field for recording user contacts and other pertinent information. The help desk should evaluate and prioritize issues to ensure the most critical problems receive prompt attention.

Help desk functions may also be supported by knowledge based-systems that provide support staff with action responses to common problems. Strong support functions continually update the knowledge based-systems with information obtained from vendors and from the experiences of help desk staff. Because attrition rates in the help desk function can be high, a knowledge based-system can ensure the BSFI retains knowledge and facilitates the training and development of new employees.

Proper authentication of users is critical to risk management within the user support function. If the help desk uses a single authentication standard for all requests, it should be sufficiently rigorous to cover the highest risk scenarios. However, the BSFI may choose to use different levels of authentication depending upon the problem reported, the type of action requested, or the platform, system, or data involved. If the help desk function is outsourced, management should determine the service provider’s information access level, assign the functions it will perform, and ensure that security and confidentiality remain in place.

3.3.2.11. Scheduling. The BSFI should implement policies and procedures for creating and changing job schedules and should supplement them with automated tools when cost effective. Sound scheduling practices and controls prevent degraded processing performance that can affect response time, cause delays in completing tasks, and skew capacity planning. Automated scheduling tools are necessary for large, complex systems to support effective job processing. Smaller and less complex IT systems generally have a standard job stream with little need for change.

3.3.2.12. Systems and Data Back-up. The BSFI should ensure that sufficient number of backup copies of essential business information, software and related hardcopy documentations are available for restoration or critical operations. A copy of these information, documentation and software should also be stored in an off-site premise or backup site and any changes should be done periodically and reflected in all copies.

The BSFI should back-up and store its data and program files in a secure off-site location to allow restoration of systems, applications, and associated data in the event normal processing is disrupted by a disaster or other significant event. A full system backup should be periodically conducted and should at least consist of the updated version of the operating software, production programs, system utilities and all master and transaction files. The frequency of backup should depend on its criticality, but should be performed after critical modification or updates. Management should implement a storage solution that is manageable from an administrative perspective and usable and accessible from the customer and end-user perspectives to enable them to receive current, complete and accurate data. Storage solutions should be appropriately scalable to allow for future growth.

Written standards should document back-up methodologies, delineate responsibilities of appropriate personnel, and ensure uniform performance throughout the institution. Management should maintain inventories of back-up media stored off-site and periodically perform physical inventories to ensure all required back-up materials are available. Procedures should include verifying adherence to the back-up schedule and reviewing actual back-up copies for readability. Similarly, management should periodically test back-up copies by actually using them to restore programs and data.

All backup media should be properly labeled using standard naming conventions. Management should develop a rotation scheme that addresses varying storage durations as well as transportation and storage of multiple formats of media at the off-site storage location. Transportation to the backup site should be done in controlled and secured manner with proper authorization and record. Procedures for disposal of backup media should also be in place.

3.3.2.13. Systems Reliability, Availability and Recoverability.

a. System Availability. BSFIs should achieve high systems availability (or near zero system downtime) for critical systems which is associated with maintaining adequate capacity, reliable performance, fast response time, scalability and swift recovery capability. Built-in redundancies for single points of failure should be developed and contingency plans should be tested so that business and operating disruptions can be minimized.

b. Technology Recovery Plan. Business resumption very often relies on the recovery of IT resources that include applications, hardware equipment and network infrastructure as well as electronic records. The technology requirements that are needed during recovery for individual business and support functions should be specified when the recovery strategies for the functions are determined.

Appropriate personnel should be assigned with the responsibility for technology recovery. Alternate personnel needs to be identified for key technology recovery personnel in case of their unavailability to perform the recovery process.

As unavailability of systems may result to disruptive impact on its operations, the BSFI should develop an IT disaster recovery plan to ensure that critical application systems and technology services can be resumed in accordance with the business recovery requirements. In formulating an effective recovery plan, scenario analysis should be included to identify and address various types of contingency scenarios. Scenarios such as major system outages which may be caused by system faults, hardware malfunction, operating errors or security incidents as well as a total inaccessibility of the primary data centre should be considered. To strengthen recovery measures relating to large scale disruptions and to achieve risk diversification, rapid operational and backup capabilities at the individual system or application cluster level should be implemented. Recovery and business resumption priorities must be defined accordingly. Specific recovery objectives including recovery time objective9 (RTO) and recovery point objective10(RPO) should be established for systems and applications.

c. Alternate sites for technology recovery. The BSFI should make arrangements for alternate and recovery sites11for their business functions and technology in the event the business premises, key infrastructure and systems supporting critical business functions become unavailable. A recovery site geographically separate from the primary site must be established to enable the restoration of critical systems and resumption of business operations should a disruption occur at the primary site. The required speed of recovery will depend on the criticality of resuming business operations, the type of services and whether there are alternative ways and processing means to maintain adequate continuing service levels to satisfy customers. Recovery strategies and technologies such as on-site redundancy and real-time data replication could be explored to enhance the BSFI’s recovery capability.

The recovery site could either be an in-house backup premise that has a redundant hardware system located away from the computer center, or a third-party recovery facility provider that requires formal subscription to its service, or a combination of both solutions. The recovery facility should be at a distance that would protect it from damage from any incident occurring at the primary site. Ideally, it should be on different electrical power and telecommunication switches, and free from the same disaster. The BSFI should ensure that the IT systems at the recovery sites are:

a. Compatible with the BSFI’s primary systems (in terms of capacity and capability) to adequately support the critical business functions; and

b. Continuously updated with current version of systems and application software to reflect any changes to the BSFI’s system configurations (e.g. hardware or software upgrades or modifications).

In case where a third-party recovery facility is used, there should be a written contract agreement that is legally binding. The agreement should specifically identify the conditions under which the recovery facility may be used and specify how customers would be accommodated if simultaneous disaster conditions occur to several customers of the recovery facility provider. The recovery facility should allow the BSFI to use its services until it achieves a full recovery from the disaster and resumption of activity at the BSFI’s own facility.

The BSFI which outsources critical systems to offshore service providers is heavily dependent on the stability and availability of cross-border network links. To minimize impact to business operations in the event of a disruption (e.g., due to earthquake), cross-border network redundancy with strategies such as engagement of different network service providers and alternate network paths may be instituted.

d. Disaster Recovery Testing. The BSFI should always adopt pre-determined recovery actions that have been tested and endorsed by management. The effectiveness of recovery requirements and the ability of BSFI’s personnel in executing or following the necessary emergency and recovery procedures should be tested and validated at least annually.

Various scenarios which include total shutdown or inaccessibility of the primary data center, as well as component failure at the individual system or application cluster level should be included in disaster recovery tests. Inter-dependencies between and among critical systems should be included in the tests. BSFIs whose networks and systems are linked to specific service providers and vendors, should consider conducting bilateral or multilateral recovery testing.

Business users should be involved in the design and execution of comprehensive test cases so as to obtain assurance that recovered systems function accordingly. The BSFI should also participate in disaster recovery tests of systems hosted overseas. Periodic testing and validation of the recovery capability of backup media should be carried out and assessed for adequacy and effectiveness. Backup tapes and disks containing sensitive data should be encrypted before they are transported offsite for storage.

3.4. Risk Monitoring

3.4.1. Service Level Agreement (SLA). BSFI Management of IT functions should formulate an SLA with business units which will measure the effectiveness and efficiency of delivering IT services. Measurable performance factors include system availability and performance requirements, capacity for growth, and the level of support provided to users, resource usage, operations problems, capacity, response time, personnel activity, as well as business unit and external customer satisfaction. Adequate procedures should be in place to manage and monitor delivery of committed services.

3.4.2. Control Self-Assessments12(CSAs). The BSFI may consider the conduct of periodic CSAs to validate the adequacy and effectiveness of the IT control environment. They also facilitate early identification to allow management to gauge performance, as well as the criticality of systems and emerging risks. Depending on the complexity of the BSFI’s IT risk profile, the content and format of the CSAs may be standardized and comprehensive or highly customized, focusing on a specific process, system, or functional area. IT operations management may collaborate with the internal audit function in creating the templates used. Typically, the CSA form combines narrative responses with a checklist. The self-assessment form should identify the system, process, or functional area reviewed, and the person(s) completing and reviewing the form. CSA’s however, are not a substitute for a sound internal audit program. Management should base the frequency of CSA the risk assessment process and coordinate the same with the internal audit plan.

3.4.3. Performance Monitoring. The BSFI should implement a process to ensure that the performance of IT systems is continuously monitored and exceptions are reported in a timely and comprehensive manner. The performance monitoring process should include forecasting capability to enable problems to be identified and corrected before they affect system performance. Monitoring and reporting also support proactive systems management that can help the BSFI position itself to meet its current needs and plan for periods of growth, mergers, or expansion of products and services.

BSFI Management should also conduct performance monitoring for outsourced IT solutions as part of a comprehensive vendor management program. Reports from service providers should include performance metrics, and identify the root causes of problems. Where service providers are subject to SLAs, management should ensure the provider complies with identified action plans, remuneration, or performance penalties.

3.4.4. Capacity Planning. Management should monitor IT resources for capacity planning including platform processing speed, core storage for each platform’s central processing unit, data storage, and voice and data communication bandwidth13. Capacity planning should be closely integrated with the budgeting and strategic planning processes. It also should address personnel issues including staff size, appropriate training, and staff succession plans. This process should help the preparation of workload forecasts to identify trends and to provide information needed for the capacity plan, taking into account planned business initiatives. Capacity planning should be extended to cover back- up systems and related facilities in addition to the production environment.

4. ROLE OF IT AUDIT

4.1. The BSFI’s IT audit function should regularly assess the effectiveness of established controls within the IT operations environment through audits or other independent verification. Audits provide independent assessments rendered by qualified individuals regarding the effective functioning of operational controls.

(Circular No. 958 dated 25 April 2017)

Footnotes

  1. IT operating platform includes the underlying computer system on which application programs run. A platform consists of an operating system, the computer system’s coordinating program, which in turn is built on the instruction set for a processor or microprocessor, and the hardware that performs logic operations and manages data movement in the computer.
  2. A network is a group of two or more computers that are linked together. For example, networks allow users at different branches or different workstations to access the Internet, send and receive email, and share printers, applications, and data. A network topology pictorially describes the arrangement or architecture of a network, including its workstations and connecting communication lines.
  3. A LAN is a network that connects workstations in a relatively small geographic area, such as a building. Computers connected in a LAN are usually connected by cables, but they can also be connected wirelessly.
  4. A MAN is a network that usually spans a city or a large campus. A MAN usually interconnects a number of LANs using a high-capacity backbone technology, such as fiber-optical links, and provides up-link services to WAN and the internet.
  5. A WAN is a network that connects other networks together. WANs are typically complicated networks covering broad areas (i.e., any network that links across metropolitan, regional, or national boundaries) and allowing many computers and other devices to communicate and share data.
  6. UPS is a device that allows computer to keep running for at least a short time when the primary power source is lost. A UPS may also provide protection from power surges. A UPS contains a battery that “kicks in” when the device senses a loss of power from the primary source allowing the user time to save any data they are working on and to exit before the secondary power source (the battery) runs out. When power surges occur, a UPS intercepts the surge so that it doesn’t damage the computer.
  7. Change management refers to the broad processes for managing organizational change. Change management encompasses planning, oversight or governance, project management, testing and implementation.
  8. A patch is a piece of software designed to fix problems with, or update a computer program or its supporting data. This includes fixing security vulnerabilities and other bugs, and improving the usability or performance. Though meant to fix problems, poorly designed patches can sometimes introduce new problems. In some special cases, updates may knowingly break the functionality, for instance, by removing components for that the update provider is no longer licensed. Patch Management is the process of using a strategy and plan of what patches should be applied to which systems at a specified time.
  9. RTO refers to the required time taken to recover an IT system from the point of disruption.
  10. RPO refers to the acceptable amount of data loss for an IT system should a disaster occur.
  11. Recovery site is an alternate location for processing information (and possibly conducting business) in an emergency.
  12. CSA is a technique used to assess risk and control strength and weaknesses against a control framework.
  13. Bandwidth is a terminology used to indicate the transmission or processing capacity of a system or of a specific location in a system (usually a network system) for information (text, images, video, sound). It is usually defined in bits per second (bps)