IT RISK MANAGEMENT STANDARDS AND GUIDELINES
Area: IT Operations
(Appendix to Sec. 148 on Purpose and Scope, and IT Risk Management Systems)
1. INTRODUCTION
1.1. The evolving role IT plays in supporting the business function has become increasingly complex. IT operations – traditionally housed in a computer data center with user connections through terminals – have become more dynamic and include distributed environments, integrated applications, telecommunication options, internet connectivity, and an array of IT operating platforms1. With the advent of technology, even small BSFIs have now become increasingly reliant on IT to achieve operational efficiency and deliver innovative products and services. Although some of these BSFIs have developed their products and services in-house, many have relied on vendors and service providers to develop and operate these products and services.
1.2. The increasing dependency to IT of BSFIs has consequently resulted to heightened risk exposures arising from their reliance on a variety of IT solutions and services and third-party relationships as well. It is also emphasized that risks involve more than IT and that controls include sound processes and well-trained people. To many BSFIs, effective support and delivery from IT operations has become vital to the performance of most of their critical business lines. This necessitates the adoption of risk management processes that promote sound and controlled operation of IT environments to ensure that IT operations process and store information in a timely, reliable, secure, and resilient manner.
2. ROLES AND RESPONSIBILITIES
2.1. Board of Directors (Board) and Senior Management. The BSFI’s Board and senior management are responsible for overseeing a safe, sound, controlled and efficient IT operating environment that supports the institution’s goals and objective. Although they can delegate implementation and oversight of daily operations to IT management, final responsibility for these activities remains with the Board and senior management. Consequently, the Board and senior management are responsible for understanding the risks associated with existing and planned IT operations, determining the risk tolerance of the BSFI, and establishing and monitoring policies for risk management.
3. IT OPERATIONS STANDARDS
3.1. Technology Inventory. To effectively identify, assess, monitor, and manage the risks associated with IT operations, management should have a comprehensive understanding of the BSFI’s operations universe. Regardless of size, BSFI management should perform and maintain an inventory of all its IT resources, recognize interdependencies of these systems and understand how these systems support the associated business lines. Management should ensure the inventory is updated on an on-going basis to reflect the BSFI’s IT environment at any point in time.
a. Hardware – Inventory should be comprehensive to include BSFI’s owned assets and equipment owned by other parties but located within the environment. To the extent possible, hardware items should be marked with a unique identifier, such as a bar code, tamper-proof tag, or other label.
b. Software – There are at least three major categories of software the BSFI should include in the software inventory: operating systems, application software, and back- office and environmental applications.
c. Network Components and Topology2 – Network management should develop and maintain high-level topologies that depict local area networks (LANs3), metropolitan area networks (MANs4) and wide area networks (WANs5). The topologies should have sufficient detail to facilitate network maintenance and troubleshooting, facilitate recovery in the event of a disruption and plan for expansion, reconfiguration, or addition of new technology.
d. Data Flow Diagram – Management should also develop data flow diagrams to supplement its understanding of information flow within and between network segments as well as across the BSFI’s perimeter to external parties. Data flow diagrams are also useful for identifying the volume and type of data stored on various media. In addition, the diagrams should identify and differentiate between data in electronic format, and in other media, such as hard copy or optical images.
e. Media – Descriptive information should identify the type, capacity, and location of the media. It should also identify the location, type, and classification (public, private, confidential, or other) of data stored on the media. Additionally, management should document source systems, data ownership, back up frequency and methodology (tape, remote disk, compact disc (CD), or other), and the location of back-up media if other than at the primary off-site storage facility.
3.2. Risk Assessment. Once inventory is complete, management should employ a variety of risk assessment techniques to identify threats and vulnerabilities to its IT operations, covering among others, the following:
a. Internal and external risks;
b. Risks associated with individual platforms, systems, or processes as well as those of a systemic nature; and
c. The quality and quantity of controls. The risk assessment process should be appropriate to the BSFI’s IT risk profile. To the extent possible, the assessment process should quantify the probability of a threat or vulnerability and the financial consequences of such an event.
3.3. Risk Mitigation & Control Implementation
3.3.1. Policies, Standards and Procedures. Board and management should enact policies, standards and procedures sufficient to address and mitigate the risk exposure of the BSFI. The BSFI should adopt minimum IT standards to establish measurable controls and requirements to achieve policy objectives. Procedures describe the processes used to meet the requirements of the BSFI’s IT policies and standards. Management should develop written procedures for critical operations, which procedures should be updated and reviewed regularly. The scope of required procedures depends on the size, complexity and the variety of functions performed by the BSFI’s IT operations.
3.3.2. Controls Implementation
3.3.2.1. Environmental Controls. IT equipment should have a continuous uninterruptible power supply (UPS6). Management should configure the UPS to provide sufficient electricity within milliseconds to power equipment until there is an orderly shutdown or transition to the back-up generator. The back-up generator should generate sufficient power to meet the requirements of mission critical IT and environmental support systems. Similarly, IT operations centers should have independent telecommunication feeds from different vendors. Wiring configurations should support rapid switching from one provider to another without burdensome rerouting or rewiring.
3.3.2.2. Preventive Maintenance. All maintenance activities should follow a predetermined schedule. A record of all maintenance activities should be maintained to aid management in reviewing and monitoring employee and vendor performance. Management should schedule time and resources for preventive maintenance and coordinate such schedule with production. During scheduled maintenance, the computer operators should dismount all program and data files and work packs, leaving only the minimum software required for the specific maintenance task on the system. If this is impractical, management should review system activity logs to monitor access to programs or data during maintenance. Also, at least one computer operator should be present at all times when the service representative is in the computer room.
3.3.2.3. Change Management7 & Control. Complex BSFIs should have a change management policy that defines what constitutes a “change” and establishes minimum standards governing the change process. Simple BSFIs may successfully operate with less formality, but should still have written change management policies and procedures.
3.3.2.4. Patch Management8
Management should establish procedures to stay abreast of patches, to test them in a segregated environment, and to install them when appropriate. Change management procedures should require documentation of any patch installations. Management should develop a process for managing version control of operating and application software to ensure implementation of the latest releases. Management should also maintain a record of the versions in place and should regularly monitor the Internet and other resources for bulletins about product enhancements, security issues, patches or upgrades, or other problems with the current versions of the software.
3.3.2.5. Conversions. Conversions involve major changes to existing systems or applications, or the introduction of systems or data sets which may span multiple platforms. Consequently, they have a higher level of risk requiring additional, specialized controls. Conversions, if improperly handled, may result to corrupt data; hence, strong conversion policies, procedures, and controls are critical. Likewise, since the ramifications of conversion span IT operations, it is important for management to periodically re-evaluate all operations processes and consider the appropriateness of process re-engineering.
3.3.2.6. Network Management Controls. Network standards, design, diagrams and operating procedures should be formally documented, kept updated, communicated to all relevant network staff and reviewed periodically. Communications facilities that are critical to continuity of network services should be identified. Single points of failure should be minimized by automatic re-routing of communications through alternate routes should critical nodes or links fail.
3.3.2.7. Disposal of Media. Management should have procedures for the destruction and disposal of media containing sensitive information. These procedures should be risk-based relative to the sensitivity of the information and the type of media used to store the information. Furthermore, disposal procedures should recognize that records stored on electronic media, including tapes, and disk drives present unique disposal problems in that residual data can remain on the media after erasure. Since data can be recovered, additional disposal techniques should be applied to remove sensitive information.
3.3.2.8. Imaging. Management should ensure there are adequate controls to protect imaging processes, as many of the traditional audit and controls for paper-based systems may be reduced. Management should also consider issues such as converting existing paper storage files, integration of the imaging system into the organization workflow, and business continuity planning needs to achieve and maintain business objectives.
3.3.2.9. Event/Problem Management. Management should ensure appropriate controls are in place to identify, log, track, analyze, and resolve problems that occur during day-to-day IT operations. The event/ problem management process should be communicated and readily available to all IT operations personnel. Management should ensure it trains all operations personnel to act appropriately during significant events. Employees should also receive training to understand event response escalation procedures.
3.3.2.10. User Support/Help Desk. User support processes and activities should ensure end users continuously have the resources and services needed to perform their job functions in an efficient and effective manner. In complex BSFIs, the help desk function provides user support, which typically consists of dedicated staff trained in problem resolution, equipped with issue tracking software, and supported with knowledge-based systems that serve as a reference resource to common problems. In simple BSFIs, user support may consist of a single person, a very small group, or a contract with a support vendor.
3.3.2.11. Scheduling. The BSFI should implement policies and procedures for creating and changing job schedules and should supplement them with automated tools when cost effective. Sound scheduling practices and controls prevent degraded processing performance that can affect response time, cause delays in completing tasks, and skew capacity planning. Automated scheduling tools are necessary for large, complex systems to support effective job processing. Smaller and less complex IT systems generally have a standard job stream with little need for change.
3.3.2.12. Systems and Data Back-up. The BSFI should ensure that sufficient number of backup copies of essential business information, software and related hardcopy documentations are available for restoration or critical operations. A copy of these information, documentation and software should also be stored in an off-site premise or backup site and any changes should be done periodically and reflected in all copies.
3.3.2.13. Systems Reliability, Availability and Recoverability.
a. System Availability. BSFIs should achieve high systems availability (or near zero system downtime) for critical systems which is associated with maintaining adequate capacity, reliable performance, fast response time, scalability and swift recovery capability. Built-in redundancies for single points of failure should be developed and contingency plans should be tested so that business and operating disruptions can be minimized.
b. Technology Recovery Plan. Business resumption very often relies on the recovery of IT resources that include applications, hardware equipment and network infrastructure as well as electronic records. The technology requirements that are needed during recovery for individual business and support functions should be specified when the recovery strategies for the functions are determined.
Appropriate personnel should be assigned with the responsibility for technology recovery. Alternate personnel needs to be identified for key technology recovery personnel in case of their unavailability to perform the recovery process.
c. Alternate sites for technology recovery. The BSFI should make arrangements for alternate and recovery sites11for their business functions and technology in the event the business premises, key infrastructure and systems supporting critical business functions become unavailable. A recovery site geographically separate from the primary site must be established to enable the restoration of critical systems and resumption of business operations should a disruption occur at the primary site. The required speed of recovery will depend on the criticality of resuming business operations, the type of services and whether there are alternative ways and processing means to maintain adequate continuing service levels to satisfy customers. Recovery strategies and technologies such as on-site redundancy and real-time data replication could be explored to enhance the BSFI’s recovery capability.
a. Compatible with the BSFI’s primary systems (in terms of capacity and capability) to adequately support the critical business functions; and
b. Continuously updated with current version of systems and application software to reflect any changes to the BSFI’s system configurations (e.g. hardware or software upgrades or modifications).
d. Disaster Recovery Testing. The BSFI should always adopt pre-determined recovery actions that have been tested and endorsed by management. The effectiveness of recovery requirements and the ability of BSFI’s personnel in executing or following the necessary emergency and recovery procedures should be tested and validated at least annually.
3.4. Risk Monitoring
3.4.1. Service Level Agreement (SLA). BSFI Management of IT functions should formulate an SLA with business units which will measure the effectiveness and efficiency of delivering IT services. Measurable performance factors include system availability and performance requirements, capacity for growth, and the level of support provided to users, resource usage, operations problems, capacity, response time, personnel activity, as well as business unit and external customer satisfaction. Adequate procedures should be in place to manage and monitor delivery of committed services.
3.4.2. Control Self-Assessments12(CSAs). The BSFI may consider the conduct of periodic CSAs to validate the adequacy and effectiveness of the IT control environment. They also facilitate early identification to allow management to gauge performance, as well as the criticality of systems and emerging risks. Depending on the complexity of the BSFI’s IT risk profile, the content and format of the CSAs may be standardized and comprehensive or highly customized, focusing on a specific process, system, or functional area. IT operations management may collaborate with the internal audit function in creating the templates used. Typically, the CSA form combines narrative responses with a checklist. The self-assessment form should identify the system, process, or functional area reviewed, and the person(s) completing and reviewing the form. CSA’s however, are not a substitute for a sound internal audit program. Management should base the frequency of CSA the risk assessment process and coordinate the same with the internal audit plan.
3.4.3. Performance Monitoring. The BSFI should implement a process to ensure that the performance of IT systems is continuously monitored and exceptions are reported in a timely and comprehensive manner. The performance monitoring process should include forecasting capability to enable problems to be identified and corrected before they affect system performance. Monitoring and reporting also support proactive systems management that can help the BSFI position itself to meet its current needs and plan for periods of growth, mergers, or expansion of products and services.
3.4.4. Capacity Planning. Management should monitor IT resources for capacity planning including platform processing speed, core storage for each platform’s central processing unit, data storage, and voice and data communication bandwidth13. Capacity planning should be closely integrated with the budgeting and strategic planning processes. It also should address personnel issues including staff size, appropriate training, and staff succession plans. This process should help the preparation of workload forecasts to identify trends and to provide information needed for the capacity plan, taking into account planned business initiatives. Capacity planning should be extended to cover back- up systems and related facilities in addition to the production environment.
4. ROLE OF IT AUDIT
4.1. The BSFI’s IT audit function should regularly assess the effectiveness of established controls within the IT operations environment through audits or other independent verification. Audits provide independent assessments rendered by qualified individuals regarding the effective functioning of operational controls.
(Circular No. 958 dated 25 April 2017)
Footnotes
- IT operating platform includes the underlying computer system on which application programs run. A platform consists of an operating system, the computer system’s coordinating program, which in turn is built on the instruction set for a processor or microprocessor, and the hardware that performs logic operations and manages data movement in the computer.
- A network is a group of two or more computers that are linked together. For example, networks allow users at different branches or different workstations to access the Internet, send and receive email, and share printers, applications, and data. A network topology pictorially describes the arrangement or architecture of a network, including its workstations and connecting communication lines.
- A LAN is a network that connects workstations in a relatively small geographic area, such as a building. Computers connected in a LAN are usually connected by cables, but they can also be connected wirelessly.
- A MAN is a network that usually spans a city or a large campus. A MAN usually interconnects a number of LANs using a high-capacity backbone technology, such as fiber-optical links, and provides up-link services to WAN and the internet.
- A WAN is a network that connects other networks together. WANs are typically complicated networks covering broad areas (i.e., any network that links across metropolitan, regional, or national boundaries) and allowing many computers and other devices to communicate and share data.
- UPS is a device that allows computer to keep running for at least a short time when the primary power source is lost. A UPS may also provide protection from power surges. A UPS contains a battery that “kicks in” when the device senses a loss of power from the primary source allowing the user time to save any data they are working on and to exit before the secondary power source (the battery) runs out. When power surges occur, a UPS intercepts the surge so that it doesn’t damage the computer.
- Change management refers to the broad processes for managing organizational change. Change management encompasses planning, oversight or governance, project management, testing and implementation.
- A patch is a piece of software designed to fix problems with, or update a computer program or its supporting data. This includes fixing security vulnerabilities and other bugs, and improving the usability or performance. Though meant to fix problems, poorly designed patches can sometimes introduce new problems. In some special cases, updates may knowingly break the functionality, for instance, by removing components for that the update provider is no longer licensed. Patch Management is the process of using a strategy and plan of what patches should be applied to which systems at a specified time.
- RTO refers to the required time taken to recover an IT system from the point of disruption.
- RPO refers to the acceptable amount of data loss for an IT system should a disaster occur.
- Recovery site is an alternate location for processing information (and possibly conducting business) in an emergency.
- CSA is a technique used to assess risk and control strength and weaknesses against a control framework.
- Bandwidth is a terminology used to indicate the transmission or processing capacity of a system or of a specific location in a system (usually a network system) for information (text, images, video, sound). It is usually defined in bits per second (bps)