Overview
L2 Disaster Recovery Engineer Jobs in Johannesburg at AKB Technology
Part-Time (80 hours/month) contract.
L2 Disaster Recovery Engineer
(Process Resilience & DR Operations – Top 100 Business Processes)
Role Overview
L2 Disaster Recovery Engineer (DR Operations & Process Resilience)
Position Summary
We are seeking a highly structured and technically capable Level 2 Disaster Recovery Engineer responsible for supporting and maintaining the organisation’s Disaster Recovery (DR) readiness across the Top 100 critical business processes.
The role focuses on documenting, mapping, testing, and maintaining recovery procedures for the most critical operational and technology processes within the organisation. The engineer will ensure that DR documentation is accurate, recovery procedures are validated through testing, and that systems supporting critical services can be recovered within agreed RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets.
This role works closely with Infrastructure, Security, ServiceDesk, Applications, and Business Process Owners to ensure operational resilience and business continuity.
Core Responsibilities
1. Disaster Recovery Process Mapping (Top 100 Processes)
- Identify and maintain documentation for the Top 100 critical operational and technical business processes.
- Map dependencies between applications, infrastructure, databases, networks, and services supporting each process.
- Document process flows, technical dependencies, recovery steps, and escalation paths.
- Maintain a Process-to-System Dependency Map.
- Work with business units to ensure accurate recovery prioritisation.
2. Disaster Recovery Documentation & Runbooks
- Develop and maintain Disaster Recovery Runbooks for all Tier 1 and Tier 2 systems. Document step-by-step recovery procedures including:
- Server recovery
- Database recovery
- Network recovery
- Application recovery
- User access restoration
- Maintain technical architecture diagrams supporting recovery plans.
- Ensure documentation is updated after changes, upgrades, or infrastructure modifications.
3. DR Infrastructure Support
Support and maintain DR capabilities across key platforms including:
- Backup platforms (e.g., Veeam or equivalent)
- Replication technologies
- Virtualisation platforms (VMware / Hyper-V / Nutanix)
- Storage replication
- Network failover configurations
- Cloud-based DR environments
Responsibilities include:
- Monitoring DR replication status
- Verifying backup integrity
- Ensuring DR environments remain operational and ready for activation
4. DR Testing & Simulation
- Plan and execute regular DR test exercises.
- Validate recovery procedures for critical systems.
- Conduct failover and failback testing where applicable.
- Document test results and improvement actions.
- Work with service owners to ensure recovery objectives are achievable.
Testing types include:
- Tabletop exercises
- Partial DR recovery tests
- Full DR simulation exercises
5. Incident Support & Recovery Coordination
- Act as L2 technical support during major incidents requiring DR activation.
- Assist in failover procedures and service restoration.
- Coordinate with infrastructure and application teams during recovery.
- Ensure communication flows correctly during recovery events.
6. Business Continuity Alignment
- Work closely with Business Continuity and Risk teams.
- Ensure DR plans align with Business Continuity Plans (BCP).
- Assist business units in identifying:
- Critical systems
- Recovery priorities
- Operational recovery requirements.
7. Compliance, Governance & Audit Support
Ensure DR practices comply with organisational and regulatory requirements including:
- ISO 27001
- ITIL Service Continuity Management
- POPIA data protection requirements
Responsibilities include:
- Maintaining DR evidence for audits
- Ensuring documentation standards are followed
- Supporting internal and external audit requests.
8. Continuous Improvement
- Identify gaps in recovery capability.
- Recommend improvements to DR architecture.
- Work with infrastructure teams to improve resilience and failover capability.
- Assist with the development of automation where possible for recovery processes.
Deliverables
- Complete Top 100 Business Process DR Mapping Document.
- Fully documented Disaster Recovery Runbooks.
- Quarterly DR Testing Report.
- Updated System Dependency Maps.
- Recovery readiness reports for management.
- DR risk and improvement recommendations.
Required Experience & Skills
- 3–6 years experience in Infrastructure, DR, or Systems Engineering.
- Strong understanding of:
- Virtualisation platforms (VMware / Hyper-V / Nutanix)
- Backup solutions (Veeam or similar)
- Enterprise storage platforms
- Windows and Linux server environments
- Network fundamentals
- Experience with Disaster Recovery planning and testing.
- Ability to document technical recovery procedures clearly.
- Strong analytical and troubleshooting skills.
Preferred Qualifications
- ITIL Foundation certification
- Disaster Recovery or Business Continuity certification
- VMware / Microsoft infrastructure certifications
- Experience working in Managed Service Provider (MSP) environments
Key Performance Indicators (KPIs)
- 100% documentation of Top 100 critical processes.
- DR runbooks maintained and updated quarterly.
- Successful DR testing with documented recovery times.
- Reduction in recovery gaps and risk exposure.
- Compliance with internal governance and audit requirements.
Expanded Key Performance Indicators (KPIs)
L2 Disaster Recovery Engineer – Top 100 Business Processes
1. Documentation of Top 100 Critical Processes
Objective
Ensure that the organisation has complete visibility of all critical operational and technical processes required to recover the business during a disaster event.
Measurement Criteria
- 100% of the Top 100 critical business processes identified and documented.
- Each process must include:
- Process description
- Business owner
- Supporting applications
- Supporting infrastructure
- Dependencies (network, storage, identity services)
- Recovery sequence
- RTO and RPO targets
- Escalation contacts
Deliverables
- Top 100 Process Catalogue
- Process Dependency Map
- Recovery Priority Matrix
KPI Target
- 100% completion within first 90 days of role start
Reporting
- Monthly progress report to Operations Management.
2. Disaster Recovery Runbooks Maintained and Updated
Objective
Ensure all critical systems have step-by-step recovery procedures that engineers can follow during a disaster.
Measurement Criteria
Each DR runbook must contain:
- Recovery steps
- Required credentials / access levels
- Infrastructure dependencies
- Estimated recovery time
- Validation steps after recovery
Runbooks must be updated whenever:
- Infrastructure changes occur
- Software versions change
- Systems are migrated
- New dependencies are introduced
KPI Target
- 100% of Tier 1 systems documented
- 95% of Tier 2 systems documented
- Runbooks reviewed quarterly
Deliverables
- DR Runbook Repository
- Infrastructure Recovery Procedures
- Application Recovery Procedures
Reporting
Quarterly report showing:
- Runbooks reviewed
- Runbooks updated
- Systems missing recovery documentation
3. Disaster Recovery Testing & Validation
Objective
Validate that documented recovery procedures actually work and meet defined recovery targets.
Testing Types
- Tabletop simulation
- Partial infrastructure recovery
- Full failover testing
Measurement Criteria
Each DR test must measure:
- Recovery Time Objective (RTO)
- Recovery Point Objective (RPO)
- Service restoration validation
KPI Target
- Minimum 2 DR test exercises per year for critical systems
- 100% documentation of DR test results
- Recovery targets achieved for 90% of tested systems
Deliverables
- DR Test Plan
- DR Test Results Report
- Improvement Action Register
Reporting
Quarterly DR readiness report to management.
4. Reduction in Recovery Gaps & Risk Exposure
Objective
Continuously improve the organisation’s resilience and recovery capability by identifying weaknesses in systems, processes, or infrastructure.
Measurement Criteria
Identify and track:
- Systems without DR coverage
- Infrastructure without replication or backup
- Single points of failure
- Missing documentation
- Recovery steps that fail during testing
KPI Target
- All critical recovery gaps identified and documented within first 90 days
- Reduction of DR risks by at least 50% within first 12 months
Deliverables
- DR Risk Register
- Infrastructure Resilience Improvement Plan
- DR Gap Analysis Report
- Reporting
- Monthly risk reduction report.
5. Governance, Compliance & Audit Readiness
Objective
- Ensure DR capabilities align with regulatory, security, and operational standards, particularly for organisations operating under governance frameworks.
Compliance Areas
- ISO 27001 (Business Continuity & Disaster Recovery controls)
- ITIL Service Continuity Management
- Internal operational governance standards
- Customer contractual SLAs
Measurement Criteria
- DR documentation available for audit
- Evidence of DR testing
- Recovery objectives defined and approved
- DR procedures aligned with business continuity plans
KPI Target
- Zero audit findings related to DR documentation or recovery readiness
- 100% DR documentation stored in controlled repository
Deliverables
- DR Compliance Evidence Pack
- Audit Response Documentation
- DR Governance Reports
Reporting
Audit readiness report submitted annually or when required.
Optional KPI (Highly Recommended)
6. Disaster Recovery Readiness Score
A maturity metric that measures overall readiness.
Components include:
- DR documentation coverage
- Runbook completeness
- Recovery test success rate
- Infrastructure redundancy
- Backup integrity
KPI Target
Maintain DR Readiness Score ≥ 85%
Work Location: In person
Title: L2 Disaster Recovery Engineer
Company: AKB Technology
Location: Johannesburg