Disaster Recovery Guide
Overviewβ
This comprehensive disaster recovery guide ensures business continuity for the Toto ecosystem in the event of system failures, data loss, or security incidents. It combines strategic planning with actionable procedures.
Recovery Objectivesβ
Recovery Time Objectives (RTO)β
- Critical Systems: 4 hours
- Important Systems: 24 hours
- Non-Critical Systems: 72 hours
Recovery Point Objectives (RPO)β
- Database: 1 hour
- User Data: 15 minutes
- Application Code: 0 minutes (Git)
- System Logs: 24 hours
Risk Assessmentβ
Potential Disastersβ
| Risk Category | Probability | Impact | Mitigation Priority |
|---|---|---|---|
| Data Loss | Medium | Critical | High |
| Service Outage | Medium | High | High |
| Security Breach | Low | Critical | High |
| Infrastructure Failure | Low | High | Medium |
| Human Error | Medium | Medium | Medium |
| Natural Disaster | Low | High | Low |
Critical Systemsβ
- toto-app: Main pet rescue application
- toto-bo: Backoffice management system
- Database: Firestore data storage
- Authentication: User authentication system
- Payment Processing: Donation handling
Incident Classificationβ
Severity Levelsβ
| Level | Description | Response Time | Escalation |
|---|---|---|---|
| P1 - Critical | Complete service outage, data corruption, security breach | 15 minutes | Immediate |
| P2 - High | Major functionality affected, performance degradation > 50% | 1 hour | 2 hours |
| P3 - Medium | Minor functionality issues, performance degradation < 50% | 4 hours | 8 hours |
| P4 - Low | Cosmetic issues, non-critical features | 24 hours | 48 hours |
Backup Strategyβ
Firestore Database Backupsβ
// scripts/backup-firestore.ts
import { initializeApp, getApps } from 'firebase/app';
import { getFirestore, collection, getDocs, writeBatch } from 'firebase/firestore';
import { Storage } from '@google-cloud/storage';
export class FirestoreBackupService {
private db: any;
private storage: Storage;
private bucketName: string;
constructor() {
const app = getApps()[0] || initializeApp({
// Firebase config
});
this.db = getFirestore(app);
this.storage = new Storage();
this.bucketName = 'toto-backups';
}
async createFullBackup(): Promise<string> {
const timestamp = new Date().toISOString();
const backupId = `backup-${timestamp}`;
console.log(`Starting full backup: ${backupId}`);
try {
const collections = await this.getCollections();
const backupData: any = {};
for (const collectionName of collections) {
console.log(`Backing up collection: ${collectionName}`);
const snapshot = await getDocs(collection(this.db, collectionName));
backupData[collectionName] = snapshot.docs.map(doc => ({
id: doc.id,
data: doc.data(),
createdAt: doc.metadata.fromCache ? null : doc.metadata.serverTimestamp
}));
}
const fileName = `${backupId}/firestore-backup.json`;
const file = this.storage.bucket(this.bucketName).file(fileName);
await file.save(JSON.stringify(backupData, null, 2), {
metadata: {
contentType: 'application/json',
metadata: {
backupId,
timestamp,
type: 'full'
}
}
});
console.log(`Backup completed: ${backupId}`);
return backupId;
} catch (error) {
console.error('Backup failed:', error);
throw error;
}
}
async createIncrementalBackup(lastBackupTime: Date): Promise<string> {
const timestamp = new Date().toISOString();
const backupId = `incremental-${timestamp}`;
console.log(`Starting incremental backup: ${backupId}`);
try {
const collections = await this.getCollections();
const backupData: any = {};
for (const collectionName of collections) {
const snapshot = await getDocs(collection(this.db, collectionName));
backupData[collectionName] = snapshot.docs.map(doc => ({
id: doc.id,
data: doc.data(),
modifiedAt: doc.metadata.serverTimestamp
}));
}
const fileName = `${backupId}/firestore-incremental.json`;
const file = this.storage.bucket(this.bucketName).file(fileName);
await file.save(JSON.stringify(backupData, null, 2), {
metadata: {
contentType: 'application/json',
metadata: {
backupId,
timestamp,
type: 'incremental',
lastBackupTime: lastBackupTime.toISOString()
}
}
});
console.log(`Incremental backup completed: ${backupId}`);
return backupId;
} catch (error) {
console.error('Incremental backup failed:', error);
throw error;
}
}
private async getCollections(): Promise<string[]> {
return [
'cases',
'users',
'donations',
'guardians',
'notifications',
'audit_logs'
];
}
}
// Automated backup scheduling
export class BackupScheduler {
private backupService: FirestoreBackupService;
constructor() {
this.backupService = new FirestoreBackupService();
}
async scheduleBackups(): Promise<void> {
// Daily full backup at 2 AM
this.scheduleCron('0 2 * * *', () => {
this.backupService.createFullBackup();
});
// Hourly incremental backup
this.scheduleCron('0 * * * *', () => {
const lastBackupTime = new Date(Date.now() - 24 * 60 * 60 * 1000);
this.backupService.createIncrementalBackup(lastBackupTime);
});
}
private scheduleCron(cronExpression: string, callback: () => void): void {
console.log(`Scheduled backup: ${cronExpression}`);
}
}
Code Repository Backupsβ
#!/bin/bash
# scripts/backup-repositories.sh
repositories=("toto-app" "toto-bo" "toto-ai-hub" "toto-wallet" "toto-docs")
for repo in "${repositories[@]}"; do
echo "Backing up repository: $repo"
cd "$repo"
git push origin main
git push backup main
cd ..
tar -czf "backups/${repo}-$(date +%Y%m%d).tar.gz" "$repo"
gsutil cp "backups/${repo}-$(date +%Y%m%d).tar.gz" gs://toto-backups/repositories/
done
Emergency Response Proceduresβ
Step 1: Incident Detection & Assessmentβ
Automated Monitoring Alertsβ
# Check system health
curl -f https://app.betoto.pet/api/health
curl -f https://bo.betoto.pet/api/health
# Check monitoring dashboard
# Access: https://bo.betoto.pet/dashboard/monitoring
Manual Health Checksβ
# Check Firebase services
firebase projects:list
firebase use toto-f9d2f
firebase firestore:indexes
# Check deployment status
firebase hosting:sites:list
Step 2: Incident Response Team Activationβ
On-Call Rotationβ
- Primary: [Primary Contact]
- Secondary: [Secondary Contact]
- Escalation: [Management Contact]
Communication Channelsβ
- Slack: #incident-response
- Phone: [Emergency Hotline]
- Email: incident@betoto.pet
Step 3: Immediate Response Actionsβ
For P1/P2 Incidentsβ
-
Acknowledge Incident (5 minutes)
- Post in #incident-response channel
- Create incident ticket
- Notify stakeholders
-
Assess Impact (15 minutes)
- Determine affected systems
- Estimate user impact
- Identify root cause
-
Implement Workaround (30 minutes)
- Deploy hotfix if available
- Activate backup systems
- Communicate with users
-
Full Resolution (4 hours)
- Implement permanent fix
- Verify system stability
- Update documentation
Data Recovery Proceduresβ
Firestore Database Recoveryβ
Full Database Restoreβ
// scripts/restore-firestore.ts
import { initializeApp, getApps } from 'firebase/app';
import { getFirestore, collection, doc, setDoc, writeBatch } from 'firebase/firestore';
import { Storage } from '@google-cloud/storage';
export class FirestoreRestoreService {
private db: any;
private storage: Storage;
private bucketName: string;
constructor() {
const app = getApps()[0] || initializeApp({
// Firebase config
});
this.db = getFirestore(app);
this.storage = new Storage();
this.bucketName = 'toto-backups';
}
async restoreFromBackup(backupId: string): Promise<void> {
console.log(`Starting restore from backup: ${backupId}`);
try {
const fileName = `${backupId}/firestore-backup.json`;
const file = this.storage.bucket(this.bucketName).file(fileName);
const [backupData] = await file.download();
const data = JSON.parse(backupData.toString());
for (const [collectionName, documents] of Object.entries(data)) {
console.log(`Restoring collection: ${collectionName}`);
await this.restoreCollection(collectionName, documents as any[]);
}
console.log(`Restore completed: ${backupId}`);
} catch (error) {
console.error('Restore failed:', error);
throw error;
}
}
private async restoreCollection(collectionName: string, documents: any[]): Promise<void> {
const batch = writeBatch(this.db);
const batchSize = 500;
for (let i = 0; i < documents.length; i += batchSize) {
const batchDocs = documents.slice(i, i + batchSize);
for (const docData of batchDocs) {
const docRef = doc(collection(this.db, collectionName), docData.id);
batch.set(docRef, docData.data);
}
await batch.commit();
}
}
async restoreToPointInTime(targetTime: Date): Promise<void> {
console.log(`Restoring to point in time: ${targetTime.toISOString()}`);
const backupId = await this.findBackupBeforeTime(targetTime);
if (!backupId) {
throw new Error('No backup found before target time');
}
await this.restoreFromBackup(backupId);
await this.applyIncrementalBackups(backupId, targetTime);
}
private async findBackupBeforeTime(targetTime: Date): Promise<string | null> {
return null; // Placeholder
}
private async applyIncrementalBackups(fromBackupId: string, toTime: Date): Promise<void> {
console.log('Applying incremental backups...');
}
}
Command Line Recoveryβ
# 1. Identify backup to restore
npm run backup:list production
# 2. Restore from backup
npm run backup restore production [backup-id]
# 3. Verify data integrity
firebase firestore:indexes
# Collection-specific recovery
ts-node scripts/restore-collection.ts [collection-name] [backup-id]
# Point-in-time recovery
ts-node scripts/point-in-time-restore.ts [timestamp] [backup-id]
File Storage Recoveryβ
# List available backups
gsutil ls gs://toto-backups-prod/
# Restore files
gsutil -m cp -r gs://toto-backups-prod/[backup-id]/files/ gs://toto-f9d2f.appspot.com/
Application Code Recoveryβ
# Rollback to previous version
git log --oneline
git checkout [commit-hash]
npm run deploy:production
# Emergency hotfix deployment
git checkout -b emergency-fix
# Make minimal changes
git commit -m "Emergency fix: [description]"
git push origin emergency-fix
npm run deploy:production
System Recovery Proceduresβ
Application Recoveryβ
toto-app Recoveryβ
# 1. Check deployment status
firebase hosting:sites:list --project toto-f9d2f
# 2. Redeploy if necessary
cd toto-app
npm run deploy:production
# 3. Verify deployment
curl -f https://app.betoto.pet/api/health
toto-bo Recoveryβ
# 1. Check deployment status
firebase hosting:sites:list --project toto-bo
# 2. Redeploy if necessary
cd toto-bo
npm run deploy:production
# 3. Verify deployment
curl -f https://bo.betoto.pet/api/health
Infrastructure Recoveryβ
# Check project status
firebase projects:list
# Verify billing
firebase billing:accounts:list
# Check quotas
firebase quotas:list
# Check DNS settings
nslookup app.betoto.pet
nslookup bo.betoto.pet
# Verify SSL certificates
openssl s_client -connect app.betoto.pet:443 -servername app.betoto.pet
Security Incident Recoveryβ
Security Breach Responseβ
Immediate Actionsβ
-
Isolate Affected Systems
# Disable compromised accounts
firebase auth:export users.json --project toto-f9d2f
# Review and disable suspicious accounts -
Preserve Evidence
# Export audit logs
npm run export:audit-logs
# Create forensic backup
npm run backup:forensic -
Notify Stakeholders
- Internal security team
- Legal team
- Affected users (if required)
Recovery Stepsβ
-
Patch Vulnerabilities
- Deploy security updates
- Update dependencies
- Review access controls
-
Reset Credentials
- Force password resets
- Rotate API keys
- Regenerate certificates
-
Monitor for Recurrence
- Enhanced monitoring
- Security scanning
- User activity review
Failover Proceduresβ
Database Failoverβ
// scripts/database-failover.ts
export class DatabaseFailoverService {
async activateFailover(): Promise<void> {
console.log('Activating database failover...');
try {
const primaryHealth = await this.checkDatabaseHealth('primary');
if (!primaryHealth.healthy) {
await this.activateSecondaryDatabase();
await this.updateDatabaseConfiguration('secondary');
await this.verifyFailover();
console.log('Database failover completed successfully');
}
} catch (error) {
console.error('Database failover failed:', error);
throw error;
}
}
private async checkDatabaseHealth(database: string): Promise<{ healthy: boolean }> {
return { healthy: true };
}
private async activateSecondaryDatabase(): Promise<void> {
console.log('Activating secondary database...');
}
private async updateDatabaseConfiguration(database: string): Promise<void> {
console.log(`Updating configuration to use ${database} database...`);
}
private async verifyFailover(): Promise<void> {
console.log('Verifying failover success...');
}
}
Service Failoverβ
// scripts/service-failover.ts
export class ServiceFailoverService {
async activateServiceFailover(service: string): Promise<void> {
console.log(`Activating failover for ${service}...`);
try {
const primaryHealth = await this.checkServiceHealth(service, 'primary');
if (!primaryHealth.healthy) {
await this.activateSecondaryService(service);
await this.updateLoadBalancerConfiguration(service, 'secondary');
await this.verifyServiceFailover(service);
console.log(`${service} failover completed successfully`);
}
} catch (error) {
console.error(`${service} failover failed:`, error);
throw error;
}
}
private async checkServiceHealth(service: string, instance: string): Promise<{ healthy: boolean }> {
return { healthy: true };
}
private async activateSecondaryService(service: string): Promise<void> {
console.log(`Activating secondary ${service} instance...`);
}
private async updateLoadBalancerConfiguration(service: string, instance: string): Promise<void> {
console.log(`Updating load balancer for ${service} to use ${instance}...`);
}
private async verifyServiceFailover(service: string): Promise<void> {
console.log(`Verifying ${service} failover success...`);
}
}
Business Continuity Proceduresβ
Service Degradation Responseβ
Performance Issuesβ
# 1. Check system metrics
curl https://bo.betoto.pet/api/monitoring/system-health
# 2. Scale resources if needed
firebase apphosting:instances:scale --min-instances 2 --max-instances 10
# 3. Implement rate limiting
# (Already configured in middleware)
Partial Outageβ
# 1. Activate maintenance mode
echo "MAINTENANCE_MODE=true" >> .env.production
# 2. Deploy maintenance page
npm run deploy:maintenance
# 3. Communicate with users
# Send notifications via email/SMS
Communication Proceduresβ
User Communicationβ
-
Status Page Updates
- Update status.betoto.pet
- Provide estimated resolution time
- Regular progress updates
-
Direct Notifications
- Email notifications
- SMS alerts (for critical users)
- In-app notifications
-
Social Media
- Twitter updates
- LinkedIn posts
- Community forum updates
Recovery Testingβ
Regular Testing Scheduleβ
Monthly Testsβ
- Backup restoration tests
- Failover procedure tests
- Communication procedure tests
Quarterly Testsβ
- Full disaster recovery simulation
- Security incident response
- Business continuity testing
Test Scenariosβ
# Scenario 1: Database Corruption
npm run test:disaster-recovery --scenario=database-corruption
npm run backup:restore test-backup
# Scenario 2: Application Failure
npm run test:disaster-recovery --scenario=application-failure
git checkout [previous-commit]
npm run deploy:production
# Scenario 3: Security Breach
npm run test:disaster-recovery --scenario=security-breach
npm run security:incident-response
Recovery Checklistsβ
Pre-Recovery Checklistβ
- Incident severity assessed
- Response team activated
- Stakeholders notified
- Evidence preserved
- Recovery plan approved
- Identify affected systems
- Assess data loss extent
- Determine recovery strategy
- Prepare recovery environment
- Verify backup integrity
During Recovery Checklistβ
- Backup integrity verified
- Recovery procedures followed
- System functionality tested
- Data integrity confirmed
- Security measures implemented
- Restore database from backup
- Apply incremental backups
- Deploy applications
- Restore configurations
- Verify service health
Post-Recovery Checklistβ
- System fully operational
- Users notified of resolution
- Incident documented
- Root cause analysis completed
- Prevention measures implemented
- Recovery procedures updated
- Complete functionality testing
- Performance verification
- Security validation
- User acceptance testing
Recovery Tools & Scriptsβ
Automated Recovery Scriptsβ
# Full system recovery
./scripts/full-recovery.sh [environment]
# Database recovery
./scripts/database-recovery.sh [backup-id]
# Application recovery
./scripts/app-recovery.sh [version]
# Security recovery
./scripts/security-recovery.sh [incident-id]
Monitoring & Alertingβ
# Check recovery status
npm run recovery:status
# Monitor recovery progress
npm run recovery:monitor
# Send recovery notifications
npm run recovery:notify
Emergency Contactsβ
Internal Contactsβ
| Role | Name | Phone | |
|---|---|---|---|
| Incident Commander | [Name] | [Phone] | [Email] |
| Technical Lead | [Name] | [Phone] | [Email] |
| Security Lead | [Name] | [Phone] | [Email] |
| Communications Lead | [Name] | [Phone] | [Email] |
External Contactsβ
| Service | Contact | Phone | |
|---|---|---|---|
| Firebase Support | [Contact] | [Phone] | [Email] |
| Google Cloud Support | [Contact] | [Phone] | [Email] |
| Stripe Support | [Contact] | [Phone] | [Email] |
| Domain Registrar | [Contact] | [Phone] | [Email] |
Escalation Matrixβ
- Level 1: On-call engineer (15 minutes)
- Level 2: Technical lead (1 hour)
- Level 3: Engineering manager (2 hours)
- Level 4: CTO/VP Engineering (4 hours)
- Level 5: CEO/Executive team (8 hours)
Documentation Updatesβ
Post-Incident Actionsβ
-
Incident Report
- Timeline of events
- Root cause analysis
- Impact assessment
- Lessons learned
-
Procedure Updates
- Update recovery procedures
- Improve monitoring
- Enhance automation
- Train team members
-
Prevention Measures
- Implement additional safeguards
- Update security measures
- Improve testing procedures
- Enhance documentation
This disaster recovery guide ensures the Toto ecosystem can recover quickly and effectively from any disaster scenario while maintaining business continuity.