Skip to main content

Disaster Recovery Guide

Overview​

This comprehensive disaster recovery guide ensures business continuity for the Toto ecosystem in the event of system failures, data loss, or security incidents. It combines strategic planning with actionable procedures.

Recovery Objectives​

Recovery Time Objectives (RTO)​

  • Critical Systems: 4 hours
  • Important Systems: 24 hours
  • Non-Critical Systems: 72 hours

Recovery Point Objectives (RPO)​

  • Database: 1 hour
  • User Data: 15 minutes
  • Application Code: 0 minutes (Git)
  • System Logs: 24 hours

Risk Assessment​

Potential Disasters​

Risk CategoryProbabilityImpactMitigation Priority
Data LossMediumCriticalHigh
Service OutageMediumHighHigh
Security BreachLowCriticalHigh
Infrastructure FailureLowHighMedium
Human ErrorMediumMediumMedium
Natural DisasterLowHighLow

Critical Systems​

  • toto-app: Main pet rescue application
  • toto-bo: Backoffice management system
  • Database: Firestore data storage
  • Authentication: User authentication system
  • Payment Processing: Donation handling

Incident Classification​

Severity Levels​

LevelDescriptionResponse TimeEscalation
P1 - CriticalComplete service outage, data corruption, security breach15 minutesImmediate
P2 - HighMajor functionality affected, performance degradation > 50%1 hour2 hours
P3 - MediumMinor functionality issues, performance degradation < 50%4 hours8 hours
P4 - LowCosmetic issues, non-critical features24 hours48 hours

Backup Strategy​

Firestore Database Backups​

// scripts/backup-firestore.ts
import { initializeApp, getApps } from 'firebase/app';
import { getFirestore, collection, getDocs, writeBatch } from 'firebase/firestore';
import { Storage } from '@google-cloud/storage';

export class FirestoreBackupService {
private db: any;
private storage: Storage;
private bucketName: string;

constructor() {
const app = getApps()[0] || initializeApp({
// Firebase config
});
this.db = getFirestore(app);
this.storage = new Storage();
this.bucketName = 'toto-backups';
}

async createFullBackup(): Promise<string> {
const timestamp = new Date().toISOString();
const backupId = `backup-${timestamp}`;

console.log(`Starting full backup: ${backupId}`);

try {
const collections = await this.getCollections();
const backupData: any = {};

for (const collectionName of collections) {
console.log(`Backing up collection: ${collectionName}`);
const snapshot = await getDocs(collection(this.db, collectionName));
backupData[collectionName] = snapshot.docs.map(doc => ({
id: doc.id,
data: doc.data(),
createdAt: doc.metadata.fromCache ? null : doc.metadata.serverTimestamp
}));
}

const fileName = `${backupId}/firestore-backup.json`;
const file = this.storage.bucket(this.bucketName).file(fileName);

await file.save(JSON.stringify(backupData, null, 2), {
metadata: {
contentType: 'application/json',
metadata: {
backupId,
timestamp,
type: 'full'
}
}
});

console.log(`Backup completed: ${backupId}`);
return backupId;
} catch (error) {
console.error('Backup failed:', error);
throw error;
}
}

async createIncrementalBackup(lastBackupTime: Date): Promise<string> {
const timestamp = new Date().toISOString();
const backupId = `incremental-${timestamp}`;

console.log(`Starting incremental backup: ${backupId}`);

try {
const collections = await this.getCollections();
const backupData: any = {};

for (const collectionName of collections) {
const snapshot = await getDocs(collection(this.db, collectionName));
backupData[collectionName] = snapshot.docs.map(doc => ({
id: doc.id,
data: doc.data(),
modifiedAt: doc.metadata.serverTimestamp
}));
}

const fileName = `${backupId}/firestore-incremental.json`;
const file = this.storage.bucket(this.bucketName).file(fileName);

await file.save(JSON.stringify(backupData, null, 2), {
metadata: {
contentType: 'application/json',
metadata: {
backupId,
timestamp,
type: 'incremental',
lastBackupTime: lastBackupTime.toISOString()
}
}
});

console.log(`Incremental backup completed: ${backupId}`);
return backupId;
} catch (error) {
console.error('Incremental backup failed:', error);
throw error;
}
}

private async getCollections(): Promise<string[]> {
return [
'cases',
'users',
'donations',
'guardians',
'notifications',
'audit_logs'
];
}
}

// Automated backup scheduling
export class BackupScheduler {
private backupService: FirestoreBackupService;

constructor() {
this.backupService = new FirestoreBackupService();
}

async scheduleBackups(): Promise<void> {
// Daily full backup at 2 AM
this.scheduleCron('0 2 * * *', () => {
this.backupService.createFullBackup();
});

// Hourly incremental backup
this.scheduleCron('0 * * * *', () => {
const lastBackupTime = new Date(Date.now() - 24 * 60 * 60 * 1000);
this.backupService.createIncrementalBackup(lastBackupTime);
});
}

private scheduleCron(cronExpression: string, callback: () => void): void {
console.log(`Scheduled backup: ${cronExpression}`);
}
}

Code Repository Backups​

#!/bin/bash
# scripts/backup-repositories.sh

repositories=("toto-app" "toto-bo" "toto-ai-hub" "toto-wallet" "toto-docs")

for repo in "${repositories[@]}"; do
echo "Backing up repository: $repo"
cd "$repo"
git push origin main
git push backup main
cd ..
tar -czf "backups/${repo}-$(date +%Y%m%d).tar.gz" "$repo"
gsutil cp "backups/${repo}-$(date +%Y%m%d).tar.gz" gs://toto-backups/repositories/
done

Emergency Response Procedures​

Step 1: Incident Detection & Assessment​

Automated Monitoring Alerts​

# Check system health
curl -f https://app.betoto.pet/api/health
curl -f https://bo.betoto.pet/api/health

# Check monitoring dashboard
# Access: https://bo.betoto.pet/dashboard/monitoring

Manual Health Checks​

# Check Firebase services
firebase projects:list
firebase use toto-f9d2f
firebase firestore:indexes

# Check deployment status
firebase hosting:sites:list

Step 2: Incident Response Team Activation​

On-Call Rotation​

  • Primary: [Primary Contact]
  • Secondary: [Secondary Contact]
  • Escalation: [Management Contact]

Communication Channels​

Step 3: Immediate Response Actions​

For P1/P2 Incidents​

  1. Acknowledge Incident (5 minutes)

    • Post in #incident-response channel
    • Create incident ticket
    • Notify stakeholders
  2. Assess Impact (15 minutes)

    • Determine affected systems
    • Estimate user impact
    • Identify root cause
  3. Implement Workaround (30 minutes)

    • Deploy hotfix if available
    • Activate backup systems
    • Communicate with users
  4. Full Resolution (4 hours)

    • Implement permanent fix
    • Verify system stability
    • Update documentation

Data Recovery Procedures​

Firestore Database Recovery​

Full Database Restore​

// scripts/restore-firestore.ts
import { initializeApp, getApps } from 'firebase/app';
import { getFirestore, collection, doc, setDoc, writeBatch } from 'firebase/firestore';
import { Storage } from '@google-cloud/storage';

export class FirestoreRestoreService {
private db: any;
private storage: Storage;
private bucketName: string;

constructor() {
const app = getApps()[0] || initializeApp({
// Firebase config
});
this.db = getFirestore(app);
this.storage = new Storage();
this.bucketName = 'toto-backups';
}

async restoreFromBackup(backupId: string): Promise<void> {
console.log(`Starting restore from backup: ${backupId}`);

try {
const fileName = `${backupId}/firestore-backup.json`;
const file = this.storage.bucket(this.bucketName).file(fileName);
const [backupData] = await file.download();
const data = JSON.parse(backupData.toString());

for (const [collectionName, documents] of Object.entries(data)) {
console.log(`Restoring collection: ${collectionName}`);
await this.restoreCollection(collectionName, documents as any[]);
}

console.log(`Restore completed: ${backupId}`);
} catch (error) {
console.error('Restore failed:', error);
throw error;
}
}

private async restoreCollection(collectionName: string, documents: any[]): Promise<void> {
const batch = writeBatch(this.db);
const batchSize = 500;

for (let i = 0; i < documents.length; i += batchSize) {
const batchDocs = documents.slice(i, i + batchSize);

for (const docData of batchDocs) {
const docRef = doc(collection(this.db, collectionName), docData.id);
batch.set(docRef, docData.data);
}

await batch.commit();
}
}

async restoreToPointInTime(targetTime: Date): Promise<void> {
console.log(`Restoring to point in time: ${targetTime.toISOString()}`);
const backupId = await this.findBackupBeforeTime(targetTime);

if (!backupId) {
throw new Error('No backup found before target time');
}

await this.restoreFromBackup(backupId);
await this.applyIncrementalBackups(backupId, targetTime);
}

private async findBackupBeforeTime(targetTime: Date): Promise<string | null> {
return null; // Placeholder
}

private async applyIncrementalBackups(fromBackupId: string, toTime: Date): Promise<void> {
console.log('Applying incremental backups...');
}
}

Command Line Recovery​

# 1. Identify backup to restore
npm run backup:list production

# 2. Restore from backup
npm run backup restore production [backup-id]

# 3. Verify data integrity
firebase firestore:indexes

# Collection-specific recovery
ts-node scripts/restore-collection.ts [collection-name] [backup-id]

# Point-in-time recovery
ts-node scripts/point-in-time-restore.ts [timestamp] [backup-id]

File Storage Recovery​

# List available backups
gsutil ls gs://toto-backups-prod/

# Restore files
gsutil -m cp -r gs://toto-backups-prod/[backup-id]/files/ gs://toto-f9d2f.appspot.com/

Application Code Recovery​

# Rollback to previous version
git log --oneline
git checkout [commit-hash]
npm run deploy:production

# Emergency hotfix deployment
git checkout -b emergency-fix
# Make minimal changes
git commit -m "Emergency fix: [description]"
git push origin emergency-fix
npm run deploy:production

System Recovery Procedures​

Application Recovery​

toto-app Recovery​

# 1. Check deployment status
firebase hosting:sites:list --project toto-f9d2f

# 2. Redeploy if necessary
cd toto-app
npm run deploy:production

# 3. Verify deployment
curl -f https://app.betoto.pet/api/health

toto-bo Recovery​

# 1. Check deployment status
firebase hosting:sites:list --project toto-bo

# 2. Redeploy if necessary
cd toto-bo
npm run deploy:production

# 3. Verify deployment
curl -f https://bo.betoto.pet/api/health

Infrastructure Recovery​

# Check project status
firebase projects:list

# Verify billing
firebase billing:accounts:list

# Check quotas
firebase quotas:list

# Check DNS settings
nslookup app.betoto.pet
nslookup bo.betoto.pet

# Verify SSL certificates
openssl s_client -connect app.betoto.pet:443 -servername app.betoto.pet

Security Incident Recovery​

Security Breach Response​

Immediate Actions​

  1. Isolate Affected Systems

    # Disable compromised accounts
    firebase auth:export users.json --project toto-f9d2f
    # Review and disable suspicious accounts
  2. Preserve Evidence

    # Export audit logs
    npm run export:audit-logs
    # Create forensic backup
    npm run backup:forensic
  3. Notify Stakeholders

    • Internal security team
    • Legal team
    • Affected users (if required)

Recovery Steps​

  1. Patch Vulnerabilities

    • Deploy security updates
    • Update dependencies
    • Review access controls
  2. Reset Credentials

    • Force password resets
    • Rotate API keys
    • Regenerate certificates
  3. Monitor for Recurrence

    • Enhanced monitoring
    • Security scanning
    • User activity review

Failover Procedures​

Database Failover​

// scripts/database-failover.ts
export class DatabaseFailoverService {
async activateFailover(): Promise<void> {
console.log('Activating database failover...');

try {
const primaryHealth = await this.checkDatabaseHealth('primary');

if (!primaryHealth.healthy) {
await this.activateSecondaryDatabase();
await this.updateDatabaseConfiguration('secondary');
await this.verifyFailover();
console.log('Database failover completed successfully');
}
} catch (error) {
console.error('Database failover failed:', error);
throw error;
}
}

private async checkDatabaseHealth(database: string): Promise<{ healthy: boolean }> {
return { healthy: true };
}

private async activateSecondaryDatabase(): Promise<void> {
console.log('Activating secondary database...');
}

private async updateDatabaseConfiguration(database: string): Promise<void> {
console.log(`Updating configuration to use ${database} database...`);
}

private async verifyFailover(): Promise<void> {
console.log('Verifying failover success...');
}
}

Service Failover​

// scripts/service-failover.ts
export class ServiceFailoverService {
async activateServiceFailover(service: string): Promise<void> {
console.log(`Activating failover for ${service}...`);

try {
const primaryHealth = await this.checkServiceHealth(service, 'primary');

if (!primaryHealth.healthy) {
await this.activateSecondaryService(service);
await this.updateLoadBalancerConfiguration(service, 'secondary');
await this.verifyServiceFailover(service);
console.log(`${service} failover completed successfully`);
}
} catch (error) {
console.error(`${service} failover failed:`, error);
throw error;
}
}

private async checkServiceHealth(service: string, instance: string): Promise<{ healthy: boolean }> {
return { healthy: true };
}

private async activateSecondaryService(service: string): Promise<void> {
console.log(`Activating secondary ${service} instance...`);
}

private async updateLoadBalancerConfiguration(service: string, instance: string): Promise<void> {
console.log(`Updating load balancer for ${service} to use ${instance}...`);
}

private async verifyServiceFailover(service: string): Promise<void> {
console.log(`Verifying ${service} failover success...`);
}
}

Business Continuity Procedures​

Service Degradation Response​

Performance Issues​

# 1. Check system metrics
curl https://bo.betoto.pet/api/monitoring/system-health

# 2. Scale resources if needed
firebase apphosting:instances:scale --min-instances 2 --max-instances 10

# 3. Implement rate limiting
# (Already configured in middleware)

Partial Outage​

# 1. Activate maintenance mode
echo "MAINTENANCE_MODE=true" >> .env.production

# 2. Deploy maintenance page
npm run deploy:maintenance

# 3. Communicate with users
# Send notifications via email/SMS

Communication Procedures​

User Communication​

  1. Status Page Updates

    • Update status.betoto.pet
    • Provide estimated resolution time
    • Regular progress updates
  2. Direct Notifications

    • Email notifications
    • SMS alerts (for critical users)
    • In-app notifications
  3. Social Media

    • Twitter updates
    • LinkedIn posts
    • Community forum updates

Recovery Testing​

Regular Testing Schedule​

Monthly Tests​

  • Backup restoration tests
  • Failover procedure tests
  • Communication procedure tests

Quarterly Tests​

  • Full disaster recovery simulation
  • Security incident response
  • Business continuity testing

Test Scenarios​

# Scenario 1: Database Corruption
npm run test:disaster-recovery --scenario=database-corruption
npm run backup:restore test-backup

# Scenario 2: Application Failure
npm run test:disaster-recovery --scenario=application-failure
git checkout [previous-commit]
npm run deploy:production

# Scenario 3: Security Breach
npm run test:disaster-recovery --scenario=security-breach
npm run security:incident-response

Recovery Checklists​

Pre-Recovery Checklist​

  • Incident severity assessed
  • Response team activated
  • Stakeholders notified
  • Evidence preserved
  • Recovery plan approved
  • Identify affected systems
  • Assess data loss extent
  • Determine recovery strategy
  • Prepare recovery environment
  • Verify backup integrity

During Recovery Checklist​

  • Backup integrity verified
  • Recovery procedures followed
  • System functionality tested
  • Data integrity confirmed
  • Security measures implemented
  • Restore database from backup
  • Apply incremental backups
  • Deploy applications
  • Restore configurations
  • Verify service health

Post-Recovery Checklist​

  • System fully operational
  • Users notified of resolution
  • Incident documented
  • Root cause analysis completed
  • Prevention measures implemented
  • Recovery procedures updated
  • Complete functionality testing
  • Performance verification
  • Security validation
  • User acceptance testing

Recovery Tools & Scripts​

Automated Recovery Scripts​

# Full system recovery
./scripts/full-recovery.sh [environment]

# Database recovery
./scripts/database-recovery.sh [backup-id]

# Application recovery
./scripts/app-recovery.sh [version]

# Security recovery
./scripts/security-recovery.sh [incident-id]

Monitoring & Alerting​

# Check recovery status
npm run recovery:status

# Monitor recovery progress
npm run recovery:monitor

# Send recovery notifications
npm run recovery:notify

Emergency Contacts​

Internal Contacts​

RoleNamePhoneEmail
Incident Commander[Name][Phone][Email]
Technical Lead[Name][Phone][Email]
Security Lead[Name][Phone][Email]
Communications Lead[Name][Phone][Email]

External Contacts​

ServiceContactPhoneEmail
Firebase Support[Contact][Phone][Email]
Google Cloud Support[Contact][Phone][Email]
Stripe Support[Contact][Phone][Email]
Domain Registrar[Contact][Phone][Email]

Escalation Matrix​

  1. Level 1: On-call engineer (15 minutes)
  2. Level 2: Technical lead (1 hour)
  3. Level 3: Engineering manager (2 hours)
  4. Level 4: CTO/VP Engineering (4 hours)
  5. Level 5: CEO/Executive team (8 hours)

Documentation Updates​

Post-Incident Actions​

  1. Incident Report

    • Timeline of events
    • Root cause analysis
    • Impact assessment
    • Lessons learned
  2. Procedure Updates

    • Update recovery procedures
    • Improve monitoring
    • Enhance automation
    • Train team members
  3. Prevention Measures

    • Implement additional safeguards
    • Update security measures
    • Improve testing procedures
    • Enhance documentation

This disaster recovery guide ensures the Toto ecosystem can recover quickly and effectively from any disaster scenario while maintaining business continuity.