Files
timmy-home/specs/fleet-ops-runbook.md

3.4 KiB

Fleet Operations Runbook

Quick Reference

Emergency Contacts

  • PagerDuty: (for Tier 1+ incidents)
  • Slack: #fleet-operations
  • Engineering On-Call: @on-call-engineer

Incident Severity Levels

  • P0-Critical: System down, >50% fleet affected, immediate response required
  • P1-High: Major degradation, 10-50% fleet affected, respond within 30 min
  • P2-Medium: Partial outage, <10% fleet affected, respond within 2 hours
  • P3-Low: Minor issues, isolated, respond within 24 hours

Daily Operator Checklist

Start of Shift

  • Check fleet health dashboard
  • Review overnight incidents
  • Confirm all active fleet units are online
  • Check weather/natural disaster alerts for fleet zones
  • Review any scheduled maintenance

Ongoing Monitoring

  • Monitor fleet uptime metrics (hourly)
  • Check communication channels for user reports
  • Rotate through fleet units for inspection
  • Document any anomalies

End of Shift

  • Handoff report to next shift operator
  • Log all incidents and resolutions
  • Update runbook with new learnings
  • Verify fleet stability before signing off

Incident Response Procedures

P0-Critical Response

  1. Acknowledge immediately in #fleet-operations
  2. Page engineering on-call
  3. Isolate affected fleet units if possible
  4. Execute failover or recovery procedures
  5. Communicate status updates every 15 minutes
  6. Document root cause after resolution

Common Incidents

Fleet Unit Offline

  1. Check unit health metrics
  2. Attempt remote restart
  3. If fails, dispatch field technician (if within service area)
  4. Escalate if unit not restored in 2 hours

GPS/Connectivity Loss

  1. Verify network status in area
  2. Check unit configuration
  3. Coordinate with carrier if widespread
  4. Update affected users

Hardware Malfunction

  1. Run diagnostic tests
  2. Review error logs
  3. Determine if field service needed
  4. Order replacement parts if required

Escalation Matrix

Issue Type Tier 1 Operator Senior Operator Master Operator Engineering
Unit offline Handle Escalate if >2hrs Escalate if >4hrs Immediate
System outage Notify Coordinate Lead Own
Customer complaint Resolve Escalate Escalate Notify
New feature issue Document Report Prioritize Fix

Reporting Requirements

Daily

  • Fleet uptime report (automated)
  • Incident summary (if any)

Weekly

  • Operator performance metrics
  • Customer satisfaction scores
  • Maintenance activities

Monthly

  • Comprehensive operations review
  • Training needs assessment
  • Process improvement recommendations

Partner Program Coordination

  • Document all partner leads in CRM
  • Track lead source and conversion
  • Provide partner with fleet status updates (weekly)
  • Escalate partner-related issues to Partnership Manager

Safety & Compliance

  • Follow all local regulations for fleet operations
  • Maintain proper insurance documentation
  • Report accidents/incidents to authorities immediately
  • Complete safety training annually

Documentation Standards

All runbook entries must include:

  1. Timestamp (ISO 8601)
  2. Incident ID
  3. Actions taken (chronological)
  4. Resolution outcome
  5. Lessons learned
  6. Follow-up tasks

This runbook is a living document. Operators are expected to suggest improvements after each incident.