3.4 KiB
3.4 KiB
Fleet Operations Runbook
Quick Reference
Emergency Contacts
- PagerDuty: (for Tier 1+ incidents)
- Slack: #fleet-operations
- Engineering On-Call: @on-call-engineer
Incident Severity Levels
- P0-Critical: System down, >50% fleet affected, immediate response required
- P1-High: Major degradation, 10-50% fleet affected, respond within 30 min
- P2-Medium: Partial outage, <10% fleet affected, respond within 2 hours
- P3-Low: Minor issues, isolated, respond within 24 hours
Daily Operator Checklist
Start of Shift
- Check fleet health dashboard
- Review overnight incidents
- Confirm all active fleet units are online
- Check weather/natural disaster alerts for fleet zones
- Review any scheduled maintenance
Ongoing Monitoring
- Monitor fleet uptime metrics (hourly)
- Check communication channels for user reports
- Rotate through fleet units for inspection
- Document any anomalies
End of Shift
- Handoff report to next shift operator
- Log all incidents and resolutions
- Update runbook with new learnings
- Verify fleet stability before signing off
Incident Response Procedures
P0-Critical Response
- Acknowledge immediately in #fleet-operations
- Page engineering on-call
- Isolate affected fleet units if possible
- Execute failover or recovery procedures
- Communicate status updates every 15 minutes
- Document root cause after resolution
Common Incidents
Fleet Unit Offline
- Check unit health metrics
- Attempt remote restart
- If fails, dispatch field technician (if within service area)
- Escalate if unit not restored in 2 hours
GPS/Connectivity Loss
- Verify network status in area
- Check unit configuration
- Coordinate with carrier if widespread
- Update affected users
Hardware Malfunction
- Run diagnostic tests
- Review error logs
- Determine if field service needed
- Order replacement parts if required
Escalation Matrix
| Issue Type | Tier 1 Operator | Senior Operator | Master Operator | Engineering |
|---|---|---|---|---|
| Unit offline | Handle | Escalate if >2hrs | Escalate if >4hrs | Immediate |
| System outage | Notify | Coordinate | Lead | Own |
| Customer complaint | Resolve | Escalate | Escalate | Notify |
| New feature issue | Document | Report | Prioritize | Fix |
Reporting Requirements
Daily
- Fleet uptime report (automated)
- Incident summary (if any)
Weekly
- Operator performance metrics
- Customer satisfaction scores
- Maintenance activities
Monthly
- Comprehensive operations review
- Training needs assessment
- Process improvement recommendations
Partner Program Coordination
- Document all partner leads in CRM
- Track lead source and conversion
- Provide partner with fleet status updates (weekly)
- Escalate partner-related issues to Partnership Manager
Safety & Compliance
- Follow all local regulations for fleet operations
- Maintain proper insurance documentation
- Report accidents/incidents to authorities immediately
- Complete safety training annually
Documentation Standards
All runbook entries must include:
- Timestamp (ISO 8601)
- Incident ID
- Actions taken (chronological)
- Resolution outcome
- Lessons learned
- Follow-up tasks
This runbook is a living document. Operators are expected to suggest improvements after each incident.