Incident Response

Luồng hoạt động

Bước 1: Triage

Đánh giá severity (P1-P4), impact scope, affected services.

P1: Service down, data loss — tất cả hands on deck
P2: Major feature broken — team lead + on-call
P3: Minor degradation — on-call engineer
P4: Cosmetic issue — normal sprint

Bước 2: Coordinate

Phân công roles: Incident Commander, Tech Lead, Communicator. Tạo war room (Lark group/thread).

Bước 3: Mitigate

Ưu tiên giảm impact trước, fix root cause sau. Options: rollback, feature flag off, scale up, hotfix.

Bước 4: Root Cause Analysis

Sau khi mitigate xong → tìm root cause. Dùng 5-Whys hoặc fault tree analysis.

Bước 5: Postmortem

Viết postmortem report: timeline, root cause, action items. Blameless — focus vào process, không blame người.

Level	Response Time	Who	Example
P1 — Critical	< 15 min	All hands	Service down, data breach
P2 — Major	< 1 hour	Team lead + on-call	Payment broken, auth fail
P3 — Minor	< 4 hours	On-call	Slow queries, minor UI bug
P4 — Low	Next sprint	Assignee	Cosmetic, non-blocking

Level

Response Time

Who

Example

P1 — Critical

< 15 min

All hands

Service down, data breach

P2 — Major

< 1 hour

Team lead + on-call

Payment broken, auth fail

P3 — Minor

< 4 hours

On-call

Slow queries, minor UI bug

P4 — Low

Next sprint

Assignee

Cosmetic, non-blocking

Ví dụ thực tế

INCIDENT REPORT
├── Severity: P2
├── Impact: Payment processing failed for 12% of users
├── Duration: 45 minutes (12:30 - 13:15)
├── Root Cause: Redis connection pool exhausted 
│   after config change reduced max from 50 → 5
├── Mitigation: Reverted config change
├── Action Items:
│   ├── [ ] Add connection pool monitoring alert
│   ├── [ ] Config change requires 2 approvals
│   └── [ ] Add integration test for Redis pool
└── Postmortem: Scheduled for 2026-04-08

Dùng	Không dùng
Production incident đang xảy ra	Bug trên staging (dùng `/fix`)
Cần structured response process	Issue đã biết, đang fix theo sprint
Viết postmortem sau incident	Regular retrospective (dùng `/retro`)

Dùng

Không dùng

Production incident đang xảy ra

Bug trên staging (dùng /fix)

Cần structured response process

Issue đã biết, đang fix theo sprint

Viết postmortem sau incident

Regular retrospective (dùng /retro)

Bắt đầu

Quan trọng

Tham khảo

Dev — Bắt đầu

Dev — Lệnh chính

Dev — Phân tích & Nghiên cứu

Dev — Công cụ hỗ trợ

QA — Bắt đầu

QA — Use Cases

PM — Bắt đầu

PM — Lập kế hoạch

PM — Vận hành

PM — Kỹ thuật

DevOps — Bắt đầu

DevOps — Triển khai

DevOps — Bảo mật & Chất lượng

DevOps — Giám sát

Design — Bắt đầu

Design — Công cụ

Design — Quy trình

Marketing — Bắt đầu

Marketing — Nội dung

Marketing — Phân phối

Marketing — Quản lý

Incident Response — Xử lý sự cố production

Cú pháp

Luồng hoạt động

Severity Levels

Ví dụ thực tế

Khi nào dùng / không dùng

Bắt đầu

Quan trọng

Tham khảo

Dev — Bắt đầu

Dev — Lệnh chính

Dev — Phân tích & Nghiên cứu

Dev — Công cụ hỗ trợ

QA — Bắt đầu

QA — Use Cases

PM — Bắt đầu

PM — Lập kế hoạch

PM — Vận hành

PM — Kỹ thuật

DevOps — Bắt đầu

DevOps — Triển khai

DevOps — Bảo mật & Chất lượng

DevOps — Giám sát

Design — Bắt đầu

Design — Công cụ

Design — Quy trình

Marketing — Bắt đầu

Marketing — Nội dung

Marketing — Phân phối

Marketing — Quản lý

​Incident Response — Xử lý sự cố production

​Cú pháp

​Luồng hoạt động

​Severity Levels

​Ví dụ thực tế

​Khi nào dùng / không dùng

Incident Response — Xử lý sự cố production

Cú pháp

Luồng hoạt động

Severity Levels

Ví dụ thực tế

Khi nào dùng / không dùng