Triage Playbook | Cloudaware Documentation

This playbook outlines a simple, repeatable process for validating anomalies, assigning owners, and implementing fixes using Cloudaware FinOps and related modules.

Confirm the Signal

When an anomaly alert arrives:

Verify the scope and time range (for example, which account, application, BU, or customer, and which days are impacted).
Check FinOps dashboards and reports (for example, Reporting & Analytics) to confirm the spike or unusual pattern.
Compare to cloud provider consoles if needed to rule out data issues.

If the spike is not visible in billing data or dashboards, review Data Health & Freshness before proceeding.

Check for Expected Changes

Before treating an anomaly as an incident:

Look for known events in that period (launches, migrations, enablement of new services, one‑time data processing, promotions).
Ask the owning team whether they recently changed configuration, scaling policies, or deployment patterns.

If the behavior is expected, document it (for example, in an internal runbook or ticket) so future similar anomalies can be interpreted correctly or suppressed where appropriate.

Identify Drivers

If the anomaly is not clearly expected:

Break down spend by service, region, and usage type to find which components changed most.
Use allocation and business mapping views to see which applications, environments, or teams are responsible.
Look for:
- New or expanded services.
- Rapidly growing storage or data transfer.
- Sudden changes in unit costs (for example, cost per request or per user).

Document the primary drivers of the anomaly for use in post‑analysis and communication.

Classify and Assign Ownership

Based on the drivers:

Classify the anomaly (for example, misconfiguration, unintentional growth, expected growth, data issue).
Identify a clear owner (team or application) using allocation and tagging information.
Route the anomaly to that owner via your normal collaboration tools (ticketing, chat, email, incident system).

Include enough context (scope, services, magnitude, timeframe, dashboards) that owners can quickly act without re‑creating the analysis.

Plan and Implement Response

For anomalies that indicate real issues:

Coordinate with owners to choose a response, such as:
- Reverting or adjusting a configuration change.
- Tightening scaling policies or limits.
- Cleaning up unintended resources.
- Accelerating rightsizing or commitment purchases.
Where appropriate, create or update Optimization tasks or playbooks so similar issues are addressed systematically.

Track actions and resolutions in your ticketing or work‑tracking system.

Learn and Tune

After resolving a significant anomaly:

Update documentation or runbooks with what happened, how it was detected, and what was done.
Adjust detector sensitivity, scopes, or exclusions if the alert was too noisy or too late.
Consider adding or updating Compliance Engine cost policies or waste detection rules if the anomaly revealed a class of issues that can be automated.

Over time, this feedback loop will reduce noise, catch genuine issues earlier, and turn one‑off investigations into repeatable FinOps practices.