Prometheus Rules - For Fun & Profit (& Loss) 18 August 2016 Tristan Colgate-McFarlane Infrastructure Engineer - Qubit

What I do Programmer

Monitoring nut

Developer and sole community member/user of Guile-SNMP 2

What are Qubit doing with Prometheus? Large and varied estate (physical + (AWS + GCE) * (Mesos + K8s))

Multiple autonomous development teams

Trying to bring more business focused monitoring. 3

AWS billing is difficult. Dozens of products

Multiple cost centers under one account

EC2 - Normal vs reserved vs spot instances

https://github.com/QubitProducts/aws_audit_exporter

https://github.com/prometheus/cloudwatch_exporter 4

CloudWatch - AWS/Billing Updated every 4 hours

Broken down by product and cost center

Estimate of cost to date since the start of the month aws_billing_estimated_charges_average 5

CloudWatch - AWS/Billing - cont. Great News Good estimate

Fine grained enough to spot issues

But can we take it further?

What will my bill be at the end of the month? 6

UTC, Unix epoch and time() Prometheus gives us time() since Unix epoch

since Unix epoch No UTC support 7

Up front truths Not accurate (some easily fixable some not)

Leap seconds are a thing 8

time_end_of_current_day_seconds RRD taught me 86400 seconds in a day End of the current day in seconds time_end_of_current_day_seconds = floor(vector(time() / 86400) +1) * 86400 So how many seconds are left today? time_current_day_remaining_seconds = time_end_of_current_day_seconds - time() 9

time_current_year Which year is it now? 86400 seconds in a day

365.242196 days in a year time_current_year = 1970 + round(vector(time()/ (86400 * 365.242196) -0.5)) note possibly just got lucky that this works in 2016 10

time_start_current_year_seconds Epoch time at the start of the year? time_start_current_year_seconds = (floor(vector(time() / (86400) / 365)) * 86400 * 365) + (floor(vector(((time()) + (86400 * 365))/ (86400) / (365 * 4))) * 86400) and the end? time_end_current_year_seconds = time_start_current_year_seconds + (365 * 86400) + (time_is_leap_year_bool * 86400) Should round these down to the nearest 86400, about 3 hours off of midnight now 11

time_is_leap_year_bool Is this a leap year?

- divisible by 4

- but not 100

- unless 400 time_is_leap_year_bool = (time_current_year % 4 == bool 0 ) and ((time_current_year % 400 == bool 0 ) or (time_current_year % 100 != bool 0 )) 12

time_end_of_current_MONTH_seconds time_end_of_current_jan_seconds = time_start_current_year_seconds + (31 * 86400) time_end_of_current_feb_seconds = time_end_of_current_jan_seconds + (28 * 86400) + (time_is_leap_year_bool * 86400) time_end_of_current_mar_seconds = time_end_of_current_feb_seconds + (31 * 86400) ... time_end_of_current_dec_seconds = time_end_of_current_nov_seconds + (31 * 86400) 13

time_current_month Which month is it? time_current_month = ( 1 + (time() > bool time_end_of_current_jan_seconds) + (time() > bool time_end_of_current_feb_seconds) + .... (time() > bool time_end_of_current_nov_seconds)) 14

time_end_of_current_month_seconds time_end_of_current_month_seconds = ( ((time_current_month == bool 1) * time_end_of_current_jan_seconds) + ((time_current_month == bool 2) * time_end_of_current_feb_seconds) + ... ((time_current_month == bool 12) * time_end_of_current_dec_seconds)) 15

time_current_month_remaining_seconds How long is left in the current month? time_current_month_remaining_seconds = time_end_of_current_month_seconds - time() 16

And finally... Current cost sum(aws_billing_estimated_charges_average) Last month's cost sum(max_over_time(aws_billing_estimated_charges_average[31d])) End of this month's cost sum( predict_linear( aws_billing_estimated_charges_average[2d], scalar(time_current_month_remaining_seconds))) Delta ... profit! Or loss? sum( predict_linear( aws_billing_estimated_charges_average[2d], scalar(time_current_month_remaining_seconds))) - sum(max_over_time(aws_billing_estimated_charges_average[31d])) 17

Result 18

Other possibilities Want higher priority alerts on the weekend? time_current_weekday = 1 + (((time() / 86400) -4) % 7) time_is_weekday_bool = time_current_day < bool 5 19

Improvements Better maths for start of year

Include leap seconds (26 since 1970)

It takes many evaluation_intervals to populate values, we could manually expand the rules to avoid the delay

Can't backfill rules

Could create a utc_exporter... TBC 20