Expose general health indication in stats
A suggestion from Chris Siebenmann's blog: https://utcc.utoronto.ca/~cks/space/blog/sysadmin/HaveGeneralHealthMetric
If your system is reasonably decent sized, it probably has some sort of logging framework that categorizes log messages by both subsystem and broad level of alarmingness. Add a hook into your logging system so that you track the last time a message was emitted for a given subsystem at a given priority level, and expose these times (with level and subsystem) as metrics. Then people like me can put together monitoring for things like 'the Prometheus TSDB has logged warnings or above within the last five minutes'.