Over the last couple of days I have been evaluating different solutions for monitoring Registry nodes and the Radicle network. I’d like to share my insights here, make a recommendation and move forward with a decision. I’ve tried to be brief but I’d be happy to share more details.
Our needs
- We want a flexible and powerful analysis tools. Since we’re not sure yet what we need to monitor we might need advanced capabilities to combine metrics and do calculations on them.
- The solution should integrate seamlessly with Prometheus metrics. Substrate, which our nodes are built on, provides Prometheus metrics out of the box.
- We want to operate as little monitoring infrastructure as possible.
- Current and future team members should require as little education on the solution as possible. Solutions that are widely used and documented are preferable.
- The solution is affordable
Baseline: Grafana/Prometheus
The self-hosted Grafana/Prometheus stack checks all the boxes for our needs except operating our own infrastructure. It’s powerful and flexible and widely used. Out of all the solutions our team (and also other Radicle teams) has the most experience with this stack. This likely holds true for future members.
The biggest drawback with this solution is that we would need to operate Grafana and Prometheus ourselves.
Grafana Cloud
Grafana Cloud is a service that provides a hosted version of Grafana and Prometheus.
It offers two options for integrating with Prometheus metrics: Either via their own agent running on a node or via remote write for existing Prometheus instances. I’ve chosen the latter because I was familiar with setting up Prometheus on K8s but not with setting up the agent. I was able to quickly set it up and get it working.
Grafana Cloud charges 16$ per 1000 unique Prometheus series per month with a minimum of 50$ per month. We’re currently using a third of their basic plan so we might need to scale this up. Their basic plan also includes 10 users which is more than enough for us.
Datadog
Datadog is an observability platform. I’ve integrated it into our stack by running their agent which scrapes Prometheus metrics. The integration was fairly straight-forward. Their feature set for analyzing metrics is comparable to Grafana/Prometheus but more limited in some cases. For example it was not possible to calculate the rate of counter metric over a configurable time window (e.g. block production rate over the last 10 minutes)—only fixed time windows where available. There is no stand-out feature that is missing from Grafana/Prometheus.
Datadog charges 15$ per host per month. This is not ideal since we want to be flexible with the number of hosts. Having to consider that we might pay more if we spin up a host might be an issue. It is also unclear how ephemeral hosts are charged.
Google Stack Driver
Stack Driver is the monitoring solution for everything running on Google Cloud Platform. It integrates well with metrics from GCP but is more limited in its capabilities than the other solutions. I tried experimented with dashboards for some GCP resources but did not set it up for our custom network metrics.
Recommendation
Based on my research I recommend we use Grafana Cloud for now. Stack driver is not really an option since it does not satisfy our needs. Datadog has no more feature but some limitations when compared to Grafana Cloud and Grafana Cloud has no serious limitations. Grafana Cloud has an advantage on openness and expertise and the pricing seems reasonable.