Stakepools, the backbone of the Cardano Network

Markus from VITAL - Responsible Staking
Cardano SPOs
Published in
4 min readNov 2, 2021

--

This is the second Article of a Series that is planned to share an overview of our perspective of a reliable Cardano Stakepool and how the VITAL pool is contributing to a better network health.

Articles in this series:

01 VITAL Introduction
02 Stakepools, the backbone of the Cardano Network (this Article)
03 How we ensure fair rewards
04 Decentralization took seriously
05 Environmental and Social Responsibility
06 Trust needs to be earned

How reliable does a stake pool need to be?

I think that's a great question to start the article as someone who is not deep in the technology might think that the availability of a single pool is not super important for the network. Just think about Bitcoin. If some mining servers are not running just someone else will generate the block.

In Cardano, it’s a little different. If one node is missing the block which is assigned to that pool the block is just gone. No other pool can stand in for that single block. Of course, there will be another block sometime later (on average 20 seconds), so for the overall network, it’s still not a big risk, as a failure of multiple pools at the same time is unlikely. But from an individual pool perspective missing a block means missing the related rewards which are generated from minting that single block.

So from a delegator perspective that one pool which you delegate to should be high-available to risk your rewards.

So what needs to be considered?

The previous chapter already leads to a first major concern: High availability. But besides the availability, there are other aspects that are also important and also might impact availability. Especially Security, Maintenance and Operational Aspects and Monitoring and Alerting are must-have technical concerns of a stake pool operator.

That’s all good to know but as a delegator, you will not be able to validate if an operator is taking those topics seriously.

How is the VITAL Pool approaching high availability?

As many delegators don’t want to dig into technical architecture details we are providing our block insurance. If we miss a block based on our fault or the fault of our hosting provider we are paying the missed rewards out of our pocket. This is limited to a maximum of one block per Epoch, which would cost us ~700 ADA in such an event. This is our way of demonstrating who convinced we are in our technical setup. So what's behind this?

Our Failover Mechanism

We run 2 Block Producing Nodes on 2 different locations. Both nodes are reporting their health status to an external service. If the currently primary node is not reporting status or is getting an outdated database an automatic failover is triggered. The failover node is hosted at a different provider in a different location. This way issues on all levels (Machine, Hosting Provider, Location). We shared our approach on GitHub [1] to allow anyone to validate it and also to allow other SPOs to make use of it. That way we hope to contribute to better overall network stability. Our failover mechanism is triggering a failover after 10 minutes. This shall avoid failover for short-term unavailabilities but leaves a risk of ~2% of missing a block during this timeframe in which case our block insurance would stand-in for the missed efforts.

Please note that some pools just running multiple Block Producers at the same time which will cause slot battles every time the pool mints a block. There is no penalty for doing so but we consider this being a bad practice and rather accept a 2% risk which we need to stand in for instead of accepting polluting the blockchain with garbage blocks.

[1] Failover Approach on Github: https://github.com/ResponsibleStaking/Cardano-Heartbeat-Failover

Security, Maintenance, Operations, Monitoring, Alerting

All of these aspects also may impact availability but more importantly, also may expose a threat to Cardano itself. Our related measures include VPN Only Access, Multi-Factor Authentication, Unattended Updates to avoid standard software security leaks, Basis Linux hardening like no root access, Fail2Ban, and running Cardano with a minimally privileged service user.

From a networking perspective, the Cardano Service port on our Relays is the only publically available endpoind. Besides, only VPN connections allow maintenance and monitoring access.

Alerting is based on DB-age (TIP) which is a very integrative check if the node is working properly. Additionally, Grafana is used to be able to inspect any anomalies. Especially time-synchronization is a very important factor that needs to be part of any Cardano monitoring setup.

And finally, we also run daily backups and snapshots before and after any change to be prepared for a worst-case disaster recovery scenario. Any updates are executed in a rolling manner and of course, timed to not happen during assigned slots.

What's next?

Next time I’ll give some insight into our rewards approach!

--

--