nanog mailing list archives

Re: Global Akamai Outage

From: Mark Tinka <mark@tinka.africa>
Date: Tue, 27 Jul 2021 16:10:12 +0200



On 7/26/21 19:04, Lukas Tribus wrote:

rpki-client can only remove outdated VRP's, if it a) actually runs and
b) if it successfully completes a validation cycle. It also needs to
do this BEFORE the RTR server distributes data.

If rpki-client for whatever reason doesn't complete a validation cycle
[doesn't start, crashes, cannot write to the file] it will not be able
to update the file, which stayrtr reads and distributes.

Have you had an odd experiences with rpki-client running? The fact thatit's not a daemon suggests that it is less likely to bomb out (eventhough that could happen as a runtime binary, but one can reliably testfor that with any effected changes).

Of course, rpki-client depends on Cron being available and stable, andover the years, I have not run into any major issues guaranteeing that.

So if you've seen some specific outage scenarios with it, I'd be keen tohear about them.

If your VM went down with both rpki-client and stayrtr, and it stays
down for 2 days (maybe a nasty storage or virtualization problem or
maybe this just a PSU failure in a SPOF server), when the VM comes
backup, stayrtr will read and distribute 2 days old data - after all -
rpki-client is a periodic cronjob while stayrtr will start
immediately, so there will be plenty of time to distribute obsolete
VRP's. Just because you have another validator and RTR server in
another region that was always available, doesn't mean that the
erroneous and obsolete data served by this server will be ignored.


This is a good point.

So I know that one of the developers of StayRTR is working on having ituse the "expires" values that rpki-client inherently possesses to ensurethat StayRTR never delivers stale data to clients. If this works, whileit does not eliminate the need to some degree of monitoring, itcertainly makes it less of a hassle, going forward.

There are more reasons and failure scenarios why this 2 piece setup
(periodic RPKI validation, separate RTR daemon) can become a "split
brain". As you implement more complicated setups (a single global RPKI
validation result is distributed to regional RTR servers - the
cloudflare approach), things get even more complicated. Generally I
prefer the all in one approach for these reasons (FORT validator).

At least if it crashes, it takes down the RTR server with it:

https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163


But I have to emphasize that all those are just examples. Unknown bugs
or corner cases can lead to similar behavior in "all in one" daemons
like Fort and Routinator. That's why specific improvements absolutely
do not mean we don't have to monitor the RTR servers.


Agreed.

I've had my fair share of Fort issues in the past month, all of whichhave been fixed and a new release is imminent, so I'm happy.

I'm currently running both Fort and rpki-client + StayRTR. At a basiclevel, they both send the exact same number of VRP's toward clients,likely because they share a philosophy in validation schemes, and cryptolibraries.


We're getting there.

Mark.

Current thread:

Re: Global Akamai Outage, (continued)
- - - Re: Global Akamai Outage Jared Mauch (Jul 25)
    - Re: Global Akamai Outage Saku Ytti (Jul 25)
    - Re: Global Akamai Outage Mark Tinka (Jul 25)
    - Re: Global Akamai Outage Saku Ytti (Jul 25)
    - Re: Global Akamai Outage Mark Tinka (Jul 26)
    - Re: Global Akamai Outage Lukas Tribus (Jul 26)
    - Re: Global Akamai Outage Mark Tinka (Jul 26)
    - Re: Global Akamai Outage heasley (Jul 26)
    - Re: Global Akamai Outage Mark Tinka (Jul 26)
    - Re: Global Akamai Outage Lukas Tribus (Jul 26)
    - Re: Global Akamai Outage Mark Tinka (Jul 27)
    - Re: Global Akamai Outage Lukas Tribus (Jul 27)
    - Re: Global Akamai Outage heasley (Jul 27)
    - Re: Global Akamai Outage Lukas Tribus (Jul 27)
    - Re: Global Akamai Outage Randy Bush (Jul 25)
    - Re: Global Akamai Outage Miles Fidelman (Jul 25)