nanog mailing list archives

Re: Anycast but for egress


From: Glenn McGurrin via NANOG <nanog () nanog org>
Date: Wed, 28 Jul 2021 14:06:34 -0400

I'd had a similar thought/question, though keeping the geo diversity, you manage the crawlers, and are making contact individually with these sites from what you have stated (and so don't need a one size fit's all list for public posting), so why not have a restricted subset of the crawlers handle sites with these issues (which subset may be unique per site, which makes maintaining even load balancing not overly complex /limiting, especially as you are using nat anyway, so multiple servers can be behind each ip and that number can vary). That let's you have geo diversity (or even multi cloud diversity) for every site, but each site that needs this IP whitelisting only needs 3-5 IP's at any site, but yet you can distribute load over a much larger overall set of machines and nat gateways.

As I understand it even CDN's that anycast TCP (externally or internally [load balancing via routers and multi path]) do similar by spreading load over multiple IP's at the DNS layer first.

As the transition to IPv6 happens you may have it easier as getting a large enough allocation to allow for splitting it out into multiple subnets advertised from different locations without providers dropping the route as too long a prefix is much easier on the v6 side, so you could give one /36 or /40 or even /44 out to whitelist but have /48's at each location. For sites with ipv6 support that may help now, but it won't help all sites for quite some time, though the number that support v6 is slowly getting better. For the foreseeable future you still need to handle the v4 side one way or another though.

On 7/28/2021 10:21 AM, William Herrin wrote:
On Wed, Jul 28, 2021 at 6:04 AM Vimal <j.vimal () gmail com> wrote:
My intention is to run a web-crawling service on a public cloud. This service
is geographically distributed, and therefore will run in multiple regions
around the world inside AWS... this means there will be multiple AWS VPCs,
each with their own NAT gateway, and traffic destined to websites
that we crawl will appear to come from this NAT gateway's IP address.

Hello,

AWS does not provide the ability to attach anycasted IP addresses to a
NAT gateway, regardless of whether it would work, so that's the end of
your quest.

The reason I want a predictable IP is to communicate this IP to website
owners so they can allow access from these IPs into their networks.
I chose IP as an example; it can also be a subnet, but what I don't want to
provide is a list of 100 different IP addresses without any predictability.

If you bring your own IP addresses, you can attach a separate /24s of
them to your VPCs in each region, providing you with a single
predictable range of source addresses. You will find it difficult and
expensive to acquire that many IP addresses from the regional
registries for the purpose you describe.


Silly question but: for a web crawler, why do you care whether it has
the limited geographically distribution that a cloud service provides?
It's a parallel batch task. It doesn't exactly matter whether you have
minimum latency.

Regards,
Bill Herrin





Current thread: