Latest oddities I ran into with Google Cloud products before I start to forget about them again.
e2 Compute Instances vs CloudNAT
Years ago I already had a surprising encounter with the Google Cloud e2 instances. Back then we observed CPU steal time from 20-60%, which made the instances unusable for anything remotely latency sensitive. Now someone started to run a workload which creates many outbound connections to the same destination IP:Port. To connect to the internet we utilize the Google Cloud product "CloudNAT" which implements a NAT solution somewhere in the network layer.
Starting the workload let after a few seconds to all sort of connection
issues, and of course logs from CloudNAT that it dropped connections.
The simplest reproducer I could find was
while true; do curl http://sven.stormbind.net; done
which already let to connection drops on CloudNAT.
We starred a bit at output of gcloud compute routers get-nat-mapping-info our-default-natgw
,
but allocating additional ports still worked fine in general. Further investigation
let to two differences between a project which was fine and those that failed:
- c2d or n2d machine types instead of e2 and
- usage of gVNIC.
Moving away from the e2 instances instantly fixed our issue. Only some connection
drops could be observed on CloudNAT if we set the min_ports_per_vm
value too low
and it could not allocate new ports in time. Thus we did some additional optimizations:
- raised
min_ports_per_vm
to 256 - raised
max_ports_per_vm
to 32768 (the sensible maximum because CloudNAT will always double its allocation) - set
nat_tcp_timewait_sec
to 30, default is 120, reclaim of ports is only running every 30s, thus ports can be re-used after 30-60s
See also upstream documentation regarding timeouts.
To complete the setup alignment we also enabled gVNIC on all GKE pools. Noteworthy detail a colleague figured out: If you use terraform to provision GKE pools make sure to use at least google provider v6.33.0 to avoid a re-creation of your node pool.
GKE LoadBalancer Force allPorts: true on Forwarding Rule
Technically it's possible to configure a forwarding rule to listen on some or all ports.
That gets more complicated if you do not configure the forwarding rule via terraform or gcloud
cli, but use a GKE resource kind: Service
with spec.type: LoadBalancer
. The
logic documented by Google Cloud
is that the forwarding rule will have per port configuration if it's five or less, and above
that it will open for all ports. Sadly that does not work e.g. in cases where you've an
internal load balancer and a serviceAttachment attached to the forwarding rule. In my
experience reconfiguring was also unreliable in cases without a serviceAttachment and
required a manual deletion of the service load balancer to have the operator reconcile it
and create it correctly.
Given that we wanted to have all ports open to allow us to dynamically add more ports on a specific load balancer, but there is no annotation for that, I worked around with this beauty:
ports:
- name: dummy-0
port: 2342
protocol: TCP
targetPort: 2342
- name: dummy-1
port: 2343
protocol: TCP
targetPort: 2343
- name: dummy-2
port: 2344
protocol: TCP
targetPort: 2344
- name: dummy-3
port: 2345
protocol: TCP
targetPort: 2345
- name: service-1
port: 4242
protocol: TCP
targetPort: 4242
- name: service-2
port: 4343
protocol: TCP
targetPort: 4343
If something in that area did not work out there's basically two things to check:
- Is the port open on the forwarding rule / is the forwarding rule configured
with
allPorts: true
? - Got the VPC firewall rule created by the service operator in GKE updated to open all required ports?
Rate Limiting with Cloud Armor on Global TCP Proxy Load Balancer
According to the Google Cloud support rate limiting on a TCP proxy is a preview feature. That seems to be the excuse why it's all very inconsistent right now, but it works.
- The Google Cloud Web Console is 100% broken and unable to deal with it. Don't touch it via the web.
- If you configure an
exceed_action
in agoogle_compute_security_policy
terraform resource you must use a value with response code, e.g.exceed_action = "deny(429)"
. The response code will be ignored. In all other cases I know you must use adeny
without response code if you want to be able to assign the policy to a L3/L4 load balancer. - If you use config-connector (kcc) you can already use
exceedAction: deny
albeit it's not documented. Neither for config-connector itself nor for the API. - If you use the gcloud cli you can use
--exceed-action=deny
which is already documented if you callgcloud beta compute security-policies create --help
, but it also works in the none beta mode. Also export / import via gcloud cli work with adeny
without defining a response code.
Terraform Sample Snippet
rule {
description = "L3-L4 Rate Limit"
action = "rate_based_ban"
priority = "2342"
match {
versioned_expr = "SRC_IPS_V1"
config {
src_ip_ranges = ["*"]
}
}
rate_limit_options {
enforce_on_key = "IP"
# exceed_action only supports deny() with a response code
exceed_action = "deny(429)"
rate_limit_threshold {
count = 320
interval_sec = 60
}
ban_duration_sec = 240
ban_threshold {
count = 320
interval_sec = 60
}
conform_action = "allow"
}
}
Config-Connector Sample Snippet
- action: rate_based_ban
description: L3-L4 Rate Limit
match:
config:
srcIpRanges:
- "*"
versionedExpr: SRC_IPS_V1
preview: false
priority: 2342
rateLimitOptions:
banDurationSec: 240
banThreshold:
count: 320
intervalSec: 60
conformAction: allow
enforceOnKey: IP
exceedAction: deny
rateLimitThreshold:
count: 320
intervalSec: 60