Why am I getting DNS lookup failures from outbound requests?
You're seeing intermittent issues where outbound requests throw errors similar to:
could not translate host name $HOSTNAME to address: Name or service not known
Error: getaddrinfo ENOTFOUND $HOSTNAME
Failed to open TCP connection to $HOSTNAME (getaddrinfo: Name or service not known)
These sorts of errors are typically caused by transient networking conditions either on the host machine the dyno experiencing the problem is on or due to intermittent networking issues in general (e.g., packet loss). Unfortunately, there's not typically a lot that can be done to prevent running into these errors. You can work around them however.
Implementing retry behavior for all your outbound requests and making sure you have fairly short timeouts (note that most HTTP libraries have a general "timeout" setting along with a more specific "TCP connect timeout" that is separate both should be configured) will prove helpful here. In some cases, retries like this will be sufficient to solve the problem.
If that doesn't solve the problem, since these issues are frequently local to the host machine your dyno is on, another workaround is to rescue these errors in your application code and keep a running count. Once it exceeds a certain number of consecutive errors, you can force the dyno experiencing problems to "crash" by doing the equivalent of an
exit(0) in whatever language you're using. Our dyno manager will detect this "crash" and start booting your application on another server. Repeatedly crashing apps can have restart throttling applied, but due to the localized nature of this issue, having it occur multiple times in a row on separate servers is relatively unlikely. If you take this path, make sure your logging is sufficient enough that you'll be able to see what happened later if necessary.