cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Persistent http/s connection issues

cafecubano
Investigator
Investigator

Hello,

I have been having this strange issue for over a month now, and I have talked thoroughly with EE CS, as well as had EE and Openreach engineers out. Funnily enough, no one I have spoken to seems to know what http is and why it is different to icmp, bare tcp, etc.

I thought I would write out what is happening and the investigation steps here incase anyone can help or if anyone else is having the same issue.

The issue manifests itself as http/https requests timing out, meaning that normal use of the internet is impossible. Occasionally, some requests will go through normally, and the % degradation varies, but it it is often in the region of >90% of requests failing over a given time period. This occurs on any device, connected via wifi or ethernet, and using any OS I've tried: Windows, MacOS, Linux, Android. These devices all have varying DNS setups so I don't think that is the issue, also, dns resolution works fine.

The interesting thing is that other application-level protocols work, such as: ssh, dns, wireguard.
ICMP ping works fine, and I suspect this is why EE cannot see any issue from their side. 
This means I can tunnel all traffic from one of my devices to a wireguard exit node and then the connection works as normal.
I can ping devices over the wan (icmp), and I can ssh to them also. Seeing as these protocols use both UDP and TCP, it doesn't seem to be an issue with one of those.

I have been through all of the basic troubleshooting steps such as restarting, factory reset, new router, etc.
Our hub is a Smart hub 6.
Interestingly, I think the start of this problem coincided with the firmware update to our router. (the update that forced https connection when using the hub's homepage)

The only non-standard thing on our network is that we have been using DMZ and now port forwarding, as we have a server we need to connect to over the WAN. Initially, we were using DMZ, but after the issue started, we tried port forwarding instead. This appeared to resolve the issue at the time, but then it came back a week later. It seems that any factory reset temporarily fixes the issue for a few days, then it comes back. A restart normally results in normal service for about 20 minutes, then relapses.

Things i'm going to try next:

  • use a different modem and router
  • try to inspect an http request from both ends (send and receive) and perhaps tcpdump

I will keep updating the post as I get more information.

1 SOLUTION

Accepted Solutions
cafecubano
Investigator
Investigator

I have been running a seperate modem (HG612) and my own router/WAP for a few weeks now and this seems to have resolved the problem. So it seems there may be a bug in the latest EE hub 6 firmware relates to NAT tables or something that interacts with DMZ/port forwarding. Thanks to Jacob for the detailed answers and insight. From what I could find there doesn't seem to be any way to report a bug to EE so will just have to wait and see if they fix it. If anyone else is having the same problem I recommend dumping the EE hub and using something else. 

View solution in original post

16 REPLIES 16
cafecubano
Investigator
Investigator

Ok I have tried to inspect the traffic from both sides. It seems like this may actually be a tcp issue.

Example of the issue (no code formatting here sorry):
$ curl -Ivv 87.106.56.27:8080
13:16:45.760872 [0-0] * [SETUP] added
13:16:45.760906 [0-0] * Trying 87.106.56.27:8080...
13:16:45.760954 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:46.761273 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:47.761629 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:48.761955 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:49.762100 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:50.762413 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:51.762734 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:52.762892 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:53.763201 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0
13:16:54.763538 [0-0] * [SETUP] Curl_conn_connect(block=0) -> 0, done=0

This then carries on until eventually the connection goes through (in this case it was 35 seconds, long enough for most applications to call it a timeout), but on the server at the other end I see nothing come through on tcpdump until the very end when it receives a SYN and replies with SYN ACK, and the connection then goes through. So it seems there are lots of packets being lost inbetween.

cafecubano
Investigator
Investigator

Have swapped out EE router for a seperate modem  and my own router, logged in via PPPoE, this seems to be working currently

I notice you mention "from both ends" in a couple of your posts, so is the destination address on the same network/host as your source?

You also mentioned it works with a VPN I think?
I hit a similar issue yesterday with a self hosted website and I believe my issue was related to hairpin NAT.

I was able to ping the destination IP without issues, unable to connect to the website using it's IP address from the machine hosting the site and another PC on the same network - but could access on mobile. Trying a VPN allowed the site to work flawlessly.
I tried multiple recycles of the router, checking firewall ports and nothing really seemed to resolve it. I think the DHCP lease on the external address of the router wasn't expiring for the first few, but I did eventually get a new external IP address.

I left it overnight then tried again this morning, and touch wood it now appears to work (mostly) for the site I host, other than for the Edge browser and I've no idea why that should be different from Google Chrome, but it is.
Other options that may be worth trying (assuming your situation is the same/similar to mine) add a local hosts file entry that resolves to an internal IP address so it doesn't go via the router - or potentially set up the machine in your DMZ.
Other than that look at the hairpin NAT problem - it definitely sounds like it's router related though imho.
Good luck


MisterWoody_
Investigator
Investigator

Oh - one other idea, if the firmware was updated (and this may not be a popular idea) - make sure you have the complete config of the smart hub documented and do a factory reset.
I've had issues with Netgear router stability in the past even with the latest firmware where conenctions would drop after about 20mins or so.
I eventually caved in and did a factory reset and reconfigured the router, and since then I've had a rock solid connection - and I think this was over 4 years ago now.

HTH

Hi MisterWoody,

appreciate the answers! Factory reset was one of the earlier things we tried, and this appeared to resolve the problem for a couple of days but it soon came back. As for the hairpin NAT, that is interesting as we do often hit our external IP from inside the LAN as a way to access self hosted services. Since the problem has not recurred so far when the EE hub is removed from the equation, I suspect a firmware bug/regression introduced in a recent version, that may interact with how we use the network (hitting the external IP from within frequently). Also, since removing the EE hub, we no longer are doing a double NAT, which we were before.

Sorry, from both ends meant I setup a VPS externally and I was using that to help investigate the connection.

Jacob_
Investigator
Investigator

This doesn’t sound like “HTTP being weird” at all — it sounds like new TCP connections are intermittently failing (specifically SYNs getting dropped), which shows up hardest as web browsing because browsers open loads of short-lived connections.
A few things jump out from what you’ve written:
The

curl -vv


hanging at Trying ... + the server seeing nothing until a late SYN finally arrives strongly suggests the SYNs aren’t reaching the server. That’s usually CPE/router state/NAT/conntrack rather than the line itself.
The fact that WireGuard/SSH keep working fits that too — long-lived sessions often stay up while new sessions fail.
“Works for a bit after reboot / works longer after factory reset then degrades” is classic “something fills up or goes buggy over time” behaviour (conntrack table, DoS protection, NAT loopback edge case, etc.).
Given you mentioned you frequently access your own services via your external IP from inside the LAN, I’d specifically look at hairpin NAT / NAT loopback. Some ISP routers/hubs are terrible at that, and it can cause really odd state-table churn. 
What I’d suggest trying :

Stop hair-pinning as a test: use internal DNS overrides so internal clients resolve your domain to a LAN IP instead of your WAN IP. Even a quick hosts-file test on one machine can prove it.If the hub has any “security / DoS / SYN flood protection / advanced firewall” toggles, try disabling them. Same for UPnP and anything like “smart firewall”.
If you can, swap the hub out entirely: ONT/modem >  your own router via PPPoE (or router directly if you’re on FTTP). If the issue vanishes, that’s basically confirmation it’s the hub/firmware.
Your last update is the biggest clue though  if it’s stable with a separate modem + your own router, I’d honestly keep that setup and just put the EE hub back in only when you need EE support to run their standard checks. If you post what access type you’re on (FTTP/ONT or FTTC/VDSL) and whether IPv6 is enabled, it might help a little more too.

Hi Jacob, Thanks for answer, although catching a whiff of chatgpt about this one 😉
Yes agreed seems to tcp related rather than just http. I don't think EE hub has any advanced settings for tuning the firewall from what I remember, also it is not only connections to internal services that hang, but connections going anywhere. If they are outbound connections to some remote server, then I thought the hub firewall would not be involved. 

We are using FTTC with vdsl, I have setup a separate modem and my own router which is working ok for the last few days, but I want it to be stable for at least a week before I am sure.

You mention UPnP and ipv6, how would they be involved?

Jacob_
Investigator
Investigator

I tend to throw some of my answers through chatgpt to do a bit of fact checking before posting, plus, I have terrible grammar at times and suffer with a little dyslexia, don't hold me to it.. I have EE, although I use Unifi hardware so I'm unsure what's actually on the EE hub as it never left the box. uPnP automatically lets devices/apps automatically create & delete port forwards on the router. I'm unsure what the EE router is running, however I suspect it's using a Linux kernel, so if you have things running like VoIP,  CCTV apps etc, I know some Hikvision NVRs will use uPnP when attempting to reach the hik-connect cloud services, they can make the NAT/conntrack unstable. So essentially some routers will tie uPnP mappings into the same state tracking system as normal outbound connections, so if it fills up you'll see new outbound tcp sessions start to fail first.  Yous aid you had a manual forward and DMZ, uPnP is tryign to remap overlapping ports/forwards, it's usually not the case but I've seen weirder things before.Since you said you're using VDSL (FTTC) the Hub is doing EVERYTHING, PPPoE session, NAT, FIrewall, sometimes QoS, so if you have lots of apps or devices attempting to use uPnP and a device with faulty firmware it can lead to the NAT/conntrack table filling quickly or having spasms, syn flood false positives etc. IPv6 usually isn't the issue, but FTTC is usually dual stack IPv4+IPv6 depending on the configuration of the ISP, unsure how EE actually does this, so if the hubs handlign of IPv6 is flaky then it could be an issue, most "clients"/browsers etc will try IPv6 and IPv4 in parallel, so if IPv6 is having issues you'll see dropouts, timeouts, ISPs tend to implement ipv6 badly.  You said your tcpdump sees nothing until the end when a SYN arrives so sounds more like a router issue than an IPv6 issue but worth ruling it out. 

You can check to see if IPv6 is having any issues 

curl -vv -6 -I https://google.com --connect-timeout 20
curl -vv -4 -I https://google.com --connect-timeout 20

or whatever host you want to test.

It does sound like the EE hub is to blame, they have them so locked down that you can't access most advanced features. I would maybe use an internal DNS server for internal hosts to stop the NAT  hairpinning.