It's driving me nuts, bfun, please help!

Discussion in 'Technology' started by Grim, May 21, 2015.

  1. I need your network expertise because I have been looking at this for the past 3 hours and got nowhere.

    Migrating from ESX 4 to ESX 6, used Veeam to migrate a test server from one vCenter to another, before migrating it was all working OK. Using a Windows 2K8 server to relay email traffic to Office 365 via IIS as some devices such as copiers don't have a method for connecting to cloud exchange. Need to use a specific WAN gateway as email routing isn't setup with the provider on the other and it will bounce, call this WAN 2.

    ESX host public IP 192.168.0.222, mask 255.255.255.0, gateway 192.168.0.251
    VM IP 192.168.0.229, mask 255.255.255.0, gateway 192.168.0.252

    In this setup I cannot ping other subnets in the network (other offices), 192.168.1.0 and 192.168.2.0 etc but it worked fine on the old ESX host.

    If I add persistent routes in Windows with Route add -p 192.168.1.0 mask 255.255.255.0 192.168.0.251 then it can ping the 1.0 subnet but when I remove it stops working. If I set the default gateway on the windows server to 0.251 then it can ping the other subnets BUT although 0.252 is set to route all internet traffic from 0.229 to WAN2 it seems to ignore this when it comes from 0.251 first and sends it via WAN1. If it is set with 0.252 as the gateway then it routes via WAN2 but I lose the routing to other subnets without persistent route.

    192.168.0.251 is a CISCO 2811
    192.168.0.252 is a Draytek Vigor 2850

    If I hit the 2811 first I can't route via WAN2 on the Draytek but if I hit the Draytek first it doesn't route internal WAN traffic to the Cisco without persistent routes in Windows!

    As I say it all worked perfectly on the old ESX host but it is set up the same and even tried adding the 2 routes into ESX CLI but still won't work. On ESX 4 the routing table for the public vmk0 is simply 0.0.0.0 192.168.0.251 and on it the VMs all are able to route to the other subnets even if they have 0.252 as a gateway. At the moment I have the extra routes to 1.0 and 2.0 in the ESX routing table and it still won't work.

    I even have an old server 2003 server with gateway 0.251 and have that set to go via WAN2 on the router and that works fine so I don't get why this 2K8 server is being so annoying! Before we completely blame ESX6 know that this non routing to WAN2 from 2K8 on 0.251 was happening on ESX4 too.

    I am guessing that when the gateway is set to 0.252 then the draytek isn't routing the stuff to the other subnets properly but this is only happening when it comes from a VM, when it comes from a physical system it works fine. I have persistent routes in the Draytek to all other subnets.

    It isn't the end of the world as I only need this one server to have 0.252 as a gateway address, all the others can have 0.251 and go out via WAN1 but I just don't feel it is a very elegant solution having to add the persistent route in Windows, I just want it to work automatically as it should!

    None of this probably makes sense.... Perhaps Server 2K8 R2 is just crap when it comes to IPv4.
     
  2. I've had an idea, I'm gonna put Iperf on it tomorrow and connect it to an Iperf server somewhere else on the network.

    I want to check what IP it reports as connecting with in case for some mental reason the ESX server IP comes up, assigned to any of the VNICs.

    I can ping the 2K8 server from any VLAN with the routes in so I doubt it but at this point I'm willing to try anything because something stupid is going on.
     
  3. How come you have two gateway routers on the same subnet? How are the routers routing? Static routes?

    Multiple routers on the same subnet and persistent routes in Windows can definitely put a little chaos in your life.
     
  4. We never purchased ADSL cards or VPN licences for the 2811s at each office so a Draytek is used to get to the web. Cisco are expensive.

    The default gateway should be the 2811 in each office which has static routes setup to get the other offices and 0.0.0.0 forwards to the Draytek at the relevant office with fall over to other offices if one should go down. The Draytek also has static routes to all of the other offices via the 2811 if you for any reason it should be used as default gateway.

    We are going to move to MPLS and centralised breakout but its been delayed by the provider being purchased.

    I attach a crap plan. [​IMG]
     
  5. I'm reading this an it all makes sense. The problem you state is this. When the default router is .251 the appropriate traffic destined for WAN2 does not get routed to the Dratek The question is how does the 2811 know that the traffic needs to go out WLAN2 and thus the Dratek? You say the Dratek ignores the rule to route all traffic to WAN2 but I'm guessing it never even gets the traffic because .251 doesn't know to send it there. Perhaps add a static route on the 2911 to send 192.168.0.229 to the .252. Did your IPs change on the servers? Maybe those routes are already there for the old IPs.

    Is it possible you could move that Drateck to the 2811? The 2811 should be the only gateway for that vlan. Once traffic hits the 2811 it should decide to route it to another office or out the internet through the Dratek.
     
  6. That is in theory how it is setup and it works for everything else. The only 2 policies the Draytek has is to pass all internet traffic from 0.245 and 0.229 to WAN2(ADSL2+) and everything else via WAN1. It works for 0.245 with 0.251 gateway but 0.229 goes to WAN1(VDSL) when 0.251 is set as its gateway.

    It's as if the Draytek isn't realising that it is 0.229 when it gets traffic from it via the 2811 first. If the 2811 is bypassed it does as it is told. This is why I want to use Iperf to see what the Iperf server reports as far as IP when it goes via the 2811. I know it sounds stupid but it's as if the Draytek isn't seeing the request from that IP and so is just sending it out via WAN1.

    I should have said earlier WAN1 and WAN2 are the 2 internet connection that the Draytek has, the 2811 passes all internet traffic to the Draytek. The 2811 is only there to connect the offices up, anything else it passes on. Everything goes to the 2811 first but when this server (0.229) does it the Draytek does the wrong thing with its traffic. Everything else does as it is told. If I bypass the 2811 from 0.229 the Draytek sends it out on the correct interface.

    I can't make WAN2 the default interface as it is only 20down/1up and there as a backup. I did try to get the provider for WAN1 to allow mail relaying from our domain but I ended up on the phone with an idiot who couldn't find the details of the domains we owned so I gave up. Might just be easiest to try again.
     
  7. I guess I meant move the draytek off the switch and onto the 2811 on a different subnet. Here is my design issue which may or may not have anything to do with the problem Routers route between vlans and switches switch inside vlans. When you follow the traffic from .299 it goes to the router and gets routed back to the same vlan on the same interface. You obviously have this working so that's good but it just makes me wonder if it's introducing any problems.

    From what I'm reading when the 2811 gets a packet and then has to send it back out the same interface it sends an ICMP redirect to the original machine telling it to further all future packets to the destination IP rather than the default gateway. ESX should update it's routing table when it gets the redirect. Maybe see if their are any changes in how ESX 6 handles ICMP redirects. Also if it's a feature that can be turned off it might need to be turned on.

    http://www.cisco.com/c/en/us/support/docs/ip/routing-information-protocol-rip/13714-43.html
    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2081185
     
  8. Throw all that shit away and get yourself a Nighthawk. You'll get a promotion for sure.
     
  9. Draytek? More like Dumpstertek. Seconding the suggestion to just get a Nighthawk, bro.
     
  10. Sooo was it the ICMP redirects failing?
     
  11. Sorry, spent all day Friday reprogramming a Nortel BCM 400 and all the phones (system failed and replacement wouldn't restore from backup) so didn't get to try.

    Off this week so gonna look next week.
     
  12. Well it ended up being a really stupid issue and a really stupid mistake.

    Set the wrong VLAN ID for the public side of the new switch stack, didn't match the public VLAN on the core switch. Both on the same VLAN now and now routes perfectly, the routers it seems were not happy getting requests from a different VLAN ID with the same subnet.
     
  13. I thought it would be that.