Linux networking crash: best steps to find out the cause?

Posted by Aron Rotteveel on Server Fault See other posts from Server Fault or by Aron Rotteveel
Published on 2010-11-10T09:17:55Z Indexed on 2010/12/28 17:55 UTC
Read the original article Hit count: 461

Filed under:
|
|

One of our Linux (CentOS) servers was unreachable last night.

The server was not reachable in any way except for the remote console. After logging in with the remote console, it turned out I could not ping any outside hosts either.

A simple service network restart solved the issue, but I am still wondering what could have caused this. My log files seem to indicate no error at all (except for the various daemons that need a network connection and failed after the network failure).

Are there any additional steps I can take to find out the cause of this problem?

EDIT: this just happened again. The server was completely unresponsive until I issued a networking service restart. Any advise is welcome. Could this be caused by a faulty hardware component?

As per Madhatters request, here are some excerpts from the log at the time (the network crashed at 20:13):

/var/log/messages:

Dec  2 20:01:05 graviton kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=<stripped> SRC=<stripped> DST=<stripped> LEN=40 TOS=0x00 PREC=0x00 TTL=101 ID=256 PROTO=TCP SPT=6000 DPT=3306 WINDOW=16384 RES=0x00 SYN URGP=0
Dec  2 20:01:05 graviton kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=<stripped> SRC=<stripped> DST=<stripped> LEN=40 TOS=0x00 PREC=0x00 TTL=100 ID=256 PROTO=TCP SPT=6000 DPT=3306 WINDOW=16384 RES=0x00 SYN URGP=0
Dec  2 20:01:05 graviton kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=<stripped> SRC=<stripped> DST=<stripped> LEN=40 TOS=0x00 PREC=0x00 TTL=101 ID=256 PROTO=TCP SPT=6000 DPT=3306 WINDOW=16384 RES=0x00 SYN URGP=0
Dec  2 20:13:34 graviton junglediskserver: Connection to gateway failed: xGatewayTransport - Connection to gateway failed.

The first three messages are simple responses to iptables rules I have set up through the LFD firewall. The last message indicates that JungleDisk, which I use for backups can no longer connect to the gateway. Apart from this, there are no interesting messages around this time.

EDIT 4 dec: as per Mattdm's request, here is the output of ethtool eth0:

(Please not that these are the settings that currently work. If things go wrong again, I will be sure to post this again if necessary.

Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes

As per Joris' request, here is also the output of route -n:

aron@graviton [~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
xx.xx.xx.58    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.42    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.43    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.41    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.46    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.47    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.44    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.45    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.50    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.51    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.48    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.49    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.54    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.52    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.53    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
xx.xx.xx.0     0.0.0.0         255.255.255.192 U     0      0        0 eth0
xx.xx.xx.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth0
0.0.0.0         xx.xx.xx.62    0.0.0.0         UG    0      0        0 eth0

The bottom xx.62 is my gateway.

EDIT december 28th: the problem occurred again and I got the chance to compare some of the outputs of the above tests. What I found out is that arp -an returns an incomplete MAC address for my gateway (which is not under my control; the server is in a shared rack):

During failure:

? (xx.xx.xx.62) at <incomplete> on eth0

After service network restart:

? (xx.xx.xx.62) at 00:00:0C:9F:F0:30 [ether] on eth0

Is this something I can fix or is it time for me to contact the data centre?

© Server Fault or respective owner

Related posts about linux

Related posts about networking