neighboring subsystem by Rami Rosen

Linux Kernel Neighboring Subsystem Overview by Rami Rosen

 

Rami Rosen pic

 

  Last updated: 2.12.12

This wiki is based on a my practical experience with Linux kernel networking and a series of lectures I gave in the Technion:
See: 
Rami Rosen lectures

Please feel free send any feedback or question to Rami Rosen by sending email to: ramirose@gmail.com

 

Introduction

● Why do we need the neighboring subsystem ?

● “The world is a jungle in general, and the networking game contributes many animals.” (from RFC 826, ARP, 1982)

● Most known protocol: ARP (in IPv6: ND, neighbour discovery)

● Ethernet header is 14 bytes long:

● Source Mac address and destination Mac address are 6 bytes each.

– Type (2 bytes). For example, (include/linux/if_ether.h)

● 0x0800 is the type for IP packet (ETH_P_IP)

● 0x0806 is the type for ARP packet (ETH_P_ARP)

● 0X8035 is the type for RARP packet (ETH_P_RARP)

Neighboring Subsystem – struct neighbour

● neighbour (instance of struct neighbour) is embedded in dst, which is in turn is embedded in sk_buff:

● Implementation: important data structures

struct neighbour (include/net/neighbour.h)

– ha is the hardware address (MAC address when dealing with Ethernet) of the neighbour. This field is filled when an ARP response arrives.

  •  primary_key – The IP address (L3) of the neighbour.

● lookup in the arp table is done with the primary_key.

  •  nud_state represents the Network Unreachability Detection state of the neighbor. (for example, NUD_REACHABLE).

● int (*output)(struct sk_buff *skb);

– output() can be assigned to different methods according to the state of the neighbour. For example, neigh_resolve_output() and neigh_connected_output().

Initially, it is neigh_blackhole().

– When a state changes, than also the output function may be assigned to a different function.

● refcnt incremented by neigh_hold(); decremented by neigh_release().

We don't free a neighbour when the refcnt is higher than 1;instead, we set dead(a member of neighbour) to 1.

● timer (The callback method is neigh_timer_handler()).

● struct hh_cache *hh (defined in include/linux/netdevice.h)

● confirmed – confirmation timestamp.

 – Confirmation can be also done from L4 (transport layer). – For example, dst_confirm() calls neigh_confirm(). – dst_confirm() is called from tcp_ack()

 (net/ipv4/tcp_input.c) – and by udp_sendmsg() (net/ipv4/udp.c) and more. –

neigh_confirm() does NOT change the state

– it is the job of neigh_timer_handler().

● dev (net_device)

● arp_queue – every neighbour has a small arp queue of itself. – There can be only 3 elements by default in an arp_queue.

– This is configurable:/proc/sys/net/ipv4/neigh/default/unres_qlen

struct neigh_table

● struct neigh_table represents a neighboring table – (/include/net/neighbour.h)

– The arp table (arp_tbl) is a neigh_table. (include/net/arp.h)

– In IPv6, nd_tbl (Neighbor Discovery table ) is a neigh_table also (include/net/ndisc.h) – There is also dn_neigh_table (DEcnet ) (linux/net/decnet/dn_neigh.c) and clip_tbl (for ATM) (net/atm/clip.c) –

gc_timer: neigh_periodic_timer() is the callback for garbage collection. – neigh_periodic_timer() deletes FAILED entries from the ARP table. Neighboring Subsystem arp

● When there is no entry in the ARP cache for the destination IP address of a packet, a broadcast is sent (ARP request, ARPOP_REQUEST: who has IP address x.y.z...). This is done by a method called arp_solicit().(net/ipv4/arp.c) – In IPv6, the parallel mechanism is called ND (Neighbor discovery) and is implemented as part of ICMPv6. – A multicast is sent in IPv6 (and not a broadcast).

● If there is no answer in time to this arp request, then we will end up with sending back an ICMP error (Destination Host Unreachable).

● This is done by arp_error_report() , which indirectly calls ipv4_link_failure() ; see net/ipv4/route.c.

● You can see the contents of the arp table by running: “cat /proc/net/arp” or by running the “arp” from a command line.

  • You can view statistics of arp cache (IPV4) by: cat /proc/net/stat/arp_cache
  • You can view statistics of ndisc cache (IPV6) by: cat /proc/net/stat/ndisc_cache

"ip neigh show" is the new method to show arp (from IPROUTE2)

  • In IPv6 it is "ip -6 neigh show".

● You can delete and add entries to the arp table; see man arp/man ip.

● When using “ip neigh add” you can specify the state of the entry which you are adding (like permanent, stale, reachable, etc).

● arp command does not show reachability states except the incomplete state and permanent state: Permanent entries are marked with M in Flags:

example : arp output

Address HWtype HWaddress Flags Mask Iface 10.0.0.2 (incomplete) eth0 10.0.0.3 ether 00:01:02:03:04:05 CM eth0 10.0.0.138 ether 00:20:8F:0C:68:03 C eth0

Neighboring Subsystem – ip neigh show.

● We can see the current neighbour states:

● Example :

ip neigh show

192.168.0.254 dev eth0 lladdr 00:03:27:f1:a1:31 REACHABLE 192.168.0.152 dev eth0 lladdr 00:00:00:cc:bb:aa STALE 192.168.0.121 dev eth0 lladdr 00:10:18:1b:1c:14 PERMANENT 192.168.0.54 dev eth0 lladdr aa:ab:ac:ad:ae:af STALE

arp_process() handles both ARP requests and ARP responses. – net/ipv4/arp.c

– If the target ip (tip) address in the arp header is the loopback then arp_process() drops it since loopback does not need ARP

. ... if (LOOPBACK(tip) || MULTICAST(tip))

             goto out;

out:

... kfree_skb(skb); return 0;

(see: #define LOOPBACK(x) (((x) & htonl(0xff000000)) == htonl(0x7f000000)) in linux/in.h

● If it is an ARP request (ARPOP_REQUEST) we call ip_route_input().

● Why ?

● In case it is for us, (RTN_LOCAL) we send and ARP reply. – arp_send(ARPOP_REPLY,ETH_P_ARP,sip,dev,tip,sha ,dev> dev_addr,sha); – We also update our arp table with the sender entry (ip/mac).

● Special case: ARP proxy server.

● In case we receive an ARP reply – (ARPOP_REPLY) –

We perform a lookup in the arp table. (by calling __neigh_lookup()) – If we find an entry, we update the arp table by neigh_update().

● If there is no entry and there is NO support for unsolicited ARP we don't create an entry in the arp table. – Support for unsolicited ARP by setting /proc/sys/net/ipv4/conf/all/arp_accept to 1. – The corresponding macro is: IPV4_DEVCONF_ALL(ARP_ACCEPT)) – In older kernels, support for unsolicited ARP was done by: – CONFIG_IP_ACCEPT_UNSOLICITED_ARP Neighboring Subsystem – lookup

● Lookup in the neighboring subsystem is done via: neigh_lookup() parameters: – neigh_table (arp_tbl) – pkey (ip address, the primary_key of neighbour struct) – dev (net_device) – There are 2 wrappers: – __neigh_lookup()

● just one more parameter: creat (a flag: to create a neighbor by neigh_create() or not))

● and __neigh_lookup_errno()

Neighboring Subsystem – static entries

● Adding a static entry is done by:

arp -s ipAddress MacAddress

● Alternatively, this can be done by:

ip neigh add ipAddress dev eth0 lladdr MacAddress nud permanent

● The state (nud_state) of this entry will be NUD_PERMANENT

  • ip neigh show will show it as PERMANENT.

● Why do we need PERMANENT entries ?

arp_bind_neighbour() method

● Suppose we are sending a packet to a host for the first time.

● a dst_entry is added to the routing cache by rt_intern_hash().

● We should know the L2 address of that host. – so rt_intern_hash() calls arp_bind_neighbour().

● only for RTN_UNICAST (not for multicast/broadcast). – arp_bind_neighbour(): net/ipv4/arp.c – dst-> neighbour=NULL, so it calls__neigh_lookup_errno(). – There is no such entry in the arp table. – So we will create a neighbour with neigh_create() and add it to the arp table.

neigh_create() creates a neighbour with NUD_NONE state

– setting nud_state to NUD_NONE is done in neigh_alloc()

The IFF_NOARP flag

● Disabling and enabling arp

ifconfig eth1 -arp

– You will see the NOARP flag now in ifconfig a

ifconfig eth1 arp (to enable arp of the device)

● In fact, this sets the IFF_NOARP flag of net_device.

● There are cases where the interface by default is with the IFF_NOARP flag (for example, ppp interface, see ppp_setup() (drivers/net/ppp_generic.c)

Changing IP address

● Suppose we try to set eth1 to an IP address of a different machine on the LAN:

● First, we will set an ip for eth1 in (in Fedora Core 8,for example)

● /etc/sysconfig/networkscripts/ifcfg-eth1

● ... IPADDR=192.168.0.122 ...

and than run:

ifup eth1

● we will get:

  • Error, some other host already uses address 192.168.0.122.

● But:

ifconfig eth0 192.168.0.122

● works ok !

● Why is it so ?

Duplicate Address Detection (DAD)

● Duplicate Address Detection mode (DAD)

● arping I eth0 D 192.168.0.10

– sends a broadcast packet whose source address is 0.0.0.0.

0.0.0.0 is not a valid IP address (for example, you cannot set an ip address to 0.0.0.0 with ifconfig)

● The mac address of the sender is the real one.

● -D flag is for Duplicate Address Detection mode.

Code: (from arp_process() ; see /net/ipv4/arp.c) /* Special case: IPv4 duplicate address detection packet (RFC2131)*/ if (sip == 0) { if (arp> ar_op == htons(ARPOP_REQUEST) &&

inet_addr_type(tip) == RTN_LOCAL && !arp_ignore(in_dev,dev,sip,tip)) arp_send(ARPOP_REPLY,ETH_P_ARP,tip,dev,tip,sha,dev->dev_addr,dev> dev_addr);

goto out;

}

Neighboring Subsystem – Garbage Collection

● Garbage Collection – neigh_periodic_timer() – neigh_timer_handler() – neigh_periodic_timer() removes entires which are in NUD_FAILED state. This is done by setting dead to 1, and calling neigh_release(). The refcnt must be 1 to ensure no one else uses this neighbour. Also expired entries are removed.

NUD_FAILED entries don't have MAC address ; see ip neigh show) Neighboring Subsystem – Asynchronous Garbage Collection

● neigh_forced_gc() performs asynchronous Garbage Collection.

● It is called from neigh_alloc() when the number of the entries in the arp table exceeds a (configurable) limit.

● This limit is configurable (gc_thresh2,gc_thresh3) /proc/sys/net/ipv4/neigh/default/gc_thresh2

/proc/sys/net/ipv4/neigh/default/gc_thresh3

– The default for gc_thresh3 is 1024.

 Candidates for cleanup: Entries which their reference count is 1, or which their state is NOT permanent.

● Changing the neighbour state is done only in neigh_timer_handler().

LVS (Linux Virtual Sever)

http://www.linuxvirtualserver.org/

● Integrated into the Linux kernel (in 2.4 kernel it was a patch).

● Located in: net/netfilter/ipvs in the kernel tree.

● LVS has eight scheduling algorithms.

● LVS/DR is LVS with direct routing (a load balancing solution).

ipvsadm is the user space management tools (available in most distros).

● Direct Routing is the packet forwarding method.

● -g, gatewaying => Use gatewaying (direct routing)

● see man ipvsadm.

LVS/DR

● Example: 3 Real Servers and the Director all have the same VirtualIP (VIP).

● There is an ARP problem in this configuration.

● When you send an ARP broadcast, and the receiving machine has two or more NICs, each of them responds to this ARP request. Example: a machine with two NICs ;

● eth0 is 192.168.0.151 and eth1 is 192.168.0.152.

LVS and ARP

● Solutions

1) Set ARP_IGNORE to 1:

  • echo “1” > /proc/sys/net/ipv4/conf/eth0/arp_ignore
  • echo “1” > /proc/sys/net/ipv4/conf/eth1/arp_ignore

2) Use arptables. – There are 3 points in the arp walkthrough: (include/linux/netfilter_arp.h) – NF_ARP_IN (in arp_rcv() , net/ipv4/arp.c). – NF_ARP_OUT (in arp_xmit()),net/ipv4/arp.c) – NF_ARP_FORWARD ( in br_nf_forward_arp(), net/bridge/br_netfilter.c)

http://ebtables.sourceforge.net/download.html

Ebtables is in fact the parallel of netfilter but in L2.

LVS example (ipvsadm)

● An example for setting LVS/DR on TCP port 80 with three real servers:

● ipvsadm C // clear the LVS table

● ipvsadm A t DirectorIPAddress:80

● ipvsadm -a t DirectorIPAddress:80 r RealServer1 g

● ipvsadm -a t DirectorIPAddress:80 r RealServer2 g

● ipvsadm -a t DirectorIPAddress:80 r RealServer3 g

● This example deals with tcp connections (for udp connection we should use u instead of t in the last 3 lines).

LVS example:

● ipvsadm -Ln // list the LVS table

/proc/sys/net/ipv4/ip_forward should be set to 1

● In this example, packets sent to VIP will be sent to the load balancer; it will delegate them to the real server according to its scheduler. The dest MAC address in L2 header will be the MAC address of the real server to which the packet will be sent. The dest IP header will be VIP.

● This is done with NF_IP_LOCAL_IN.

ARPD arp user space daemon

● ARPD is a user space daemon; it can be used if we want to remove some work from the kernel.

● The user space daemon is part of iproute2 (/misc/arpd.c)

● ARPD has support for negative entries and for dead hosts.

– The kernel arp code does NOT support these type of entries!

● The kernel by default is not compiled with ARPD support; we should set CONFIG_ARPD for using it:

● Networking Support-> Networking Options-> IP: ARP daemon support.

● see: /usr/share/doc/iproute2.6.22/arpd.ps (Alexey Kuznetsov). 

● We should also set app_probes to a value greater than 0 by setting – /proc/sys/net/ipv4/neigh/eth0/app_solicit – This can be done also by the a (active_probes) parameter. – The value of this parameter tells how many ARP requests to send before that neighbour is considered dead.

● The k parameter tells the kernel not to send ARP broadcast; in such case, the arpd daemon is not only listening to ARP requests, but also send ARP broadcasts.

● Activation:

● arpd a 1 k eth0 &

● On some distros, you will get the error db_open: No such file or directory unless you simply run mkdir /var/lib/arpd/ before (for the arpd.db file).

● Pay attention: you can start arpd daemon when there is no support in the kernel(CONFIG_ARPD is not set).

● In this case you, arp packets are still caught by arpd daemon get_arp_pkt()

 (misc/arpd.c)

● But you don't get messages from the kernel.

● get_arp_pkt() is not called.(misc/arpd.c)

● Tip: to check if CONFIG_ARPD is set, simply see if there are any results from

– cat /proc/kallsyms | grep neigh_app

Mac addresses

● MAC address (Media Access Control)

● According to specs, MAC address should be unique.

● The 3 first bytes specify a hw manufacturer of the card.

● Allocated by IANA.

 There are exceptions to this rule. 

 – Ethernet HWaddr 00:16:3E:3F:6E:5D 

ARPwatch (detect ARP cache poisoning)

● Changing MAC address can be as a result of some security attack (ARP cache poisoning).

● Arpwatch can help detect such an attack.

● Activation: arpwatch d i eth0 (output to stderr)

● Arpwatch keeps a table of ip/mac addresses and senses when there is a change.

● d is for redirecting the log to stderr (no syslog, no mail).

● In case someone changed MAC address on the same network, you will get a message like this: ARPwatch Example

From: root (Arpwatch) To: root Subject: changed ethernet address (jupiter) hostname: jupiter ip address: 192.168.0.54 ethernet address: aa:bb:cc:dd:ee:ff ethernet vendor: <unknown> old ethernet address: 0:20:18:61:e5:e0 old ethernet vendor: ...

Neighbour states

● neighbour states

neigh_alloc() Reachable Incomplete None Stale Delay Probe Neighboring Subsystem

●– NUD_NONE

– NUD_REACHABLE

– NUD_STALE

– NUD_DELAY

– NUD_PROBE

– NUD_FAILED

– NUD_INCOMPLETE

● Special states:

● NUD_NOARP

● NUD_PERMANENT

● No state transitions are allowed from these states to another state.

Neighboring Subsystem – states

● NUD state combinations:

● NUD_IN_TIMER (NUD_INCOMPLETE|NUD_REACHABLE| NUD_DELAY|NUD_PROBE)

● NUD_VALID (NUD_PERMANENT|NUD_NOARP| NUD_REACHABLE|NUD_PROBE|NUD_STALE|NUD_DELAY)

● NUD_CONNECTED (NUD_PERMANENT|NUD_NOARP| NUD_REACHABLE)

● When a neighbour is in a STALE state it will remain in this state until one of the two will occur – a packet is sent to this neighbour. – Its state changes to FAILED.

neigh_resolve_output() and neigh_connected_output().

net/core/neighbour.c

● A neighbour in INCOMPLETE state does not have MAC address set yet (ha member of neighbour)

● So when neigh_resolve_output() is called, the neighbour state is changed to INCOMPLETE.

● When neigh_connected_output() is called, the MAC address of the neighbour is known; so we end up with calling dev_queue_xmit(), which calls the ndo_start_xmit() callback method of the NIC device driver. 

● The ndo_start_xmit() method actually puts the frame on the wire.

Change of IP address/Mac address

● Change of IP address does not trigger notifying its neighbours.

● Change of MAC address , NETDEV_CHANGEADDR ,also does not trigger notifying its neighbours.

● It does update the local arp table by neigh_changeaddr().

– Exception to this is irlan eth: irlan_eth_send_gratuitous_arp() – (net/irda/irlan/irlan_eth.c) – Some nics don't permit changing of MAC address – you get: SIOCSIFHWADDR: Device or resource busy.

Flushing the arp table

● Flushing the arp table:

● ip statistics neigh flush dev eth0

      • Round 1, deleting 7 entries ***

      • Flush is complete after 1 round ***

● Specifying twice statistics will also show which entries were deleted, their mac addresses, etc...

ip statistics statistics neigh flush dev eth0

● 192.168.0.254 lladdr 00:04:27:fd:ad:30 ref 17 used 0/0/0 REACHABLE ●

● *** Round 1, deleting 1 entries ***

● *** Flush is complete after 1 round ***

● calls neigh_delete() in net/core/neighbour.c

● Changes the state to NUD_FAILED

Groups: