Commit 35a9ad8a authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
 "Most notable changes in here:

   1) By far the biggest accomplishment, thanks to a large range of
      contributors, is the addition of multi-send for transmit.  This is
      the result of discussions back in Chicago, and the hard work of
      several individuals.

      Now, when the ->ndo_start_xmit() method of a driver sees
      skb->xmit_more as true, it can choose to defer the doorbell
      telling the driver to start processing the new TX queue entires.

      skb->xmit_more means that the generic networking is guaranteed to
      call the driver immediately with another SKB to send.

      There is logic added to the qdisc layer to dequeue multiple
      packets at a time, and the handling mis-predicted offloads in
      software is now done with no locks held.

      Finally, pktgen is extended to have a "burst" parameter that can
      be used to test a multi-send implementation.

      Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
      virtio_net

      Adding support is almost trivial, so export more drivers to
      support this optimization soon.

      I want to thank, in no particular or implied order, Jesper
      Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
      Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
      David Tat, Hannes Frederic Sowa, and Rusty Russell.

   2) PTP and timestamping support in bnx2x, from Michal Kalderon.

   3) Allow adjusting the rx_copybreak threshold for a driver via
      ethtool, and add rx_copybreak support to enic driver.  From
      Govindarajulu Varadarajan.

   4) Significant enhancements to the generic PHY layer and the bcm7xxx
      driver in particular (EEE support, auto power down, etc.) from
      Florian Fainelli.

   5) Allow raw buffers to be used for flow dissection, allowing drivers
      to determine the optimal "linear pull" size for devices that DMA
      into pools of pages.  The objective is to get exactly the
      necessary amount of headers into the linear SKB area pre-pulled,
      but no more.  The new interface drivers use is eth_get_headlen().
      From WANG Cong, with driver conversions (several had their own
      by-hand duplicated implementations) by Alexander Duyck and Eric
      Dumazet.

   6) Support checksumming more smoothly and efficiently for
      encapsulations, and add "foo over UDP" facility.  From Tom
      Herbert.

   7) Add Broadcom SF2 switch driver to DSA layer, from Florian
      Fainelli.

   8) eBPF now can load programs via a system call and has an extensive
      testsuite.  Alexei Starovoitov and Daniel Borkmann.

   9) Major overhaul of the packet scheduler to use RCU in several major
      areas such as the classifiers and rate estimators.  From John
      Fastabend.

  10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
      Duyck.

  11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
      Dumazet.

  12) Add Datacenter TCP congestion control algorithm support, From
      Florian Westphal.

  13) Reorganize sk_buff so that __copy_skb_header() is significantly
      faster.  From Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
  netlabel: directly return netlbl_unlabel_genl_init()
  net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
  net: description of dma_cookie cause make xmldocs warning
  cxgb4: clean up a type issue
  cxgb4: potential shift wrapping bug
  i40e: skb->xmit_more support
  net: fs_enet: Add NAPI TX
  net: fs_enet: Remove non NAPI RX
  r8169:add support for RTL8168EP
  net_sched: copy exts->type in tcf_exts_change()
  wimax: convert printk to pr_foo()
  af_unix: remove 0 assignment on static
  ipv6: Do not warn for informational ICMP messages, regardless of type.
  Update Intel Ethernet Driver maintainers list
  bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
  tipc: fix bug in multicast congestion handling
  net: better IFF_XMIT_DST_RELEASE support
  net/mlx4_en: remove NETDEV_TX_BUSY
  3c59x: fix bad split of cpu_to_le32(pci_map_single())
  net: bcmgenet: fix Tx ring priority programming
  ...
parents d5935b07 64b1f00a
Showing with 1092 additions and 93 deletions
+1092 -93
Driver for ARM AXI Bus with Broadcom Plugins (bcma)
Required properties:
- compatible : brcm,bus-axi
- reg : iomem address range of chipcommon core
The cores on the AXI bus are automatically detected by bcma with the
memory ranges they are using and they get registered afterwards.
The top-level axi bus may contain children representing attached cores
(devices). This is needed since some hardware details can't be auto
detected (e.g. IRQ numbers). Also some of the cores may be responsible
for extra things, e.g. ChipCommon providing access to the GPIO chip.
Example:
axi@18000000 {
compatible = "brcm,bus-axi";
reg = <0x18000000 0x1000>;
ranges = <0x00000000 0x18000000 0x00100000>;
#address-cells = <1>;
#size-cells = <1>;
chipcommon {
reg = <0x00000000 0x1000>;
gpio-controller;
#gpio-cells = <2>;
};
};
* Broadcom UniMAC MDIO bus controller
Required properties:
- compatible: should one from "brcm,genet-mdio-v1", "brcm,genet-mdio-v2",
"brcm,genet-mdio-v3", "brcm,genet-mdio-v4" or "brcm,unimac-mdio"
- reg: address and length of the regsiter set for the device, first one is the
base register, and the second one is optional and for indirect accesses to
larger than 16-bits MDIO transactions
- reg-names: name(s) of the register must be "mdio" and optional "mdio_indir_rw"
- #size-cells: must be 1
- #address-cells: must be 0
Optional properties:
- interrupts: must be one if the interrupt is shared with the Ethernet MAC or
Ethernet switch this MDIO block is integrated from, or must be two, if there
are two separate interrupts, first one must be "mdio done" and second must be
for "mdio error"
- interrupt-names: must be "mdio_done_error" when there is a share interrupt fed
to this hardware block, or must be "mdio_done" for the first interrupt and
"mdio_error" for the second when there are separate interrupts
Child nodes of this MDIO bus controller node are standard Ethernet PHY device
nodes as described in Documentation/devicetree/bindings/net/phy.txt
Example:
mdio@403c0 {
compatible = "brcm,unimac-mdio";
reg = <0x403c0 0x8 0x40300 0x18>;
reg-names = "mdio", "mdio_indir_rw";
#size-cells = <1>;
#address-cells = <0>;
...
phy@0 {
compatible = "ethernet-phy-ieee802.3-c22";
reg = <0>;
};
};
* Broadcom Starfighter 2 integrated swich
Required properties:
- compatible: should be "brcm,bcm7445-switch-v4.0"
- reg: addresses and length of the register sets for the device, must be 6
pairs of register addresses and lengths
- interrupts: interrupts for the devices, must be two interrupts
- dsa,mii-bus: phandle to the MDIO bus controller, see dsa/dsa.txt
- dsa,ethernet: phandle to the CPU network interface controller, see dsa/dsa.txt
- #size-cells: must be 0
- #address-cells: must be 2, see dsa/dsa.txt
Subnodes:
The integrated switch subnode should be specified according to the binding
described in dsa/dsa.txt.
Optional properties:
- reg-names: litteral names for the device base register addresses, when present
must be: "core", "reg", "intrl2_0", "intrl2_1", "fcb", "acb"
- interrupt-names: litternal names for the device interrupt lines, when present
must be: "switch_0" and "switch_1"
- brcm,num-gphy: specify the maximum number of integrated gigabit PHYs in the
switch
- brcm,num-rgmii-ports: specify the maximum number of RGMII interfaces supported
by the switch
- brcm,fcb-pause-override: boolean property, if present indicates that the switch
supports Failover Control Block pause override capability
- brcm,acb-packets-inflight: boolean property, if present indicates that the switch
Admission Control Block supports reporting the number of packets in-flight in a
switch queue
Example:
switch_top@f0b00000 {
compatible = "simple-bus";
#size-cells = <1>;
#address-cells = <1>;
ranges = <0 0xf0b00000 0x40804>;
ethernet_switch@0 {
compatible = "brcm,bcm7445-switch-v4.0";
#size-cells = <0>;
#address-cells = <2>;
reg = <0x0 0x40000
0x40000 0x110
0x40340 0x30
0x40380 0x30
0x40400 0x34
0x40600 0x208>;
interrupts = <0 0x18 0
0 0x19 0>;
brcm,num-gphy = <1>;
brcm,num-rgmii-ports = <2>;
brcm,fcb-pause-override;
brcm,acb-packets-inflight;
...
switch@0 {
reg = <0 0>;
#size-cells = <0>;
#address-cells <1>;
port@0 {
label = "gphy";
reg = <0>;
};
...
};
};
};
Bosch MCAN controller Device Tree Bindings
-------------------------------------------------
Required properties:
- compatible : Should be "bosch,m_can" for M_CAN controllers
- reg : physical base address and size of the M_CAN
registers map and Message RAM
- reg-names : Should be "m_can" and "message_ram"
- interrupts : Should be the interrupt number of M_CAN interrupt
line 0 and line 1, could be same if sharing
the same interrupt.
- interrupt-names : Should contain "int0" and "int1"
- clocks : Clocks used by controller, should be host clock
and CAN clock.
- clock-names : Should contain "hclk" and "cclk"
- pinctrl-<n> : Pinctrl states as described in bindings/pinctrl/pinctrl-bindings.txt
- pinctrl-names : Names corresponding to the numbered pinctrl states
- bosch,mram-cfg : Message RAM configuration data.
Multiple M_CAN instances can share the same Message
RAM and each element(e.g Rx FIFO or Tx Buffer and etc)
number in Message RAM is also configurable,
so this property is telling driver how the shared or
private Message RAM are used by this M_CAN controller.
The format should be as follows:
<offset sidf_elems xidf_elems rxf0_elems rxf1_elems
rxb_elems txe_elems txb_elems>
The 'offset' is an address offset of the Message RAM
where the following elements start from. This is
usually set to 0x0 if you're using a private Message
RAM. The remain cells are used to specify how many
elements are used for each FIFO/Buffer.
M_CAN includes the following elements according to user manual:
11-bit Filter 0-128 elements / 0-128 words
29-bit Filter 0-64 elements / 0-128 words
Rx FIFO 0 0-64 elements / 0-1152 words
Rx FIFO 1 0-64 elements / 0-1152 words
Rx Buffers 0-64 elements / 0-1152 words
Tx Event FIFO 0-32 elements / 0-64 words
Tx Buffers 0-32 elements / 0-576 words
Please refer to 2.4.1 Message RAM Configuration in
Bosch M_CAN user manual for details.
Example:
SoC dtsi:
m_can1: can@020e8000 {
compatible = "bosch,m_can";
reg = <0x020e8000 0x4000>, <0x02298000 0x4000>;
reg-names = "m_can", "message_ram";
interrupts = <0 114 0x04>,
<0 114 0x04>;
interrupt-names = "int0", "int1";
clocks = <&clks IMX6SX_CLK_CANFD>,
<&clks IMX6SX_CLK_CANFD>;
clock-names = "hclk", "cclk";
bosch,mram-cfg = <0x0 0 0 32 0 0 0 1>;
status = "disabled";
};
Board dts:
&m_can1 {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_m_can1>;
status = "enabled";
};
Renesas R-Car CAN controller Device Tree Bindings
-------------------------------------------------
Required properties:
- compatible: "renesas,can-r8a7778" if CAN controller is a part of R8A7778 SoC.
"renesas,can-r8a7779" if CAN controller is a part of R8A7779 SoC.
"renesas,can-r8a7790" if CAN controller is a part of R8A7790 SoC.
"renesas,can-r8a7791" if CAN controller is a part of R8A7791 SoC.
- reg: physical base address and size of the R-Car CAN register map.
- interrupts: interrupt specifier for the sole interrupt.
- clocks: phandles and clock specifiers for 3 CAN clock inputs.
- clock-names: 3 clock input name strings: "clkp1", "clkp2", "can_clk".
- pinctrl-0: pin control group to be used for this controller.
- pinctrl-names: must be "default".
Optional properties:
- renesas,can-clock-select: R-Car CAN Clock Source Select. Valid values are:
<0x0> (default) : Peripheral clock (clkp1)
<0x1> : Peripheral clock (clkp2)
<0x3> : Externally input clock
Example
-------
SoC common .dtsi file:
can0: can@e6e80000 {
compatible = "renesas,can-r8a7791";
reg = <0 0xe6e80000 0 0x1000>;
interrupts = <0 186 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&mstp9_clks R8A7791_CLK_RCAN0>,
<&cpg_clocks R8A7791_CLK_RCAN>, <&can_clk>;
clock-names = "clkp1", "clkp2", "can_clk";
status = "disabled";
};
Board specific .dts file:
&can0 {
pinctrl-0 = <&can0_pins>;
pinctrl-names = "default";
status = "okay";
};
......@@ -24,15 +24,17 @@ Optional properties:
- ti,hwmods : Must be "cpgmac0"
- no_bd_ram : Must be 0 or 1
- dual_emac : Specifies Switch to act as Dual EMAC
- syscon : Phandle to the system control device node, which is
the control module device of the am33x
Slave Properties:
Required properties:
- phy_id : Specifies slave phy id
- phy-mode : See ethernet.txt file in the same directory
- mac-address : See ethernet.txt file in the same directory
Optional properties:
- dual_emac_res_vlan : Specifies VID to be used to segregate the ports
- mac-address : See ethernet.txt file in the same directory
Note: "ti,hwmods" field is used to fetch the base address and irq
resources from TI, omap hwmod data base during device registration.
......@@ -57,6 +59,7 @@ Examples:
active_slave = <0>;
cpts_clock_mult = <0x80000000>;
cpts_clock_shift = <29>;
syscon = <&cm>;
cpsw_emac0: slave@0 {
phy_id = <&davinci_mdio>, <0>;
phy-mode = "rgmii-txid";
......@@ -85,6 +88,7 @@ Examples:
active_slave = <0>;
cpts_clock_mult = <0x80000000>;
cpts_clock_shift = <29>;
syscon = <&cm>;
cpsw_emac0: slave@0 {
phy_id = <&davinci_mdio>, <0>;
phy-mode = "rgmii-txid";
......
......@@ -39,6 +39,22 @@ Optionnal property:
This property is only used when switches are being
chained/cascaded together.
- phy-handle : Phandle to a PHY on an external MDIO bus, not the
switch internal one. See
Documentation/devicetree/bindings/net/ethernet.txt
for details.
- phy-mode : String representing the connection to the designated
PHY node specified by the 'phy-handle' property. See
Documentation/devicetree/bindings/net/ethernet.txt
for details.
Optional subnodes:
- fixed-link : Fixed-link subnode describing a link to a non-MDIO
managed entity. See
Documentation/devicetree/bindings/net/fixed-link.txt
for details.
Example:
dsa@0 {
......@@ -58,6 +74,7 @@ Example:
port@0 {
reg = <0>;
label = "lan1";
phy-handle = <&phy0>;
};
port@1 {
......
* ARC EMAC 10/100 Ethernet platform driver for Rockchip Rk3066/RK3188 SoCs
Required properties:
- compatible: Should be "rockchip,rk3066-emac" or "rockchip,rk3188-emac"
according to the target SoC.
- reg: Address and length of the register set for the device
- interrupts: Should contain the EMAC interrupts
- rockchip,grf: phandle to the syscon grf used to control speed and mode
for emac.
- phy: see ethernet.txt file in the same directory.
- phy-mode: see ethernet.txt file in the same directory.
Optional properties:
- phy-supply: phandle to a regulator if the PHY needs one
Clock handling:
- clocks: Must contain an entry for each entry in clock-names.
- clock-names: Shall be "hclk" for the host clock needed to calculate and set
polling period of EMAC and "macref" for the reference clock needed to transfer
data to and from the phy.
Child nodes of the driver are the individual PHY devices connected to the
MDIO bus. They must have a "reg" property given the PHY address on the MDIO bus.
Examples:
ethernet@10204000 {
compatible = "rockchip,rk3188-emac";
reg = <0xc0fc2000 0x3c>;
interrupts = <6>;
mac-address = [ 00 11 22 33 44 55 ];
clocks = <&cru HCLK_EMAC>, <&cru SCLK_MAC>;
clock-names = "hclk", "macref";
pinctrl-names = "default";
pinctrl-0 = <&emac_xfer>, <&emac_mdio>, <&phy_int>;
rockchip,grf = <&grf>;
phy = <&phy0>;
phy-mode = "rmii";
phy-supply = <&vcc_rmii>;
#address-cells = <1>;
#size-cells = <0>;
phy0: ethernet-phy@0 {
reg = <1>;
};
};
......@@ -16,6 +16,12 @@ Optional properties:
- phy-handle : phandle to the PHY device connected to this device.
- fixed-link : Assume a fixed link. See fixed-link.txt in the same directory.
Use instead of phy-handle.
- fsl,num-tx-queues : The property is valid for enet-avb IP, which supports
hw multi queues. Should specify the tx queue number, otherwise set tx queue
number to 1.
- fsl,num-rx-queues : The property is valid for enet-avb IP, which supports
hw multi queues. Should specify the rx queue number, otherwise set rx queue
number to 1.
Optional subnodes:
- mdio : specifies the mdio bus in the FEC, used as a container for phy nodes
......
* Marvell PXA168 Ethernet Controller
Required properties:
- compatible: should be "marvell,pxa168-eth".
- reg: address and length of the register set for the device.
- interrupts: interrupt for the device.
- clocks: pointer to the clock for the device.
Optional properties:
- port-id: Ethernet port number. Should be '0','1' or '2'.
- #address-cells: must be 1 when using sub-nodes.
- #size-cells: must be 0 when using sub-nodes.
- phy-handle: see ethernet.txt file in the same directory.
- local-mac-address: see ethernet.txt file in the same directory.
Sub-nodes:
Each PHY can be represented as a sub-node. This is not mandatory.
Sub-nodes required properties:
- reg: the MDIO address of the PHY.
Example:
eth0: ethernet@f7b90000 {
compatible = "marvell,pxa168-eth";
reg = <0xf7b90000 0x10000>;
clocks = <&chip CLKID_GETH0>;
interrupts = <GIC_SPI 24 IRQ_TYPE_LEVEL_HIGH>;
#address-cells = <1>;
#size-cells = <0>;
phy-handle = <&ethphy0>;
ethphy0: ethernet-phy@0 {
reg = <0>;
};
};
* Amlogic Meson DWMAC Ethernet controller
The device inherits all the properties of the dwmac/stmmac devices
described in the file net/stmmac.txt with the following changes.
Required properties:
- compatible: should be "amlogic,meson6-dwmac" along with "snps,dwmac"
and any applicable more detailed version number
described in net/stmmac.txt
- reg: should contain a register range for the dwmac controller and
another one for the Amlogic specific configuration
Example:
ethmac: ethernet@c9410000 {
compatible = "amlogic,meson6-dwmac", "snps,dwmac";
reg = <0xc9410000 0x10000
0xc1108108 0x4>;
interrupts = <0 8 1>;
interrupt-names = "macirq";
clocks = <&clk81>;
clock-names = "stmmaceth";
}
......@@ -26,7 +26,7 @@ Example (for ARM-based BeagleBoard xM with ST21NFCB on I2C2):
clock-frequency = <400000>;
interrupt-parent = <&gpio5>;
interrupts = <2 IRQ_TYPE_LEVEL_LOW>;
interrupts = <2 IRQ_TYPE_LEVEL_HIGH>;
reset-gpios = <&gpio5 29 GPIO_ACTIVE_HIGH>;
};
......
......@@ -13,6 +13,11 @@ Optional SoC Specific Properties:
- pinctrl-names: Contains only one value - "default".
- pintctrl-0: Specifies the pin control groups used for this controller.
- autosuspend-delay: Specify autosuspend delay in milliseconds.
- vin-voltage-override: Specify voltage of VIN pin in microvolts.
- irq-status-read-quirk: Specify that the trf7970a being used has the
"IRQ Status Read" erratum.
- en2-rf-quirk: Specify that the trf7970a being used has the "EN2 RF"
erratum.
Example (for ARM-based BeagleBone with TRF7970A on SPI1):
......@@ -30,7 +35,10 @@ Example (for ARM-based BeagleBone with TRF7970A on SPI1):
ti,enable-gpios = <&gpio2 2 GPIO_ACTIVE_LOW>,
<&gpio2 5 GPIO_ACTIVE_LOW>;
vin-supply = <&ldo3_reg>;
vin-voltage-override = <5000000>;
autosuspend-delay = <30000>;
irq-status-read-quirk;
en2-rf-quirk;
status = "okay";
};
};
* Qualcomm QCA7000 (Ethernet over SPI protocol)
Note: The QCA7000 is useable as a SPI device. In this case it must be defined
as a child of a SPI master in the device tree.
Required properties:
- compatible : Should be "qca,qca7000"
- reg : Should specify the SPI chip select
- interrupts : The first cell should specify the index of the source interrupt
and the second cell should specify the trigger type as rising edge
- spi-cpha : Must be set
- spi-cpol: Must be set
Optional properties:
- interrupt-parent : Specify the pHandle of the source interrupt
- spi-max-frequency : Maximum frequency of the SPI bus the chip can operate at.
Numbers smaller than 1000000 or greater than 16000000 are invalid. Missing
the property will set the SPI frequency to 8000000 Hertz.
- local-mac-address: 6 bytes, MAC address
- qca,legacy-mode : Set the SPI data transfer of the QCA7000 to legacy mode.
In this mode the SPI master must toggle the chip select between each data
word. In burst mode these gaps aren't necessary, which is faster.
This setting depends on how the QCA7000 is setup via GPIO pin strapping.
If the property is missing the driver defaults to burst mode.
Example:
/* Freescale i.MX28 SPI master*/
ssp2: spi@80014000 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "fsl,imx28-spi";
pinctrl-names = "default";
pinctrl-0 = <&spi2_pins_a>;
status = "okay";
qca7000: ethernet@0 {
compatible = "qca,qca7000";
reg = <0x0>;
interrupt-parent = <&gpio3>; /* GPIO Bank 3 */
interrupts = <25 0x1>; /* Index: 25, rising edge */
spi-cpha; /* SPI mode: CPHA=1 */
spi-cpol; /* SPI mode: CPOL=1 */
spi-max-frequency = <8000000>; /* freq: 8 MHz */
local-mac-address = [ A0 B0 C0 D0 E0 F0 ];
};
};
......@@ -12,6 +12,10 @@ Required properties:
- altr,sysmgr-syscon : Should be the phandle to the system manager node that
encompasses the glue register, the register offset, and the register shift.
Optional properties:
altr,emac-splitter: Should be the phandle to the emac splitter soft IP node if
DWMAC controller is connected emac splitter.
Example:
gmac0: ethernet@ff700000 {
......
DCTCP (DataCenter TCP)
----------------------
DCTCP is an enhancement to the TCP congestion control algorithm for data
center networks and leverages Explicit Congestion Notification (ECN) in
the data center network to provide multi-bit feedback to the end hosts.
To enable it on end hosts:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
All switches in the data center network running DCTCP must support ECN
marking and be configured for marking when reaching defined switch buffer
thresholds. The default ECN marking threshold heuristic for DCTCP on
switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps,
but might need further careful tweaking.
For more details, see below documents:
Paper:
The algorithm is further described in detail in the following two
SIGCOMM/SIGMETRICS papers:
i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan:
"Data Center TCP (DCTCP)", Data Center Networks session
Proc. ACM SIGCOMM, New Delhi, 2010.
http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192
ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar:
"Analysis of DCTCP: Stability, Convergence, and Fairness"
Proc. ACM SIGMETRICS, San Jose, 2011.
http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
IETF informational draft:
http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
DCTCP site:
http://simula.stanford.edu/~alizade/Site/DCTCP.html
......@@ -951,7 +951,7 @@ Size modifier is one of ...
Mode modifier is one of:
BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */
BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
BPF_ABS 0x20
BPF_IND 0x40
BPF_MEM 0x60
......@@ -995,6 +995,275 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
2 byte atomic increments are not supported.
eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single
instruction that loads 64-bit immediate value into a dst_reg.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
32-bit immediate value into a register.
eBPF verifier
-------------
The safety of the eBPF program is determined in two steps.
First step does DAG check to disallow loops and other CFG validation.
In particular it will detect programs that have unreachable instructions.
(though classic BPF checker allows them)
Second step starts from the first insn and descends all possible paths.
It simulates execution of every insn and observes the state change of
registers and stack.
At the start of the program the register R1 contains a pointer to context
and has type PTR_TO_CTX.
If verifier sees an insn that does R2=R1, then R2 has now type
PTR_TO_CTX as well and can be used on the right hand side of expression.
If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE,
since addition of two valid pointers makes invalid pointer.
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
sure that kernel addresses don't leak to unprivileged users)
If register was never written to, it's not readable:
bpf_mov R0 = R2
bpf_exit
will be rejected, since R2 is unreadable at the start of the program.
After kernel function call, R1-R5 are reset to unreadable and
R0 has a return type of the function.
Since R6-R9 are callee saved, their state is preserved across the call.
bpf_mov R6 = 1
bpf_call foo
bpf_mov R0 = R6
bpf_exit
is a correct program. If there was R1 instead of R6, it would have
been rejected.
load/store instructions are allowed only with registers of valid types, which
are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked.
For example:
bpf_mov R1 = 1
bpf_mov R2 = 2
bpf_xadd *(u32 *)(R1 + 3) += R2
bpf_exit
will be rejected, since R1 doesn't have a valid pointer type at the time of
execution of instruction bpf_xadd.
At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context')
A callback is used to customize verifier to restrict eBPF program access to only
certain fields within ctx structure with specified size and alignment.
For example, the following insn:
bpf_ld R0 = *(u32 *)(R6 + 8)
intends to load a word from address R6 + 8 and store it into R0
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
that offset 8 of size 4 bytes can be accessed for reading, otherwise
the verifier will reject the program.
If R6=FRAME_PTR, then access should be aligned and be within
stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
so it will fail verification, since it's out of bounds.
The verifier will allow eBPF program to read data from stack only after
it wrote into it.
Classic BPF verifier does similar check with M[0-15] memory slots.
For example:
bpf_ld R0 = *(u32 *)(R10 - 4)
bpf_exit
is invalid program.
Though R10 is correct read-only register and has type FRAME_PTR
and R10 - 4 is within stack bounds, there were no stores into that location.
Pointer register spill/fill is tracked as well, since four (R6-R9)
callee saved registers may not be enough for some programs.
Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
The eBPF verifier will check that registers match argument constraints.
After the call register R0 will be set to return type of the function.
Function calls is a main mechanism to extend functionality of eBPF programs.
Socket filters may let programs to call one set of functions, whereas tracing
filters may allow completely different set.
If a function made accessible to eBPF program, it needs to be thought through
from safety point of view. The verifier will guarantee that the function is
called with valid arguments.
seccomp vs socket filters have different security restrictions for classic BPF.
Seccomp solves this by two stage verifier: classic BPF verifier is followed
by seccomp verifier. In case of eBPF one configurable verifier is shared for
all use cases.
See details of eBPF verifier in kernel/bpf/verifier.c
eBPF maps
---------
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
The maps are accessed from user space via BPF syscall, which has commands:
- create a map with given type and attributes
map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
returns process-local file descriptor or negative error
- lookup key in a given map
err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
using attr->map_fd, attr->key, attr->value
returns zero and stores found elem into value or negative error
- create or update key/value pair in a given map
err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
using attr->map_fd, attr->key, attr->value
returns zero or negative error
- find and delete element by key in a given map
err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
using attr->map_fd, attr->key
- to delete map: close(fd)
Exiting process will delete maps automatically
userspace programs use this syscall to create/access maps that eBPF programs
are concurrently updating.
maps can have different types: hash, array, bloom filter, radix-tree, etc.
The map is defined by:
. type
. max number of elements
. key size in bytes
. value size in bytes
Understanding eBPF verifier messages
------------------------------------
The following are few examples of invalid eBPF programs and verifier error
messages as seen in the log:
Program with unreachable instructions:
static struct bpf_insn prog[] = {
BPF_EXIT_INSN(),
BPF_EXIT_INSN(),
};
Error:
unreachable insn 1
Program that reads uninitialized register:
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
BPF_EXIT_INSN(),
Error:
0: (bf) r0 = r2
R2 !read_ok
Program that doesn't initialize R0 before exiting:
BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
BPF_EXIT_INSN(),
Error:
0: (bf) r2 = r1
1: (95) exit
R0 !read_ok
Program that accesses stack out of bounds:
BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
BPF_EXIT_INSN(),
Error:
0: (7a) *(u64 *)(r10 +8) = 0
invalid stack off=8 size=8
Program that doesn't initialize stack before passing its address into function:
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_EXIT_INSN(),
Error:
0: (bf) r2 = r10
1: (07) r2 += -8
2: (b7) r1 = 0x0
3: (85) call 1
invalid indirect read from stack off -8+0 size 8
Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_EXIT_INSN(),
Error:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 0x0
4: (85) call 1
fd 0 is not pointing to valid bpf_map
Program that doesn't check return value of map_lookup_elem() before accessing
map element:
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
BPF_EXIT_INSN(),
Error:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 0x0
4: (85) call 1
5: (7a) *(u64 *)(r0 +0) = 0
R0 invalid mem access 'map_value_or_null'
Program that correctly checks map_lookup_elem() returned value for NULL, but
accesses the memory with incorrect alignment:
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
BPF_EXIT_INSN(),
Error:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 1
4: (85) call 1
5: (15) if r0 == 0x0 goto pc+1
R0=map_ptr R10=fp
6: (7a) *(u64 *)(r0 +4) = 0
misaligned access off 4 size 8
Program that correctly checks map_lookup_elem() returned value for NULL and
accesses memory with correct alignment in one side of 'if' branch, but fails
to do so in the other side of 'if' branch:
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
BPF_EXIT_INSN(),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),
Error:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 1
4: (85) call 1
5: (15) if r0 == 0x0 goto pc+2
R0=map_ptr R10=fp
6: (7a) *(u64 *)(r0 +0) = 0
7: (95) exit
from 5 to 8: R0=imm0 R10=fp
8: (7a) *(u64 *)(r0 +0) = 1
R0 invalid mem access 'imm'
Testing
-------
......
......@@ -65,6 +65,12 @@ neigh/default/gc_thresh1 - INTEGER
purge entries if there are fewer than this number.
Default: 128
neigh/default/gc_thresh2 - INTEGER
Threshold when garbage collector becomes more aggressive about
purging entries. Entries older than 5 seconds will be cleared
when over this number.
Default: 512
neigh/default/gc_thresh3 - INTEGER
Maximum number of neighbor entries allowed. Increase this
when using large numbers of interfaces and when communicating
......@@ -757,8 +763,21 @@ icmp_ratelimit - INTEGER
icmp_ratemask (see below) to specific targets.
0 to disable any limiting,
otherwise the minimal space between responses in milliseconds.
Note that another sysctl, icmp_msgs_per_sec limits the number
of ICMP packets sent on all targets.
Default: 1000
icmp_msgs_per_sec - INTEGER
Limit maximal number of ICMP packets sent per second from this host.
Only messages whose type matches icmp_ratemask (see below) are
controlled by this limit.
Default: 1000
icmp_msgs_burst - INTEGER
icmp_msgs_per_sec controls number of ICMP packets sent per second,
while icmp_msgs_burst controls the burst size of these packets.
Default: 50
icmp_ratemask - INTEGER
Mask made of ICMP types for which rates are being limited.
Significant bits: IHGFEDCBA9876543210
......@@ -832,6 +851,11 @@ igmp_max_memberships - INTEGER
conf/all/* is special, changes the settings for all interfaces
igmp_qrv - INTEGER
Controls the IGMP query robustness variable (see RFC2236 8.1).
Default: 2 (as specified by RFC2236 8.1)
Minimum: 1 (as specified by RFC6636 4.5)
log_martians - BOOLEAN
Log packets with impossible addresses to kernel log.
log_martians for the interface will be enabled if at least one of
......@@ -935,14 +959,9 @@ accept_source_route - BOOLEAN
FALSE (host)
accept_local - BOOLEAN
Accept packets with local source addresses. In combination
with suitable routing, this can be used to direct packets
between two local interfaces over the wire and have them
accepted properly.
rp_filter must be set to a non-zero value in order for
accept_local to have an effect.
Accept packets with local source addresses. In combination with
suitable routing, this can be used to direct packets between two
local interfaces over the wire and have them accepted properly.
default FALSE
route_localnet - BOOLEAN
......@@ -1140,6 +1159,11 @@ anycast_src_echo_reply - BOOLEAN
FALSE: disabled
Default: FALSE
mld_qrv - INTEGER
Controls the MLD query robustness variable (see RFC3810 9.1).
Default: 2 (as specified by RFC3810 9.1)
Minimum: 1 (as specified by RFC6636 4.5)
IPv6 Fragmentation:
ip6frag_high_thresh - INTEGER
......
......@@ -99,6 +99,9 @@ Examples:
pgset "clone_skb 1" sets the number of copies of the same packet
pgset "clone_skb 0" use single SKB for all transmits
pgset "burst 8" uses xmit_more API to queue 8 copies of the same
packet and update HW tx queue tail pointer once.
"burst 1" is the default
pgset "pkt_size 9014" sets packet size to 9014
pgset "frags 5" packet will consist of 5 fragments
pgset "count 200000" sets number of packets to send, set to zero
......
The existing interfaces for getting network packages time stamped are:
1. Control Interfaces
The interfaces for receiving network packages timestamps are:
* SO_TIMESTAMP
Generate time stamp for each incoming packet using the (not necessarily
monotonous!) system time. Result is returned via recv_msg() in a
control message as timeval (usec resolution).
Generates a timestamp for each incoming packet in (not necessarily
monotonic) system time. Reports the timestamp via recvmsg() in a
control message as struct timeval (usec resolution).
* SO_TIMESTAMPNS
Same time stamping mechanism as SO_TIMESTAMP, but returns result as
timespec (nsec resolution).
Same timestamping mechanism as SO_TIMESTAMP, but reports the
timestamp as struct timespec (nsec resolution).
* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
Only for multicasts: approximate send time stamp by receiving the looped
packet and using its receive time stamp.
Only for multicast:approximate transmit timestamp obtained by
reading the looped packet receive timestamp.
The following interface complements the existing ones: receive time
stamps can be generated and returned for arbitrary packets and much
closer to the point where the packet is really sent. Time stamps can
be generated in software (as before) or in hardware (if the hardware
has such a feature).
* SO_TIMESTAMPING
Generates timestamps on reception, transmission or both. Supports
multiple timestamp sources, including hardware. Supports generating
timestamps for stream sockets.
SO_TIMESTAMPING:
Instructs the socket layer which kind of information should be collected
and/or reported. The parameter is an integer with some of the following
bits set. Setting other bits is an error and doesn't change the current
state.
1.1 SO_TIMESTAMP:
Four of the bits are requests to the stack to try to generate
timestamps. Any combination of them is valid.
This socket option enables timestamping of datagrams on the reception
path. Because the destination socket, if any, is not known early in
the network stack, the feature has to be enabled for all packets. The
same is true for all early receive timestamp options.
SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamps in hardware
SOF_TIMESTAMPING_TX_SOFTWARE: try to obtain send time stamps in software
SOF_TIMESTAMPING_RX_HARDWARE: try to obtain receive time stamps in hardware
SOF_TIMESTAMPING_RX_SOFTWARE: try to obtain receive time stamps in software
For interface details, see `man 7 socket`.
1.2 SO_TIMESTAMPNS:
This option is identical to SO_TIMESTAMP except for the returned data type.
Its struct timespec allows for higher resolution (ns) timestamps than the
timeval of SO_TIMESTAMP (ms).
1.3 SO_TIMESTAMPING:
Supports multiple types of timestamp requests. As a result, this
socket option takes a bitmap of flags, not a boolean. In
err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val);
val is an integer with any of the following bits set. Setting other
bit returns EINVAL and does not change the current state.
The other three bits control which timestamps will be reported in a
generated control message. If none of these bits are set or if none of
the set bits correspond to data that is available, then the control
message will not be generated:
SOF_TIMESTAMPING_SOFTWARE: report systime if available
SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available (deprecated)
SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available
1.3.1 Timestamp Generation
It is worth noting that timestamps may be collected for reasons other
than being requested by a particular socket with
SOF_TIMESTAMPING_[TR]X_(HARD|SOFT)WARE. For example, most drivers that
can generate hardware receive timestamps ignore
SOF_TIMESTAMPING_RX_HARDWARE. It is still a good idea to set that flag
in case future drivers pay attention.
Some bits are requests to the stack to try to generate timestamps. Any
combination of them is valid. Changes to these bits apply to newly
created packets, not to packets already in the stack. As a result, it
is possible to selectively request timestamps for a subset of packets
(e.g., for sampling) by embedding an send() call within two setsockopt
calls, one to enable timestamp generation and one to disable it.
Timestamps may also be generated for reasons other than being
requested by a particular socket, such as when receive timestamping is
enabled system wide, as explained earlier.
If timestamps are reported, they will appear in a control message with
cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like
this:
SOF_TIMESTAMPING_RX_HARDWARE:
Request rx timestamps generated by the network adapter.
SOF_TIMESTAMPING_RX_SOFTWARE:
Request rx timestamps when data enters the kernel. These timestamps
are generated just after a device driver hands a packet to the
kernel receive stack.
SOF_TIMESTAMPING_TX_HARDWARE:
Request tx timestamps generated by the network adapter.
SOF_TIMESTAMPING_TX_SOFTWARE:
Request tx timestamps when data leaves the kernel. These timestamps
are generated in the device driver as close as possible, but always
prior to, passing the packet to the network interface. Hence, they
require driver support and may not be available for all devices.
SOF_TIMESTAMPING_TX_SCHED:
Request tx timestamps prior to entering the packet scheduler. Kernel
transmit latency is, if long, often dominated by queuing delay. The
difference between this timestamp and one taken at
SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent
of protocol processing. The latency incurred in protocol
processing, if any, can be computed by subtracting a userspace
timestamp taken immediately before send() from this timestamp. On
machines with virtual devices where a transmitted packet travels
through multiple devices and, hence, multiple packet schedulers,
a timestamp is generated at each layer. This allows for fine
grained measurement of queuing delay.
SOF_TIMESTAMPING_TX_ACK:
Request tx timestamps when all data in the send buffer has been
acknowledged. This only makes sense for reliable protocols. It is
currently only implemented for TCP. For that protocol, it may
over-report measurement, because the timestamp is generated when all
data up to and including the buffer at send() was acknowledged: the
cumulative acknowledgment. The mechanism ignores SACK and FACK.
1.3.2 Timestamp Reporting
The other three bits control which timestamps will be reported in a
generated control message. Changes to the bits take immediate
effect at the timestamp reporting locations in the stack. Timestamps
are only reported for packets that also have the relevant timestamp
generation request set.
SOF_TIMESTAMPING_SOFTWARE:
Report any software timestamps when available.
SOF_TIMESTAMPING_SYS_HARDWARE:
This option is deprecated and ignored.
SOF_TIMESTAMPING_RAW_HARDWARE:
Report hardware timestamps as generated by
SOF_TIMESTAMPING_TX_HARDWARE when available.
1.3.3 Timestamp Options
The interface supports one option
SOF_TIMESTAMPING_OPT_ID:
Generate a unique identifier along with each packet. A process can
have multiple concurrent timestamping requests outstanding. Packets
can be reordered in the transmit path, for instance in the packet
scheduler. In that case timestamps will be queued onto the error
queue out of order from the original send() calls. This option
embeds a counter that is incremented at send() time, to order
timestamps within a flow.
This option is implemented only for transmit timestamps. There, the
timestamp is always looped along with a struct sock_extended_err.
The option modifies field ee_info to pass an id that is unique
among all possibly concurrently outstanding timestamp requests for
that socket. In practice, it is a monotonically increasing u32
(that wraps).
In datagram sockets, the counter increments on each send call. In
stream sockets, it increments with every byte.
1.4 Bytestream Timestamps
The SO_TIMESTAMPING interface supports timestamping of bytes in a
bytestream. Each request is interpreted as a request for when the
entire contents of the buffer has passed a timestamping point. That
is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record
when all bytes have reached the device driver, regardless of how
many packets the data has been converted into.
In general, bytestreams have no natural delimiters and therefore
correlating a timestamp with data is non-trivial. A range of bytes
may be split across segments, any segments may be merged (possibly
coalescing sections of previously segmented buffers associated with
independent send() calls). Segments can be reordered and the same
byte range can coexist in multiple segments for protocols that
implement retransmissions.
It is essential that all timestamps implement the same semantics,
regardless of these possible transformations, as otherwise they are
incomparable. Handling "rare" corner cases differently from the
simple case (a 1:1 mapping from buffer to skb) is insufficient
because performance debugging often needs to focus on such outliers.
In practice, timestamps can be correlated with segments of a
bytestream consistently, if both semantics of the timestamp and the
timing of measurement are chosen correctly. This challenge is no
different from deciding on a strategy for IP fragmentation. There, the
definition is that only the first fragment is timestamped. For
bytestreams, we chose that a timestamp is generated only when all
bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to
implement and reason about. An implementation that has to take into
account SACK would be more complex due to possible transmission holes
and out of order arrival.
On the host, TCP can also break the simple 1:1 mapping from buffer to
skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The
implementation ensures correctness in all cases by tracking the
individual last byte passed to send(), even if it is no longer the
last byte after an skbuff extend or merge operation. It stores the
relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff
has only one such field, only one timestamp can be generated.
In rare cases, a timestamp request can be missed if two requests are
collapsed onto the same skb. A process can detect this situation by
enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
send time with the value returned for each timestamp. It can prevent
the situation by always flushing the TCP stack in between requests,
for instance by enabling TCP_NODELAY and disabling TCP_CORK and
autocork.
These precautions ensure that the timestamp is generated only when all
bytes have passed a timestamp point, assuming that the network stack
itself does not reorder the segments. The stack indeed tries to avoid
reordering. The one exception is under administrator control: it is
possible to construct a packet scheduler configuration that delays
segments from the same stream differently. Such a setup would be
unusual.
2 Data Interfaces
Timestamps are read using the ancillary data feature of recvmsg().
See `man 3 cmsg` for details of this interface. The socket manual
page (`man 7 socket`) describes how timestamps generated with
SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
2.1 SCM_TIMESTAMPING records
These timestamps are returned in a control message with cmsg_level
SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
struct scm_timestamping {
struct timespec systime;
struct timespec hwtimetrans;
struct timespec hwtimeraw;
struct timespec ts[3];
};
recvmsg() can be used to get this control message for regular incoming
packets. For send time stamps the outgoing packet is looped back to
the socket's error queue with the send time stamp(s) attached. It can
be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
original outgoing packet data including all headers preprended down to
and including the link layer, the scm_timestamping control message and
a sock_extended_err control message with ee_errno==ENOMSG and
ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
bounced packet is ready for reading as far as select() is concerned.
If the outgoing packet has to be fragmented, then only the first
fragment is time stamped and returned to the sending socket.
All three values correspond to the same event in time, but were
generated in different ways. Each of these values may be empty (= all
zero), in which case no such value was available. If the application
is not interested in some of these values, they can be left blank to
avoid the potential overhead of calculating them.
systime is the value of the system time at that moment. This
corresponds to the value also returned via SO_TIMESTAMP[NS]. If the
time stamp was generated by hardware, then this field is
empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is
set.
hwtimeraw is the original hardware time stamp. Filled in if
SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
relation to system time should be made.
hwtimetrans is always zero. This field is deprecated. It used to hold
hw timestamps converted to system time. Instead, expose the hardware
clock device on the NIC directly as a HW PTP clock source, to allow
time conversion in userspace and optionally synchronize system time
with a userspace PTP stack such as linuxptp. For the PTP clock API,
see Documentation/ptp/ptp.txt.
SIOCSHWTSTAMP, SIOCGHWTSTAMP:
The structure can return up to three timestamps. This is a legacy
feature. Only one field is non-zero at any time. Most timestamps
are passed in ts[0]. Hardware timestamps are passed in ts[2].
ts[1] used to hold hardware timestamps converted to system time.
Instead, expose the hardware clock device on the NIC directly as
a HW PTP clock source, to allow time conversion in userspace and
optionally synchronize system time with a userspace PTP stack such
as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt.
2.1.1 Transmit timestamps with MSG_ERRQUEUE
For transmit timestamps the outgoing packet is looped back to the
socket's error queue with the send timestamp(s) attached. A process
receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE
set and with a msg_control buffer sufficiently large to receive the
relevant metadata structures. The recvmsg call returns the original
outgoing data packet with two ancillary messages attached.
A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR
embeds a struct sock_extended_err. This defines the error type. For
timestamps, the ee_errno field is ENOMSG. The other ancillary message
will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This
embeds the struct scm_timestamping.
2.1.1.2 Timestamp types
The semantics of the three struct timespec are defined by field
ee_info in the extended error structure. It contains a value of
type SCM_TSTAMP_* to define the actual timestamp passed in
scm_timestamping.
The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_*
control fields discussed previously, with one exception. For legacy
reasons, SCM_TSTAMP_SND is equal to zero and can be set for both
SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It
is the first if ts[2] is non-zero, the second otherwise, in which
case the timestamp is stored in ts[0].
2.1.1.3 Fragmentation
Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
then only the first fragment is timestamped and returned to the sending
socket.
2.1.1.4 Packet Payload
The calling application is often not interested in receiving the whole
packet payload that it passed to the stack originally: the socket
error queue mechanism is just a method to piggyback the timestamp on.
In this case, the application can choose to read datagrams with a
smaller buffer, possibly even of length 0. The payload is truncated
accordingly. Until the process calls recvmsg() on the error queue,
however, the full packet is queued, taking up budget from SO_RCVBUF.
2.1.1.5 Blocking Read
Reading from the error queue is always a non-blocking operation. To
block waiting on a timestamp, use poll or select. poll() will return
POLLERR in pollfd.revents if any data is ready on the error queue.
There is no need to pass this flag in pollfd.events. This flag is
ignored on request. See also `man 2 poll`.
2.1.2 Receive timestamps
On reception, there is no reason to read from the socket error queue.
The SCM_TIMESTAMPING ancillary data is sent along with the packet data
on a normal recvmsg(). Since this is not a socket error, it is not
accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case,
the meaning of the three fields in struct scm_timestamping is
implicitly defined. ts[0] holds a software timestamp if set, ts[1]
is again deprecated and ts[2] holds a hardware timestamp if set.
3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
Hardware time stamping must also be initialized for each device driver
that is expected to do hardware time stamping. The parameter is defined in
......@@ -167,8 +372,7 @@ enum {
*/
};
DEVICE IMPLEMENTATION
3.1 Hardware Timestamping Implementation: Device Drivers
A driver which supports hardware time stamping must support the
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment