26 Linux 网络框架

Linux内核网络栈(Network Stack)是一个复杂且分层的系统,用于处理网络数据包从物理层到应用层的传输。
网络栈框图如下:

+-------------------------------------------------------------+
|                         应用层                              |
|    (如 HTTP, FTP, DNS, SMTP, SSH,提供网络服务和通信接口)     |
+-------------------------------------------------------------+
                           ↑
                           |
+-------------------------------------------------------------+
|                         表示层                              |
|  (如 SSL/TLS, JPEG, MPEG, ASCII,负责数据加密、压缩和格式化)  |
+-------------------------------------------------------------+
                           ↑
                           |
+-------------------------------------------------------------+
|                         会话层                              |
|   (如 NetBIOS, PPTP,负责会话的建立、管理与终止)              |
+-------------------------------------------------------------+
                           ↑
                           |
+-------------------------------------------------------------+
|                         传输层                              |
|      (如 TCP, UDP,负责端到端的数据传输,确保可靠性)           |
+-------------------------------------------------------------+
                           ↑
                           |
+-------------------------------------------------------------+
|                         网络层                              |
| (如 IPv4, IPv6, ICMP, ARP,负责路由选择和跨网络的数据转发)    |
+-------------------------------------------------------------+
                           ↑
                           |
+-------------------------------------------------------------+
|                       数据链路层                            |
|      (如 Ethernet, Frame Relay, PPP,负责节点间数据传输)     |
+-------------------------------------------------------------+
                           ↑
                           |
+-------------------------------------------------------------+
|                         物理层                              |
|  (如 Ethernet Physical Layer, 光纤, 同轴电缆,负责比特传输)   |
+-------------------------------------------------------------+

26.1 物理层 (Physical Layer)

物理层主要负责处理网络接口的硬件信号,这部分涉及到PHY(Physical Layer Device)芯片和驱动程序。

PHY 层主要涉及的内容:

  • PHY芯片:负责将电信号或光信号转换为可以被网卡接收和理解的数字信号。

  • PHY驱动程序:管理PHY芯片与MAC(Media Access Control)层之间的通信。驱动程序通过MDIO(管理数据输入/输出)总线与PHY芯片通信,检测链路状态、协商速度、半双工/全双工等

PHY层重要的数据结构:

struct phy_device:它是内核中用于表示PHY设备的主要数据结构。PHY设备的状态、功能和能力都存储在这个结构中。

/**
 * struct phy_device - An instance of a PHY
 *
 * @mdio: MDIO bus this PHY is on
 * @drv: Pointer to the driver for this PHY instance
 * @phy_id: UID for this device found during discovery
 * @c45_ids: 802.3-c45 Device Identifiers if is_c45.
 * @is_c45:  Set to true if this PHY uses clause 45 addressing.
 * @is_internal: Set to true if this PHY is internal to a MAC.
 * @is_pseudo_fixed_link: Set to true if this PHY is an Ethernet switch, etc.
 * @is_gigabit_capable: Set to true if PHY supports 1000Mbps
 * @has_fixups: Set to true if this PHY has fixups/quirks.
 * @suspended: Set to true if this PHY has been suspended successfully.
 * @suspended_by_mdio_bus: Set to true if this PHY was suspended by MDIO bus.
 * @sysfs_links: Internal boolean tracking sysfs symbolic links setup/removal.
 * @loopback_enabled: Set true if this PHY has been loopbacked successfully.
 * @downshifted_rate: Set true if link speed has been downshifted.
 * @state: State of the PHY for management purposes
 * @dev_flags: Device-specific flags used by the PHY driver.
 * @irq: IRQ number of the PHY's interrupt (-1 if none)
 * @phy_timer: The timer for handling the state machine
 * @phylink: Pointer to phylink instance for this PHY
 * @sfp_bus_attached: Flag indicating whether the SFP bus has been attached
 * @sfp_bus: SFP bus attached to this PHY's fiber port
 * @attached_dev: The attached enet driver's device instance ptr
 * @adjust_link: Callback for the enet controller to respond to changes: in the
 *               link state.
 * @phy_link_change: Callback for phylink for notification of link change
 * @macsec_ops: MACsec offloading ops.
 *
 * @speed: Current link speed
 * @duplex: Current duplex
 * @port: Current port
 * @pause: Current pause
 * @asym_pause: Current asymmetric pause
 * @supported: Combined MAC/PHY supported linkmodes
 * @advertising: Currently advertised linkmodes
 * @adv_old: Saved advertised while power saving for WoL
 * @lp_advertising: Current link partner advertised linkmodes
 * @eee_broken_modes: Energy efficient ethernet modes which should be prohibited
 * @autoneg: Flag autoneg being used
 * @link: Current link state
 * @autoneg_complete: Flag auto negotiation of the link has completed
 * @mdix: Current crossover
 * @mdix_ctrl: User setting of crossover
 * @interrupts: Flag interrupts have been enabled
 * @interface: enum phy_interface_t value
 * @skb: Netlink message for cable diagnostics
 * @nest: Netlink nest used for cable diagnostics
 * @ehdr: nNtlink header for cable diagnostics
 * @phy_led_triggers: Array of LED triggers
 * @phy_num_led_triggers: Number of triggers in @phy_led_triggers
 * @led_link_trigger: LED trigger for link up/down
 * @last_triggered: last LED trigger for link speed
 * @master_slave_set: User requested master/slave configuration
 * @master_slave_get: Current master/slave advertisement
 * @master_slave_state: Current master/slave configuration
 * @mii_ts: Pointer to time stamper callbacks
 * @lock:  Mutex for serialization access to PHY
 * @state_queue: Work queue for state machine
 * @shared: Pointer to private data shared by phys in one package
 * @priv: Pointer to driver private data
 *
 * interrupts currently only supports enabled or disabled,
 * but could be changed in the future to support enabling
 * and disabling specific interrupts
 *
 * Contains some infrastructure for polling and interrupt
 * handling, as well as handling shifts in PHY hardware state
 */
struct phy_device {
	struct mdio_device mdio;

	/* Information about the PHY type */
	/* And management functions */
	struct phy_driver *drv;

	u32 phy_id;

	struct phy_c45_device_ids c45_ids;
	unsigned is_c45:1;
	unsigned is_internal:1;
	unsigned is_pseudo_fixed_link:1;
	unsigned is_gigabit_capable:1;
	unsigned has_fixups:1;
	unsigned suspended:1;
	unsigned suspended_by_mdio_bus:1;
	unsigned sysfs_links:1;
	unsigned loopback_enabled:1;
	unsigned downshifted_rate:1;

	unsigned autoneg:1;
	/* The most recently read link state */
	unsigned link:1;
	unsigned autoneg_complete:1;

	/* Interrupts are enabled */
	unsigned interrupts:1;

	enum phy_state state;

	u32 dev_flags;

	phy_interface_t interface;

	/*
	 * forced speed & duplex (no autoneg)
	 * partner speed & duplex & pause (autoneg)
	 */
	int speed;
	int duplex;
	int port;
	int pause;
	int asym_pause;
	u8 master_slave_get;
	u8 master_slave_set;
	u8 master_slave_state;

	/* Union of PHY and Attached devices' supported link modes */
	/* See ethtool.h for more info */
	__ETHTOOL_DECLARE_LINK_MODE_MASK(supported);
	__ETHTOOL_DECLARE_LINK_MODE_MASK(advertising);
	__ETHTOOL_DECLARE_LINK_MODE_MASK(lp_advertising);
	/* used with phy_speed_down */
	__ETHTOOL_DECLARE_LINK_MODE_MASK(adv_old);

	/* Energy efficient ethernet modes which should be prohibited */
	u32 eee_broken_modes;

#ifdef CONFIG_LED_TRIGGER_PHY
	struct phy_led_trigger *phy_led_triggers;
	unsigned int phy_num_led_triggers;
	struct phy_led_trigger *last_triggered;

	struct phy_led_trigger *led_link_trigger;
#endif

	/*
	 * Interrupt number for this PHY
	 * -1 means no interrupt
	 */
	int irq;

	/* private data pointer */
	/* For use by PHYs to maintain extra state */
	void *priv;

	/* shared data pointer */
	/* For use by PHYs inside the same package that need a shared state. */
	struct phy_package_shared *shared;

	/* Reporting cable test results */
	struct sk_buff *skb;
	void *ehdr;
	struct nlattr *nest;

	/* Interrupt and Polling infrastructure */
	struct delayed_work state_queue;

	struct mutex lock;

	/* This may be modified under the rtnl lock */
	bool sfp_bus_attached;
	struct sfp_bus *sfp_bus;
	struct phylink *phylink;
	struct net_device *attached_dev;
	struct mii_timestamper *mii_ts;

	u8 mdix;
	u8 mdix_ctrl;

	void (*phy_link_change)(struct phy_device *phydev, bool up);
	void (*adjust_link)(struct net_device *dev);

#if IS_ENABLED(CONFIG_MACSEC)
	/* MACsec management functions */
	const struct macsec_ops *macsec_ops;
#endif

	ANDROID_KABI_RESERVE(1);
	ANDROID_KABI_RESERVE(2);
	ANDROID_KABI_RESERVE(3);
	ANDROID_KABI_RESERVE(4);
};
  • state:表示PHY设备的状态(例如链接是否已建立)。

  • link:表示物理链路是否处于连接状态。

  • speed:当前连接的速率,如10Mbps、100Mbps或1000Mbps。

  • duplex:半双工或全双工模式。

  • phy_attach():内核使用这个函数将PHY设备与网卡(MAC)关联起来,注册PHY设备并为其创建设备文件。

  • phy_start() / phy_stop():这些函数用于启动或停止PHY设备,并处理链路状态变化。

数据从PHY层到MAC层的传递:

在PHY层完成对物理信号的解码后,比特流数据通过MII或GMII等接口传递到MAC层。PHY的驱动程序将物理层信息通知给MAC驱动,告诉它当前的链路状态(如是否链接成功、速率等),以便MAC层可以正确配置和处理接收到的帧。

物理介质(RJ45) ──> PHY设备 (电信号转换成比特流)──> RGMII ──> MAC驱动(比特流转换成) ──> Ethernet帧

26.2 数据链路层 (Data Link Layer)

MAC层驱动是用于管理网络接口控制器(NIC)的驱动,它负责数据的收发并与PHY层进行交互。Rockchip平台的MAC层驱动代码在: kernel/drivers/net/ethernet/stmicro/*。

26.2.1 MAC层驱动的主要部分

  • 初始化与注册:驱动需要向内核注册设备,以便内核能够管理它。

  • 接收数据处理:从PHY层接收到数据,并将数据传递给上层的网络协议栈。

  • 发送数据处理:将来自上层的数据通过MAC层发送到PHY层,再传输到物理介质。

  • 中断处理:处理网络接口卡的硬件中断,通常与数据的接收和发送密切相关。

26.2.2 注册过程

MAC驱动通过 net_device 结构向内核注册网络设备。注册步骤包括:

  • 设备的注册:通过 register_netdev() 函数将设备注册到Linux内核。

  • 设备的初始化:设置 MAC 地址、初始化硬件资源,分配和初始化 net_device 结构。

int register_netdev(struct net_device *dev);

该函数会将 net_device 结构注册到内核的网络子系统中,并初始化相关的网络接口。

26.2.3 关键数据结构

  • struct net_device:这是Linux内核中代表网络设备的核心数据结构,包含了设备的基本信息和功能接口。它定义了设备的各种操作函数,如发送、接收、初始化、关闭等。

    struct net_device {
        char name[IFNAMSIZ];
        unsigned char dev_addr[ETH_ALEN];    // MAC地址
        int (*open)(struct net_device *dev); // 打开设备
        int (*stop)(struct net_device *dev); // 关闭设备
        netdev_tx_t (*hard_start_xmit)(struct sk_buff *skb, struct net_device *dev); // 发送数据
        int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);     // I/O控制
        // other fields ...
    };
  • struct sk_buff:Linux中用来存储和处理网络数据包的结构,代表了数据包的封装。所有从MAC层接收到或要发送到MAC层的数据都是通过 sk_buff 结构来管理的。

    struct sk_buff {
        struct net_device *dev;   // 关联的网络设备
        unsigned char *data;      // 数据指针
        unsigned int len;         // 数据长度
       // other fields ...
    };
  • struct phy_device:表示物理层设备(PHY)的结构体。MAC层驱动通过这个结构体与PHY层交互。

    struct phy_device {
        struct mii_bus *bus;     // 关联的MII总线
        int phy_id;              // PHY的唯一标识
        int link;                // 链接状态
        // other fields ...
    };

26.2.4 数据接收

1.) 从PHY接收数据:

  • 在MAC层驱动中,有接收中断或轮询机制来检测是否有新数据到达。

  • 接收到的数据会被填充到 sk_buff 结构中,这些数据通常是以太网帧格式(包含MAC地址、帧类型、数据负载、CRC等)

  • Ethernet data frame

    +-------------------+------------------+---------+--------------+------------+-----------------------+
    | Destination MAC   |   Source MAC     |  Type   | Payload Data | Padding    | Frame Check           |
    |-------------------|------------------|---------|--------------|------------|-----------------------|
    | Address (6 bytes) | Address (6 bytes)| 2 bytes |46-1500 bytes | 0-46 bytes | Sequence (FCS) 4 bytes|
    +-------------------+------------------+---------+--------------+------------+-----------------------+

    整个以太网帧的最大长度为 1518字节(含FCS)。

2.)传递数据到上层

当MAC层驱动接收到数据后,它会通过 netif_rx(skb) 函数将数据包传递给内核网络协议栈处理。

netif_rx(skb); // 将数据包传递到协议栈
  • netif_rx(skb) 是异步的, skb 会被放入一个 软中断(softirq) 上下文中,具体是 NET_RX_SOFTIRQ 队列。

  • 一旦数据包被加入 NET_RX_SOFTIRQ 队列,Linux内核将触发软中断,网络栈会调用软中断处理程序 net_rx_action 处理队列中的所有数据包。

  • 当上层协议栈(如IP层、传输层TCP/UDP,最终到用户空间)处理完数据包后,skb 的内存将被释放。

3.)接收数据代码

static void stmmac_napi_add(struct net_device *dev)
{
	struct stmmac_priv *priv = netdev_priv(dev);
	u32 queue, maxq;

	maxq = max(priv->plat->rx_queues_to_use, priv->plat->tx_queues_to_use);

	for (queue = 0; queue < maxq; queue++) {
		struct stmmac_channel *ch = &priv->channel[queue];
		int rx_budget = ((priv->plat->dma_rx_size < NAPI_POLL_WEIGHT) &&
				 (priv->plat->dma_rx_size > 0)) ?
				 priv->plat->dma_rx_size : NAPI_POLL_WEIGHT;
		int tx_budget = ((priv->plat->dma_tx_size < NAPI_POLL_WEIGHT) &&
				 (priv->plat->dma_tx_size > 0)) ?
				 priv->plat->dma_tx_size : NAPI_POLL_WEIGHT;

		ch->priv_data = priv;
		ch->index = queue;
		spin_lock_init(&ch->lock);

		if (queue < priv->plat->rx_queues_to_use) {
			netif_napi_add(dev, &ch->rx_napi, stmmac_napi_poll_rx,
				       rx_budget);
		}
		if (queue < priv->plat->tx_queues_to_use) {
			netif_tx_napi_add(dev, &ch->tx_napi,
					  stmmac_napi_poll_tx, tx_budget);
		}
	}
}

该函数的主要功能是为网络设备的接收和发送队列分别添加 NAPI(New API) 处理程序,NAPI 是 Linux 网络子系统中的一种机制,用于通过轮询的方式处理网络数据包,以减少中断压力,特别是在高负载情况下。 通过 netif_napi_add 和 netif_tx_napi_add,分别为接收队列和发送队列注册 NAPI 轮询处理函数,实现了轮询机制。

26.2.5 发送数据

1.) 发送数据的流程

  • 当内核协议栈需要通过MAC发送数据时,会调用MAC驱动中的发送函数 ndo_start_xmit(),它负责处理 sk_buff 中的数据。

  • MAC层驱动将 sk_buff 中的数据提取出来,并通过RGMII等接口传递给PHY层,PHY层再负责将数据转换为物理信号发送到网络。

2.) 发送数据的关键函数

在 net_device 中定义了发送数据的函数指针 ndo_start_xmit(),当网络层需要发送数据时,它会调用这个函数。

MAC层驱动主代码kernel/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c, 网络注册函数如下:

static const struct net_device_ops stmmac_netdev_ops = {
        .ndo_open = stmmac_open,
        .ndo_start_xmit = stmmac_xmit,
        .ndo_stop = stmmac_release,
        .ndo_change_mtu = stmmac_change_mtu,
        .ndo_fix_features = stmmac_fix_features,
        .ndo_set_features = stmmac_set_features,
        .ndo_set_rx_mode = stmmac_set_rx_mode,
        .ndo_tx_timeout = stmmac_tx_timeout,
        .ndo_do_ioctl = stmmac_ioctl,
        .ndo_setup_tc = stmmac_setup_tc,
        .ndo_select_queue = stmmac_select_queue,
#ifdef CONFIG_NET_POLL_CONTROLLER
        .ndo_poll_controller = stmmac_poll_controller,
#endif
        .ndo_set_mac_address = stmmac_set_mac_address,
        .ndo_vlan_rx_add_vid = stmmac_vlan_rx_add_vid,
        .ndo_vlan_rx_kill_vid = stmmac_vlan_rx_kill_vid,
};

26.3 总结

  • MAC层驱动 主要负责数据的处理、封装和传输,确保数据在网络中的高效流动,并实现流量控制和错误检测等功能。

  • PHY驱动 负责信号的物理传输和接收,管理与物理介质的交互,确保信号的可靠传递与转换。
    两者相辅相成,共同构成了网络设备的基本功能框架,支持网络数据的完整传输过程。

其它网络层TCP/IP及应用层, 在此就不讨论。