EPOLL是如何实现的？

为什么80%的码农都做不了架构师？>>>

网址：http://www.quora.com/Network-Programming/How-is-epoll-implemented

我会从“select是如何实现的？”开始，因为涉及的大多数基础结构都是相同的。

epoll 和 select的主要不同是：使用 select，要监控的文件描述符列表存在于一次select调用期间，调用线程仅仅在调用期间存在于fd的等待队列中。使用 epoll，你创建一个唯一的fd，整合事件和多个你想监控的文件描述符（存在于 open file description），因此监控列表是持久的或者说长期的，通过多个系统调用后，线程继续停留在fd等待队列上。而且，epfd可以在多个线程间共享，在fd等待队列中不再只有一个线程，而是一个数据结构，包含当前正在epfd上等待的所有线程。（根据Linux实现，socket fd等待队列持有一个函数指针，一个 void *数据指针传递给那个函数；感觉这个地方特别别扭，如果不纠结于实现的话，可以任务epfd 和 socket fd之间会有一种关联关系）。

so，详细解释这种机制：

1、epoll fd有一个私有的struct eventpoll，它记录哪一个fd注册到了epfd上。eventpoll 同样有一个等待队列，记录所有等待的线程。还有一个预备好的fd列表，这些fd可以进行读或写。

2、使用 epoll_ctl 添加fd到epfd，epoll 添加 eventpoll 到 fd的等待队列（指针 or something？）。同时检查这个fd是否是ready状态（可读，可写），如果ready，添加到 ready list。

3、线程调用 epoll_wait 等待时，内核首先检查ready list，如果有fd ready，立即返回。如果没有 ready list，添加到 epfd 的等待队列中，开始休眠。

4、当 epfd 监控的socket有事件发生时，添加 fd 到 ready list，唤醒等待 epfd 的所有线程。

明显的，struct eventpoll需要很多小心的锁操作，多种列表和等待队列，但那就是实现细节了。

要记住的重要的事情是：我上面描述的没有一步是需要遍历兴趣fd列表的。通过完全使用事件驱动，持久的fd集合和兴趣列表，epoll避免消耗 O(n) 时间的操作，n是被监控的文件描述符的个数。

summary：终于网路这块又可以向前了，现在我想知道的是内核协议栈是如何读取数据？写socketbuffer？通知fd？进程调度？

原文：

I'd start off by reading my answer to "How is select implemented?" (Network Programming: How is select implemented?), because most of the infrastructure involved is the same.

The main difference between epoll and select is that in select(), the list of file descriptors to wait on only exists for the duration of a single select() call, and the calling task only stays on the sockets' wait queues for the duration of a single call. In epoll, on the other hand, you create a single file descriptor that aggregates events from multiple other file descriptors you want to wait on, and so the list of monitored fd's is long-lasting, and tasks stay on socket wait queues across multiple system calls. Furthermore, since an epoll fd can be shared across multiple tasks, it is no longer a single task on the wait queue, but a structure that itself contains another wait queue, containing all processes currently waiting on the epoll fd. (In terms of implementation, this is abstracted over by the sockets' wait queues holding a function pointer and a void * data pointer to pass to that function).

So, to explain the mechanics a little more:

    1. An epoll file descriptor has a private struct eventpoll that keeps track of which fd's are attached to this fd. struct eventpoll also has a wait queue that keeps track of all processes that are currently epoll_waiting on this fd. struct epoll also has a list of all file descriptors that are currently available for reading or writing.
    2. When you add a file descriptor to an epoll fd using epoll_ctl, epoll adds the struct eventpoll to that fd's wait queue. It also checks if the fd is currently ready for processing and adds it to the ready list, if so.
    3. When you wait on an epoll fd using epoll_wait, the kernel first checks the ready list, and returns immediately if any file descriptors are already ready. If not, it adds itself to the single wait queue inside struct eventpoll, and goes to sleep.
    4. When an event occurs on a socket that is being epoll()ed, it calls the epoll callback, which adds the file descriptor to the ready list, and also wakes up any waiters that are currently waiting on that struct eventpoll.

Obviously, a lot of careful locking is needed on struct eventpoll and the various lists and wait queues, but that's an implementation detail.

The important thing to note is that at no point above there did I describe a step that loops over all file descriptors of interest. By being entirely event-based and by using a long-lasting set of fd's and a ready list, epoll can avoid ever taking O(n) time for an operation, where n is the number of file descriptors being monitored.

摘录了 eventpoll 结构体的代码

struct eventpoll {
	/* Protect the access to this structure */
	spinlock_t lock;

	/*
	* This mutex is used to ensure that files are not removed
	* while epoll is using them. This is held during the event
	* collection loop, the file cleanup path, the epoll file exit
	* code and the ctl operations.
	*/
	struct mutex mtx;

	/* Wait queue used by sys_epoll_wait() */
	wait_queue_head_t wq;

	/* Wait queue used by file->poll() */
	wait_queue_head_t poll_wait;

	/* List of ready file descriptors */
	struct list_head rdllist;

	/* RB tree root used to store monitored fd structs */
	struct rb_root rbr;

	/*
	* This is a single linked list that chains all the "struct epitem" that
	* happened while transferring ready events to userspace w/out
	* holding ->lock.
	*/
	struct epitem *ovflist;

	/* The user that created the eventpoll descriptor */
	struct user_struct *user;
};

转载于:https://my.oschina.net/astute/blog/92683

EPOLL是如何实现的？

相关文章

MYSQL 查看最大连接数和修改最大连接数

Log4j输出到指定日志文件

交换机的互连技术

关系型和非关系型数据库的区别?

ODPS基础

jFreeChart在Linux下的问题以及常见异常

refcardz

java 接口(翻译自Java Tutorials)